Specifications for Digitizing the Newspapers of Connecticut

Master files when scanning from microfilmThumbnail image of page 1 of the newspaper: Sunday Herald, Feb. 9, 1890 (Waterbury)

  • TIFF 6.0, uncompressed, representing the original size of the page
  • Greyscale not black-and-white
    • Use 8-bit grayscale
    • Stay away from 1-bit bitonal which can reduce the accuracy of OCR
  • 300 to 400 dpi (relative to the size of original page and considering the size of lettering)
  • “Maximum resolution possible, between 300 and 400 dpi relative to the physical dimensions of the original newspaper” (quoted from NDNP Technical guidelines for 2016-18, p. 7)
  • Image processing will not be applied to the TIFF, except for deskewing. The TIFF will be as close to the original produced by the scanner as possible. Deskewing will be applied if the skew is greater than 3 degrees” (quoted from NDNP Technical guidelines for 2016-18, p. 36)
  • “The image should be cropped to the page edge (not to the text block boundaries), retaining the actual edge and up to 1/4″ beyond” (quoted from NDNP Technical guidelines for 2016-18, p. 7)

Master files when scanning from original pagesThumbnail image of page 1 of the newspaper: Daily Norwalk Gazette, Aug. 28, 1895

  • TIFF 6.0, uncompressed, representing the original size of the page
  • Greyscale vs. color
    • If printed in black-and-white, use 8-bit grayscale
    • If printed with all or some pages in color, use 24-bit color
  • 300 to 400 dpi
  • Disbind, put the issues and pages in proper order, make a note of missing issues and prepare original newspapers for scanning
    • Whether you do this or the vendor does it, this is a vital step.
    • “Curves and bound volume newspapers will always inhibit OCR accuracy.” “Bound volumes, if they can be disbound, should always be disbound. … [Y]ou will never get the pristine results you’re looking for with a bound page.” (quoted from Digital Newspapers Technical specifications, in the section on “File Format and Resolution Recommendations” then “Digitizing from Paper” by the Newspaper Digitization Interest Group.)
    • Preparing to get the best scanned images requires the same level of effort as preparing to microfilm. See the guidelines in “Collation and Bibliographic Preparation” and “Physical Preparation” from the United States Newspaper Project. USNP Preservation Microfilming Guidelines

Organizing your Files

Put the TIFF files for each issue in its own subfolder. The name of the folder is the issue date, in this format YYYY-MM-DD (with hyphens).

Put the Derivative files all in one subfolder called PDFs.

Derivative FilesThumbnail image of page 1 of the newspaper: Thompsonville Press, Feb. 23, 1922 (Enfield)

For now, we ask for a multi-page searchable PDF, in addition to the TIFF files. We’ll use these in our CONTENTdm-based Newspapers of Connecticut collection.

In time, the online collection will migrate to the Connecticut Digital Archive (CTDA). At that point we expect to only need TIFF files. Over time, we expect CTDA to adapt to new technology and access tools by re-using the TIFF files and the metadata.

Use the better OCR engines to make your searchable files. However, we understand that historical newspapers and old microfilm present challenges for OCR engines and we won’t turn down images with poor OCR.

File Names and Folders

Use the Library of Congress Catalog Number (LCCN) or OCLC accession number for each title, followed by the issue date (yyyymmdd) and page number. This scheme is intended to keep all issues of the same title together, in order by date, then page number. Watch out, because if the title changes, there will be a new LCCN or OCLC number. Do not use the file name scheme given in the NDNP Technical Guidelines.

TIFF files

Multi-page PDF

sn84022517_18810415_01.tif

sn84022517_18810415.pdf

sn84022517_18810415_02.tif

 

sn84022517_18810415_03.tif

 

sn84022517_1881045_04.tif

 
  • Find the right LCCN in the catalog record in the US Newspaper Directory, 1690-Present
    • Remove the space. That is, change it from “sn 84022380” to “sn84022380”.
    • There will be a new catalog record and new LCCN each time the title changes.
      • But if a title changes lasted for less than a year it will just be noted on the record for the former title. Use the LCCN for the former title.
      • Sometimes the Sunday paper has a different title. If this is noted on the record for the daily title, use the LCCN for the daily.
    • To find the preceding and succeeding titles in a newspaper family, follow the trail in the US Newspaper Directory, 1690-Present. To help to visualize this see if your newspaper is included in Charts of Related Newspapers Published in Connecticut. Start with the index on page 3.
    • Ask us for help if it is hard to determine the right LCCN.
    • We also accept the OCLC accession number in place of the LCCN. Find the right record in WorldCat to get the OCLC number.
  • Give the date next using this format: yyyyymmdd
    • Use four digits for the year, two for the month and two for the day.
  • Add zeros before the page number so all pages will sort in the directory in proper order. If you don’t page 1 will be followed by page 10. Instead, call it page 01. If a special anniversary edition has more than 99 pages, call the first page, 001.
  • See examples above.

If your vendor can not follow these practices or if you have existing digital files that don’t meet this standard, call us. We will discuss how to handle the situation on a case by case basis. We’ll figure out if we can make use of your file names.

Do Quality Control

Thumbnail image of page 1 of the newspaper: Wethersfield Weekly Farmer, June 4, 1887

Plan to do a quality check when scanning is completed. At least do a spot check, especially if you know of any problem areas.

Tell us of anything that can’t be fixed and would make it hard for patrons to find what they need.

  • Do you have the right number of issues? Were any issues on the film not scanned?
  • Do any issues have an odd number of pages? Newspapers should always have an even number of pages (with extremely rare exceptions). Look at the issue to see why or ask the vendor about this.
    • Tell us if a page was missing from the film.
  • If an issue was filmed with pages in the wrong order, was this fixed in the TIFF files and PDF?
  • If some pages were filmed twice, were they kept (because both copies are useful) or was one deleted from the TIFF files and PDF?
  • Do the files have the correct LCCN? If the wrong LCCN was used, the metadata won’t match the issues. This can get tricky if the title changed part way through a reel or if the film had a few extra issues of a different title.
  • Does the file name have the wrong date? This can get tricky if an issue was on the film in the wrong place.
    • Get your vendor to fix this or tell us if a file name is wrong.
  • Is the date on page 1 wrong?
    • Tell us so we can add a note to the metadata and make sure the issue can be properly identified. See example below.

Example of Misdated Issue

Before it was microfilmed, someone changed the date of Thomaston Express. It was printed as Jan. 4, 1934 issue but was changed to Jan. 4, 1935. You can’t see the volume and issue number in the picture but they confirm that the printer forgot to change the year on page 1 after the New Year’s celebration.

1935 issue misdated as 1934. Thomaston Express Jan. 4, 1935
1935 issue misdated as 1934. Thomaston Express Jan. 4, 1935

The file name for this issue should have the date 19350104, not 19340104.

Give us a list of misdated issues so we can add a note to the metadata and make sure the issue can be properly identified.

Back to

Digitizing the Newspapers of Connecticut

Copyright & Permissions

Selecting Microfilm for Digitization

Sources of Grant Funding

Working with Vendors