Friday, September 18, 2009

Week 3 reading notes

Michael Lesk, Understanding Digital Libraries - sections 2.1, 2.2, 2.7, and Chapter 3



2.1 Computer Typesetting

The history of text standards for both software and printing... WYSIWYG formats developed in the 1970s led to more user-friendly transparency between screen and page...Postscript & SGML both important developments for computer typesetting


2.2 Text Formats

Discussion of tags and searches within the three most notable standards (MARC, SGML, and HTML).... Difficulty of entering SGML labels in a WYSIWYG document... HTML, of course, supports hypertext links. Certain questions and issues arise when information professionals try to describe content, deduce keywords, rank search terms, etc.


2.7 Document Conversion


There are two primary strategies for text: scanning w/ Optical Character Recognition, and keying in materials (original or transcribed, with original errors usually included in transcriptions).

I was struck by the number of errors in certain scans of the newspaper text, but I see that progress is being made in OCR systems and software. This is clearly very important for digital archiving of older materials.

Chapter 3: Images of Pages


There are sizable differences among the quality of scanned images (depending on equipment) and the costs for such efforts (keying vs. machine scanning). Review of specific equipment, etc...

Various image formats (compression algorithms, etc.)

Display requirements should allow for easy legibility on a variety of computer equipment

PDFs help retain the author's originally intended format & appearance.
Postscript --> Adobe Acrobat, PDFs

Indexing: thumbnails are helpful for users, but not machine-readable

Print indexes can be based on other standards

Again, OCR helps for searchability

For old newspapers' digitization, sometimes clipping story-by-story as image files works best

Some files share texts & images, tables, other formats, etc. -- a page can be broken into separate regions

There are three library storage options: scanning, on-campus storage, and an off-site depository, each w/ pros & cons (off-site for rarer stuff)

ARMS. Chapters 9. http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter9.html.


Again, proof that OCR is vital to searchability of scanned texts.

No comments:

Post a Comment