Michael Lesk, Understanding Digital Libraries - sections 2.1, 2.2, 2.7, and Chapter 3
2.1 Computer Typesetting
The history of text standards for both software and printing... WYSIWYG formats developed in the 1970s led to more user-friendly transparency between screen and page...Postscript & SGML both important developments for computer typesetting
2.2 Text Formats
Discussion of tags and searches within the three most notable standards (MARC, SGML, and HTML).... Difficulty of entering SGML labels in a WYSIWYG document... HTML, of course, supports hypertext links. Certain questions and issues arise when information professionals try to describe content, deduce keywords, rank search terms, etc.
2.7 Document Conversion
There are two primary strategies for text: scanning w/ Optical Character Recognition, and keying in materials (original or transcribed, with original errors usually included in transcriptions).
I was struck by the number of errors in certain scans of the newspaper text, but I see that progress is being made in OCR systems and software. This is clearly very important for digital archiving of older materials.
Chapter 3: Images of Pages
There are sizable differences among the quality of scanned images (depending on equipment) and the costs for such efforts (keying vs. machine scanning). Review of specific equipment, etc...
Various image formats (compression algorithms, etc.)
Display requirements should allow for easy legibility on a variety of computer equipment
PDFs help retain the author's originally intended format & appearance.
Postscript --> Adobe Acrobat, PDFs
Indexing: thumbnails are helpful for users, but not machine-readable
Print indexes can be based on other standards
Again, OCR helps for searchability
For old newspapers' digitization, sometimes clipping story-by-story as image files works best
Some files share texts & images, tables, other formats, etc. -- a page can be broken into separate regions
There are three library storage options: scanning, on-campus storage, and an off-site depository, each w/ pros & cons (off-site for rarer stuff)
ARMS. Chapters 9. http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter9.html.
Again, proof that OCR is vital to searchability of scanned texts.
Subscribe to:
Post Comments (Atom)

No comments:
Post a Comment