Joseph E. Stiglitz -- "Intellectual Property Rights and Wrongs"
I've been thinking a lot about intellectual property for the other class I'm taking, Kip Currier's Legal Issues - Copyright class. So yes, I agree with Stiglitz's contention that sharing intellectual property is crucial for advances in medicine, science, technology, and other research. The author raises some of the key issues we've been considering about how abuses of legal power may impede innovation merely in order to benefit corporations or other legal institutions. As a prospective digital librarian, I of course tend to favor open access to research and information. But in the last few months I've become aware of many unfortunately thorny issues that librarians must face in attempting to provide information to the public.
Clifford Lynch -- "Where Do We Go from Here? The Next Decade for Digital Libraries"
Preservation concerns are also something I've been thinking about lately, with regard to both physical and digital materials. One angle of this article that I found interesting was the idea that long-term preservation of intellectual property is too important to allow only librarians to be entrusted with it, since they may be considered only "one group among a broad array of stakeholders." I'm glad to see that funding for digitization initiatives has increased over the last decade or so, thereby "validating" the mission and forming communities among diverse organizations, as the article points out. And indeed, the article reminds us that digital collection creation and management are essential to a huge range of industries and institutions, such as (just for example) engineering firms, homeland security, museums, personal archives, schools, laboratories, and historical societies.
I appreciate the author's effort to consider a "long time horizon perspective" in integrating digital information management technologies for multiple purposes across people's lifetimes. As a side note, this makes me think of the excellent book "The Clock of the Long Now" by Stewart Brand, which I read as an optional assignment for Dr. Richard Cox's class in archival ethics. The book was a fascinating meditation on ultra-long-term preservation. Highly recommended!
Sunday, December 6, 2009
Sunday, November 29, 2009
Reading notes for Security and Economics
W. Y. Arms -- Implementing Policies for Access Management
This article addresses issues in access management of electronic documents. Many institutions wish to restrict access to online documents for reasons of privacy, security, or payment restrictions. The access model's framework places policies at the center, such that every user and collection has an associated policy. However, under this model any policy change would require altering every document, a time-consuming and error-prone prospect. Alternatively, another approach uses containers of information to encapsulate policies and more easily transmit/change/enforce them. The article's table in section two was helpful for illustrating a simple breakdown of users, attributes, and operations all comprising a policy.
Of course, digital materials' metadata allows for many attributes to be associated with every item, and users' logins may easily demarcate different populations. However, interoperability is still a challenge when dealing with multiple libraries' collections. The article shows why it's best to keep attributes, policies, operations, etc. separate for easy management.
Lesk Chapter Nine
The chapter begins by pointing out that traditional libraries have often been financially extravagant, and questioning whether digital libraries offer a more economically reasonable alternative.
Funding models for digital libraries include:
* institutional support
* charging users
* advertisers
* other, such as pledge drives for donations
Of course, the traditional library has not been monetized, so users may be resistant to paying for digital library services. Although costs for digital copying are low to nil, if consumers expect instant and unlimited copies, publishers stand to lose money.
Costs of academic texts and journals are particularly high, causing many libraries to reduce their offerings -- a loss to scholars. Sometimes libraries switch to on-demand acquisition only. I was interested in the notion of libraries as "buyers' clubs," wherein people pool their money to buy a single copy of something, but the article mentions many problems (and even paradoxes) with this approach.
Subscription libraries are one option, with parallels both in history and in organizations such as video lending services. A per-item or a per-month/year fee model may be used.
One problem libraries may face with digital materials is revocation of priorly owned materials. When a library cancels a subscription to a print journal, it still owns the titles it has already bought; in the digital realm, all access to back issues may be denied with the cancellation of a subscription.
Yet another issue is the difficulty of obtaining access to copyrighted work in order to digitize it. This process can consume far too much time and money to be worthwhile.
This article addresses issues in access management of electronic documents. Many institutions wish to restrict access to online documents for reasons of privacy, security, or payment restrictions. The access model's framework places policies at the center, such that every user and collection has an associated policy. However, under this model any policy change would require altering every document, a time-consuming and error-prone prospect. Alternatively, another approach uses containers of information to encapsulate policies and more easily transmit/change/enforce them. The article's table in section two was helpful for illustrating a simple breakdown of users, attributes, and operations all comprising a policy.
Of course, digital materials' metadata allows for many attributes to be associated with every item, and users' logins may easily demarcate different populations. However, interoperability is still a challenge when dealing with multiple libraries' collections. The article shows why it's best to keep attributes, policies, operations, etc. separate for easy management.
Lesk Chapter Nine
The chapter begins by pointing out that traditional libraries have often been financially extravagant, and questioning whether digital libraries offer a more economically reasonable alternative.
Funding models for digital libraries include:
* institutional support
* charging users
* advertisers
* other, such as pledge drives for donations
Of course, the traditional library has not been monetized, so users may be resistant to paying for digital library services. Although costs for digital copying are low to nil, if consumers expect instant and unlimited copies, publishers stand to lose money.
Costs of academic texts and journals are particularly high, causing many libraries to reduce their offerings -- a loss to scholars. Sometimes libraries switch to on-demand acquisition only. I was interested in the notion of libraries as "buyers' clubs," wherein people pool their money to buy a single copy of something, but the article mentions many problems (and even paradoxes) with this approach.
Subscription libraries are one option, with parallels both in history and in organizations such as video lending services. A per-item or a per-month/year fee model may be used.
One problem libraries may face with digital materials is revocation of priorly owned materials. When a library cancels a subscription to a print journal, it still owns the titles it has already bought; in the digital realm, all access to back issues may be denied with the cancellation of a subscription.
Yet another issue is the difficulty of obtaining access to copyrighted work in order to digitize it. This process can consume far too much time and money to be worthwhile.
Worse than a muddy point...
This past week I compiled the metadata for all of my group's documents on Greenstone, and everything first seemed to be going fine. However, at (what I thought would be) the very end of the project, when I tried to build the digital library, Greenstone kept giving me messages such as ""The file [filename.pdf] was recognised but could not be processed by any plugin." Then at the end of processing, it gave me the following message:
120 documents were considered for processing:
46 documents were processed and included in the collection.
74 were rejected.
From then on, I no longer had access to the rejected documents via Greenstone. They were still on my computer, but had been relegated to an "archives" folder in the Greenstone file structure, with names like Hash9bb4.dir in place of the more recognizable file or folder names. I have no idea what to do about this! Are two thirds of our metadata-enriched PDFs totally unusable in Greenstone? And why??? I tried looking up the issue online, but only found a single site with a vague suggestion to convert the PDFs to HTML using third-party software.
If anyone reading this has any ideas, please comment or email me...
120 documents were considered for processing:
46 documents were processed and included in the collection.
74 were rejected.
From then on, I no longer had access to the rejected documents via Greenstone. They were still on my computer, but had been relegated to an "archives" folder in the Greenstone file structure, with names like Hash9bb4.dir in place of the more recognizable file or folder names. I have no idea what to do about this! Are two thirds of our metadata-enriched PDFs totally unusable in Greenstone? And why??? I tried looking up the issue online, but only found a single site with a vague suggestion to convert the PDFs to HTML using third-party software.
If anyone reading this has any ideas, please comment or email me...
Sunday, November 8, 2009
Reading notes for Evaluation
The link to Arms' chapter didn't work, so I looked it up elsewhere and found a 1999 version at the following web address: http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter8.html
I assume it's not too outdated!
It gives a good basic introduction to the principles of a user interface, and the ways that an interface should change along with technology over time. The conceptual model that this chapter proposed was pretty straightforward, as was its review of browser technology. Most of the chapter's points are ones I was already familiar with, e.g. that a web designer must balance effective use of advanced or sophisticated features with the ability to offer simplicity and speed for less well-equipped users. I was also already familiar with mirroring and caching -- and was interested to see that when this article was written (presumably in 1999) video skimming was mostly merely an idea for future development. What I found most interesting were the chapter's brief references to the writer's own experiences, such as the fact that his online magazine redesigned its interface yearly.
Kling and Elliott's article brings a focus to usability concerns in designing an interface; they recognize that ease of use improves users' performance. They break ease of use down into four components:
Learnability - which also concerns the speed with which a user can begin using the software
Efficiency - how productively a user can make use of the system
Memorability - whether the user can easily return to using the system after an absence
Low error rate - no catastrophic errors and easy recovery from the minor ones
Clearly an intuitive system organization that works well on a server will lead to the best results. Given that users of digital libraries will have all sorts of different goals and intentions, it's probably best for system developers to survey users frequently to determine the areas in need of improvement.
For organizations, the authors break down concerns as follows:
Accessibility - the ease with which people can locate specific systems and content, both physically and administratively
Compatibility - of file transfers between systems
Integrability into work practices - how smoothly the system fits existing practices
Social-organizational expertise - how well people can obtain training and consulting to learn to use systems and troubleshoot
Unsurprisingly, many of the authors' recommendations for digital libraries involve testing systems, surveying users, exploring multiple design alternatives, etc. They implore us to pay attention to cultural models of user bases, reminding us that a system appropriate for elementary schoolchildren will not be as appropriate for graduate-level science laboratories.
Finally, Tefko Saracevic's article evaluates evaluation: analyzing the methods and contexts of the (relatively rare) evaluation of several different digital libraries. The article goes into details about the evaluative methods that were used, and highlights the variety of approaches possible: usability-centered (as with the article above), ethnographic, anthropological, sociological, and economic. The article highlights many distinct matrices of assessment, and briefly acknowledges that despite many factual criteria there is also the role of human judgment in certain evaluations. Digital libraries are fairly new, so it is understandable that not much evaluation has been done on them, but one of the take-home messages of this article is that despite apparent lack of interest and definite lack of funding, evaluation is important and should become a bigger part of digital library culture.
I assume it's not too outdated!
It gives a good basic introduction to the principles of a user interface, and the ways that an interface should change along with technology over time. The conceptual model that this chapter proposed was pretty straightforward, as was its review of browser technology. Most of the chapter's points are ones I was already familiar with, e.g. that a web designer must balance effective use of advanced or sophisticated features with the ability to offer simplicity and speed for less well-equipped users. I was also already familiar with mirroring and caching -- and was interested to see that when this article was written (presumably in 1999) video skimming was mostly merely an idea for future development. What I found most interesting were the chapter's brief references to the writer's own experiences, such as the fact that his online magazine redesigned its interface yearly.
Kling and Elliott's article brings a focus to usability concerns in designing an interface; they recognize that ease of use improves users' performance. They break ease of use down into four components:
Learnability - which also concerns the speed with which a user can begin using the software
Efficiency - how productively a user can make use of the system
Memorability - whether the user can easily return to using the system after an absence
Low error rate - no catastrophic errors and easy recovery from the minor ones
Clearly an intuitive system organization that works well on a server will lead to the best results. Given that users of digital libraries will have all sorts of different goals and intentions, it's probably best for system developers to survey users frequently to determine the areas in need of improvement.
For organizations, the authors break down concerns as follows:
Accessibility - the ease with which people can locate specific systems and content, both physically and administratively
Compatibility - of file transfers between systems
Integrability into work practices - how smoothly the system fits existing practices
Social-organizational expertise - how well people can obtain training and consulting to learn to use systems and troubleshoot
Unsurprisingly, many of the authors' recommendations for digital libraries involve testing systems, surveying users, exploring multiple design alternatives, etc. They implore us to pay attention to cultural models of user bases, reminding us that a system appropriate for elementary schoolchildren will not be as appropriate for graduate-level science laboratories.
Finally, Tefko Saracevic's article evaluates evaluation: analyzing the methods and contexts of the (relatively rare) evaluation of several different digital libraries. The article goes into details about the evaluative methods that were used, and highlights the variety of approaches possible: usability-centered (as with the article above), ethnographic, anthropological, sociological, and economic. The article highlights many distinct matrices of assessment, and briefly acknowledges that despite many factual criteria there is also the role of human judgment in certain evaluations. Digital libraries are fairly new, so it is understandable that not much evaluation has been done on them, but one of the take-home messages of this article is that despite apparent lack of interest and definite lack of funding, evaluation is important and should become a bigger part of digital library culture.
Labels:
design,
digital libraries,
evaluation,
interface
A note on "muddy points"
I've found all of the recent lectures to be clear and comprehensive; thanks for that! I'll post if I think of any questions over the course of the project I'm working on, but for now, I don't really have any muddy points that would be useful to address in tomorrow's lecture. I almost feel disappointed about that!
Tuesday, November 3, 2009
I didn't mean to fall behind...
My poor, neglected blog. Well, assuming that a late posting is better than none, here are my reading notes for the Preservation of Digital Materials unit.
Preservation in the Age of Large-Scale Digitization -- A White Paper by Oya Y. Rieger
I appreciated the wide scope and long-term perspective of this paper. Given the pace at which technologies change, it's essential to ask questions such as "who will ensure that digital content remains accessible over time?".
The article points out the difference between digital backups (to ensure against destruction of physical texts) and a bona fide digital library (searchable, indexed, copyright-cleared, etc.). It's a reminder of considerations to keep in mind when transitioning a "backup" repository into a digital library.
The paper gives a rundown of some of the key players in digitization, including OCLC, the OCA, Google, Microsoft, and the Million Book Project. I appreciated Table 1, which lays out the essential aspects of various initiatives (their distinguishing features & goals) -- personally I'm most interested in Google Book Search, simply because of its hugely ambitious intentions; I'll continue to follow news about it.
As a side note, I was interested to hear the figure that (at least in Cornell's study) about 10% of library books accounted for about 90% of circulation. While this may make digitization priorities a bit easier, on another level it makes me a bit sad; I'd like to think of a majority of people reading more different things!
I agree with criticisms (mentioned in the article) of the Google Book project over uploading scans with poor image quality, missing text, or other defects. For example, when I recently looked up a Chaucer poem on Google Books, the copy that came up was covered in extensive handwritten notes; it seemed odd to me that this adulterated text was selected as the copy to be scanned. Surely they could have easily found a cleaner copy? (Or at least digitally removed the margin notes?) I know Chaucer's words won't ever be lost, but as for lesser-known texts, I do fear -- as the article mentioned -- that once digital copies are uploaded, some originals will be discarded, even if their contents were not always properly preserved.
The article details some storage and retrieval concerns, as well as some security and environmental considerations. Throughout the article, I was aware of how today's decisions will affect tomorrow's library conditions; it's worth making careful quality assessments while we're at this crucial point of transition into digitized formats.
Finally, all of the registry and copyright concerns that were mentioned here dovetail interestingly with the other class that I'm taking this semester, Legal Issues and Copyright. I've become increasingly aware of the ways in which legal concerns can curtail a library's practices, and I hope to find ways to circumvent perceived restrictions and allow access of library texts as widely as is legally feasible.
Research Challenges in Digital Archiving and Long-term Preservation by Margaret Hedstrom
This article starts with an interesting point, that "many of the digital resources we are creating today will be re-purposed and re-used for reasons that we cannot imagine today" -- while at the same time, evolving technologies make traditional paradigms obsolete. Just as preservation of library materials has always focused on the very long term, digital preservation should be enacted with an eye to long-term feasibility, including the adaptability of metadata, ease of restructuring, and sustainability of infrastructure.
Actualized Preservation Threats -- Practical Lessons from Chronicling America by Justin Littman
Chronicling America is an initiative to digitize several historical American newspapers and provide public access. This article focused on several of the things that can go wrong in digital preservation efforts, such as software errors, operator errors, hardware failure, and problems with media drives and file corruptions. At best, such issues slow down a data transfer process, and at worst, data is lost. But this article led me to believe that even the worst cases of data loss are remediable as long as operators are paying close intention to issues of data integrity.
Preservation in the Age of Large-Scale Digitization -- A White Paper by Oya Y. Rieger
I appreciated the wide scope and long-term perspective of this paper. Given the pace at which technologies change, it's essential to ask questions such as "who will ensure that digital content remains accessible over time?".
The article points out the difference between digital backups (to ensure against destruction of physical texts) and a bona fide digital library (searchable, indexed, copyright-cleared, etc.). It's a reminder of considerations to keep in mind when transitioning a "backup" repository into a digital library.
The paper gives a rundown of some of the key players in digitization, including OCLC, the OCA, Google, Microsoft, and the Million Book Project. I appreciated Table 1, which lays out the essential aspects of various initiatives (their distinguishing features & goals) -- personally I'm most interested in Google Book Search, simply because of its hugely ambitious intentions; I'll continue to follow news about it.
As a side note, I was interested to hear the figure that (at least in Cornell's study) about 10% of library books accounted for about 90% of circulation. While this may make digitization priorities a bit easier, on another level it makes me a bit sad; I'd like to think of a majority of people reading more different things!
I agree with criticisms (mentioned in the article) of the Google Book project over uploading scans with poor image quality, missing text, or other defects. For example, when I recently looked up a Chaucer poem on Google Books, the copy that came up was covered in extensive handwritten notes; it seemed odd to me that this adulterated text was selected as the copy to be scanned. Surely they could have easily found a cleaner copy? (Or at least digitally removed the margin notes?) I know Chaucer's words won't ever be lost, but as for lesser-known texts, I do fear -- as the article mentioned -- that once digital copies are uploaded, some originals will be discarded, even if their contents were not always properly preserved.
The article details some storage and retrieval concerns, as well as some security and environmental considerations. Throughout the article, I was aware of how today's decisions will affect tomorrow's library conditions; it's worth making careful quality assessments while we're at this crucial point of transition into digitized formats.
Finally, all of the registry and copyright concerns that were mentioned here dovetail interestingly with the other class that I'm taking this semester, Legal Issues and Copyright. I've become increasingly aware of the ways in which legal concerns can curtail a library's practices, and I hope to find ways to circumvent perceived restrictions and allow access of library texts as widely as is legally feasible.
Research Challenges in Digital Archiving and Long-term Preservation by Margaret Hedstrom
This article starts with an interesting point, that "many of the digital resources we are creating today will be re-purposed and re-used for reasons that we cannot imagine today" -- while at the same time, evolving technologies make traditional paradigms obsolete. Just as preservation of library materials has always focused on the very long term, digital preservation should be enacted with an eye to long-term feasibility, including the adaptability of metadata, ease of restructuring, and sustainability of infrastructure.
Actualized Preservation Threats -- Practical Lessons from Chronicling America by Justin Littman
Chronicling America is an initiative to digitize several historical American newspapers and provide public access. This article focused on several of the things that can go wrong in digital preservation efforts, such as software errors, operator errors, hardware failure, and problems with media drives and file corruptions. At best, such issues slow down a data transfer process, and at worst, data is lost. But this article led me to believe that even the worst cases of data loss are remediable as long as operators are paying close intention to issues of data integrity.
Labels:
Chronicling America,
Google Books,
preservation
Sunday, October 25, 2009
Reading notes for Access in Digital Libraries II
Chapter 1. Definition and Origins of OAI-PMH --
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH): a relatively simple protocol for sharing descriptive data, broadly useful (esp. for digital libraries)
-Created to aid the development of services across similar items (e.g. journal articles, video clips, etc.)
-Allows transfer of metadata online
-It's important not to assume a context that would be obvious within an institution but not to outsiders, for when this collection is shared, there will be no metadata to indicate what would have been obvious only within the institution
-The OAI technical committee worked throughout 2001 to establish the metadata issues most in need of consideration
-While OAI-PMH enables searches across repositories, it is not itself a protocol for searching
Todd Miller -- Federated Searching: Put It in Its Place --
This article posits the idea that "only librarians like to search; everyone else likes to find." (This may be an oversimplification; just for example, surely many users benefit from playing around with search terms, or find interesting new materials within a search for other items...) It points out that library searches limited to cataloged metadata pertaining to books is insufficient for the twenty-first century, when searchability should extend within the full text of a broader range of materials (especially digital documents). Thus, the article draws a distinction between catalog searches of books (relying on metadata) and Google searches, which can more thoroughly index a text's entire content. The article argues for simplicity and access, claiming that efforts to make information more secure usually make it less accessible. This would seem to be an obvious point.
The Truth About Federated Searching--
The article debunks the five most common myths about federated searching. In doing so, it highlights the importance for libraries of using their own authentication when possible in order to keep authentication problems from preventing effective searches for remote users. The article also helped me see that federated searching is not just software, but a service that constantly updates itself and helps a library avoid the need to update translators for its search terms (which can result in disruption of service).
The Z39.50 Information Retrieval Standard--
This article gives a helpful overview and history of Z39.50. I was most interested in the section about the role of content semantics, which allow for more abstract associations in searching. There are endless classes of information mapping, and I'm wondering how consensus is reached regarding the structure of content semantics. Also, this is a fairly old article; I'm wondering what may have changed in the last twelve years?
Search Engine Technology and Digital Libraries-
One of the interesting things this article pointed out was that libraries still see themselves as repositories of collections, rather than "gateways" to information that already exists online. The article highlights the importance of libraries' awareness of existing digital resources, and argues for their role evolving to include serving as portals to the academic web. It claims that the younger generations express a strong preference for "Google-like" access to information over traditional catalogs, and examines libraries' resistance to commercial search engines while suggesting ways in which such search technologies could be integrated into sustainable system architecture for library collections and digital materials. Are more libraries indeed creating their own local search engine infrastructures in order to build further indexes? And if so, is interoperability a great concern?
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH): a relatively simple protocol for sharing descriptive data, broadly useful (esp. for digital libraries)
-Created to aid the development of services across similar items (e.g. journal articles, video clips, etc.)
-Allows transfer of metadata online
-It's important not to assume a context that would be obvious within an institution but not to outsiders, for when this collection is shared, there will be no metadata to indicate what would have been obvious only within the institution
-The OAI technical committee worked throughout 2001 to establish the metadata issues most in need of consideration
-While OAI-PMH enables searches across repositories, it is not itself a protocol for searching
Todd Miller -- Federated Searching: Put It in Its Place --
This article posits the idea that "only librarians like to search; everyone else likes to find." (This may be an oversimplification; just for example, surely many users benefit from playing around with search terms, or find interesting new materials within a search for other items...) It points out that library searches limited to cataloged metadata pertaining to books is insufficient for the twenty-first century, when searchability should extend within the full text of a broader range of materials (especially digital documents). Thus, the article draws a distinction between catalog searches of books (relying on metadata) and Google searches, which can more thoroughly index a text's entire content. The article argues for simplicity and access, claiming that efforts to make information more secure usually make it less accessible. This would seem to be an obvious point.
The Truth About Federated Searching--
The article debunks the five most common myths about federated searching. In doing so, it highlights the importance for libraries of using their own authentication when possible in order to keep authentication problems from preventing effective searches for remote users. The article also helped me see that federated searching is not just software, but a service that constantly updates itself and helps a library avoid the need to update translators for its search terms (which can result in disruption of service).
The Z39.50 Information Retrieval Standard--
This article gives a helpful overview and history of Z39.50. I was most interested in the section about the role of content semantics, which allow for more abstract associations in searching. There are endless classes of information mapping, and I'm wondering how consensus is reached regarding the structure of content semantics. Also, this is a fairly old article; I'm wondering what may have changed in the last twelve years?
Search Engine Technology and Digital Libraries-
One of the interesting things this article pointed out was that libraries still see themselves as repositories of collections, rather than "gateways" to information that already exists online. The article highlights the importance of libraries' awareness of existing digital resources, and argues for their role evolving to include serving as portals to the academic web. It claims that the younger generations express a strong preference for "Google-like" access to information over traditional catalogs, and examines libraries' resistance to commercial search engines while suggesting ways in which such search technologies could be integrated into sustainable system architecture for library collections and digital materials. Are more libraries indeed creating their own local search engine infrastructures in order to build further indexes? And if so, is interoperability a great concern?
Labels:
digital libraries,
federated searching,
OAI-PMH
Monday, October 19, 2009
Week 7 reading notes, etc.
I'm back from Pittsburgh, and although it was a rushed trip, I'm glad to have had the chance to meet with some of the people in my classes.
And yes, I'm a bit late in posting these reading notes, but better late than never, I suppose.
David Hawking, Web Search Engines Part I--
I had never heard the term "politeness" to describe a way to prevent too many server requests from forming a bottleneck, but it makes sense to introduce delays when necessary in order for a server not to be overwhelmed (in the same way highway on-ramps at rush hour only allow one car per green light). Parallelism is clearly important for making maximal use of the server's capacities; I was surprised to hear how complex (and prone to crash) these systems are.
David Hawking, Web Search Engines Part II--
I was interested to read about some of the ways that algorithms aim to improve result quality within search engines. This reminded me of some of the things Professor He said in the lecture when I was on campus about how the highest number of hits doesn't always equal the "right" search result: for example, "Java" might mean a programming language, a country, or coffee, and a good search engine will show all three of those on the front page of results. (Similarly, the article mentioned the importance of distinguishing a search for the political satire magazine "The Onion" from countless recipes and gardening sites.) Programmers clearly have to think of many subtleties when designing search engines; I was glad to learn about some aspects that I hadn't considered (skipping, caching, assigning document numbers intentionally, etc.).
Lesk Chapter 4 --
Text searches seem utterly simple by comparison to other digital media files' data. I know the technology is still developing, but it's remarkable to me that automatic recognition programs (for indexing images) work at all, and it's unsurprising that they currently work only imperfectly. I'm also not surprised that media formats whose content unfolds over time, such as video and audio, are still harder to search for content than images are. I know that textual tagging helps greatly, and can be used in numerous ways, as when Pandora.com uses musically relevant textual tags such as "downtempo beats" and "female vocalist" to hone in on a user's musical preferences.
And yes, I'm a bit late in posting these reading notes, but better late than never, I suppose.
David Hawking, Web Search Engines Part I--
I had never heard the term "politeness" to describe a way to prevent too many server requests from forming a bottleneck, but it makes sense to introduce delays when necessary in order for a server not to be overwhelmed (in the same way highway on-ramps at rush hour only allow one car per green light). Parallelism is clearly important for making maximal use of the server's capacities; I was surprised to hear how complex (and prone to crash) these systems are.
David Hawking, Web Search Engines Part II--
I was interested to read about some of the ways that algorithms aim to improve result quality within search engines. This reminded me of some of the things Professor He said in the lecture when I was on campus about how the highest number of hits doesn't always equal the "right" search result: for example, "Java" might mean a programming language, a country, or coffee, and a good search engine will show all three of those on the front page of results. (Similarly, the article mentioned the importance of distinguishing a search for the political satire magazine "The Onion" from countless recipes and gardening sites.) Programmers clearly have to think of many subtleties when designing search engines; I was glad to learn about some aspects that I hadn't considered (skipping, caching, assigning document numbers intentionally, etc.).
Lesk Chapter 4 --
Text searches seem utterly simple by comparison to other digital media files' data. I know the technology is still developing, but it's remarkable to me that automatic recognition programs (for indexing images) work at all, and it's unsurprising that they currently work only imperfectly. I'm also not surprised that media formats whose content unfolds over time, such as video and audio, are still harder to search for content than images are. I know that textual tagging helps greatly, and can be used in numerous ways, as when Pandora.com uses musically relevant textual tags such as "downtempo beats" and "female vocalist" to hone in on a user's musical preferences.
Sunday, October 18, 2009
Friday, October 9, 2009
"Muddy point" for Week 5
A fairly basic question this week. I've never worked with Dublin Core, so I'm not sure how far its purview extends. For example, does the metadata by default feed into search engines, if the digital library exists online? Could the cataloger specify that as an option, or is Dublin Core's searchability always limited only to an institution's own network? It seems like it could be especially helpful for a commercial search engine to pick up on Dublin Core's metadata, but I could also understand why search engines might want to avoid metadata that has been pre-programmed by users (for example, to avoid spammers using false metadata as hooks to give their own sites a higher priority).
Labels:
Dublin Core,
Metadata,
search engines
Sunday, October 4, 2009
Assignment 2, Part 1
For the first part of Assignment 2 I chose five digital images I had taken from the upstairs windows of my house in New Mexico, and saved the large master copies in their own folder. I then used Microsoft Office Picture Manager and Pixlr.com (a great site; I'm glad to know about it now!) to reduce the photographs to smaller versions and create extra-small thumbnails. Finally, I uploaded the reduced-dpi photos, the thumbnails, and (just for good measure) the master copies all to my Flickr page, creating a set for this assignment. Here's the link to the whole set:
http://www.flickr.com/photos/43242999@N03/sets/72157622394273731/
And here are copies of the images, just to enhance the look of my web page!
First, a sunset viewed from the balcony:

Next, a sunset detail:

Thirdly, a small rainstorm sweeping over South Mountain at sunset:

Fourthly, the full moon setting over the Sandia Mountains one morning:

And finally, a rainbow ending at South Mountain:

Enjoy!
http://www.flickr.com/photos/43242999@N03/sets/72157622394273731/
And here are copies of the images, just to enhance the look of my web page!
First, a sunset viewed from the balcony:
Next, a sunset detail:
Thirdly, a small rainstorm sweeping over South Mountain at sunset:
Fourthly, the full moon setting over the Sandia Mountains one morning:
And finally, a rainbow ending at South Mountain:
Enjoy!
Week 4 reading notes
Witten 2.2 --
It appears that information overload was a problem even as far back as 1674; the quotation from Hyde illustrates that it has always been problematic to try to condense vast amounts of information into mere subject headings and organize the headings helpfully. I'm glad to learn the term collocation, a more specific term for the methods of organizing information in a library, and I'm interested in the author's implication that confirmation is now an essential middle stage in digital information retrieval, along with increased priorities on acquisition and navigation.
The article also points out the more fluid boundaries of digital objects, which can be so easily copied/altered. This makes me think about how important it is to annotate versions and provide adequate metadata. Other interesting points were raised, too... for instance, the chart showing the dozens of spellings of Muammar Qaddafi's name illustrates how difficult it can be to acquire comprehensive metadata when so many variables are in place, as well as the need for programming the variants as cross-references.
Witten 5.4-5.7 --
Part of this chapter reiterates what I've learned in previous classes about how different bibliographic metadata format standards fill different needs: some (e.g. MARC) providing rich details (for example, for the unique documents held by archivists) and some (e.g. Dublin Core) aiming more for breadth and interoperability. I was interested to learn more about BibTEX and Refer, as well, since I had never even heard of these standards before. So many standards! I'm glad the article addressed the possibilities for their interoperability. This article also gave a good rundown of multimedia file formats, so I'll keep it in mind as a useful resource.
I was interested to hear that key phrase metadata can be obtained automatically from digital documents with some degree of success. This surely solves some of the problems Hyde was worrying about more than 300 years ago! And the article illustrated why it's helpful to build a key phrase hierarchy for enhanced data retrieval.
Gilliland - Introduction to Metadata --
This article provided a pretty good summary of metadata, finding aids, and the structure of information systems. I'm especially interested in what the author said about user-created metadata systems that are flourishing on the Web. I understand that lack of quality control is a concern with such grassroots-level tagging, but I also see how this is a helpful way for huge amounts of data to be collectively organized by multiple people.
Much of what the author said (e.g. about the value of metadata) was already clear to me after 1.5 years of library school, but I do appreciate the author's charts illustrating the various types/characteristics of metadata, which help me to see categorical distinctions among descriptors.
Weibel - Border Crossings: Reflections on a Decade of Metadata Consensus Building --
I was glad to read (albeit briefly; there weren't many details) about an attempt to involve representatives from so many different communities in bringing together disparate metadata standards. Interacting and networking among various professions and disciplines is crucial to effective Information Science practices, and this was a brief glimpse into some of the concerns/confusions/challenges that arise when various institutions attempt to collaborate, each employing its own standards and assumptions. As I said, I might have liked more details rather than just summary statements, but I'm glad at least to see that collaboration is happening.
It appears that information overload was a problem even as far back as 1674; the quotation from Hyde illustrates that it has always been problematic to try to condense vast amounts of information into mere subject headings and organize the headings helpfully. I'm glad to learn the term collocation, a more specific term for the methods of organizing information in a library, and I'm interested in the author's implication that confirmation is now an essential middle stage in digital information retrieval, along with increased priorities on acquisition and navigation.
The article also points out the more fluid boundaries of digital objects, which can be so easily copied/altered. This makes me think about how important it is to annotate versions and provide adequate metadata. Other interesting points were raised, too... for instance, the chart showing the dozens of spellings of Muammar Qaddafi's name illustrates how difficult it can be to acquire comprehensive metadata when so many variables are in place, as well as the need for programming the variants as cross-references.
Witten 5.4-5.7 --
Part of this chapter reiterates what I've learned in previous classes about how different bibliographic metadata format standards fill different needs: some (e.g. MARC) providing rich details (for example, for the unique documents held by archivists) and some (e.g. Dublin Core) aiming more for breadth and interoperability. I was interested to learn more about BibTEX and Refer, as well, since I had never even heard of these standards before. So many standards! I'm glad the article addressed the possibilities for their interoperability. This article also gave a good rundown of multimedia file formats, so I'll keep it in mind as a useful resource.
I was interested to hear that key phrase metadata can be obtained automatically from digital documents with some degree of success. This surely solves some of the problems Hyde was worrying about more than 300 years ago! And the article illustrated why it's helpful to build a key phrase hierarchy for enhanced data retrieval.
Gilliland - Introduction to Metadata --
This article provided a pretty good summary of metadata, finding aids, and the structure of information systems. I'm especially interested in what the author said about user-created metadata systems that are flourishing on the Web. I understand that lack of quality control is a concern with such grassroots-level tagging, but I also see how this is a helpful way for huge amounts of data to be collectively organized by multiple people.
Much of what the author said (e.g. about the value of metadata) was already clear to me after 1.5 years of library school, but I do appreciate the author's charts illustrating the various types/characteristics of metadata, which help me to see categorical distinctions among descriptors.
Weibel - Border Crossings: Reflections on a Decade of Metadata Consensus Building --
I was glad to read (albeit briefly; there weren't many details) about an attempt to involve representatives from so many different communities in bringing together disparate metadata standards. Interacting and networking among various professions and disciplines is crucial to effective Information Science practices, and this was a brief glimpse into some of the concerns/confusions/challenges that arise when various institutions attempt to collaborate, each employing its own standards and assumptions. As I said, I might have liked more details rather than just summary statements, but I'm glad at least to see that collaboration is happening.
Labels:
collaboration,
descriptors,
Metadata,
standards
Friday, October 2, 2009
No muddiness so far...
My small group has tentatively decided to use DSpace for our digital library project, so I've been poking around DSpace.org and taking notes on this week's guest lecture, preparing to install and experiment with it. It seems like a great way to archive materials once we learn how to use it, so I'm willing to put in the initial work that it will apparently require to make it happen.
So far I've been gathering materials for our project, but it hasn't yet come time to post them... I'll make another blog post if or when I run into any DSpace difficulties.
So far I've been gathering materials for our project, but it hasn't yet come time to post them... I'll make another blog post if or when I run into any DSpace difficulties.
Friday, September 25, 2009
Just wondering...
I realize that Prof. He said at the end of lecture we didn't have to write "muddiest points" for this week; I'm just making a quick blog post while the course is on my mind. I'm still not clear on the logistics of the exam next month -- how it will be administered, how much time we'll have, etc. -- but I suppose there will be enough time to figure it out before it occurs. And I assume there's only one exam this semester?
The lecture clarified a lot of points for me about digitization, the pros and cons of various standards and formats, and hosting content online. I'll have to check out the PURL examples at the OCLC site.
The lecture clarified a lot of points for me about digitization, the pros and cons of various standards and formats, and hosting content online. I'll have to check out the PURL examples at the OCLC site.
Friday, September 18, 2009
Week 3 reading notes
Michael Lesk, Understanding Digital Libraries - sections 2.1, 2.2, 2.7, and Chapter 3
2.1 Computer Typesetting
The history of text standards for both software and printing... WYSIWYG formats developed in the 1970s led to more user-friendly transparency between screen and page...Postscript & SGML both important developments for computer typesetting
2.2 Text Formats
Discussion of tags and searches within the three most notable standards (MARC, SGML, and HTML).... Difficulty of entering SGML labels in a WYSIWYG document... HTML, of course, supports hypertext links. Certain questions and issues arise when information professionals try to describe content, deduce keywords, rank search terms, etc.
2.7 Document Conversion
There are two primary strategies for text: scanning w/ Optical Character Recognition, and keying in materials (original or transcribed, with original errors usually included in transcriptions).
I was struck by the number of errors in certain scans of the newspaper text, but I see that progress is being made in OCR systems and software. This is clearly very important for digital archiving of older materials.
Chapter 3: Images of Pages
There are sizable differences among the quality of scanned images (depending on equipment) and the costs for such efforts (keying vs. machine scanning). Review of specific equipment, etc...
Various image formats (compression algorithms, etc.)
Display requirements should allow for easy legibility on a variety of computer equipment
PDFs help retain the author's originally intended format & appearance.
Postscript --> Adobe Acrobat, PDFs
Indexing: thumbnails are helpful for users, but not machine-readable
Print indexes can be based on other standards
Again, OCR helps for searchability
For old newspapers' digitization, sometimes clipping story-by-story as image files works best
Some files share texts & images, tables, other formats, etc. -- a page can be broken into separate regions
There are three library storage options: scanning, on-campus storage, and an off-site depository, each w/ pros & cons (off-site for rarer stuff)
ARMS. Chapters 9. http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter9.html.
Again, proof that OCR is vital to searchability of scanned texts.
2.1 Computer Typesetting
The history of text standards for both software and printing... WYSIWYG formats developed in the 1970s led to more user-friendly transparency between screen and page...Postscript & SGML both important developments for computer typesetting
2.2 Text Formats
Discussion of tags and searches within the three most notable standards (MARC, SGML, and HTML).... Difficulty of entering SGML labels in a WYSIWYG document... HTML, of course, supports hypertext links. Certain questions and issues arise when information professionals try to describe content, deduce keywords, rank search terms, etc.
2.7 Document Conversion
There are two primary strategies for text: scanning w/ Optical Character Recognition, and keying in materials (original or transcribed, with original errors usually included in transcriptions).
I was struck by the number of errors in certain scans of the newspaper text, but I see that progress is being made in OCR systems and software. This is clearly very important for digital archiving of older materials.
Chapter 3: Images of Pages
There are sizable differences among the quality of scanned images (depending on equipment) and the costs for such efforts (keying vs. machine scanning). Review of specific equipment, etc...
Various image formats (compression algorithms, etc.)
Display requirements should allow for easy legibility on a variety of computer equipment
PDFs help retain the author's originally intended format & appearance.
Postscript --> Adobe Acrobat, PDFs
Indexing: thumbnails are helpful for users, but not machine-readable
Print indexes can be based on other standards
Again, OCR helps for searchability
For old newspapers' digitization, sometimes clipping story-by-story as image files works best
Some files share texts & images, tables, other formats, etc. -- a page can be broken into separate regions
There are three library storage options: scanning, on-campus storage, and an off-site depository, each w/ pros & cons (off-site for rarer stuff)
ARMS. Chapters 9. http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter9.html.
Again, proof that OCR is vital to searchability of scanned texts.
Week 2 - the muddiest point from lecture
I wouldn't say any points from lecture were particularly "muddy," but certain things did surprise me. For example, I was disappointed to hear that there's not much funding these days for digital library development from the government; I would have hoped that such funding would be ongoing, given how very much printed material there is to put online.
In lecture we also heard a bit more about the aspect of "community" that I had questioned in the paper I wrote last week. It seems that community is considered a central starting-point for digital library development: you start with the society you want to serve, and develop everything according to that community's social, economic, and legal issues. The lecture helped me understand a few of the questions I raised about the necessity of developing a DL for a specific community, but I still wonder whether it's always essential.
In lecture we also heard a bit more about the aspect of "community" that I had questioned in the paper I wrote last week. It seems that community is considered a central starting-point for digital library development: you start with the society you want to serve, and develop everything according to that community's social, economic, and legal issues. The lecture helped me understand a few of the questions I raised about the necessity of developing a DL for a specific community, but I still wonder whether it's always essential.
Labels:
community,
digital libraries,
government funding
Monday, September 14, 2009
A couple of odds and ends
The link within the syllabus to one of the Week 2 readings didn't work, so I had to get there through another route: http://www.cs.cornell.edu/wya/DigLib/MS1999/Chapter2.html. The article provided a brief introduction to the birth of the internet, which I had already read about, but it also clarified a few of the finer points regarding the particulars of the internet, such as the difference between IP and TCP and how these protocols work together to send packets. I was interested in the sidebar about the relatively early formation of the Los Alamos E-Print Archives, since it sounds like such a seamless way for researchers themselves to contribute to a digital library without having to make any adjustments for format, protocol, or tools. Its use of an open archive format, wherein the authors retain copyright of their materials, makes good sense and seems an interesting precursor to the World Wide Web as we now know it.
In other LIS 2670 news, I didn't see an assignment uploading tool within CourseWeb for this class, so I emailed my first brief essay to the professor, and I also hosted a copy of the file via Google Docs, just to be sure. It can be accessed at
http://docs.google.com/View?id=dfddpdf6_2fhffsfcx.
In other LIS 2670 news, I didn't see an assignment uploading tool within CourseWeb for this class, so I emailed my first brief essay to the professor, and I also hosted a copy of the file via Google Docs, just to be sure. It can be accessed at
http://docs.google.com/View?id=dfddpdf6_2fhffsfcx.
Labels:
assignment,
digital libraries,
Open Archives
Friday, September 11, 2009
Week 2 Reading Notes
Just a few notes on the articles I read for this week...
A Framework for Building Open Digital Libraries
The technology behind Digital Libraries is evolving at a faster pace than the already well-established practices of library science, so standards such as the Dublin Core Metadata Element Set and the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH) aim to make library and computer/data systems more interoperable.
Digital libraries are often custom-built, and therefore not designed to be interoperable with/within other systems. Program logic (which can be complex) varies according to community needs. As of the article date (Dec. 2001), few software tool kits existed for the express purpose of building digital libraries; in the absence of such unified starting-points, the OAI Protocol for Metadata Harvesting treats multiple DLs as searchable Open Archives. The Open Digital Library design allows for interoperability at the functional level across physically separate collections. Various components of an ODL network allow users to browse by categories, combine metadata from multiple sources, search and filter search results, sort by date, and so on. Ultimately, the authors hope to influence the development of DLs starting from the design stage.
Interoperability for Digital Objects and Repositories
This article describes the Cornell/Corporation for National Research Initiatives' own efforts to develop and employ a design for interoperable digital repositories. The authors' own approach is a hybrid of some of the traditional approaches to achieving interoperability, such as standardization (e.g., schema definition, data models, protocols), distributed object request architectures (e.g., CORBA), remote procedure calls, mediation (e.g., gateways, wrappers), and mobile computing (e.g., Java applets).
The article mentions the authors' focus on the preservation of digital items, standardization of format, and attention to access issues. The article lists certain components of accessing digital objects: disseminator types (usually outside operations), servlets (executable programs), and the notion of extensibility (how easily the new digital object can be used with additional functionality).
The authors performed experiments to test extensibility, interoperability, functionality, and other access/compatibility issues. The tests were mostly successful, and will guide the authors' future research on the implementation of programs designed to increase interoperability.
An Architecture for Information in Digital Libraries
This article gives an overview of the structure of stored information in digital libraries. I aim to fill in details as soon as I get the chance!
A Framework for Building Open Digital Libraries
The technology behind Digital Libraries is evolving at a faster pace than the already well-established practices of library science, so standards such as the Dublin Core Metadata Element Set and the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH) aim to make library and computer/data systems more interoperable.
Digital libraries are often custom-built, and therefore not designed to be interoperable with/within other systems. Program logic (which can be complex) varies according to community needs. As of the article date (Dec. 2001), few software tool kits existed for the express purpose of building digital libraries; in the absence of such unified starting-points, the OAI Protocol for Metadata Harvesting treats multiple DLs as searchable Open Archives. The Open Digital Library design allows for interoperability at the functional level across physically separate collections. Various components of an ODL network allow users to browse by categories, combine metadata from multiple sources, search and filter search results, sort by date, and so on. Ultimately, the authors hope to influence the development of DLs starting from the design stage.
Interoperability for Digital Objects and Repositories
This article describes the Cornell/Corporation for National Research Initiatives' own efforts to develop and employ a design for interoperable digital repositories. The authors' own approach is a hybrid of some of the traditional approaches to achieving interoperability, such as standardization (e.g., schema definition, data models, protocols), distributed object request architectures (e.g., CORBA), remote procedure calls, mediation (e.g., gateways, wrappers), and mobile computing (e.g., Java applets).
The article mentions the authors' focus on the preservation of digital items, standardization of format, and attention to access issues. The article lists certain components of accessing digital objects: disseminator types (usually outside operations), servlets (executable programs), and the notion of extensibility (how easily the new digital object can be used with additional functionality).
The authors performed experiments to test extensibility, interoperability, functionality, and other access/compatibility issues. The tests were mostly successful, and will guide the authors' future research on the implementation of programs designed to increase interoperability.
An Architecture for Information in Digital Libraries
This article gives an overview of the structure of stored information in digital libraries. I aim to fill in details as soon as I get the chance!
Labels:
digital libraries,
interoperability,
Metadata,
Open Archives
Tuesday, September 8, 2009
An Initial Post...
And so commences my blog for LIS 2670, Digital Libraries, at Pitt. Welcome!
We were asked to submit our "muddiest points" from lectures and/or readings. Most of the first week's readings were pretty straightforward, although in the Dewey Meets Turing article, I wasn't entirely clear about how or whether the grant money for the natural sciences translated into adequate funding for the digital library initiative. I assume that the natural sciences grants were stretched to fund the DLI, but that makes me wonder about funding for digital libraries in disciplines that don't traditionally receive as much grant money.
I'll post again soon with reactions to this week's readings.
We were asked to submit our "muddiest points" from lectures and/or readings. Most of the first week's readings were pretty straightforward, although in the Dewey Meets Turing article, I wasn't entirely clear about how or whether the grant money for the natural sciences translated into adequate funding for the digital library initiative. I assume that the natural sciences grants were stretched to fund the DLI, but that makes me wonder about funding for digital libraries in disciplines that don't traditionally receive as much grant money.
I'll post again soon with reactions to this week's readings.
Subscribe to:
Comments (Atom)
