Sunday, November 29, 2009

Worse than a muddy point...

This past week I compiled the metadata for all of my group's documents on Greenstone, and everything first seemed to be going fine. However, at (what I thought would be) the very end of the project, when I tried to build the digital library, Greenstone kept giving me messages such as ""The file [filename.pdf] was recognised but could not be processed by any plugin." Then at the end of processing, it gave me the following message:

120 documents were considered for processing:
46 documents were processed and included in the collection.
74 were rejected.

From then on, I no longer had access to the rejected documents via Greenstone. They were still on my computer, but had been relegated to an "archives" folder in the Greenstone file structure, with names like Hash9bb4.dir in place of the more recognizable file or folder names. I have no idea what to do about this! Are two thirds of our metadata-enriched PDFs totally unusable in Greenstone? And why??? I tried looking up the issue online, but only found a single site with a vague suggestion to convert the PDFs to HTML using third-party software.

If anyone reading this has any ideas, please comment or email me...

1 comment:

  1. Hi Elizabeth

    Greenstone uses third party software to convert PDF's into HTML. This software can only handle up to version 1.4.

    We are hoping to upgrade our pdf/doc etc handling next year.

    In the meantime, you could try the following:
    Add UnknownPlugin to your plugin list (after PDFPlugin), and set its process_extension option to pdf.
    This will pick up any pdfs that have failed to be processed by PDFPlugin. You won't get any text extracted from them, but if you have assigned metadata through GLI (metadata.xml files) then they will appear in classifiers and can be searched via metadata.

    If you have Greenstone problems, please join the Greenstone mailing list (greenstone-users) and ask your questions there.
    https://list.scms.waikato.ac.nz/mailman/listinfo/greenstone-users

    Regards,
    Katherine Don,
    Greenstone developer

    ReplyDelete