Porting data overview

Implementation scenarios: where does the metadata come from?

OAI-PMH repository implementations consist of request- and error-handling scripts that draw on a relational database (RDBMS) for the metadata they deliver. In practice, this is because implementations that relied on real-time extraction of metadata from documents in a file tree would be unacceptably slow. So for practical purposes the question becomes one of how the metadata finds its way into the RDBMS, and that in turn depends on the nature of the documents themselves, and how they have been encoded.

Many of you will already be stashing metadata for various purposes, maybe in a relational database, or maybe as a series of properties stored on a full-text index of some kind. The idea behind the Cornell prototype is that it will be much easier for you to crosswalk your existing metadata into the prototype schema than it would be for either of us to write adapter code for use with a galaxy of legacy setups. Taking a look at the conceptual overview of the OAI4Courts schema will probably help you. Various schema diagrams and documentation are also available.

Metadata enters RDBMS via direct extraction from documents themselves

If well-structured documents are available, extraction into a relational database is trivial assuming that the markup standard used in the documents marks up the same document features we wish to use as metadata. We need not have reached full XML markup for this to work; the metadata might be embedded in the document with distinct presentational cues and conventions ("the name of the primary author of the opinion is always the first thing that appears in <h4> tags") or be marked up in its <head> section using <meta> tags. These are not hard scripts to write -- ours took a day only because our metadata setup is the product of 12 years' accumulation of flaky design decisions.

Metadata enters RDBMS as part of the conversion of documents into a structured format

If the document ingestion process involves conversion into a structured format, the conversion process becomes a handy place to extract metadata. In effect this is the same scenario as 1. above, but with extraction taking place at the same time that the document features are marked up.

Metadata is extracted from a document known to be related, such as a headsheet

Many document-management systems used in courts make use of a headsheet that contains metadata in a document that is separate from the document containing the judgement itself but related to it either by storage method (eg. file-folder or paperclip) or by an identifier scheme of some sort. These may be parsed and the metadata added to the RDBMS.

Metadata is separately placed in the RDBMS by human agency

This scenario is particularly relevant to legacy collections where the documents are available only in an imaged (rather than full-text) format. In such cases metadata might be entered directly into the RDBMS by document analysts; the activity would be similar to (but much simpler than) the preparation of a library accession or catalog record.
  • LexCraft articleType