Guided tour of Layer 1 prototype

What's here

[NB: this is a guided tour of an experimental service. It bounces around a lot, so if it's not working for you, try again. Frustration at repeated, unsuccessful attempts can be vented at the author, tom-dot-bruce[somewhere in the vicinity of]cornell-dot-edu]

This "guided tour" will walk you through the prototype OAI repository server at the Legal Information Institute. It includes a little bit of explanatory text, but if you don't understand anything about OAI you'll probably want to read the very basic overview of OAI-PMH and OAI-PMH for beginners. As of this writing the server validates cleanly against the validator at openarchives.org (thanks, Simeon!), which means it's in full technical compliance with the standard.

The data


The data underlying this demo is a static snapshot taken from our Supreme Court collection as of about April 4 2008. We are not currently updating it, so if it doesn't have the most current Supreme Court case you can think of, just chill out. We will add regular updating, and a larger collection of more diverse data, soon. Also, most of the categorization that is implied by set membership is completely bogus -- we just populated the sets using a search-based technique that in some cases is little better than random.

The Big Idea


The basic idea behind OAI-PMH is this: the world is divided into repositories, and service providers who build services on top of what those repositories offer (such services might include multi-site search, current awareness services, and an incalculable number of mashups). Given the base URL for a repository, you should be able to fire a set of well-understood queries at it and get well-formed answers in valid, documented XML that you can process easily to build your service -- say by feeding the XML data to a spidering engine or some such. All of the questions and all of the answers are essentially variations on three questions: "Who are you?" , "What have you got in your repository?", and "How is your repository organized?".

Take a look

Who are you?

Look at http://hula.law.cornell.edu:51254/oai2?verb=Identify
OAI repositories answer the Identify request with a well-formed answer. Right now the prototype gives a pretty unsophisticated answer -- we will build it out to include information about rights-management policies, print citations for the contents of the server, and a welter of configurable stuff. The OAI spec lays out some of the possibilities. That's the "Who are you?" part.

What have you got?

Now, look at http://hula.law.cornell.edu:51254/oai2?verb=ListRecords&metadataPrefix=oai_dc</span>
Obviously, this is an answer (a partial one) to the "What have you got?" question. You can see that it contains fairly rich information -- far more than you'd need just to drive a spider, and enough to put decent metadata in a search-engine result set created by such a spider. As you might guess from the parameters in the request, you can set up any number of metadata formats and OAI will use them, provided that they have valid XML schemas. There is a complex explanation of how we're going about that here, and of what was actually done in this implementation here.

I said this was an incomplete response, and it is -- OAI sets up a flow-control mechanism. You can catch a glimpse of it if you scroll to the bottom of these results -- the server has issued a <resumptionToken> that the harvesting client can use to get another page of records.

You can limit result sets by date (that the metadata was posted) or by "sets", which are arbitrary collections set up by the repository. Here's a complicated date-based retrieval:
http://hula.law.cornell.edu:51254/oai2?verb=ListRecords&from=2008-04-04T18:57:29Z&until=2008-04-04T18:57:29Z&metadataPrefix=oai_dc

That's all the records that we stuffed in in a single second a few days ago. You can get single records, too:

http://hula.law.cornell.edu:51254/oai2?verb=GetRecord&identifier=oai_lii%3Awww%2Elaw%2Ecornell%2Eedu%3Aus/federal/scotus/06-5754-ZC-html&metadataPrefix=oai_dc

That's a really big URL, but mostly because of the URL escaping ;-). Incidentally, we are suggesting a more-or-less formal identifier scheme; it's more or less formal right now because it needs to mature a bit before being set in stone.

How is your repository organized?

Any repository can be partitioned into an organized, arbitrary group of sets. Sets may be grouped into hierarchies. The architecture of the prototype allows you to do this pretty much any way you want, by populating a database table with set-membership information. Here's how to see what sets we have:

http://hula.law.cornell.edu:51254/oai2?verb=ListSets

And you can get the contents of a single set with our old friend ListRecords:

http://hula.law.cornell.edu:51254/oai2?verb=ListRecords&set=lii:scotus:copyright&metadataPrefix=oai_dc

Such a "set setup" might make a good basis for a federated current-awareness service.

That's all, folks

There are infinitely many combinations and variations on OAI-PMH verbs, but those are the basics -- enough to start you thinking about a repository of your own, I hope.