The OAI-PMH Protocol and Search Services

June 12, 2008 – 9:50 am

Created with support from the National Library of Sweden and its development program OpenAccess.se

Jörgen Eriksson, 2007
http://creativecommons.org/licenses/by-nc-sa/2.5/se/

Description
This section will deal with the searching for data in open archives. The standard for collection of data that dominates today will be presented, as will its advantages and disadvantages, together with some of the established search services that use this standard. In addition, some other search services, which are important for visibility, will be described.

The purpose of the section is to explain the OAI-PMH standard and to present the most important search services.

Introduction
One advantage with institutional repositories is that the organization that builds up the archive can support its researchers by developing a local infrastructure (tools, practical assistance, copyright expertise,…) which the researchers can have close at hand. The disadvantages with institutional repositories manifest themselves when it comes to dissemination and accessibility of the publications. To go from archive to archive and search for publications is not a rational procedure.  On 21 February 2007 OpenDoar lists 843 open archives.

To really make the publications accessible you need search services which can collect and index the information in the local archives. Standardized descriptions of the publications are also required if the search services are to offer search possibilities that go beyond those that Google offers. 

Open Archives Initiative (OAI)
The OAI is an organisation which develops and markets standards. The goal of the activities is to render more effective the spreading of contents, for example the contents of local open archives. The members of the organization are some of the most influential developers of digital information services with a focus on scholarly communication. In 2001 the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) was published, a protocol for harvesting and description of documents in open archives. The protocol has been widely disseminated within the Open Access movement where it has become an established standard. What follows here is a short survey of the fundamentals of OAI-PMH.

An important concept in the OAI-PMH are the “data providers”, “service providers” and the relation between these two. A data provider is a service which makes the data about its publications accessible in accordance with the OAI-PMH specification. A service provider is a service which harvests data from different data providers in accordance with the OAI-PMH specification, indexes the harvested information and makes it accessible in a search service. The Lund University Open Archive, LU:research, for example, functions as a data provider in relation to a service provider like OAIster  which harvests and makes the information from LU:research and over 800 other archives searchable in one place.

The OAI-PMH standard describes how the harvesting robot and the open archive will communicate.

Service providers only harvest the descriptive information from a data provider, not the full-text documents. This means that any possible limitations of the right to parallel publish articles which the publishing companies may have will not be violated (the author may have permission to parallel publish on his or her own Web pages or in a local open archive but not in other archives).

The description of the information allows the use of the 15 elements described in the Dublin Core (unqualified) . You don’t have to use all the DC fields which opens for the harvesting of also “information-poor” posts (for example only author and title) as long as they are presented correctly in accordance with the DC standard for the harvesting robot.

The advantage of using a simple metadata standard like the DC without specifying that certain fields are compulsory is that many can get connected without any major problems or costs. The disadvantage is that it is not possible to build particularly sophisticated search services from the metadata that is collected.

Today it is almost a must for an open archive to be OAI-PMH compatible. Practically all software produced supports the standard and there is also free software that can be used to make other services compatible.

If you have a small collection of posts (1-5,000 posts) which needs rare updating, a simplified possibility exists to make your collection harvestable by service providers. It is called the OAI Static Repository. Stargate is an English project dealing with applications of the OAI Static Repository and the project has also developed tools.

Once you have made your open archive OAI-PMH compatible you have to validate  to make sure that it follows the standard. After that you report your archive to the OAI.  This means that you will be placed on the OAI list of data providers where service providers can find services. You should also state separately the services that you particularly want to be collected. A list of service providers can also be found on the OAI Web pages.

OAI-PMH search services
Here follows a survey of some of the most important service providers. Generally it may be said that the development of service providers has been weak, probably a reflection of the difficulty to bring about a critical mass of contents in the open archives. Subject-oriented archives built up with the OAI-PMH are few, the successful subject-oriented archives that exist, for example ArXiv (physics) and PubMed Central (biomedicine), are based on the authors and publishing companies providing publications and description directly to the services. The opposition which exists between local institutional repositories and subject-oriented archives which are based on central feeding has been further accentuated after the decision by some major research financiers in Great Britain to parallel publish all biomedical research which they fund in PubMed Central UK.

OAIster

URL: http://oaister.umdl.umich.edu/o/oaister/index.html

Host: University of Michigan, Digital Library Production Service.

Covers: all subjects and document types. ” OAIster is a union catalogue of digital resources. Digital resources can range from an old-time advertisement of electric refrigerators from the Library of Congress American Memory project) to Harriet Beecher Stowe memoirs (from the University of Michigan Digital Library Production Service Making of America collection).”

Posts/Full text: mainly full text

Collection policy: “harvest everything and use anything that has a link to a digital object,  whether freely available or restricted”. Also collects from publishing companies like the Institute of Physics and Highwire

Size:  11,737,670 posts from 811 archives (15 May 2007). The number of posts includes a large quantity of duplicates as OAIster collects from the service provider CiteBase.

Search possibilities: Boolean search (AND, OR, NOT). Truncation with*, default phrase search  

Search limitations: author, title, subject, language, resource type.

Hit sorting: title, author, date, number of hits in a post

Comment: The largest of the OAI-PMH search services and the one with broadest coverage

BASE - Bielefeld Academic Search Engine

Host: Bielefeld University Library

URL: http://www.base-search.net/index.php?i=b

Covers: ” multi-disciplinary search engine for scientifically relevant web resources”

Postsr/full text: mixed. Also contains commercial resources that may be filtered away in the advanced search.

Size: 4,715,354 posts from 363 archives

Search possibilities: Possibilities to filtrate for document type journal article/preprint and institutional repositories according to geographical residence. Possibility to limit search to author, title, subject, publishing company and part of URL.

CiteBase

URL: http://www.citebase.org/

An interesting experimental service which indexes metadata and full text from the major subject-oriented archives and from institutional repositories. From the full-text documents references are extracted, linked if possible, and citation lists are created. Since the material you work with constitutes a very limited part of the total amount (that which is freely accessible and which can be collected by means of OAI-PMH) the citation analysis should be seen as a development project indicating future possibilities rather than as a practical usable service. See Citebase Help. Full-text articles in CiteBase are, in the first place, retrieved from arXive.

Search possibilities: possible to search on author, word from title/abstract, publication, publication year and the combination of these using AND.

Other specialized search services

Here follows some other specialized search services that your archive should try to get indexed by for increased visibility.

Google Scholar

URL: http://scholar.google.com/schhp?ie=UTF-8&oe=UTF-8&hl=en&tab=ws&q=

A broad service which indexes many types of free and commercial publications. Just like Google it has many users.

Thompson ISI - Current Web Contents (CWC)

URL: http://scientific.thomson.com/products/cwc/faq/

CWC is a service which describes specially chosen Web resources. The service is included in the Tomson-ISI database selection. The link above goes to information about the service and also to information about selection criteria and how to suggest the inclusion of your service.

Scirus

URL: http://www.scirus.com/

Scirus is a search service which indexes a selection of free and commercial Web services and archives.

The service is owned by Elsevier.

Google (et. al)

To be visible and to be well indexed in, above all, Google is also very important when it comes to making the contents of an institutional repository visible. The following visit statistics may serve as an example. The example comes from the Lund University dissertation database and the numbers show from where the visitors came in 2005.

Google: 77,559

Via the university’s main entrance to research: 15,436     

Via the Web of the libraries: 1,448

A short compilation of things to consider in order to optimize your visibility on Google is Peter Subers, How to facilitate Google crawling: Notes for open-access repository maintainers

Print This Post Print This Post

Sorry, comments for this entry are closed at this time.