Outline for Final Report

Charge (Adam C)

Problem statement

Digital collection identifiers (Rick)
A long standing issue in the presentation of our digital collections is the lack of a reliable identifier
for the collections themselves. Collections move to new machines, collections change their delivery platform,
and collections change their default behaviors. We need to be able to locate collections reliably over a long period of time and discover aspects of collections for interoperating with other collections.

Persistent IDs for digital preservation (Bill)
- We are working on preserving the digital objects in two large collections, the Euclid journals and the arXiv.org preprints. In general, we can view the objects as having at least one content file and an object descriptor file containing metadata about the object. Most digital objects contain multiple content and metadata files. We need to be able to identify and locate the files for a long time, regardless of where they are located. Rather than changing the metadata in the descriptor file every time a file is moved-we would like to create persistent identifiers that can be mapped to the files' current locations.
- The number of component files to be preserved will be several times larger than the number of digital objects. With processing efficiency in mind, we would prefer a solution that will allow us to resolve the identifiers locally, without going out over the internet for each request for resolution.
- The digital objects' component files in our preservation system will not be directly accessible to the public; access will occur through a gated interface. The persistent identifier mechanism we use should be able to be produce identifiers that are private and not discoverable.

Requirements for an implementation (John)

We don't want to break any system currently used at CUL that uses persistent identifieres, such as the PURL server. Backward compatible.
Provide a mechanism for persistent ID's for OAI-PMH
Optionally resolvable only within a constrained environment. Secured nameservice. For archiving. The individual
Can ensure confidentiallity.System should define a mechanism for client authentication/authorization to ensure data integrity and authority control.
It doesn't have dependencies on external systems in order to resolve local PIDs.
Every PID should be globally unique.
PID should be free of location semantics.
PID must be able to refer to multiple aspects, attributes, or behaviors of the digital object, but with a default aspect that conforms with convential use.
Globally resolvable.
Fine-grained control of PIDs, so that groups can maintain their own sets, without having to maintain multiple PID resolvers/servers.

Methodology (Adam S)

Bill Kehoe and Rick Silterra began discussing the desire for a consistent PID strategy based on the integral role that PIDs would play in several ongoing projects that included Priority Teams One and Two and the MathArc project. They desired a lightweight, focused and efficient process for vetting the different approaches to PIDs with the goal of finding the most appropriate PID strategy for use on these projects. Marcy Rosenkrantz, Nancy McGovern and Oya Reiger proceeded to help organize this group, which included (in alphabetical order):

Adam Chandler
John Fereira
Bill Kehoe
George Kozak
Rick Silterra
Adam Smith

An initial meeting of this group, attended by Marcy and Oya, clarified the charge and the approach to be taken. The group was to meet weekly for a planned period of four to six weeks to review PID approaches, ending in a "retreat" in which recommendations would be formulated and ultimately presented to stakeholders within the library, particularly the Digital Content Delivery Platforms Forum and the Metadata Working Group.

The group began a Wiki that gathered previous research into PID solutions by members of CUL, and outlined a number of those approaches as well as the relevant standards and other documentation for each. In the first meeting, the group surveyed the landscape of existing PID strategies and other approaches to identifying print and electronic resources including the general issues and challenges underlying these strategies. Then, in each subsequent week, the group chose to look at a particular PID strategy, reading the relevant documentation and discussing the advantages and disadvantages of each approach. The group also examined the experiences of any existing implementations of each PID strategy at CUL and elsewhere. Throughout the group's explorations, the original scope and charge were reevaluated in light of the discussions, and the impact that the group's eventual recommendations would have on existing and future CUL projects were considered.

A rough list of topics discussed includes:

an overview of identifier strategies and issues,
PURL,
ARK,
Handle,
OpenURL, and
OAI-PMH requirements.

By the end of these discussions, the group began to reach a consensus around Handles based on that approach's general maturity and installed base, its fit for CUL's projects, and the existing knowledge about this approach within CUL. The group then began to plan and write this report on the group Wiki that recommends further exploration of Handles.

Recommendations
(embed the rationale for each recommendation)

CNRI Handle System (George)

We recommend that CUL undertake a proof-of-concept implementation of the CNRI Handle System. We would hope to test the system's ability to meet all of our requirements and gain insight and skill in the technical and organizational demands of maintaining a persistent identifier system. CNRI provides free handle server software and documentation. We envision a local system that can be administered in a distributed manner, so caretakers of different collections will be able to make autonomous decisions about the identifiers their collections use. We also recognize that a proof-of-concept system might prove to be of limited or no use in the long run. During the period of experimentation we would ensure that the digital objects we give identifiers to would be able to be remapped to their original URLs or some other useful identifer system, in case we decide to discontinue using the Handle System. (Bill)

Resource Requirements -- we don't know what the requirements are (development? a standalone machine? server space? maintenance?) (John)

Evaluate the system and its use some time in the future. (Rick)

A deliverable: A Usage Document that explains how the system can be integrated into CUL collection building (Adam C)

We recommend that, embedded in mapping metadata, there be an explicit statement of the estimated lifespan of the PID and of the object it represents. This practice will add value to the identifier and help enforce a best practice in lifecycle management of our digital objects. (Bill)

Child pages

Outline for Final Report