vrijdag 24 juni 2011

Persistent identifiers: policy and ‘will’ vital ingredients (#kepoid)

The world of internet is changeable and volatile. If we are to secure long-term access to content on the internet we have to find mechanisms to bring order to the seeming chaos. Standards, for instance – although I learned last month in Tallinn that we may be rushing into those (see blog post). Persistent identifiers are another type of building blocks for long-term access to digital objects, because PIDs make sure that we can find the object that is being preserved, even if it is moved from one URL to another. But I learned last week that the persistent identifiers are not as persistent as one might hope for. Another illusion down the drain?


In front of a famous painting by Rembrandt (The Anatomy Lesson of Dr. Nicholaes Tulp), a working group led by Andrew Treloar (standing, at right) dissects the truth about persistent identifiers and their complex relationship with Linked Open Data.

The setting was a two-day seminar on persistent object identifiers (or POID, thus #kepoid) organized by Knowledge Exchange, the PersID project, SURFfoundation and Data Archiving and Networked Service (DANS) in the Hague (14-15 June). Regrettably, I managed to attend only the second day, but it was enough to make me understand how complicated this business is.

This is how it should work: a (national) organization (national library, scientific organization) assigns a unique identifier to a digital object, a so-called persistent identifier. If the object is moved from one URL (internet location) to another, the PI remains the same and a resolver service links the new URL back to the PID.

Borrowing from Andrew Treloar’s presentation (Australian National Data Service), here are the main complications associated with object identifiers:

  • Granularity: what do you assign a PID to? In FRBR terms: to the work? to the expression? to the manifestation? to the item? Or, I may add, to a chapter? to a paragraph? Perhaps we even need multiple PIDs at multiple levels.
  • How do you assign PIDs to objects that are not static, but that change all the time (e.g., databases)?
  • How trustworthy is the object that is being identified (e.g., short url services)?
  • How to point to something inside the object?
  • Who owns the binding between the PID and the object?

And then there is the problem that there are a number of different PID systems (e.g., URN, DOI, PURL), which are not interoperable (comment by Juha Hakala: ‘It is encouraging that it is quite a long time since someone came up with a new PID system.’). And PID’s do not go well together with Linked Open Data (LOD).


‘Why is it so hard?’ – notes from Jeroen Rombouts’ computer (3TU.Datacenter)

Both Clifford Lynch and Andrew Treloar concluded that solving the technical problems of the PID challenge is the easiest part of the work to be done. Andrew built a pyramid of key success factors (photo above): at the bottom of the pyramid is a sustainability model, the second layer is about policies, the third is about procedures, and the top layer is about will or the intention of individuals to follow the rules and make the system work.


A room full of persistent identifiers – at right seminar chair Bas Cordewener (SURFfoundation).

In the end the attendees concluded that building interoperability between the existing PID systems is not a top priority. But getting PIDs to work with Linked Data is. Treloar proposed a 'Den Haag manifesto’ to bring this about:

The Hague Manifesto on persistent identifiers and Linked Open Data (LOD) (draft version)

  1. Make sure PID’s can be referred to HTTP URI’s including content negotiation
  2. Use LOD vocabularies, for schema elements
  3. Identify the minimum common set of schema elements, across identifiers in scholarly communication space.
  4. Use same-as relations to help PID interoperability across PID systems/schema’s
  5. Work with the LOD community on simple policies/procedures to improve persistence of HTTP URI’s.

Treloar will work with anybody who is ‘ready, willing and able’ to develop these principles.

Some other recommendations from the meeting:

  • Do an inventory of different PID systems and make transparent how they work, so that organizations contemplating using PID’s know how to choose a system
  • Find the common ground between the systems and use these to widen awareness of PID problems and systems
  • Organize regular meetings between those who are involved in building PID infrastructures to facilitate alignment.

The work is being continued, within PersID and also within the European APARSEN project.


1 opmerking:

ChrisBellekom zei

Hoi Inge,

bedankt voor weer een interessant verslag!

Ik ga hier niet vragen of je een voor of tegenstander bent van dit fenomeen, ik besef me ook dat dit niet iets is om voor of tegen te zijn.

Ik vind het idee van persistent identifiers een interessante ontwikkeling, maar als ik nadenk over complexiteitsreductie van technologie (wat mij betreft een voorwaarde voor duurzaamheid) dan past dit niet in dat rijtje thuis.

Ik zie namelijk geen toegevoegde waarde voor de PID als digitale duurzaamheidsoplossing. Zodra ik als bedrijf of instelling geen extra investering wil doen om al mijn digitale objecten, of locaties binnen objecten te voorzien van een DOI (Ik denk dat ik het goed begrijp als ik zeg dat de PID een doorontwikkeling is van de 'digital object identifier') en ik maak geen resources vrij om me te abonneren op een PID-URL resolver service, dan is het hele idee gedoemd te falen. "What's in it for me?"

Naar mijn mening is de PID een fail-safe voor diegenen die niets hebben gedaan aan web-archiving.

En zoals je al zegt, de zwakste schakel in het hele verhaal is het instituut of het bedrijf dat de 'webobject locatie' gaat beheren.

Is er al eens gesproken over het onveranderlijk maken van URL's in plaats van het doorontwikkelen van PID's? Ik bedoel, de U in URL staat voor 'Uniform'

Vriendelijke groet,

Chris Bellekom