donderdag 30 juni 2011

Global Web of Data depends on machine-actionable XML (LIBER 3)

Two inspired keynotes today about the vast new possibilities that machine-readable data – or more precisely: data that machines can act on - open up for the advancement of science. There is so much (digital) data out there that no human can comprehend it all. Fortunately, we have or are developing tireless machines that can publish, merge, search, reason, predict and integrate information. They can establish relationships between fields of science which have never even contemplated getting together and make way for new cross-disciplinary work.


Herbert van de Sompel, at right, with yesterday’s keynote speaker Rick Luce.

It was, of course, Herbert van de Sompel of Los Alamos who treated the audience of 400 research librarians to a peek into this fascinating world of emerging research possibilities (slides available from slideshare). All based on some rather basic building blocks:


Enthousiastically Van de Sompel reviewed some of the projects to make all of this possible, starting with his own OAI Object Reuse and Exchange, Open Annotation, and Memento, and then on to other developments, such as the ‘nano publication’, the smallest entity of information that can be searched, merged, read, etc. etc. by machines – a somewhat extended version of an RDF triple. And what to think of Executable Papers – articles that include the software and underlying data so that the reader can repeat the original experiments and draw his own conclusions.



Mind boggling! Van de Sompel explained that this is what we need to make it all happen, ‘and it irks me that we have that at our fingertips’:

  • open access to all data
  • permissive (i.e., non-restrictive) creative commons licences
  • money to pay for these tools
  • persistence in identifying the objects – and that is still a challenge (see last week’s post)

From a digital preservation viewpoint, however, there is a complication. As Alma Swan of Enabling Open Scholarship explained a few hours later, it does not work with PDF. PDF was developed for humans to read. Machines cannot read PDF, they need XML.


Alma showed that universities are building digital repositories at great speed – there are now almost 2000 of them. But what are we filling them with? Mainly PDF’s … Because they are nice and robust from a preservation viewpoint. And humans can read them.


 Alma Swan (seated) awaiting her turn to speak with session chair Bas Savenije.

There is good news as well. As Alma pointed out, digital repositories are attracting new users, mostly users from outside the university who do not have access to licensed digital content from major publishers. Companies, for instance, and private citizens who are making use of the new digital possibilities and getting involved in scientific efforts, such as these:


However, much of the material these new users are interested in, is not available in open access, and thus cannot be used by either humans or machines.

So what are we preserving all of the stuff for?

I’m going to sleep on that one …

vip Yours truly will spare no effort to tell you everything; here is my paparazzi shot at the VIP room: LIBER President Paul Ayris (left) and Executive Director Wouter Schallier.

woensdag 29 juni 2011

More (digital) wake-up calls for academic libraries (LIBER 2)

_DSC7142Today it was Rick Luce (Emory University, US) who had the (questionable) honour of issuing a wake-up  call to research libraries. This time the topic was not cultural heritage (see yesterday’s post), but the core business of academic libraries: serving researchers and the scientific research process. Check out his slides when they become available on the LIBER website. It is a dazzling summary of all the changes taking place in the sciences: zetabytes of data; dynamic, complex data objects that require management; communities and data flows becoming much more important than static library collections, etc. etc. Luce’s warning: somebody will develop the services the new researcher needs. If the library does not develop those, there is no future for the research library. Luce called for a fundamental transformation process that will affect every aspect of the ‘library’ business.


In this vision, the role of the ‘library’ is to deliver a layer of middleware between the scientific process and IT infrastructure.

Luce’s advice for libraries on how to bring all this about: ‘Radical Cooperation’:


Such change, of course, does not come easy. Disruption of set patterns makes people nervous – but instability also contributes to disruptive innovation:


Luce warned that ‘Culture will have strategy for breakfast every time.’ It takes years to turn the culture of an organization around.

At the end of this inspiring session, Kurt de Belder of Leiden University asked a crucial question: ‘Are libraries in fact the type of organizations that can make such drastic changes?’ In his response, Luce implied that some will make the change successfully – those who do not will be out of business by 2020. One of those libraries might be that of the librarian who asked how to make researchers deliver the metadata that the library needs during a data workshop this morning. Or was the wake-up call clear enough?

Norbert Lossau of Gottingen asked for advice on how to convince the University Board and staff of all this. Luce’s advice: ‘Look for small victories; do it step by step; work with early adopters within your staff.’

Will research libraries have enough time to get all that done by 2020?

_DSC7120 Neelie Kroes of the European Commission addressing the conference by video. Her agenda: a) open access; b) open data (all public-sector information); c) digital culture (‘By 2025 all European cultural heritage should be digitized and available through Europeana’)

dinsdag 28 juni 2011

Tomorrow will be too late – (born) digital in library special collections (LIBER 1)

_DSC6846 “Heritage collections in the digital future” was the revealing title of the first workshop I attended at the annual conference of LIBER, the Association of European Research Libraries, in Barcelona. The title was immediately attacked by Marco de Niet of Digital Heritage Netherlands: ‘Digital Future? Come on! The digital reality is today.’

_DSC6886 Was it really necessary for De Niet to talk in terms of ‘Shame on you!’, I wondered when he started his presentation? Have we not reached the stage now where we all know about the digital present and are making plans for it even if we cannot act upon them immediately?

Unfortunately, the answer must be no. Starting with Ivan Boserup’s summary of the findings of a recent poll amongst LIBER libraries, followed by statistics gathered by De Niet himself  (from projects such as Enumerate), and on through Jackie Dooley’s summary of recent OCLC research (Taking Our Pulse: the OCLC Research Survey of Special Collections and Archives) – the evidence is overwhelming that (research) libraries are still pretty much functioning within an analogue paradigm. This is not to say that they are all still about lending physical books. Of course not, quite a few libraries have digitized collections and provide online access. But their digitization efforts mostly lack strategic planning, access is still mostly provided in a controlled way (for a limited group of users), preservation issues are still not being addressed adequately, and born-digital material (including audiovisual content) is blatently missing from collections and collection plans.

‘Shocking’, De Niet called these findings, and I second that. De Niet summarized the changing value propositions as follows:


Most libraries are still in their comfort zone, that of digital in a controlled network. According to De Niet, libraries cannot afford to stay there. If they do, their role in the information business become insignificant. The key factor is still the libraries’ desire to be in control, according to De Niet. ‘Libraries have to let go.’


De Niet sees opportunities for libraries, but not if they stick to their traditional values.


During the Q&A a library director remarked that he did not find this presentation particularly helpful. He thought it was rather confusing …… Quod erat demonstrandum?

_DSC6868 Spain is suffering from a heat wave; fortunately the conference pack includes a sponsored fan – Jasmine Honculada of WIPO was one of the first to discover how that funny plastic object could be put to excellent use.

(There is more to tell about this session – more to follow soon.)

vrijdag 24 juni 2011

Persistent identifiers: policy and ‘will’ vital ingredients (#kepoid)

The world of internet is changeable and volatile. If we are to secure long-term access to content on the internet we have to find mechanisms to bring order to the seeming chaos. Standards, for instance – although I learned last month in Tallinn that we may be rushing into those (see blog post). Persistent identifiers are another type of building blocks for long-term access to digital objects, because PIDs make sure that we can find the object that is being preserved, even if it is moved from one URL to another. But I learned last week that the persistent identifiers are not as persistent as one might hope for. Another illusion down the drain?


In front of a famous painting by Rembrandt (The Anatomy Lesson of Dr. Nicholaes Tulp), a working group led by Andrew Treloar (standing, at right) dissects the truth about persistent identifiers and their complex relationship with Linked Open Data.

The setting was a two-day seminar on persistent object identifiers (or POID, thus #kepoid) organized by Knowledge Exchange, the PersID project, SURFfoundation and Data Archiving and Networked Service (DANS) in the Hague (14-15 June). Regrettably, I managed to attend only the second day, but it was enough to make me understand how complicated this business is.

This is how it should work: a (national) organization (national library, scientific organization) assigns a unique identifier to a digital object, a so-called persistent identifier. If the object is moved from one URL (internet location) to another, the PI remains the same and a resolver service links the new URL back to the PID.

Borrowing from Andrew Treloar’s presentation (Australian National Data Service), here are the main complications associated with object identifiers:

  • Granularity: what do you assign a PID to? In FRBR terms: to the work? to the expression? to the manifestation? to the item? Or, I may add, to a chapter? to a paragraph? Perhaps we even need multiple PIDs at multiple levels.
  • How do you assign PIDs to objects that are not static, but that change all the time (e.g., databases)?
  • How trustworthy is the object that is being identified (e.g., short url services)?
  • How to point to something inside the object?
  • Who owns the binding between the PID and the object?

And then there is the problem that there are a number of different PID systems (e.g., URN, DOI, PURL), which are not interoperable (comment by Juha Hakala: ‘It is encouraging that it is quite a long time since someone came up with a new PID system.’). And PID’s do not go well together with Linked Open Data (LOD).


‘Why is it so hard?’ – notes from Jeroen Rombouts’ computer (3TU.Datacenter)

Both Clifford Lynch and Andrew Treloar concluded that solving the technical problems of the PID challenge is the easiest part of the work to be done. Andrew built a pyramid of key success factors (photo above): at the bottom of the pyramid is a sustainability model, the second layer is about policies, the third is about procedures, and the top layer is about will or the intention of individuals to follow the rules and make the system work.


A room full of persistent identifiers – at right seminar chair Bas Cordewener (SURFfoundation).

In the end the attendees concluded that building interoperability between the existing PID systems is not a top priority. But getting PIDs to work with Linked Data is. Treloar proposed a 'Den Haag manifesto’ to bring this about:

The Hague Manifesto on persistent identifiers and Linked Open Data (LOD) (draft version)

  1. Make sure PID’s can be referred to HTTP URI’s including content negotiation
  2. Use LOD vocabularies, for schema elements
  3. Identify the minimum common set of schema elements, across identifiers in scholarly communication space.
  4. Use same-as relations to help PID interoperability across PID systems/schema’s
  5. Work with the LOD community on simple policies/procedures to improve persistence of HTTP URI’s.

Treloar will work with anybody who is ‘ready, willing and able’ to develop these principles.

Some other recommendations from the meeting:

  • Do an inventory of different PID systems and make transparent how they work, so that organizations contemplating using PID’s know how to choose a system
  • Find the common ground between the systems and use these to widen awareness of PID problems and systems
  • Organize regular meetings between those who are involved in building PID infrastructures to facilitate alignment.

The work is being continued, within PersID and also within the European APARSEN project.


zaterdag 18 juni 2011

On alignment (when we do and don’t need it), and on the Elephants in the Room (ANADP11–10, Evaluation)

ANADP11 was a conference about (inter)national alignment and organizing a conference around such a theme implies that we need more of it. I offer you my tentative conclusions, a fortnight after the conference:

Sometimes aligning is as easy as hopping on a plane and attending a conference in Tallinn. Nothing tops meeting people face to face to exchange ideas and information. Much of the (technical) information can, of course, be found on the internet, but in our day-to-day lives we rarely have or make the time to actually find and read it all. Also: having an actual conversation about something is much more informative than one-way traffic. This type of alignment really requires no more than a bunch of enthusiastic people who are willing to put in a lot of time into putting conferences such as these together (thanks! Matt Schultz, Katherine Skinner, Martin Halbert, Aaron Trehub, Abigail Potter, Martha Anderson, Michelle Galinger and, last but not least, Mari Kannusaar of the Estonian National Library): the Tallinn conversation was an important type of alignment in and of itself in the way it was organized around themes, with panels discussing the issues before the conference.

Alignment networks become a little more formal when organizations actually start doing projects together: the Open Planets Foundation, the Alliance for Permanent Access, the US National Digital Information Infrastructure and Preservation Program (NDIIPP), the national digital preservation coalitions (nestor, DPC, NCDD). These are foundations with by-laws. Typically, they will run outreach and R&D programs together. A more informal, but very successful type of network is the International Internet Preservation Consortium (IIPC). It is fluid, it is informal. One wonders whether these need more international alignment than they already have. Perhaps more of these groupings are called for, especially in countries that have not yet embraced the issue of digital preservation, but if so, then these must really be bottom-up initiatives.


The organizational panel’s breakout session pleaded for fluid, informal international alignment rather than an international steering committee, as proposed by Laura Campbell in her keynote address.

Alignment of necessity becomes more formal when organizations start sharing the burden of digital preservation: LOCKSS, the MetaArchive, etc. Taking care of each others’ collections requires governance contracts to be drawn up. However, one wonders if these types of initiatives are in need of more or further international alignment; they seem to work best when similar organizations (similar as in: close together, with similar remits, of similar size, within the same scientific discipline) group together to do a very practical job. I thought the Alabama Digital Preservation Network (blog post) was a particularly powerful example which can inspire others. We definitely need to continue to organize conferences to highlight such initiatives for others to learn (and enable our staff to come to these conferences despite budget cuts …), but whether they need more (international) alignment as such … I wonder.

Sharing the burden of preservation requires more robust support, but often at a local and/or disciplinary level (St. Catherine’s passage, Tallinn)

Technical development, then, and standards. Are we reinventing the wheel over and over again? Will alignment help us save money? Yes and no. Of course we would save a lot of money in the short run if we were all to adopt the same technology. However, digital preservation is a moving target. Michael Seadle and Andreas Rauber assured us that all our present systems are untested; they are, indeed, “a leap of faith”. So this is no time to “rush into standards” (Bram van der Werf). In other words: exchanging experiences, exchanging test data: yes; throwing all of our eggs into one basket at this (early) point in time: definitely not. We need to allow for diversity in preservation strategies and tools to develop. This will, of course, lead to some redundancy, but that is all in the game.

So far, we have made a strong case for more conferences, more conversations, more debates, more projects, more comparing of notes, more sharing of experiences (test data, best and, yes please, worst practices), the establishment of fluid, informal affinity groups to share knowledge. Plus more formal alignment on a local or national scale when organizations actually start collaborating in preserving their digital collections. But we have not yet made the case for more international alignment. Unless … one thinks about …

An international digital preservation registry?

During the conference the call was heard for some type of international registry of projects, initiatives, knowledge, competences, etc. I have heard that call before, and it would be great to have such a facility? A one-stop portal where everything that we know about what is being preserved by whom, everything we are studying (including everything we have found out that does not work), and perhaps even more importantly, everybody involved in DP and their special expertise, is linked up and a brokering services brings supply and demand together. Not a project, but a sustained effort by the international community.

!a2The trouble with that is that we had something like that, the PADI service, run by the Australian National Library. But, as Maurizio Lunghi (photo left) explained at the conference, that service was discontinued last year. The website is still up, but it is not being maintained any more. PADI went down, Lunghi told us, because it started out as an international effort, but in the course of time it became isolated, and the Australian National Library was doing all the work alone. PADI became too isolated. And that is the trouble with this type of registry/competence center. Everybody wants it, but establishing and maintaining one is very labour-intensive. Especially in difficult economic times such as the present, our institutions’ management will not prioritise active knowledge sharing. Which is understandable in the short term, but obviously unwise in the long term. I have been involved in organizing something like this at a national scale, and apart from the staff effort involved, it is also very difficult to make all that knowledge and experience available in a form that is useful to users – simply because there are so many users with so many different needs.

But wouldn’t it be great … I am not giving up the idea and I welcome anybody to think this through with me!

The Digital Elephants in the Room

So much for the problems that we did discuss. Now for the ones we did not discuss (enough). Clifford Lynch called them the Elephants in the Room, the issues we chose not to talk about because … yes, well, why did not we discuss them? I have a couple of theories. For one, the conference room was dominated by librarians – there were very few people from the archives, museums, or from scientific repositories. That was a shame, because it limits the scope of issues to be discussed. Particularly, those issues that transcend traditional borders between domains and sectors were not talked about nearly enough. And I would argue that these are the very issues where we absolutely need international alignment because there is no other way to deal with them. Here is my list of these Key Issues (which I will be happy to develop further with anybody who wants to join the debate):

  • Lots and lots of digital content is being produced at the moment that no heritage institution is collecting, because it does not fall within traditional collection profiles. It is high time we get together at a high policy level, with representatives from a broad range of organizations (museums, libraries, archives, scientific institutions) to talk about this huge challenge. This involves talking about data deluges (in science, but also in social media; audiovisual output for which there is no Public Records Act and no Legal Deposit Scheme), it definitely involves talking about selection (what to keep?), and who is going to do the keeping (distribute the work? establish new organizations?)
  • Making the case of digital preservation towards funders and the public at large. This is a big one. It may force us to take another look at the assumptions underlying our work before we can get this right. Which is not something people particularly like doing. But we’ve got to do it, make sure that we talk with one voice. Can we write the success cases to prove our point?
  • Where do we put our money? Both in Europe and in the US substantial amounts of money are invested in digital preservation research. But is R&D attacking the right problems? Where should we put our money to really start bringing costs of digital preservation down?
  • Should effective copyright action be in this list? Clifford Lynch suggested that we get some smart people from both sides of the ocean together to really get to the bottom of the issue and organize an effective lobby to truly bring about a fundamental change in thinking about the individual (producer’s) rights versus public rights.
  • Dare we think of an international registry that will really allow the international community for reinventing wheels only where alternatives really are beneficial?

At the end of the day I think I can say that no, we did not get everything right at this first ANADP conference. But we made a good start at filtering out the issues and deciding where they should be addressed and by whom. That in itself is important. There is talk of a follow-up workshop at iPRES and there is talk of a follow-up ANADP. Let’s try to include more parties there (more countries, more domains), let’s try to bring more decision makers to the debate, and let’s try to narrow the issues down to those areas where alignment is truly essential.


My favorite conference recommendation, by Jeremy York of HathiTrust: ‘Developing killer digital preservation apps’

This is the last of a series of 11 blog posts about the Tallinn conference. The others were:


Martin Halbert (left) and Matt Schultz of Educopia, two of the driving forces behind this initiative to open up a transcontinental policy debate.


Many thanks to Mari Kannusaar of the Estonian National Library (right), thanks to her colleague Leila and all the other colleagues for a smooth conference!

This is the last of a series of 11 blog posts about the Tallinn conference. The others were: