zondag 3 juli 2011

How can we prove that digital preservation systems will deliver? (LIBER 5)

david1 This blog post is about the LIBER2011 workshop with the poorest attendance (14 out of 400 conference participants having a choice between three parallel sessions). Attendance may have been poor, but the subject matter was important and thus I can only conclude that I and others who plead the cause of digital preservation still have a lot of work to do. (Or are the other 386 counting on me blogging about it in sufficient detail ;-)

Why testing?

Over the past 15 years or so we have been building preservation systems and putting our digital collections (or, more precisely, ‘digitally encoded information’) into them. But how do we know that they will deliver? Last month in Tallinn, Michael Seadle called our present systems ‘a leap of faith’ and with Andreas Rauber he pleaded for more testing and more exchanges of testing data (see post).

But what do you test? And how?

That was what David Giaretta’s workshop was about, in the context of the APARSEN project (a major European project with 32 partners) (slides in this post courtesy of David Giaretta).

_DSC7084 Giaretta explaining the four phases of APARSEN: Trust, Sustainability, Usability and Access. Testing is part of the trust package.

‘We need more than migration and emulation.’

The most well-known preservation techniques are migration and emulation. The results are tested on the basis of ‘significant properties’: Is the information an organization regards as essential still there after the object has been changed or, alternatively, in the new computer environment that purports to emulate the old computer?

Giaretta asserts that these techniques are useful for some digital objects – and they have a role to play in determining authenticity -, but the techniques do not work for all objects. APARSEN has developed a three-dimensional model to characterize objects technically to be able to determine what tools can be applied:


Which leads to these conclusions:


So, we need other techniques in addition to migration and emulation. Especially if we want to our information to be part of the Global Brain of Linked Open Data Herbert van de Sompel spoke about on Wednesday.

First question: who do we preserve for?

Giaretta has developed a very elegant way of describing what we do all this work for: we have ‘unfamiliar’ stuff (rows of ones and zeros) which we must make ‘familiar’ for people to be able to use it. We must do that now, and we must continue to do it in the future. Over time, the job will become more difficult.

Second question: what do they need to use the object?

What ‘familiar’ means, depends on the context, on a central concept from the OAIS reference model, the ‘designated community’, the user group an institution works for and their knowledge bases. If the target audience is a group of five-year-olds, our rendering techniques must be very sophisticated so the five-year old only has to push a button. If the audience is a group of computer specialists, less help will be needed.

In this view, the representation information which is part of the OAIS is included in the AIP (OAIS term for the archival information package that includes both the object itself and all the extra information needed to process and render it) becomes the focus of testing the systems (see OAIS Information Model). Is everything there that the designated community needs to be able to use the information?


The ‘representation information network’

As we saw above, the representation information varies between designated communities. But it will also change over time. A present-day computer will understand the information ‘this is XML’. But in 2080 XML is perhaps an archaic file format, and the rendering information will have to be much more specific in telling the computer how it can render XML so a human (or machine) can use it. And if the manual for the programme happens to be in PDF, it will need to include the same information about PDF. Discipline-specific information must also be included, such as vocabularies and ontologies. And when the information package contains a series of dates one must be able to determine the time zone, summer or winter time, etcetera.

It is a network, to which new information must be added as time goes on:


This network can be tested. Is all the required information being preserved?

In the past month, APARSEN has been doing a series of test audits in Europe in preparation for the ISO16363 standard which is in the making. The tests were also designed to test prospective auditors. The provisional conclusions are as follows:

  • most audited organizations do a good job at preserving the bits;
  • quite a few organizations lack succession plans (what happens to the data when my organization ceases to exist?);
  • quite a few have not defined their designated communities;
  • typically, the representation information networks are insufficient or non-existing.

Giaretta concluded:


2011-06-29 11-06-50 - 114

Plenty of empty chairs … (Photo: Jordi Aguilar)

Here is David’s impressive list of references for those of you who want to know more:

vrijdag 1 juli 2011

Setting priorities: value shift from printed to digital … and vice versa (LIBER 4)

‘Digital and print is an and and proposition for libraries’, said Graham Jefcoate during the special collections session. But the budgets have not increased. So how should we allocate resources? How should we prioritize?

_DSC6956The Dutch KB has been working on a model that takes an integral view at printed and digital collections. The model can help decide what to spend our money on. At LIBER it was presented by Sophie Ham (photo left) and Tanja de Boer. As it has not been published yet, I gladly offer it here with some detail, because I think it can really help libraries make tough decisions (slides courtesy of Sophie Ham).

The model rates the value of (parts of) collections to determine which ones should be conserved or preserved with priority. Here are the rating criteria:

Primary criteria:


Secondary criteria:


And this is how the process works:


Multiplying primary and secondary criteria might look like this (just an example):


Sophie gave two extreme examples of how such a value assessment can turn out. This example concerns printed versions. Obviously, newspapers score higher on informational value and medieval manuscripts score higher on uniqueness and historic value.


Medieval manuscripts

Informational value



Aesthetic value



Historic value















What happens when these collections are digitized? Some values are transferred to the digital copy (e.g., much of the use and informational value for newspapers), but other values cannot be transferred (e.g., the uniqueness of a medieval manuscript). As Claudia Fabian of the Bayerische Staatsbibliothek demonstrated, the use value of a physical object may even increase when it is digitized, because more people become aware of its existence and become interested.


And here is the end result of this (extreme) example:


As you can see, the value of the physical medieval manuscript has remained unchanged, whereas some of the value of the printed newspaper collection has been transferred to the digital copy, especially the use value; the KB’s online newspaper database is in great demand by users of all kinds in the Netherlands.

_DSC6951 Discussing the KB model, from the left: Sophie Ham, Claudia Fabian and workshop chair Graham Jefcoate.

So, if push comes to shove, and painful decisions have to be made about whether to build new stacks for physical newspapers or invest in expanding the e-Depot that holds the digital newspapers, this (limited) analysis clearly points in the direction of investing in the e-Depot and perhaps deciding to keep only representative selections of printed newspapers.

Obviously, lots of questions remain. At what granularity should one assess collections? How can one make the assessments as objective as possible?, etc. But it is a promising beginning. If you want to know more or contribute to developing the model, please get in touch with sophie.ham@kb.nl.

_DSC6838 KB colleagues at Sophie’s presentation: from the left, Els van Eijck van Heslinga, Lotte Wilms, Lieke Ploeger,  Victor-Jan Vos.


