zondag 3 juli 2011

How can we prove that digital preservation systems will deliver? (LIBER 5)

david1 This blog post is about the LIBER2011 workshop with the poorest attendance (14 out of 400 conference participants having a choice between three parallel sessions). Attendance may have been poor, but the subject matter was important and thus I can only conclude that I and others who plead the cause of digital preservation still have a lot of work to do. (Or are the other 386 counting on me blogging about it in sufficient detail ;-)

Why testing?

Over the past 15 years or so we have been building preservation systems and putting our digital collections (or, more precisely, ‘digitally encoded information’) into them. But how do we know that they will deliver? Last month in Tallinn, Michael Seadle called our present systems ‘a leap of faith’ and with Andreas Rauber he pleaded for more testing and more exchanges of testing data (see post).

But what do you test? And how?

That was what David Giaretta’s workshop was about, in the context of the APARSEN project (a major European project with 32 partners) (slides in this post courtesy of David Giaretta).

_DSC7084 Giaretta explaining the four phases of APARSEN: Trust, Sustainability, Usability and Access. Testing is part of the trust package.

‘We need more than migration and emulation.’

The most well-known preservation techniques are migration and emulation. The results are tested on the basis of ‘significant properties’: Is the information an organization regards as essential still there after the object has been changed or, alternatively, in the new computer environment that purports to emulate the old computer?

Giaretta asserts that these techniques are useful for some digital objects – and they have a role to play in determining authenticity -, but the techniques do not work for all objects. APARSEN has developed a three-dimensional model to characterize objects technically to be able to determine what tools can be applied:

david2 

Which leads to these conclusions:

david3

So, we need other techniques in addition to migration and emulation. Especially if we want to our information to be part of the Global Brain of Linked Open Data Herbert van de Sompel spoke about on Wednesday.

First question: who do we preserve for?

Giaretta has developed a very elegant way of describing what we do all this work for: we have ‘unfamiliar’ stuff (rows of ones and zeros) which we must make ‘familiar’ for people to be able to use it. We must do that now, and we must continue to do it in the future. Over time, the job will become more difficult.

Second question: what do they need to use the object?

What ‘familiar’ means, depends on the context, on a central concept from the OAIS reference model, the ‘designated community’, the user group an institution works for and their knowledge bases. If the target audience is a group of five-year-olds, our rendering techniques must be very sophisticated so the five-year old only has to push a button. If the audience is a group of computer specialists, less help will be needed.

In this view, the representation information which is part of the OAIS is included in the AIP (OAIS term for the archival information package that includes both the object itself and all the extra information needed to process and render it) becomes the focus of testing the systems (see OAIS Information Model). Is everything there that the designated community needs to be able to use the information?

david4

The ‘representation information network’

As we saw above, the representation information varies between designated communities. But it will also change over time. A present-day computer will understand the information ‘this is XML’. But in 2080 XML is perhaps an archaic file format, and the rendering information will have to be much more specific in telling the computer how it can render XML so a human (or machine) can use it. And if the manual for the programme happens to be in PDF, it will need to include the same information about PDF. Discipline-specific information must also be included, such as vocabularies and ontologies. And when the information package contains a series of dates one must be able to determine the time zone, summer or winter time, etcetera.

It is a network, to which new information must be added as time goes on:

david5

This network can be tested. Is all the required information being preserved?

In the past month, APARSEN has been doing a series of test audits in Europe in preparation for the ISO16363 standard which is in the making. The tests were also designed to test prospective auditors. The provisional conclusions are as follows:

  • most audited organizations do a good job at preserving the bits;
  • quite a few organizations lack succession plans (what happens to the data when my organization ceases to exist?);
  • quite a few have not defined their designated communities;
  • typically, the representation information networks are insufficient or non-existing.

Giaretta concluded:

david6

2011-06-29 11-06-50 - 114

Plenty of empty chairs … (Photo: Jordi Aguilar)

Here is David’s impressive list of references for those of you who want to know more:

1. CCSDS. (2002), Reference model for an Open Archival Information System (OAIS). Retrieved from: http://public.ccsds.org/publications/archive/650x0b1.pdf

2. OAIS update (at the time of writing under CCSDS review), http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Attachments/650x0p11.pdf

3. Knight, G., 2008, Framework for the definition of significant properties. Retrieved from http://www.significantproperties.org.uk/documents/wp33-propertiesreport-v1.pdf

4. Wilson, A., 2007, Significant Properties Report. Retrieved from http://www.significantproperties.org.uk/documents/wp22_significant_properties.pdf

5. J. Rothenberg and T. Bikson, 1999, 'Carrying Authentic, Understandable and Usable Digital Records Through Time' report to the Dutch National Archives and Ministry of the Interior. Retrieved from http://www.digitaleduurzaamheid.nl/bibliotheek/docs/final-report_4.pdf

6. M. Hedstrom and C.A. Lee, “Significant properties of digital objects: definitions, applications, implications”, Proceedings of the DLM-Forum 2002. Retrieved from http://ec.europa.eu/transparency/archival_policy/dlm_forum/doc/dlm-proceed2002.pdf

7. Cedars project, http://www.leeds.ac.uk/cedars/

8. Investigating the Significant Properties of Electronic Content over time (InSPECT) http://www.significantproperties.org.uk/

9. The InterPARES project, http://www.interpares.org/

10. Wison, A., 2008, Significant Properties of Digital Objects, presented at “What to preserve? Significant Properties of Digital Objects”. Retrieved from http://www.dpconline.org/docs/events/080407sigpropsWilson.pdf

11. DELOS Digital Preservation Testbed. Retrieved from http://www.ifs.tuwien.ac.at/dp/testbed.html

12. OCLC/RLG Working Group on Preservation Metadata, 2002, Preservation Metadata and the OAIS Information Model, A Metadata Framework to Support the Preservation of Digital Objects. Retrieved from http://www.oclc.org/research/projects/pmwg/pm_framework.pdf

13. Derek Sergeant, 2002, Interpretation of the OAIS Model. Retrieved from http://www.erpanet.org/events/2002/copenhagen/presentations/dmserpanet.ppt

14. CASPAR Access Model, http://www.casparpreserves.eu/Members/cclrc/Deliverables/report-on-oais-access-model/at_download/file especially section 2.

15. Michael Factor, Ealan Henis, Dalit Naor, Simona Rabinovici-Cohen, Petra Reshef, Shahar Ronen, IBM Research Lab in Haifa, Israel and Giovanni Michetti, Maria Guercio, University of Urbino, Authenticity and Provenance in Long Term Digital Preservation: Modelling and Implementation in Preservation Aware Storage, TaPP ’09. First Workshop on the Theory and Practice of Provenance. San Francisco, 23 February 2009, http://www.usenix.org/event/tapp09/tech/full_papers/factor/factor.pdf

16. CASPAR Conceptual Model, http://www.casparpreserves.eu/Members/cclrc/Deliverables/caspar-conceptual-model-phase-1-1/at_download/file

17. Giaretta, D., 2007, The CASPAR Approach to Digital Preservation, The International Journal of Digital Curation, Issue 1, Volume 2, http://www.ijdc.net/index.php/ijdc/article/viewFile/29/18

18. CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. See http://www.casparpreserves.eu

19. Mike Coyne, David Duce, Bob Hopgood, George Mallen, Mike Stapleton. The Significant Properties of Vector Images. JISC report, 27 November 2007. http://www.jisc.ac.uk/media/documents/programmes/preservation/vector_images.pdf

20. Mike Coyne, Mike Stapleton. The Significant Properties of Moving Images. JISC report, 26 March 2008. http://www.jisc.ac.uk/media/documents/programmes/preservation/spmovimages_report.pdf

21. Brian Matthews, Brian McIlwrath, David Giaretta, Esther Conway. The Significant Properties of Software: A Study. JISC report, March 2008 http://www.jisc.ac.uk/media/documents/programmes/preservation/spsoftware_report_redacted.pdf

22. Kevin Ashley, Richard Davis, Ed Pinsent. Significant Properties of E-learning Objects. JISC report, March 2008. http://www.jisc.ac.uk/media/documents/programmes/preservation/spelos_report.pdf

23. PARADIGM project, Workbook on Digital Private Papers. http://www.paradigm.ac.uk/workbook/preservation-strategies/file-properties.html

vrijdag 1 juli 2011

Setting priorities: value shift from printed to digital … and vice versa (LIBER 4)

‘Digital and print is an and and proposition for libraries’, said Graham Jefcoate during the special collections session. But the budgets have not increased. So how should we allocate resources? How should we prioritize?

_DSC6956The Dutch KB has been working on a model that takes an integral view at printed and digital collections. The model can help decide what to spend our money on. At LIBER it was presented by Sophie Ham (photo left) and Tanja de Boer. As it has not been published yet, I gladly offer it here with some detail, because I think it can really help libraries make tough decisions (slides courtesy of Sophie Ham).

The model rates the value of (parts of) collections to determine which ones should be conserved or preserved with priority. Here are the rating criteria:

Primary criteria:

image

Secondary criteria:

image 

And this is how the process works:

image

Multiplying primary and secondary criteria might look like this (just an example):

image

Sophie gave two extreme examples of how such a value assessment can turn out. This example concerns printed versions. Obviously, newspapers score higher on informational value and medieval manuscripts score higher on uniqueness and historic value.

Newspapers

Medieval manuscripts

Informational value

8

5

Aesthetic value

2

9

Historic value

3

9

Use

8

3

Uniqueness

4

10

Condition

2

8

TOTAL

27

44

What happens when these collections are digitized? Some values are transferred to the digital copy (e.g., much of the use and informational value for newspapers), but other values cannot be transferred (e.g., the uniqueness of a medieval manuscript). As Claudia Fabian of the Bayerische Staatsbibliothek demonstrated, the use value of a physical object may even increase when it is digitized, because more people become aware of its existence and become interested.

value1

And here is the end result of this (extreme) example:

 value3

As you can see, the value of the physical medieval manuscript has remained unchanged, whereas some of the value of the printed newspaper collection has been transferred to the digital copy, especially the use value; the KB’s online newspaper database is in great demand by users of all kinds in the Netherlands.

_DSC6951 Discussing the KB model, from the left: Sophie Ham, Claudia Fabian and workshop chair Graham Jefcoate.

So, if push comes to shove, and painful decisions have to be made about whether to build new stacks for physical newspapers or invest in expanding the e-Depot that holds the digital newspapers, this (limited) analysis clearly points in the direction of investing in the e-Depot and perhaps deciding to keep only representative selections of printed newspapers.

Obviously, lots of questions remain. At what granularity should one assess collections? How can one make the assessments as objective as possible?, etc. But it is a promising beginning. If you want to know more or contribute to developing the model, please get in touch with sophie.ham@kb.nl.

_DSC6838 KB colleagues at Sophie’s presentation: from the left, Els van Eijck van Heslinga, Lotte Wilms, Lieke Ploeger,  Victor-Jan Vos.

_DSC7014

Value shift: cool conference bag being put to alternative use. Those with short legs thank the sponsors for the abundance of printed promotional material.