Duurzame toegang (long-term access): september 2011

woensdag 21 september 2011

'The Really Foolproof Solution for Digital Preservation ....

... is ... money ... enough of it, and for an indefinite period. If this cannot be guaranteed, the rest of this book will be essential for you', writes David Giaretta (Alliance for Permanent Access/STFC) at the beginning of his new book. I am not giving you the full title yet, because I am afraid that it might scare you off. Certainly, if I had just seen the title without knowing the book or the background, I would have thought that this book is not for me. And that would have been a mistake, because, although I am not a technical expert, I am thoroughly enjoying it and learning a lot in the process. And so may you, especially if you are one of the many readers that read my post about Giaretta's workshop at the LIBER conference in June. This book gives you much more of where that came from.

David Giaretta with his book.

Contrary to what the title would have you expect (allright then, here it is: Advanced Digital Preservation), there is lots of good solid basic digital preservation information in this book. OAIS for example. Everybody has seen the functional model's diagramme, but how many of us have actually read the standard and understood the philosophy behind it? Giaretta guides us through it in detail. And in pleasantly understandable language. Plus: the importance attached to a clear definition of the designated community in the OAIS model - this is crucial to what follows.

Inevitably, things do get technical in the course of the book; after all, if we did not have technical problems we would not have a digital preservation problem, but the not-too-technical-reader is always warned in good time that this perhaps is a section or a chapter to be skipped. Yet the essence of Giaretta's theory is worth noting for everybody. In his view, migration and emulation, our most well-known preservation strategies, are perhaps good enough for simple objects (PDFs, tiffs, jpegs), but are inadequate for many complex objects which can be found among research data, Giaretta's main focus (hence the title Advanced Digital Preservation). No-one will doubt that scientific research often generates very difficult objects to preserve - they are complex, dynamic, often non-renderable, and so forth.

If you do not preserve research data, this book is still important for you, because other sectors (cultural heritage, archives) that started out with simple objects will increasingly be faced with more complex varieties, as content producers are discovering the extra possibilities and putting them to good use.

The 'Droste' effect

To tackle the problems of more complex objects, Giaretta, and the CASPAR project team, developed a theory around the Representation Information Network. Simply put: a (or rather: any) data object is nothing but ones and zeros; they must be accompanied by representation information in the metadata to tell you what you need to 'independently interpret, understand and use' (in OAIS language) the data object. The data object can be a single file or multiple files, and the representation information can be anything from a scribbled handwritten note to a complex machine readable formal description (pp 17 ff). In Giaretta's more accessible advocacy language: you have something that is unfamiliar (ones and zeros) and the representation information gives you what you need to make it familiar. However, representation information is not a straight-forward thing: it is more like a set of Russian babushka dolls (in Dutch we would refer to the 'Droste effect', after the cacao nurse that serves from a cacao tin that has her own image on it which serves from a cacao tin that ...): a Word document cannot be understood with Microsoft Office software alone, you will need the operating system, and the programming language, and so forth and so forth. You will need every dictionary, every definition, every standard, every specification that is used somewhere along the line - until you connect with the knowledge base of your designated community, that is: you make the connection with what your designated community has at its disposal in terms of software, hardware and knowledge to work with those.

Over time, as technology evolves, the 'unfamiliarity' of a digital object will increase and the the amount of representation information needed to connect with your designated community will increase with it. Our job is to manage that process and make sure there is always enough representation information to connect with our users. Preferably in an automated way, because there is no way we can do this manually (unless of course we have an truly endless flow of money ...).

Giaretta and his CASPAR team argue that this is the only method that will work for all digital objects, no matter how simple or complicated. The trick will of course be to build that automated process that will keep our digital objects "fresh".

More research is needed to turn this theory into something practical. Meanwhile there is this book to enjoy and learn from, including excursions into non-technical territory: repository audits, preservation chains, business models, stakeholders analysis, and more. Giaretta's fluid style of writing, the many cross-references, summaries, and warning signs have enabled me to delve deeper into the technical level than I thought possible. And I am still learning.

What I would like to see next, however, is more interaction between what Giaretta is developing and what the Open Planets Foundation led by Bram van der Werf (and the related SCAPE project) is working on. What would be really great to have for the community is their joint views on what works and what does not - and in which circumstances, and the direction R&D should take. How about it, gentlemen?

David Giaretta [et al.], Advanced Digital Preservation (Springer, 2011, isbn 978-3-642-16808-6, €99.95).