woensdag 30 november 2011

Digital preservation basics in four online seminars

If you are new to digital preservation, you may want to check out four ‘webinars’ organized by the California State Library and the California Preservation Program. The one-hour webinars promise to give you a basic understanding of what digital preservation is all about, of interest especially to librarians and archivists who are involved in developing digital projects.

The first webinar is scheduled for December 8, 12 PM Pacific time (which is 21.00 hrs in Holland). Topics include: ‘storing digital objects, choosing and understanding risks in file formats, planning for migration and emulation, and the roles of metadata in digital preservation.’ See http://infopeople.org/training/digital-preservation-fundamentals.

woensdag 23 november 2011

‘Mind the Gap’ and Archive-it – on web archiving (iPRES2011, 9)

At a reception the other day, I heard a rumour. Because preserving web sites is so difficult, the Internet Archive was rumoured to consider printing all of its content. I will not disclose the informant’s name – he would not have a future in the digital library where he works (OK, it was a guy, a young guy, and he works for a Dutch library.) Needless to say, it could not even be done if the Internet Archive wanted to do it. Lori Donovan told the iPRES audience that a single snapshot of the www nowadays results in 3 billion pages [for the Dutch: 3 miljard pagina’s].

Mind-boggling numbers, especially if you think of the Internet Archive’s shoestring budget.

Anyway, iPRES2011 is over, but I still have some worthwhile stories waiting to be told. One of the issues tabled at iPRES was whether we can (and/or should) safely leave web archiving to the Internet Archive and national libraries.


Logistics put the Internet panel members much further apart than their viewpoints would warrant: they agreed that web archiving is important, not just for national libraries. From the left: Geoff Harder, University of Alberta, Tessa Fellon, University of Columbia, and Lori Donovan, the Internet Archive.

No, said Geoff Harder of the University of Alberta and Tessa Fellon of the University of Columbia. There are compelling reasons for research libraries to get involved as well. Harder: “This is just another tool in collection building; we should not treat it any differently. You begin with a collection policy and an expanded view of what constitutes a research collection: build on existing collections; find collections where research is happening or will happen.”

I would say that perhaps there are even more compelling reasons to collect web content than, e.g., printed books, because web content is extremely fleeting. Harder told his audience: “Too much online (western) Canadian is disappearing; this creates a research gap for future scholars and a hole in our collective memory.” He encouraged research libraries to: “Mind the Gap – Own the Problem”.

The University of Alberta’s involvement in web archiving started with a rescue operation: a non-profit foundation which created some 80+ websites, including the Alberta Online Encyclopedia, went out of business. This was extremely valuable content, and it needed to be rescued fast.

When a time bomb is ticking …

The University of Alberta decided to use Archive-It, a service developed by the Internet Archive. It is a light-weight tool that is easy to get up and running immediately. Plus, said Harder, there is a well-established tool-kit including dashboard and workflows, you become part of an instant community of users and your collection becomes part of a larger, global web archive. Because that is a precondition for working with Archive-It: by default, everything that is harvested becomes publicly available globally. Harder: ‘It is an economical tool for saving orphaned and at-risk web content … where we know a time bomb is ticking.”

Have a look at the collections built with Archive-It, I would say to research libraries’ subject specialists. You can include anything that is interesting in your field, such as important blogs, for as long as they are relevant.


Yunhyong Kim of HATII, Glasgow, takes blogging very serious and is doing research into the dynamics of the blogosphere.


Is Archive-It durable enough? asked Yunhyong Kim of Glasgow (HATII). Donovan appeared confident that Internet Archive would be able to continue developing the tool. And I would repeat Harder: when a time bomb is ticking, you have got to go with what is available.

What about preventing redundancy, was another question. Should we not keep a register somewhere of what is being archived? Fellon thought that was a good idea, but perhaps it was too early for that. 'There are many different reasons for web archiving, different frequencies.” Sorting out what overlaps exactly and what does not is perhaps more work than just accepting some “collateral damage”.

If you want to know more about Archive-It, you can sign up for one of their live online demos. There’s one scheduled for November 29 and one for December 6. See the website

_aaarchiveitArchive-It Singapore-style

maandag 21 november 2011

PDF/A-2: what it is, what it can do, what it cannot do, and what to expect in the future

There is a new PDF ISO standard, 19005-2, or PDF/A-2, and therefore the Benelux PDF/A Competence Center decided to organize a seminar. When one of the organizers, Dominique Hermans of DO Consultancy, asked me to do the warming-up presentation, I readily agreed, because I had been hearing some bad things about PDF these last few months, and was eager to find out more. While preparing my own talk (slides at the end of this post) I decided to quote those very criticisms (see LIBER2011 blog post), just to get the ball rolling and challenge the experts to comment:


This slide of mine is a mash-up of three slides by Alma Swan at the LIBER 2011 conference, Open Access, repositories and H.G. Wells

These criticisms come from people who want machines to analyse large quantities of data in a semantic-web/Linked Data-type environment. Are the criticisms justified? For those of you who, like me, are sometimes confused about what is and what is not possible, I will summarize what the experts told the seminar.


The key one-liner came from Carsten Heinemann of LuraTech:

“PDF was designed as electronic paper”

‘It was designed to reproduce a visual image across different platforms (PC, Mac, operating systems), and for a limited period of time.’ As such, PDF was a really good  product, because it was compact and complete and it allowed for random access. But there were also many issues, and Adobe has been working on fixing those ever since. This has resulted in an entire family of PDF formats with different functionalities.

PDF/A is the file format most suited for archiving purposes. The new standard, PDF/A-2 is not a new version of PDF/A-1 in the sense that one would need to migrate from 1 to 2, but rather a new member of the PDF family tree that has improved functionality over PDF/A-1. In order words: migrating from PDF/A-1 to PDF/A-2 is senseless, but if you are creating new PDF documents you may want to consider PDF/A-2 because of the new functionality to incorporate more features from the original document (e.g., JPEG2000 compression, possibility to embed one file into another, larger page sizes, support for transparency effects and layers).

To make matters more complicated, PDF-A/2 comes in two varieties. Compliance level 2a and compliance level 2b. Level a allows for more access by search engines such as used in semantic web techniques, because it requires that files do not only provide a visual image, but that they are structured and tagged and include Unicode character maps.

Heiermann concluded: XML is for transporting data; PDF is for transporting visual representations. To which I may add: XML is for use by machines, PDF is for use by humans.

Misuse of PDF is easy

Raph de Rooij of Logius (Ministry of the Interior) told his audience that one should not be too quick to say that something is “impossible” with PDF. A lot is possible, but you have to use the tools the right way – and that is where things often go wrong.


Raph demonstrated that most PDFs put online by government agencies do not meet the government’s own requirements for web usability – including access by those who are, e.g., visually impaired. “The many nuances of the PDF discussion often get lost in translation,” he said. The trick is to pay a lot of attention to organizing the work flow that ends in PDFs.

PDF is no silver bullet

Ingmar Koch, a well-known (blogging) Dutch public records inspector, has seen many examples of PDF misuse. “Public officials tend to think of PDF as a silver bullet that solves all of their archiving problems”. But PDF was never designed to include anything that is not static (excel sheets with formulas, movies, interactive communications, etc.).


From the left: Caroline vd Meulen, Ingmar Koch, Bas from Krimpen a/d IJssel and Robert Gillesse of the DEN Foundation.

From a preservation point of view, I heard some shocking case studies from public offices. An official will type the minutes of a council meeting in Word, make a print-out, have the print-out signed physically, then OCR the document and convert it to PDF for archiving. I dare not imagine how much information gets lost in the process. But then again, we all know that data producers’ interests are often different from archives’ interests. Public offices just want to make a “quick PDF” and not be bothered by all the nuances.

How about validation?

There is a lot of talk about “validating” PDF documents. First of all, PDFs are created by all sorts of software, and what they produce often does not conform to the ISO standards and is thus rejected by validators. Things get more confusing when validators turn out different verdicts. Heinemann explained: “That’s because some validators only check 30%, whereas some will check 80%. The latter may find something the first did not see.”

At the end of the day …

It seems that, indeed, there are millions and millions of PDFs out there that can only provide a visual representation and are no good when it comes to Linked Data and the Semantic Web. But PDF is catching up, including new features all of the time. I understand that we may even expect a PDF/A-3, which supports including the original file format in the digital object. Ingmar Koch did not seem to be too happy about such functionality. It would make his life as a public records inspector even harder. But from a preservation point of view, that just might be as close to a silver bullet for archiving as we will ever get.

Meanwhile, if you want to use PDF in your workflow, getting some advice from an expert about what type of PDF is appropriate in your case is called for!

Comments by Adobe

Adobe itself was very quick to respond to this blog post in an e-mail I found this morning. Leonard Rosenthol, PDF Architect, was not very pleased with the picture painted by the above workshop – as a matter of fact, he used the word “appalled”. He asserted that PDF and XML/Linked Data go very well together and that various countries and government agencies have already adopted a scenario that ‘presents a best of two worlds’. Here is his link to a recent blog post by James C. King that describes how it is done: <http://blogs.adobe.com/insidepdf/2011/10/my-pdf-hammer-revision.html.

That blog post is an interesting addition to the workshop results (confirming Raph de Rooy’s assertion that “nothing is impossible”), but it does not take away the fact that PDF is often misused. I would guess that is because it is complicated stuff. “Making a quick PDF” just does not do it. The recommendation to seek expert advice, therefore, stands!

Lastly, here is my own presentation: a broad overview of developments in the digital information arena to start off the day – in Dutch:

For the Dutch fans: Ingmar Koch has blogged about this event here, and the slides will become available here. Thanks also to KB colleague Wouter Kool for helping me understand PDF.

zondag 20 november 2011

‘Bewaar als …’: glashelder advies over digitaal archiveren


Karin van der Heiden heeft met Premsela (Nederlands Instituut voor Design en Mode) een glasheldere leporello (uitvouwbrochure) ontwikkeld om vormgevers praktische handvatten te bieden om hun informatie goed te ordenen en goed op te slaan – en dat is het begin van alle langetermijntoegang. Niet alleen belangrijk voor vormgevers, maar voor iedereen die digitale documenten maakt en die goed wil bewaren!

Gefeliciteerd, Karin, met deze productie!

Kijk op de bijbehorende website, http://bewaarals.nl/, en zegt het voort!

PS: Hieronder de hele vellen, in .jpeg.



An English edition will be made available in the US in a few months. I will keep you posted.

dinsdag 8 november 2011

Aligning with most of the world (iPRES2011, 8)

_aUSBhub3iPRES is organized alternately in Europe, in North-America and in Asia in order to include people and discussions from all continents – Africa and South-America are still on the Steering Committee’s wish list. However, when you looked at the list of presenters at iPRES2011, it was the usual suspects that dominated: Europe, North America, Australia/New Zealand. I asked a Programme Committee member about that, and he told me that some papers had been submitted from Asia, but they were deemed not good enough to make it to the programme.

To my mind, there is a bit of a contradiction in this. Of course we want high-quality papers at iPRES, but it is a bit risky to take our (western) stage of development as a yard stick for what constitutes “quality”. As Cal Lee phrased it: “Digital preservation tends to be quite regionally myopic.” I would suggest that the next iPRES organize a special track or workshop day for those that are just beginning to think about digital preservation, or that work from a very different context than a “western” one and focus on their specific circumstances and challenges.


Fortunately, there was one workshop that expressly invited members from “other” countries. It was the workshop “Aligning national approaches to digital preservation”, a follow-up from last May’s Tallinn conference (see my blog posts), put together by Cal Lee from the University of North Carolina. Yes, there were usual suspects presenting as well (including yours truly), but in this post I shall mostly  ignore them in favour of new input:

_aOzgurÖzgür Külcü, from Hacettepe University, Ankara, Turkey, described Turkish participation in the AccessIT project, whereby an online education module with practical information about digitisation issues and protection of cultural heritage was developed. And in the context of the InterPares 3 project the Turkish team is helping translate digital preservation theory into concrete action plans for organizations with limited resources. But many issues remain:


Masaki Shibata from Japan, revealed the results of a DRAMBORA 2.0 test audit carried out at the National Diet Library in Japan:



_aShibataShibata admitted that, unfortunately, the risks mentioned in the final report largely remain unsolved. ‘We were caught up in an illusion that there was an ideal solution to ensure long-term digital preservation,’ he said. ‘We tried to address the risks only by means of systems development.’ Also, specific Japanese and NDL circumstances played a role, such as the rigidness of the fiscal, budget, employment and personnel system; language difficulties and geographical constraints; lack of digital conservators; and a cultural context of preservation. Shibata concluded that an international alliance for digital preservation ‘would become a boost/tailwind for national policymaking in Japan.’

_aDaisyDaisy Selematsela from the National Research Foundation of the Republic of South Africa, described the outcomes of An audit of South African digitisation initiatives before focussing on “Managing Digital Collections: a collaborative initiative on the South African Framework”, a report published earlier this year, which is meant to provide data producers with high-level principles for managing data throughout the digital collection life cycle; and the Train-the-trainer programme:


As for international alignment, Selematsela concluded:


_arajuRaju Buddharaju of the National Library of Singapore (photo right) suggested that we first need a better understanding of what we mean  “alignment” and what we mean by digital preservation (what do we include, what do we exclude)  before we can try and come to workable initiatives.

The workshop was originally designed as a one-day event, but in the end the conference organizers only gave us 3 hours on Friday afternoon. The good news was that despite the time of day and conference fatigue, more than fourty participants showed up and they conducted animated discussions on such topics as: costs; public policy and society; and preservation & access.

_aKnightBut it was difficult to reach any concrete conclusions. There are many good intentions, but it continues to be difficult to find the common ground that leads to practical results. Steve Knight of the National Library of New Zealand (photo left) questioned whether there is any real will to collaborate, e.g., on putting together a much-needed international format (technical) registry. Talking about education, finally, Andi Rauber suggested that because there is no well-defined body of knowledge, we might prefer a range of “friendly competing curricula” rather than an aligned body – for the time being.

Which only goes to show that, like Singapore itself, alignment comes in many shapes and sizes.

_DSC0889 kopiëren

Disaster planning and enabling smaller institutions (iPRES2011, 7)


As this iPRES was moved from Tsukuba, Japan, to Singapore because of the earthquake and tsunami in Japan in March this year, it was only fitting that iPRES2011 should include a panel session on disaster planning. Neil Grindley (JISC) asked if digital preservation does not implicitly include disaster planning, but Angela Dappart (DPC) argued that with an entire infrastructure going down, the problems will be massively larger. Plus, as Arif Shaon (STFC) observed, ‘Grade A preservation should include it, but we have not reached that stage yet.’

Shigeo Sugimoto of Tsukuba, who would have been iPRES’s host in Japan, took a forward-looking view at disaster planning. Many physical artefacts were lost during the earthquake, and having lots of digital copies at different locations can certainly help rescue cultural heritage, provided the metadata are kept at different locations as well.

_aSugimotoShigeo Sugimoto (right) with José Barateiro of Portugal during the disaster planning session.

There is one catch, though: many smaller institutions do not have the means (money, staff) to build digital archives. Therefore, in Japan the idea has been put forward to design a robust and easy-to-use cloud-based service for small institutions:

CloudForPreservation - SS1

CloudForPreservation - SS

CloudForPreservation - SS3

In the Netherlands, I am involved in two Dutch Digital Preservation Coalition (NCDD) working groups who are looking at the same problems: how to enable smaller institutions to preserve their digital objects. Professor Sugimoto and I have agreed to stay in touch and exchange information and experiences.

zondag 6 november 2011

‘At scale, storage is the dominant hardware cost’ (iPRES2011, 6)

aaSharpeIt is not uncommon for conferences to be ‘interrupted’ by sponsor presentations. When I say ‘interrupted’, I do not necessarily mean that such talks are unwelcome. Conference days tend to be packed from early morning to late at night, and such sponsor interventions can be quite pleasant – a moment to doze off or to check your e-mail. Robert Sharpe (photo) of Tessella (vendors of the Safety Deposit Box or SDB system) gave us no such respite. In an entertaining presentation he shared some scalability experiences with us.

The case study was Family Search, which ingests no less than 20 Terabyte of images a day. That was quite a scalability test for the Tessella Safety Deposit Box system, and it tested some of Sharpe’s own assumptions:

  • Tessella expected that they would need faster, more efficient tools, but it turned out that existing tools (DROID, Jhove, etc.) were easily fast enough.
  • Tessella expected reading and writing of content to be fast compared to processing, but it turned out that reading and writing were not fast enough; the process required parallel reads and parallel writes. Thus the hardware cost is dominated by non-processing costs.
  • Tessella (and most of us) expected storage to be cheap, but at scale it turned out to be the dominant hardware cost. Reading and writing hardware came to about GBP 80,000. The storage costs came to GBP 100 per Terabyte content (3 copies), which amounted to GBP 730,000 a year, each year, and without refreshment costs.

Sharpe concluded that we do not need faster tools – but we do need better & more comprehensive tools. We need systems engineering, not just software engineering. And we need enterprise solutions: automation, multi-threading, efficient workflow management and automated issue handling.

All of which, of course, Rob will be happy to talk to you about.

PS: In response to this blog post, Rob wrote to me: ‘A further point I was trying to make in the rest of my talk is you don't need especially powerful application servers to do this: you can do it fairly cheaply (certainly when compared to other costs at such scale).’


Scale Singapore-style: the Marine Bay Sands Hotel. The ship-like contraption on top of the three towers holds lush tropical gardens, a 150 meter swimming pool, restaurants, and a bar.

vrijdag 4 november 2011

Taking emulation another couple of steps further (iPRES11, 5)

It is Friday morning, 9 pm. In the other room they are talking about cost modelling. I have opted for the emulation session, continuing the thread from the KEEP workshop I blogged about last week. Judging by the number of participants, emulation is not yet a “hot” topic in the DP community. But perhaps it is only because it is Friday morning, the third conference day – people keep sneeking in with cups of black coffee in their hands._DSC0451

The first paper conveys a really bright idea, which has also come up in the Netherlands (Maurice van den Dobbelsteen at the National Archives). If your job is archiving large amounts of data from a controlled environment (e.g., a government ministry), would it not be a great idea to simply make a virtual copy of the entire hardware/software environment in which the objects are produced? Then, when the data come into the archive, all you have to know is when they were produced and at which ministry, and your emulation environment is ready to go. That would save loads of work at the preservation stage.

A similar procedure could work if you want to harvest the archive of significant persons. For example, there was a project to emulate Salman Rushdie’s computer environment, to be able to access his files later.


Euan Cochrane (left) and Dirk von Suchodoletz

Euan Cochrane of the Archives of New Zealand and Dirk von Suchodoletz of the University of Freiburg present this approach at iPRES2011. They ran some tests and inevitably found some technical problems, but nothing that cannot be overcome. And I can imagine that, if it works, it can really be a time and money saver, especially in the long run. However, as always, there are challenges. Some are inherent to all emulation strategies: you need workable emulators and emulators themselves become obsolete – for which, of course, there is the KEEP approach which I blogged about earlier. Bram Lohman is presenting KEEP here in a minute.

One obstacle is unique to this particular approach: the data producers should include ‘emulatability’ in their calls for bids for computer systems. Technically this should not be too difficult, but in terms of licenses, there may be catches. I will blog about the copyright problems later. The KEEP project did a lot of work on that.

The next presentation raises the level of complexity a bit – at least for non-techies like me. It is entitled: “Using emulation as a tool for migration”, by Klaus Reichert et al.


Emulation developers: from the left Dirk von Suchodoletz, Klaus Reichert and Euan Cochrane

Because I did not understand it very well, I asked Klaus during coffee break. This is my layman’s version of how it works: almost every software programme comes with “little” migration tools, e.g., between Word 2003 and Word 2010. If you want a file that does not run on your present software/hardware combination and there is no direct migration tool, you can re-create the original environment (emulation), perform the “little” migration there, and then use the resulting file on your present system. The advantages are that you do not need to write new migration routines and you can use the file within your present context – without the old “handicaps”, if you will. See the slide below.

 _DSC0466The case studies in this session include Mark Guttenbrunner, on home computer software emulation, Roman Graf & Reinhold Huber-Mork on braille conversion, and Geoffrey Brown (Indiana University), on emulating some interactive Voyager (1989-1997) publications on cd-rom, classic Mac applications such as Robert Winter’s Interactive Beethoven’s 9th (you could play themes, play notes, play synthesizer versions without certain rhythms, etc.). Let me give you the screenshot as a reminder of what we do all our hard work for (you have to imagine the music). You can find the technical details in the proceedings.

_DSC0494With vintage American pragmatism, Brown said: ‘Our goal is to demonstrate that emulation is practical’ – and that is what we all want. But he also said: ‘At the moment, a lot of this is hobby stuff.’ And there seems to be lot of that in the emulation environment. To really make it work we need more sustainable initiatives such as, perhaps, the Open Planets Foundation (OPF Director Bram van der Werf is in the room  …).

_DSC0504 Geoffrey Brown answering questions.

After this semi-live blogging session, we are off to lunch. This afternoon there is the workshop on International Alignment in Digital Preservation, in which I am involved myself. As it is Friday afternoon, the last conference day, and as we have competition from a session on web analytics and of organized visits to the National Library and National Archives, we are not expecting the greatest turnout. But quality can do a lot to make up for quantity ;-)

Expect a few more posts, though. Because of time constraints I had to forego some good stuff. I will report on that in the course of the next few days.

_DSC0260What to Wear is not only a ladies’ question here – air conditioning sometimes works too well for some of us.


donderdag 3 november 2011

‘Metadata is a love note to the future’ (iPRES2011, 4)

_DSC0430 The quote in the title comes from a tiny sticker I found here in the conference room. I almost missed it, it is that small. I do not know who put the stickers out there. Perhaps the barcode could tell me, but I haven’t downloaded the app yet. In any case, the quote is one to pass on to you, and in a (bit of a creative) way, it brings together the two keynote speeches on day 2 of iPRES2011. [Post script: Henk Koning of DANS tells me that this is the url: http://mialnttf.tumblr.com/].

_DSC0114 Mick Newman of the Australian National Film and Sound Archive (photo left) spoke about  ‘Preserving motion film, so much to do and so little time …’ Mick made no secret of his ‘lovely chauvinism’ for analogue film – the film experience for him includes a buzzing reel of acetate film (nitrate being a bit too flammable, after all). Like Richard Wright (BBC) last week in Hamburg, Newman reported that the transition to digital is slow. And not as easy as, e.g., for documents, because films are complex objects containing high-quality images and high-quality sounds, and an ISO file format such as motion-JPEG2000 works for the image, but does a mediocre job with sound. And the files are huge: a 35-mm film with 350,000 frames will turn into a 10 TB digital object. A complex object at that, with formidable metadata challenges (here’s the link to the title), with complicated intellectual property regimes for script, sound, music, set design, etc. etc., causing Newman to quote Karen Van Malssen saying ‘It doesn’t matter what the question is, the answer is metadata.’

At the end of his talk, Newman showed a slide with the pros and cons of analogue preservation vs. digital preservation, and the very fact that he showed the slide at all was notable. In the library and archives community nobody thinks in terms of those ‘or-or’ terms anymore. But, then again, they tend to keep the physical originals whereas Newman made clear that preserving analogue films is very demanding and thus expensive.


In the next keynote the focus was shifted to research data, with Ross Wilkinson of the Australian National Data Service ANDS. Like Seamus Ross yesterday morning (see earlier post) Wilkinson made the case for preservation in terms of added value rather than controlling risks (see last week’s post). With research driving innovation, however, I think his argument is a lot more compelling (towards funders, that is) than just referring to our ‘memory’ and ‘identity’ – which are hard to express in terms of dollars and euros.

_DSC0134Wilkinson (photo left) focussed on increasing the value of research data, a.o. by connecting different databases to generate new information to enable new types of research (and some of that value must come from metadata, hence another creative link to my title). Wilkinson’s added value list is as follows:

‘Data is more valuable, if …

  • it can be used later
  • it is able to be used by more researchers
  • it is able to be used to answer new questions
  • it is able to be integrated to explore new data spaces …’

‘To do so, it must be managed, connected, discovered, and then re-used – it has to move out of the lab.’

Who is going to do all that work? Wilkinson believes in partnerships – many have a role to play in the process. The researcher, the institution, ‘data carers’ in all shapes and sizes (data librarians, data scientists, etc.), discipline-specific repositories and more general archiving institutions such as ANDS (and DANS in the Netherlands). 


Moving research data from the private to the public sphere: each partner has a role to play in adding value to the data

Plus, of course, a number of stakeholders that have a role to play in motiving researchers to share, such as funders and key players in scholarly communications such as publishers. If data become citable, more researchers will be willing to share.

To conclude, this is Wilkinson’s favourate case study of research data reuse:


_DSC0153 Q&A: Arif Shaon of STFC in the UK

_DSC0085Conference dinner in the library’s plaza. Yes, the food was again wonderful. And the lemonade kept us sharp for the next day (the barley flavour was especially tasty)


Our kind hosts put an umbrella in the conference pack …

woensdag 2 november 2011

On governance, trust and certification (iPRES 3)

I am going to blog about this parallel session in reverse order, because that way it makes more sense to me.

_DSC9921 At the end of the session (but at the beginning of this post!) Devan Ray Donaldson of the University of Michigan reminded us what ‘trust’ (as in Trustworthy Digital Repositories, or TDR’s) is all about: end users (those that have had no involvement in either production or archiving of a document) need some assurance that the document they are getting from an archive is, in fact, authentic, that it is what it is supposed to be, and has not been tampered with or altered in any way. [BTW: that does not mean that the archive guarantees that the information in the document is reliable. The archive does not know that. The only thing an archive can do is assure that what the end user gets is the same thing that originally came into the archive.]

Archives know that end users care about trust, about authenticity. So Donaldson wants to study how we communicate with the end user about that authenticity. If we put some seal of approval on a document, will the end user trust it more than if we do not put any seal on it? That is an interesting question. Donaldson intends to use HathiTrust documents to test this, and, to me, that is the only ‘flaw’ in his plan – if such is the word, that is. HathiTrust contains digitized book pages, and that type of document is a lot easier to trust and be regarded as “authentic” than, e.g., e-mail. Donaldson agreed, but, as he said: you’ve got to start somewhere.

_DSC9937 Next (in whichever order) came Olivier Rouchon of CINES, a large data centre in France (photo right, “Cannot I even have  lunch without being photographed?” J). CINES finds itself in a strange political situation: as an organization  CINES has a remit for only four years, but  it also has the express mandate to do long-term preservation and its clients ask for 30-year guarantees. This is a strange dichotomy and CINES has decided to seek certification as a trusted repository to a) lock the mission, and b) attract larger volumes of data to be preserved.

CINES went through various (self-) audits to attain ever higher levels of certification. That took a lot of work. Rouchon estimates that 1 fte of his 11 fte’s is constantly busy with audits. But, says Rouchon, ‘that should not stop you from doing it.’ First of all, it is mostly a lot of work the first time around. Once you have a good system in place, the next audits become business as usual. Secondly, CINES is using the audit system as an internal quality assessment instrument to keep improving the quality of the service. By comparing the outcome of audits over time the organization can measure its progress.

The EU is now building a three-tiered certification system: the first level is the relatively lightweight Data Seal of Approval, then comes a self-audit, and the highest level of certification is awarded by an external audit. The APARSEN project recently did a number of test audits, a.o. at CINES, and will publish the results shortly.

_DSC9900 Steve Knight from the National Library of New Zealand enquired how we know that we can trust the auditors doing the auditing. Rouchon trusts his own (internal) auditors and part of the aim of the APARSEN test audits was to train auditors.

_DSC9964 Having talked about trust, and about auditing trust, I now come to the last (first) presentation. Basically, it was about building all the capabilities you need to assure trust and prove trustworthiness into your system. It was also about not dealing with digital preservation as an issue (and a system!) that stands apart from the rest of your organization, but to build an information system for your organization that integrates digital preservation requirements, make them ‘ubiquitous’. Christoph Becker of TU Wien(photo right) told his audience that we have lots of models and concepts and frameworks (OAIS, TRAC, RAC, Drambora, Platter, etc. etc.), but ‘we still lack a holistic view.’ His team takes its cue from frameworks from the IT industry, such as ‘enterprise architecture’, and COBIT (goal-oriented, process-oriented, control-based) to build a Maturity Model based on CMM – you measure your maturity by a set of criteria to identify places for improvement … and then I lost the story. My mind tends to switch off when the discussion becomes abstract and high-level. It is a flaw, I know, but one I have to learn to live with. The basic idea, however, integrating digital preservation, is a good one, and so is using existing industry frameworks, so for those of you who are better at high-level discussions, do check out Becker’s paper in the proceedings which come online soon. The paper is called “A Capability Model for Digital Preservation: Analysing Concerns, Drivers, Constraints, Capabilities and Maturities”.

_DSC9876 Parallel session ‘Governance’