NEDLIB CONTRIBUTION TO THE REVIEW OF OAIS

June 2000

Applying the OAIS Reference Model to the Deposit System for Electronic Publications (DSEP)

Applying the OAIS Reference Model

This is a discussion of the Open Archival Information System (OAIS), a proposed ISO archiving standard. It describes how project NEDLIB has applied, scoped and extended this Reference Model to the needs of digital deposit libraries. The findings of NEDLIB are reported to the Consultative Committee for Space Data Systems (CCSDS) through the Centre National d'Etudes Spatiales (CNES), as a contribution to the review process of the OAIS Reference Model.

Application domain

Project NEDLIB - Networked European Deposit Library - is a collaborative project of European national libraries. It aims to construct the basic infrastructure upon which a networked European deposit library can be built. The objectives of NEDLIB concur with the mission of national deposit libraries to ensure that electronic publications of the present can be used now and in the future.

Project NEDLIB was launched on 1st January 1998 with funding from the European Commission' s Telematics Application Programme. The project ends on December 31st, 2000.

Applicability of the OAIS Model

The OAIS Model is applicable to any archive. It is specifically applicable to organisations with a responsibility to make information available for the long term.

The NEDLIB consortium adopted this Reference Model as a basis for modelling the DSEP. The decision to adopt OAIS was taken at the NEDLIB Paris meeting in December 1998. The OAIS Reference Model (issue 4.0, dated September 1998) had developed into a mature conceptual framework, providing a coherent and consistent view of functions and data flows pertaining to digital archives. At that point in time, NEDLIB had started to map the workflow for handling electronic publications onto a structure of functional entities [ref. 2] and the project decided to extend this effort by mapping the workflow onto the OAIS functional entities. Not surprisingly, it appeared that the functional structure of a deposit system could be appropriately represented by the OAIS-model.

Advantages of applying the OAIS Model

From the start it was recognised that by applying the OAIS-Model, deposit libraries can benefit from the advantages of international standardisation. By using a common reference model, a common terminology and a common conceptual framework, it is much easier to share ideas and exchange experiences. Not only between deposit libraries, but also across institutional boundaries, for example, between libraries and archives.

In the NEDLIB project all work has been related to OAIS: process design was done on the basis of OAIS-modelling, tools were described in terms of OAIS functional entities and tested according to an OAIS-based scorecard, metadata were specified in the context of the OAIS data model. This has facilitated the consensus building process considerably. This has proven to be true also during concertation meetings with non-NEDLIB deposit libraries such as the British Library and the National Library of Australia and other, related initiatives and projects, such as CEDARS.

On the longer-term it is hoped that IT-vendors and system developers will adopt the OAIS-framework as a basis for implementing deposit systems and for developing ready-to-market products. This would facilitate open systems development for the benefit of a much larger community then would be the case if archival institutions invested in tailor-made systems on an individual basis.

This is also mentioned as being part of the objectives of the OAIS Reference Model, namely:

[see ref. 1, section 1.1 Purpose and scope]

NEDLIB is keen to help progress the OAIS standardisation process and to provide feed back in order to ensure that generic deposit library requirements are catered to by the Reference Model.

Description of the OAIS Model

The Reference Model addresses a full range of archival functions including Ingest, Archival Storage, Data Management, Access, and Administration (Figure 1).

 

Figure 1. OAIS Functional Entities

[OAIS, Figure 4.1]

  

It also addresses the data models used to represent digital information in archives from a preservation perspective. The OAIS defines an Information Object as Data Object interpreted using its Representation Information. This is shown schematically in Figure 2.

 

Figure 2. Obtaining Information from Data

[OAIS, Figure 2.2]

 

In order for this Information Object to be successfully preserved, it is critical for an OAIS to clearly identify and understand the Data Object (the bits) and its associated Representation Information (implicitly hidden in the interpreting/rendering software).

Information transmissions internal and external to the OAIS archive functions occur by way of information packages. The information package contains the information object that needs to be preserved for future access. Three different types of Information Packages are defined: the Submission Information Package (SIP), the Archival Information Package (AIP) and the Dissemination Information Package (DIP). This is shown in Figure 3 below.

 

 

 

Figure 3. OAIS Archive External Data Flows

[Adapted from OAIS, Figure 2-4]

 

An Information Package (IP) is a conceptual container of two types of information called Content Information and Preservation Description Information (PDI). The Content Information and PDI are viewed as being encapsulated and identifiable by the Packaging Information. The resulting package is viewed as being discoverable by virtue of the Descriptive Information.

These Information Package relationships are shown schematically in Figure 4.

 

 

Figure 4. Information Package Concepts and Relationships

[OAIS, Figure 2-3]

 

Preservation perspectives

The concept of Information Object, with the explicit distinction between the Data Object (the actual bit-stream) and the Representation Information (enables interpretation of the bit-stream into meaningful information) is central to the OAIS model. Preserving both the bit-stream and its Representation Information through time are crucial requirements.

The OAIS Document provides some perspectives on the issues of information preservation using digital migration across media and across new formats or representations, but it is not clear which processes are needed and which functionality is required. It discusses medium migration (refreshing or copying a publication) as a preservation procedure belonging to Archival Storage. As formats become obsolete and the viewers needed to interpret and render these formats become obsolete as well, measures to preserve the content of a publication and all related aspects such as look and feel, layout, structure and functionality, need to be taken. To this end, several strategies may be followed, such as migration and emulation. The OAIS model does not discuss different preservation strategies and how they affect the model. It implicitly accepts data migration, i.e. "transformation" of digital content, as the preferred strategy. In all cases, transformation leads to a "new version" of the original publication. However, even with this strategy, it is not clear where transformation processes take place in OAIS. It does not belong to Archival Storage and this is understandable because Archival Storage does not have (and does not need to have) any knowledge of the content of a publication. The Administration entity has an "Archival Information Update" function that provides a mechanism for updating the contents of an AIP stored in Archival Storage, by accessing it as a DIP, updating its content and resubmitting it as a SIP to Ingest. However the Reference Model does not clarify if and in what way this function belongs to a preservation process.

Scoping OAIS to the DSEP

The OAIS Reference Model defines the environment surrounding the archive and the interactions with Producers, Consumers and Management. Similarly, the DSEP Model defines the Digital Library environment surrounding the DSEP and the interactions between the DSEP and the Digital Library System (DLS).

NEDLIB has established that much of the OAIS functionality maps onto the broader digital library configuration. Many sub-functions of OAIS map (partly) to equivalent DLS functions, such as:

  • receive submission (is partly covered by acquisition),
  • description (cataloguing),
  • creating finding aids (the National Bibliography, the library OPAC, subject guides and other indexes),
  • user authorisation and managing access controls (library user registration and authentication, authorisation and access control system),
  • access provision to customers (service delivery through the library web site)

These mappings concern general DLS functions.

In other words, OAIS functionality bundled in entities such as Ingest, Data Management and Access overlaps the general functionality of a digital library system (DLS). Consequently, OAIS functionality is situated partly outside and partly inside the actual limits of a DSEP.

In the model developed by NEDLIB, the Ingest, Data-Management and Access functionality belonging to the DSEP system is much more limited than in the OAIS Reference Model. In fact, Ingest, Data-Management and Access have functionality only directly relating to the storage-handling and preservation processes.

The NEDLIB model defines how the DSEP and the DLS interact. Two interfacing processes have been defined through which all input and output interactions with the DSEP take place. The interfacing processes have been defines as "Delivery&Capture" at the input side and "Access&Delivery" at the output side of the DSEP. These processes interact with the DSEP on the basis of well-defined input and output standards and conventions, such as a SIP and a DIP. They take care of the interactions with the publishers and other information providers at the input level, and with the library customers at the output level. In this way the interfacing processes cater to the particular and changing requirements for interaction with the outside world whilst the DSEP can be considered to operate in a more controlled environment. These modelling considerations are expanded upon below.

It is acknowledged that some processes, namely creating finding aids, such as content indexing and cataloguing, need to some extent to interact with the DSEP to perform their tasks, but this interaction need not be direct and can occur through the input/output interfaces.

Figure 5 shows the result of this scoping exercise.

Figure 5. OAIS functional entities scoped to DSEP processes

(the numbers refer to workflow steps - see note below)

 

Extending the OAIS model

NEDLIB has looked into preservation issues in some detail from different perspectives. It has made an overview of preservation strategies and characteristics of electronic publications. It has commissioned Jeff Rothenberg to initiate experiments to test the emulation of hardware approach with real digital deposit publications and to provide procedural models to support emulation-based preservation in the context of OAIS and DSEP [ref. 3]. It has looked in some detail at the metadata for preservation requirements and how they relate to the OAIS Information Model.

What NEDLIB found missing in the OAIS Model was a conceptual entity symbolising the preservation processes required of an OAIS, whatever the preservation strategies followed. Therefore NEDLIB has added in its DSEP model a Preservation entity that manages the preservation processes required of a DSEP. Although it is recognised that the preservation function affects all DSEP processes, NEDLIB has added this separate preservation entity to make this function more visible and more explicit in the model. Much in the same way as metadata processing affects all DSEP functions, still, OAIS has defined a separate Data-Management entity to visualise the metadata processing function.

Both transformation and emulation approaches are worked out in some detail in the DSEP model. The resulting output is either a new version of a formerly deposited publication, in which case it is ingested anew in the system, or it is a set of specifications for interpreting or emulating the interpretation of the publication. In both cases, new preservation metadata are generated and managed by the Data-Management process.

Modelling considerations

This paragraph discusses some design considerations encountered during NEDLIB's work on OAIS and DSEP. It explains why certain less obvious choices have been made. The choices have been discussed with experts in the field, with other national libraries not participating in NEDLIB and with representatives from the OAIS standardisation effort.

Conceptual design vs implementation design

The Reference Model document assumes that implementers will use the OAIS model as a guide while developing a specific archival system implementation. The document stresses that it "does not assume or endorse any specific computing platform, system environment, system design paradigm, system development methodology, database management system, database design paradigm, data definition language, command language, system interface, user interface, technology, or media required for implementation." [see ref. 1, section 1.4]

In his preservation test-bed report for NEDLIB [ref. 3] Jeff Rothenberg rightfully stresses how important it is "to keep in mind that the OAIS is intended as a reference model rather than a system design model. One implication of this (both for the OAIS and for the DSEP specification, which is derived from it) is that the functions or processes shown in these models do not necessarily correspond directly to the functional modules of a system that would implement that model. The functional decomposition of a system into appropriate modules is a design issue, and various implementations may well lend themselves to functional decompositions that are quite different from the 'reference processes' of the OAIS." [ref. 3, section 2.1]

In order to apply the OAIS model to the DSEP, the actual logical processes belonging to the DSEP have been identified and mapped to the OAIS functional entities. The DSEP model is a logical process design of the DSEP system. It refers to the OAIS functional entities without assuming that these correspond to an optimal functional decomposition of the desired system. In fact it introduces a new functional entity, the preservation entity, as a conceptual framework for the preservation process. This is not done to suggest that the preservation process is an isolated process that can be implemented as a separate preservation module, but rather to stress the importance of the preservation process as part and parcel of the DSEP. Even though the preservation process may pervade several other DSEP processes it is conceptualised in the DSEP model as a separate entity.

Similarly, NEDLIB has interpreted the OAIS concept of the Archival Information Package (AIP) as a conceptual package rather than an actual data structure. In this way it was possible to disassociate the different component parts of an Information Package (data and metadata) and to consider where various kinds of metadata are needed or generated by the different DSEP processes. The DSEP data model therefore indicates which metadata subsets logically "belong to" which DSEP processes. The AIP, containing the information object, remains the unit to be preserved and retrieved for access in a DSEP, but actual implementations of an AIP are free to decide which metadata subsets are to be stored together with the information object into one data structure for preservation. It is suggested however that only those metadata that are considered to belong to the original publication be preserved together with the information object - as a bitstream - in the AIP. The metadata required during the DSEP processes should not be preserved as part of the AIP, but be functionally accessible at all time. They need to be updated on a frequent basis and migrated to new data management software or data structures, as necessary, without having to unpack and modify the preserved publication itself, which remains unchanged in its AIP.

Modularity

Two interfacing processes have been defined through which all input and output interactions with the DSEP take place. This has been done to protect the DSEP from being exposed to changing external requirements that can be less easily controlled. From a design point of view it is best to ensure that the DSEP is a system operating autonomously, independently from external variables. Other digital library processing systems belonging to the DLS should similarly, be able to operate independently and without any knowledge of internal DSEP technical solutions and conventions. The DSEP should operate like a black box to the other systems. In this way changes in one system do not affect other systems. The DLS is conceived as a modular system, in which the cataloguing system, the acquisition system, the DSEP and other library systems operate individually, each with well-defined responsibilities. Because the internal workings of the DSEP are private to itself, it can be developed independently of other existing DLS modules.

Generic vs local requirements

In order to minimise dependency, the number of interfaces for interaction with the DSEP have been reduced to a minimum of two. These are basically defined as input and output interfaces. The input interface (Capture&Delivery) enables to ingest a publication into the DSEP. The output interface (Delivery&Access) enables to retrieve a publication from the DSEP. The interfacing processes bundle all functionality that is necessary to address local, non-generic particularities of the DSEP environment. The process "Capture&Delivery" defines the non-generic part of Ingest: it is tailor-made to accommodate the different local deposit formats and procedures agreed between deposit library and publishers within each country. The process "Delivery&Access" defines the non-generic part of Access: it is tailor-made to accommodate the different local access conditions agreed between the deposit library and publishers. In addition the deposit formats and procedures and the access conditions may alter significantly through time, making it necessary to adapt the interfacing processes regularly to changing external conditions. Separating the changing and local features of a DSEP from the more fixed and generic ones, makes it possible to focus on a generic system that is relevant for all deposit libraries and that can be implemented everywhere without the need for drastic localisations.

Process Model of the DSEP

The resulting DSEP process model is described in the upcoming NEDLIB report entitled "The Deposit System for Electronic Publications (DSEP). A Process Model".

A published article in D-LIB Magazine [ref. 4] discusses interim modelling results that led to the DSEP process model.

Figure 6. Top-level view of the DSEP Process Model


REFERENCES

[1] Referencing Model for an Open Archive Information System (OAIS), Don Sawyer / NASA and Lou Reich / CSC.

In the course of project NEDLIB the following subsequent versions of the OAIS Reference Model have appeared and been used as a basis for this NEDLIB report:

  • White Book, Issue 4, September 1998
  • White Book, Issue 5, April 1999
  • Red Book, Issue 1, May 1999

URL: http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html

[2] Jose Borbinha, Fernando Cardoso, "High Level Design", NEDLIB project Document 6048/DEL-35, 25-09-1998

URL: http://nedlib.kb.nl/high-level-design/doc0000.htm

 

[3] Jeff Rothenberg, "An experiment in using emulation to preserve digital publications", NEDLIB Report, April 2000

URL: http://nedlib.kb.nl/results/emulationpreservationreport.pdf

[4] Titia van der Werf, "Long-term Preservation of Electronic Publications", D-Lib Magazine, September 1999

URL: http://www.dlib.org/dlib/september99/vanderwerf/09vanderwerf.html


NOTE

Digital Deposit Library Workflow

The workflow identified by NEDLIB is characterised by the following steps:

  • 1. selection for collection building: This process is mainly human-driven and involves the decision making process for including or excluding electronic material from the deposit collection. The decision-making process is based on national deposit policies and regulations and agreements made with publishers and other providers. The selection process is therefore highly dependent of local conditions.
  • 2. acquisition: this process involves all administrative transactions needed to receive new electronic publications in deposit – including receipt of new title information (metadata) from the publisher's, ordering and recall. This process requires an exchange of bibliographic and administrative information between deposit library and publisher. In general, publications are acquired free of charge under deposit regulations or against agreed prices.
  • 3. delivery/capture/harvest: the actual deposit process involves actions required to get a copy of an electronic publication from the publisher's production/distribution system to the library's deposit system, according to agreed procedures. The procedures may include delivery via electronic transfer or by conventional shipping procedures or capture by means of web harvest techniques.
  • 4. registration: this is the process by which the new incoming electronic publication is checked into the deposit system. It may involve actions such as registration, acceptance/rejection and notification of receipt.
  • 5. verification: these are required control routines for checking the authenticity of data transfer, the physical integrity of the medium and the file formats and the logical integrity of the document. The procedures may involve authentication, installation and de-installation. The procedures may also result into returning the electronic publication to the sender with error-messages and a notification to acquisition.
  • 6. description: this is the stage in the publication handling process where an entry is created in the library catalogue, to ensure the publication can be found in the library search systems. Cataloguing is done according to national cataloguing rules. This process may involve (automatic) re-use of primary metadata supplied by the publisher/author, but it may also involve manual value-adding by the librarian, such as performing authority control on author names, adding annotations and subject description. The bibliographic records are integrated in the library’s online catalogue (OPAC) and published in the National Bibliography (NB). The National Bibliography contains the authoritative bibliographic descriptions of publications included in the deposit collection. Content-indexing is the (automatic) procedure that creates full-text indexes and other finding aids for the electronic publication. Subject-indexing is the intellectual process whereby subject specialists classify publications according to a subject scheme and qualify them with the help of controlled vocabularies. The description process may require installation of the electronic publication.
  • 7. storage-handling: this is the process that takes care of the storage of the electronic publication in the deposit system. It involves medium migration of the electronic publication from its former carrier to the physical storage of the deposit system, regular refreshing and duplication for backup purposes. It also involves actions such as regular integrity checks and quality control monitoring. Part of this process also consists of the storage location handling, ensuring the management of the physical location of all files in the storage system.
  • 8. preservation: this process consists of all actions required for the long-term preservation of the deposit collections. This may involve specification and emulation of document behaviour, regular medium refreshing and document re-formatting, other transformations to be performed on (parts of) the electronic publication, preservation metadata updates, integrity checks, authenticity assessments and quality assurance procedures.
  • 9. packaging&delivery : this is the process needed for making the deposited electronic publication available in such a way that it is fit for consumption by library users. The process allows for the retrieval of the electronic publication from the deposit store. This involves making a copy of the item and transferring it to the access stage. It may entail extracting parts of the electronic publication, or adding a full-text index to it, or re-formatting of (parts) of the publication for viewing, printing or downloading. It may involve providing for a viewing configuration, etc. Delivery may be embedded in services such as document delivery or print-on-demand.
  • 10. access: access is a whole set of facilities, as part of the library end user environment, to support access to the library deposit collection. It includes access to finding aids, it supports user-authorisation, user-rights management, user-profiles, etc. Access conditions to the deposit collections vary with each library because of different deposit regimes and agreements with publishers. It is also anticipated that access conditions will change over time. The access-environment supports the access process initiated by the library user.
  • 11. monitoring: the whole workflow pertaining to the deposit collection, as defined by processes 1 to 10, needs to be monitored for quality control.