Genesis
At the IVOA meeting #11 on 23 June 2004, a request was made for the creation of an Interest Group for Data Curation and Preservation. Reagan Moore (SDSC, USA) and Francoise Genova (CDS, France) were tasked with developing the initial description of the activities for the interest group. The goal of the group is to identify both mechanisms for the long-term preservation of astrophysics collections and sustainability procedures to ensure continued access to astrophysics collections that are at risk.
Mechanisms for long-term preservation address challenges related to:
- Authenticity, how to guarantee that preservation information (provenance, description, administrative) will remain associated with astrophysics data, while Uniform Content Descriptors evolve and data are moved to new storage systems.
- Integrity, how to validate that data has not been corrupted, manage the collection name spaces, and manage replicas.
- Technology evolution, how to guarantee a collection will remain accessible while the underlying storage systems and database technologies evolve.
- Disaster recovery, how to guarantee that a collection will remain accessible in the event of a natural catastrophe such as an earthquake or hurricane.
Sustainability procedures for continued support address challenges related to:
- Identification of collections at risk of being lost due to lack of support.
- Development of procedures for validating whether an at-risk collection should be preserved.
- Development of procedures for preserving at-risk collections.
- Development of procedures for identifying communities that will continue to support at-risk collections.
The IVOA Data Curation and Preservation interest group will develop:
- A white paper discussing the concepts involved in preservation, and providing a description of existing preservation environments.
- A white paper discussing the sustainability procedures.
A combined document will then be circulated to the IVOA as an IVOA Working Draft.
The technologies that are available to build preservation environments come from the grid and digital library communities. Examples include:
- Storage Resource Broker data grid which provides mechanisms to assert authenticity, validate integrity, manage technology evolution, package data for storage, replicate files, and federate preservation environments.
- integrated Rule Oriented Data System (iRODS) data grid which automates application of preservation processes. Management policies are expressed as rules that control the execution of preservation processes that are implemented through sets of micro-services. Queries on persistent state information are used to validate assertions about preservation properties such as trustworthiness, authenticity, integrity, and chain of custody.
- Fedora which provides relationship management that can be used to track UCD evolution
- DSpace which provide life-cycle management procedures for creating preservation metadata
- Open Archive Initiative Protocol for Metadata Harvesting for supporting name space registries
- Metadata Encoding and Transmission Standard for structuring metadata
- Open Archival Information System for packaging data for preservation (Archival Information Package)
- Federated data grids which provide deep archives for disaster recovery
- Dataflow management systems for applying preservation procedures
One expectation is that multiple preservation models may be used, and that interoperability mechanisms will be developed for exchange of data and metadata between preservation systems. This capability will be required if for no other reason than to manage migration between different versions of technology over time.
Interface of the WG with other parts of the IVOA
Preservation environments can be assembled by federating existing collections with deep archives. This means that interactions with existing archives is essential for building a viable system.
Interactions will be needed with the Data Access Layer WG for providing uniform interfaces to the preservation environment, and with the UCD WG to provide relevant discovery information for items in the preservation environment.
Requirements from other WGs
We will seek requirements from other WGs through specification of use cases. A first use case is being defined in collaboration with Interpares (International Research on Permanent Authentic Records in Electronic Systems). The Canadian MOST image collection is being analyzed for preservation requirements. A report will be generated in 1st quarter, 2005. Other examples of interactions with IVOA working groups include integration of preservation environments with processing pipelines, and support for discovery through portals.
Namespaces
The preservation environment will use at least four name spaces for identifying files:
- Logical file name. This is a location independent identifier that is unique within a preservation environment.
- Global Unique Identifier - GUID. This is an identifier that is unique across existing archives.
- Physical file name. This is the path name on the physical storage system where the data resides.
- UCDs. Attributes will be associated with each logical file name that can be used to discover relevant data.
The preservation environment provides additional name spaces to describe:
- Storage resources
- Users
- Metadata
By managing each of these name spaces independently from the underlying storage repositories and administrative domains, the preservation environment can control the authenticity and integrity of the collections.
Federation
To ensure preservation against natural disasters, operator error, and malicious users, replicas of the data must be kept with differentiated levels of access. One copy should be accessible by users, a second copy user-accessible copy should be created at a geographically remote site for fault-tolerance, and a third copy should be kept in a deep archive that is not user-accessible. Each of the copies should be managed by an independent metadata catalog. Federated data grids provide this level of support.