IVOA Data Curation and Preservation
This Interest Group is chaired by AlbertoAccomazzi
(appointed May 2010), who succeeds BobHanisch
(appointed May 2007).
The Data Curation and Preservation (DCP) Interest Group is established to
share best practices and engage IVOA member projects in the long-term
curation and preservation of astronomical data. Discussion topics include:
- Identification of at-risk data and data collections
- Processes for engaging the community in curation and preservation activities
- Coordination with strategic partners such as astronomical data centers, research libraries, and scholarly publishers
- Technology reviews concerning all aspects of curation and preservation (metadata, provenance, data integrity, media migration, replication, disaster recovery, assessment criteria, etc.)
The Interest Group will from time to time produce white papers and/or bring
proposed actions to the attention of the IVOA Technical Coordination Group
and IVOA Executive.
- General working-group discussion archive.
May 2010, Victoria Interop InterOpMay2010DCP
May 2008, Trieste Interop InterOpMay2008DCP
October 2007, Cambridge Interop InterOpSep2007CP
There is no fixed membership; anyone interested in data curation and preservation is welcome to participate.
People who are currently involved include Bob Hanisch, Francoise Genova, Pepi Fabbiano, Arnold Rots, Wolfgang Voges, Fabio Pasian and Bob Mann.
Please add your name to this list!
At the IVOA meeting #11 on 23 June 2004, a request was made for the creation of an Interest Group for Data Curation and Preservation. Reagan Moore (SDSC, USA) and Francoise Genova (CDS, France) were tasked with developing the initial description of the activities for the interest group. The goal of the group is to identify both mechanisms for the long-term preservation of astrophysics collections and sustainability procedures to ensure continued access to astrophysics collections that are at risk.
Mechanisms for long-term preservation address challenges related to:
- Authenticity, how to guarantee that preservation information (provenance, description, administrative) will remain associated with astrophysics data, while Uniform Content Descriptors evolve and data are moved to new storage systems.
- Integrity, how to validate that data has not been corrupted, manage the collection name spaces, and manage replicas.
- Technology evolution, how to guarantee a collection will remain accessible while the underlying storage systems and database technologies evolve.
- Disaster recovery, how to guarantee that a collection will remain accessible in the event of a natural catastrophe such as an earthquake or hurricane.
Sustainability procedures for continued support address challenges related to:
- Identification of collections at risk of being lost due to lack of support.
- Development of procedures for validating whether an at-risk collection should be preserved.
- Development of procedures for preserving at-risk collections.
- Development of procedures for identifying communities that will continue to support at-risk collections.
The IVOA Data Curation and Preservation interest group will develop:
(1) A white paper discussing the concepts involved in preservation, and providing a description of existing preservation environments.
(2) A white paper discussing the sustainability procedures.
A combined document will then be circulated to the IVOA as an IVOA Working Draft.
The technologies that are available to build preservation environments come from the grid and digital library communities. Examples include:
- Storage Resource Broker data grid which provides mechanisms to assert authenticity, validate integrity, manage technology evolution, package data for storage, replicate files, and federate preservation environments.
- integrated Rule Oriented Data System (iRODS) data grid which automates application of preservation processes. Management policies are expressed as rules that control the execution of preservation processes that are implemented through sets of micro-services. Queries on persistent state information are used to validate assertions about preservation properties such as trustworthiness, authenticity, integrity, and chain of custody.
- Fedora which provides relationship management that can be used to track UCD evolution
- DSpace which provide life-cycle management procedures for creating preservation metadata
- Open Archive Initiative Protocol for Metadata Harvesting for supporting name space registries
- Metadata Encoding and Transmission Standard for structuring metadata
- Open Archival Information System for packaging data for preservation (Archival Information Package)
- Federated data grids which provide deep archives for disaster recovery
- Dataflow management systems for applying preservation procedures
One expectation is that multiple preservation models may be used, and that interoperability mechanisms will be developed for exchange of data and metadata between preservation systems. This capability will be required if for no other reason than to manage migration between different versions of technology over time.
Interface of the WG with other parts of the IVOA
Preservation environments can be assembled by federating existing collections with deep archives. This means that interactions with existing archives is essential for building a viable system.
Interactions will be needed with the Data Access Layer WG for providing uniform interfaces to the preservation environment, and with the UCD WG to provide relevant discovery information for items in the preservation environment.
Requirements from other WGs
We will seek requirements from other WGs through specification of use cases. A first use case is being defined in collaboration with Interpares (International Research on Permanent Authentic Records in Electronic Systems). The Canadian MOST image collection is being analyzed for preservation requirements. A report will be generated in 1st quarter, 2005.
Other examples of interactions with IVOA working groups include integration of preservation environments with processing pipelines, and support for discovery through portals.
Currently the process to provide requirements is by email to the IVOA PE mailing list.
The preservation environment will use at least four name spaces for identifying files:
- Logical file name. This is a location independent identifier that is unique within a preservation environment.
- Global Unique Identifier - GUID. This is an identifier that is unique across existing archives.
- Physical file name. This is the path name on the physical storage system where the data resides.
- UCDs. Attributes will be associated with each logical file name that can be used to discover relevant data.
The preservation environment provides additional name spaces to describe:
- Storage resources
By managing each of these name spaces independently from the underlying storage repositories and administrative domains, the preservation environment can control the authenticity and integrity of the collections.
To ensure preservation against natural disasters, operator error, and malicious users, replicas of the data must be kept with differentiated levels of access. One copy should be accessible by users, a second copy user-accessible copy should be created at a geographically remote site for fault-tolerance, and a third copy should be kept in a deep archive that is not user-accessible. Each of the copies should be managed by an independent metadata catalog. Federated data grids provide this level of support.
The project will track preservation approaches across multiple communities:
- Smith, M., R. Moore, “Digital Archive Policies and Trusted Digital Repositories”, proceedings of The 2nd International Digital Curation Conference: Digital Data Curation in Practice, November 2006, Glasgow, Scotland.
- Moore, R., “Building Preservation Environments with Data Grid Technology”, American Archivist, vol. 69, no. 1, pp. 139-158, July 2006.
- Rajasekar, A., M. Wan, R. Moore, W. Schroeder, “A Prototype Rule-based Distributed Data Management System”, HPDC workshop on “Next Generation Distributed Data Management”, May 2006, Paris, France.
- Moore, R., J. Jaja, R. Chadduck, "Mitigating Risk of Data Loss in Preservation Environments", NASA / IEEE MSST2005, Thirteenth NASA Goddard / Twenty-second IEEE Conference on Mass Storage Systems and Technologies, April 2005.
- Moore, R., "Persistent Collections," book chapter in "Databasing the Brain," editors S. H. Koslow and S. Subramaniam, John Wiley & Sons, 2005.
- Moore, R., R. Marciano, "Prototype Preservation Environments", submitted to Library Trends, Dec. 2004.
- Moore, R., A. Rajasekar, M. Wan, "Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data", submitted to IEEE, Dec. 2004
- Moore, R., "Building Preservation Environments with Data Grid Technology", submitted to American Archivist, Oct., 2004
- Moore, R., W. Underwood, "Preservation Environments for Digital Entities," Interpares II report, June 2004.