Parquet in IVOA
VOParquet Note
A Note is under preparation at
https://github.com/ivoa/voparquet (formatted versions available:
html,
pdf).
This describes use of VOTable within Parquet files to associate rich semantic (VOTable) metadata with the (parquet) data.
Implementation Status
The following
implementations of the VOParquet convention exist. Please add to this list if you know of others.
- TOPCAT/STILTS/STIL: implementation in pre-release at topcat-extra.jar.
- Writes VOParquet metadata into output parquet files by default; uses VOParquet metadata from input files when reading if present.
- Believed fully compliant with VOParquet Note v1.0.
- A small example output file is attached: skysim10.parquet. (Produced like this:
java -jar topcat-extra.jar -stilts tpipe in=:skysim:10 out=skysim10.parquet
)
- OpenCADC TAP server implementation (in-progress, ready early Jan 2025). PR here: https://github.com/opencadc/tap/pull/181
- For queries where RESPONSEFORMAT specifies Parquet, Parquet file produced with metadata in enclosed VOTable as per the VOParquet note.
- If the server encounters an error while constructing and streaming out the Parquet file, the client will receive text/plain message describing the error. Since the output stream is already open at this point (and the response code cannot be modified) we think it's the best it can do, but we are open to other suggestions.
- Will be available in all instances of CADC TAP services including YouCat for user-managed tables.
Validator
A VOParquet
validator is available:
- There is a new STILTS command
parqlint
in the VOParquet-capable TOPCAT pre-release mentioned above (topcat-extra.jar).
- Invoke it like this:
java -jar topcat-extra.jar -stilts parqlint <voparquet-filename>
- It checks that the key-value entries look OK, does full votlint validation on the embedded VOTable, and reports on any discrepancies between the VOTable metadata and the parquet data
- It has a few optional parameters; use
help
or help=<param-name>
for more details.
- One option useful for debugging is being able to test an external data-less VOTable as if it were attached to the parquet file - see the
votable
parameter.
Parquet and DALi
The current draft of DALI 1.2 has added a new row to the
RESPONSEFORMAT
table, introducing the alias
parquet
for the parquet MIME type
application/vnd.apache.parquet
. That means that for DALI services offering parquet output, clients can request that format by providing the parameter
RESPONSEFORMAT=parquet
. See
DALI PR#43.
Documents and Presentations
Meetings
5 November 2024 19:00 UTC Online meeting
22 Zoom participants
Agenda
The purpose of the meeting is to learn about different current Parquet-related efforts currently under way and identify synergies that can be channeled into possibly new IVOA standards.
Meeting minutes
Active groups using Parquet techonology:
- Jeff Burke (CADC) - Parquet format for TAP
- Mario Juric (U of Washington) - Upload and download large Rubin catalogues
- Vandana Desai (Caltech/IPAC) - Science with large catalogues inspired by Mario's work.
- Jos De Bruijne (ESA) - ESA GAIA DR4 - Use Parquet format (instead of the current CSV format)
- Pierre Le Sidaner (Observatoire de Paris) - Also GAIA mission. Appreciate data access by column
- Gregory Dubois-Felsmann (Rubin) - Rubin very heavy use of Parquet internally and externally. Catalogue releases. Rich metadata in VOTable goes beyond UCDs and UTypes
- Trey Roby (IPAC) - Firefly read and write Parquet file. Very good file. Pushing the edge of how big the files can be.
- Mark Taylor presentation on adding VOTable rich metadata in Parquet
Questions and comments on the presentation
- Gregory D-F: VOTable metadata is important for DataLink, no objection for adding encoding, version support is a must. Valid VOTable medata must comply with the spec. Parquet is not great to work multiple datasets but VOTable should support multiple tables. What happens if there’s a conflict between VOTable metadata nd Parquet metadata.
- Mark T.: Cannot have an empty data element. VOTable Resource has a flag that can be used. Main table can be of type result and other resources can have other types (medatadat).
- Trey R.: How to resolve inconsistencies between Parquet and VOTable. Parquet should have the last word.
- Mario J: Important to specify how to do deal with inconsistencies between VOTabe metadata and Parquet metadata
- Brigitta S: Prefers Parquet support first before generalization (FITS) Example files would be useful for implementation such as astroquery - ex multiple tables with the same file.
- Jos DB: Compression algs. Should start thinking about this as it's one of the core strengths of Parquet
- Gregory D-F. We need to define new TAP result formats. Is there a Parquet MIME type?
- A few participants stressed the importance of moving fast with the adoption of Parquet in IVOA as a lot of projects are currently under way.
- Marco M: Proposed the faster route of writing an IVOA Note for VOTable and Parquet and then explore generalizing it for other types (FITS) and turn it into a standard.
Actions:
- Mark T. and Gregory D-F will start drafting the note
- Mark T. will make his presentation available so that others can comment on it - done
- Mark T. will present progress during the Apps session in Malta
- Apps WG to support the efforst: create Twiki page with useful info, organize meetings post Interop as required, etc.
<!--
* Set ALLOWTOPICRENAME =
TWikiAdminGroup-->