Parquet in IVOA


VOParquet Note

A Note is under preparation at https://github.com/ivoa/voparquet (formatted versions available: html, pdf).

This describes use of VOTable within Parquet files to associate rich semantic (VOTable) metadata with the (parquet) data.

Implementation Status

The following implementations of the VOParquet convention exist. Please add to this list if you know of others.

  • TOPCAT/STILTS/STIL: implementation in pre-release at topcat-extra.jar.
    • Writes VOParquet metadata into output parquet files by default; uses VOParquet metadata from input files when reading if present.
    • Believed fully compliant with VOParquet Note v1.0.
    • A small example output file is attached: skysim10.parquet. (Produced like this: java -jar topcat-extra.jar -stilts tpipe in=:skysim:10 out=skysim10.parquet)
  • OpenCADC TAP server implementation (in-progress, ready early Jan 2025). PR here: https://github.com/opencadc/tap/pull/181
    • For queries where RESPONSEFORMAT specifies Parquet, Parquet file produced with metadata in enclosed VOTable as per the VOParquet note.
    • If the server encounters an error while constructing and streaming out the Parquet file, the client will receive text/plain message describing the error. Since the output stream is already open at this point (and the response code cannot be modified) we think it's the best it can do, but we are open to other suggestions.
    • Will be available in all instances of CADC TAP services including YouCat for user-managed tables.

Validator

A VOParquet validator is available:

  • There is a new STILTS command parqlint in the VOParquet-capable TOPCAT pre-release mentioned above (topcat-extra.jar).
  • Invoke it like this: java -jar topcat-extra.jar -stilts parqlint <voparquet-filename>
  • It checks that the key-value entries look OK, does full votlint validation on the embedded VOTable, and reports on any discrepancies between the VOTable metadata and the parquet data
  • It has a few optional parameters; use help or help=<param-name> for more details.
  • One option useful for debugging is being able to test an external data-less VOTable as if it were attached to the parquet file - see the votable parameter.

Parquet and DALi

The current draft of DALI 1.2 has added a new row to the RESPONSEFORMAT table, introducing the alias parquet for the parquet MIME type application/vnd.apache.parquet. That means that for DALI services offering parquet output, clients can request that format by providing the parameter RESPONSEFORMAT=parquet. See DALI PR#43.

Documents and Presentations


Meetings

5 November 2024 19:00 UTC Online meeting

22 Zoom participants

Agenda

The purpose of the meeting is to learn about different current Parquet-related efforts currently under way and identify synergies that can be channeled into possibly new IVOA standards.

Meeting minutes

Active groups using Parquet techonology:

  • Jeff Burke (CADC) - Parquet format for TAP
  • Mario Juric (U of Washington) - Upload and download large Rubin catalogues
  • Vandana Desai (Caltech/IPAC) - Science with large catalogues inspired by Mario's work.
  • Jos De Bruijne (ESA) - ESA GAIA DR4 - Use Parquet format (instead of the current CSV format)
  • Pierre Le Sidaner (Observatoire de Paris) - Also GAIA mission. Appreciate data access by column
  • Gregory Dubois-Felsmann (Rubin) - Rubin very heavy use of Parquet internally and externally. Catalogue releases. Rich metadata in VOTable goes beyond UCDs and UTypes
  • Trey Roby (IPAC) - Firefly read and write Parquet file. Very good file. Pushing the edge of how big the files can be.
  • Mark Taylor presentation on adding VOTable rich metadata in Parquet

Questions and comments on the presentation

  • Gregory D-F: VOTable metadata is important for DataLink, no objection for adding encoding, version support is a must. Valid VOTable medata must comply with the spec. Parquet is not great to work multiple datasets but VOTable should support multiple tables. What happens if there’s a conflict between VOTable metadata nd Parquet metadata.
  • Mark T.: Cannot have an empty data element. VOTable Resource has a flag that can be used. Main table can be of type result and other resources can have other types (medatadat).
  • Trey R.: How to resolve inconsistencies between Parquet and VOTable. Parquet should have the last word.
  • Mario J: Important to specify how to do deal with inconsistencies between VOTabe metadata and Parquet metadata
  • Brigitta S: Prefers Parquet support first before generalization (FITS) Example files would be useful for implementation such as astroquery - ex multiple tables with the same file.
  • Jos DB: Compression algs. Should start thinking about this as it's one of the core strengths of Parquet
  • Gregory D-F. We need to define new TAP result formats. Is there a Parquet MIME type?
  • A few participants stressed the importance of moving fast with the adoption of Parquet in IVOA as a lot of projects are currently under way.
  • Marco M: Proposed the faster route of writing an IVOA Note for VOTable and Parquet and then explore generalizing it for other types (FITS) and turn it into a standard.

Actions:

  • Mark T. and Gregory D-F will start drafting the note
  • Mark T. will make his presentation available so that others can comment on it - done
  • Mark T. will present progress during the Apps session in Malta
  • Apps WG to support the efforst: create Twiki page with useful info, organize meetings post Interop as required, etc.

<!--
* Set ALLOWTOPICRENAME = TWikiAdminGroup
-->

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf Bulk_download_for_DR4-public.pdf r1 manage 903.4 K 2024-11-06 - 10:01 MarkTaylor Gaia DR4 bulk download plans
Unknown file formatparquet skysim10.parquet r1 manage 2.9 K 2024-12-17 - 12:45 MarkTaylor  
PDFpdf votparquet-telecon-2024-11-05.pdf r1 manage 139.5 K 2024-11-05 - 22:37 MarkTaylor  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r13 - 2025-01-20 - MarkTaylor
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback