*DAL 1 Date: Tuesday 5 June 2025 Time: 9:00-10:30 EDT Moderator: Joshua Fraustro Notetaker: Grégory Mantelet Participants: Joshua Fraustro (JF), Massimo Costantini (MC), Pat Dowler (PD), Grégory Mantelet (GM), Markus Demleitner (MD), Stephane Erhard (SE), Tess Jaffe (TJ), François Bonnarel (FB), Daniel Durand (DD), Mark Taylor (MT), Alberto Micol (AM), Theresa Dower (TD), ... -------------------------------------------------------------------------------------------------------------------------------------- *Spoke3 Archive Infrastructure (Massimo Costantini) * Main Data Workflows: Ingestion / Extraction * Ingestion * Python component implemented as a device and monitored by the TANGO controls framework * Different data description models (MariaDB + mapping tables) * metadata in PostgreSQL * data storage and replication uses S3 * Supports Cut & Paste operations for GAIA and access to Fermi tools * Custom portals: astronomers can query it to search data * Gaia use case: * a query returns 2 tables: one for metadata (including DID) and another for data * result of merging two queries * Cut and merge operation possible, significantly decreases file size - GB to KB. * Gaia not yet accessible: users must log in into the poral * Additional features: * auto correction of transit selections that do not match any sources * comprehensive logging with ElasticSearch / Kibana * Aggregation of observations and simulations * One microservice compliant with TAP + Another for DataLink * accessible by the Spoke3 Portal URL + /tap * access_url column embeds 2 UUIDs (managed by Redis), * Store the selected rows and queries * Create a personal cache for the user's queries * DataLink: * #this => cut & merge ; HDF5 => opened in browser * #progenitor => full file * #counterpart => transits table * Transits table provides a recursive DataLink solution * Jupyter notebook to demonstrate how to use the TAP endpoint * using Astropy and Pandas DataFrame * possible to reproduce the TOPCAT behaviour with the DataLink document * Next steps: * performance comparison between Oracle (for Gaia) and PostreSQL * evaluation of sharding for distribution of data across mult. databases/servers * portal extension for new datasets / usecases * resources are currently not being registered. data Q: SE: Curious about Solar System targets, is there a way to access Solar System/moving targets? A: Project is not driven by all Gaia usecases, no use cases in mind for solar system objects. MM: Don't think moving objects are supported. Q: TJ: HEASARC has load problems, interested in ElasticSearch / Kibana for performance -------------------------------------------------------------------------------------------------------------------------------------- *WD-DALI 1.2 update (Pat Dowler) * Pretty close to be ready for RFC process * Changes since 1.1: * clarifications/improvements * availability * new xtypes * GitHub issues: * issues for 1.2 tagged as such * 14 issues completed * Changes since DALI-1.1 * VOSI-availability now optional, perhaps deprecate? * OVERFLOW query status * Data types promoted * update the architecture diagram (forward reference for standards using DALI) * boolean serialziation * currently refers to XML schema rather than VOTable * should be based on VOTable's boolean datatype * POST parameter encoding (still under discussions) * form encoded vs multipart form data * both should be accepted as POST parameters * New xtypes: * uuid, uri * moc (ascii serialization ; see MOC standard) * range, now defined as in SIA2 * hms, dms (sexagesimal angle) * shape (polymorphic simple region ; designed based on existing database implementations) * multishape - region made of component shapes, implicitly a UNION. Complex footprint for surveys, complex instruments, covers about 90% of use-cases for STC-S * multiinterval (implictely UNION ; most of the time for energy and time footprints) * json - parseable string value, no specific meaning, lower-level xtype than others, but allows custom use cases (round-tripping) * OpenAPI components * Small, describing individual paramters: * MAXREC * 64 bit integer (long, 0 to positive int) * UPLOAD * Constrained by Regex pattern matching comma-separated name, location pair * Not valuable to attempt to capture all variations of location strings * Advice to clients to reduce pain in the future: * parameters in OpenAPI are case sensitive ; so decision to be taken: which parameter case folding: UPPER or lower? * Uppercase is currently used in many text documents just for text clarity. * Style is typically lowercase. * Out of scopre / Future: * Include MOC in shape (see GitHub Issue #53) * DALI-sync requires GET and POST (#50) * Causes repetitiveness / duplication in path endpoints in OpenAPI documents * Limits how payloads other than parameters can be used in requests. * Include YAML as well as JSON as an xtype? (see #35 for answer) * Case sensitive params (#21) * Option 1: Give advice that the OpenAPI document spec isn't strict, but recommendation. * May have to issue strict compliance re: parameters in future if OpenAPI adoption continues / pays off. * Authors think that WD-DALI-1.2 almost ready for RFC Q-FB: 2 questions on Xtypes: 1/ about "moc" ; did you take into account the other MOC flavors? PD: Spatial moc are the generally preferred type of MOC. Not sure about the other flavors with Radio and Time. Let's see what's possible with them and then, let's see later whether to include them. FB: 2/ is there a plan to support legacy xtypes? PD: if you speak about SIA2 xtypes (like STC-s regions). STC-s did not become a standard. Some part of STC-s should be in columns. Transitionning to multishape should be trivial in most cases. Q-MD: wants MOC into shape. Thinks the only way to deal with regions in DB is through MOC. Q-DD: do you think having negative shape (holes)? PD: possible but probably do not worth the complexity. -------------------------------------------------------------------------------------------------------------------------------------- *ADQL PEG grammar and validation (Grégory Mantelet) *Grammar * PEG will replace BNF to write the ADQL grammar (see ADASS poster in Tucson in 2023) * Not machine-readable, difficult to validate. * Poster featured in Tuscon * A PEG grammar already exists on GitHub (lyonetia repository) ; written by Jon Juaristi Campillo * It needs to be rewritten with * syntax guidelines (rules to respect when writting and formatting the PEG grammar) * fix some elements (e.g. strict identifiers syntax, forbid reserved keywords, ...) * Since identifiers are the core of the grammar, it needed to be rewritten piece-by-piece * Rules name are likely to be different than the one of the BNF * As BNF, there are different variants of PEG ; it generally depends on the used parser tool * Each comes with a different syntax * Parser compatibilities * Mouse: Java * Arpeggio: Python * peg/leg: C * Canopy vs * Canopy requires a name for the grammar * String literals usually between single-quotes, canopy back-ticks * Canopy identifiers are closer to regex * But differences are so small that they can easily be adapted with substitutions based on regular expressions * Regex created to convert Ford grammar to canopy. * For ADQL, let's use a tool agnostic, unique syntax: the syntax used by the author of PEG: Bryan Ford ; why? because * it does not depend on a specific tool that may disappear in the future * it is well defined in a published paper * Convertable to other parser's grammar via regular expression and Makefile * PEG is machine readable * it works with various programming languages * it is easy to generate parsers and so, it can be validated (on the contrary to the existing ADQL's BNF grammar) *Validator * Currently, there is a validation for ADQL. It is based on a set of test queries stored on the lyonetia GitHub repository. * There is also a validator, but it is implementation dependent. * Just want to validate the grammar itself. * After validation, a report is stored on the lyonetia repository in the Wiki section * 2 possible architectures: * one validator for each tool. * very heavy to maintain * prone to implementation differences. * So, this is not the adopted solution. * one validator for all: * it runs a simple command line parser. * So, a different command for each targeted parser. * For this to work, a common API is required: * parameters (at least the ADQL query to validate) * JSON document in output (2 fields: success and error) * This validator must be multi-threaded in order to span with all test queries, ADQL language versions and parsers. *Next 1. Validator to finish (almost done) 1.1. A few tests left to run 2. Grammar to complete (just started) 2.1. Only way to test the grammar efficiently is to start with the grammar, and test after changes 3. Eventually add few features (by keeping in mind that it is desirable to have a light ADQL minor revision ; we do not want to wait for years before having a new ADQL revision) => see the next part of this session for discussions about the next desired features 4. Release a new WD -> PR -> REC version of ADQL (the RFC should be easier by now relying on the validator) MT: There was a BNF to HTML converter that allowed navigating around links, etc. Is there such a thing for PEG? GM: Don't know! But I'll search for this feature. PD: Curious about the number of syntax variations for PEG grammar? Why do all the parsers have different syntaxes? GM: Perhaps each parser likes to implement their own features, so they use their own syntax. -------------------------------------------------------------------------------------------------------------------------------------- *ADQL-next discussions (All) *Optional vs Mandatory features * SE: We already have problems with optional features in ADQL-- we now have two portals that send ADQL to many services. There are number of problems, not all services understand the same query-- breaks down interoperability since ADQL is not uniform. Simple things like LOWER, UPPER. * GM: Kind of agree. Some features are broadly supported in all DBMS like LOWER, UPPER, ILIKE. Would like to go through databases and see if they are all supported. Some, IN_UNIT, there is only one implementation. So some must stay optional. * SE: LOWER, UPPER, ILIKE, OFFSET, are important * AM: there are plenty of reserved SQL functions that are not available in ADQL, (string manipulations), can't ask developers to re-implement these. I hope you'll include many of these. * GM: These features (on slide) are only optional in ADQL, but if they are simple to integrate, "why not adding them in ADQL". However, let's keep next release simple ; I'd prefer to have several short but quick ADQL reviews than one long release in some years. * AM: In DALI, we see INTERVAL, native in HE and others, see the promise in DALI, same should go in DALI. * PD: Re: HE usage of Intervals, multi-interval in the time axis, T-MOC is probably better in a more complex footprint, finding intersections ( a whole survey ) INTERVAL/MULTI are good for simpler cases. Moving forward with obscore, keep in mind diff missions may want to make diff choice re: which datatypes to use. Flexibility of course, means optional, and it's hard on interoperability, but mandatory things are hard on implementation costs. It's different to underlying servers, difficult to finess into what databse server supports. Implementing functions/datatypes is considerably more complicated. Balance implementation barriers in servers and interop. * TD: No reason a RegTAP server needs all of these, on the other hand with Roman and very large catalogs, some will need to be functionally required. Need to implement them in the backend. Balance of req. v. optional is difficult. Thanks for the translator! * MD: Refocusing Pat's point about optional/req: * Optional features are good for server-side, implement what you want. * Poision for client-side, must determine what is available. * Problem is: most people are server-side here, ergo so many optional * *MOC * Changed from polygons to MOC, really simple way to describe geometry. All the optional parameters are to focus the data provider to use a specific parameter name, to ensure all data providers to use the same. *INTERSECT *Array * GM: We have heard some presentations (SPHEREx) need to have array data (SPHEREx, GAIA) not isolated cases. Something we should work on. MD has a prototype, arrays should be supported in most databases. For those that don't, there are tricks. Should we make these optional? * MT: No harm in making it optional, if data providers don't have array-value data. Better than not being there at all. * MD: For the additional functions, they're making arrays useful. On optionality, we don't have much of a choice. If we have them optional it may be just an issue for UPLOAD (tables with arrays may be uploaded). *Boolean *Binary *Numerical Interval * Discussed in the obscore discussion, to use UDF to compute? * Typically, we might have thousands to millions of time intervals in one data product. Must be a way to represent that robustly, i.e. T-MOC * MD: Would like to have this implemented. Implementing in database code is not trivial. As long as there is no support for T-MOCs in databases yet, don't include it in ADQL. * GM: Agreed, I think it's too early for T-MOC. Let's first support spatial MOCs. It will already be a challenged to support that in most the existing DBMS ; only PostgreSQL+PgSphere is ready. It is not so easy to implement on DBMS side. * *Unit syntax * IN_UNIT: If your units have an inverse relationship, you must recognize that the relations must be flipped. That either happens automatically or becomes confusing to the end-user * MD: Current behavior is not to flip the intervals (...