VOQL WG Meeting, 14 May 2003 Chairman: Masatoshi Ohishi Present: Masatoshi Ohishi, Wil O'Mullane, Bob Mann, Masahiro Tanaka, Francoise Oschenbein, Naoki Yasuda, Clive Page, Mark Taylor, Pedro Osuna, David Schade, Alasdair Allan, Pat Dowler, Ed Shaya, Gerard Lemson. Notes by Bob Mann 1. Naoki Yasuda (JVO QL) Want to describe query condition, what to retrieve and analysis of data. Need to cover distributed databases and federate catalogues and images. Requirements: Platform independent. Sufficient for what astronomers want to do. Easy to understand by humans and machines Format: JVOQL based on SQL, but with astronomy-specific extensions: XMATCH: for cross-matching AREA: image cutout and celestial area limit XMATCH: Spatial matching similar to join in SQL. - e.g. where XMATCH(a,b) < 3 arcsec (same as SkyQuery syntax) Image Cutout: AStronomers want images associated with objects extracted from catalogues. Express in SQL-like manner: A.BOX(POINT(A.ra,a.dec,J2000), 3arcmin, 3arcmin) after defining POINT and AREA; AREA can be box, circle, oval, triangle Complex queries over multiple databases will be split up and component parts processed on different databases independently, where possible. Use count statistics to decide the order in which to do federated parts - just like SkyQuery. One use of AREA is to search for coverage in particular sets of passbands, or multiple observations - i.e. operations on image metadata. Page: coverage queries can be done in registry. Schade: needs dynamic and detailed registries Page: inevitable? O'Mullane: can handle sparse coverage of large are via "fill factor" Dowler: don't want to make the registry the full thing - some info should be left in the database. Dowler: cross-mtaching individual objects, not 10s of millions Yasuda: yes O'Mullane: SkyQuery not for matching whole catalogues 2. Wil O'Mullane (SkyQuery) ~5 staff-months of effort for demonstrator. All SQLServer databases, despite being physically remote. Want simple framework that matches what astronomers want to do. Web services chosen partly because it was for a WS competition. Services for meta-data, query, information, cross-match. Portal manages the query, different sky nodes hold individual databases. Not covering replication, etc. Now adding more spatial structures - polygons, with holes, etc, etc: need to match region specification in registry. Cross-match: good for small areas now - optimization in pipeline. Oschenbein: does this return only one result? - can be several candidates within the error radius? O'Mullane: can have more than one candidate, but hasn't happened Mann: this only works when error ellipses are small, and there is less than one object on average in the error ellipses. The general case can't be processed on a source-by-source basis. O'Mullane: clearly, more to do scientifically - this is more about getting a working prototype, which can be extended later. Dowler: count queries are vital for optimising performance O'Mullane: exactly - that's the essence of SkyQuery. Running the count query also brings the relevant data into cache. Ohishi: want to include model of network speed in generation of query plan. Page: important if there are replicas of catalogues O'Mullane: could poll replicas and choose which looks best. Mann: might have different indexing in different replicas O'Mullane: can clearly extend the cost-function used. At the moment, the subqueries are not executed in parallel, which they could/should be. Lessons: it works! - web services provide a good framework for this sort of work. Adding SkyNodes is easy. Uses Microsoft Dataset type in C#, which makes things easy. Intend to produce a shrinkwrapped package for other people to use. Questions: user interface: XML, GUI, SQL - or all of them? participant DBs: what functionality is required? Allan: astronomers don't want to use SQL. Many people: yes they do. O'Mullane: SDSS now allow users to generate temporary tables, but on a first-come-first-served disk usage. 3. Francois Oschenbein (A few notes on JVOQL) Especially interested in cross-matching between large catalogues. Want to select catalogues by author or wavelength - should this be at the registry level or the QL level. O'Mullane: in SkYQuery model, need to register with the portal before the SkyNode can be accessed. Oschenbein: will this scale to a situation with 10,000 small tables? Dowler: can ignore irrelevant skynodes as they give count=0 in the first step Oschenbein: what does "union" mean here? - implies same schema in both tables Yasuda: yes Oschenbein: what about UCD usage in JVOQL? In general, won't know the column names. O'Mullane: want to use data model, not UCDs - as UCDs can match to more than one column. Oschenbein: Aladin uses UCDs to filter, to find likely useful data. But they aren't unique, so have to use additional rules - e.g. take first column matching the UCD, or the first with non-null contents. [Discussion of relationship between UCDs and Data Models: off topic?] XMATCH: XMATCH (a,b..)< something - does the something have to be a constant or can it be a function of table entries (e.g. the error ellipses of different sources)? O'Mullane: the XMATCH procedure uses the positional errors in a chi-squared function. ...and is this limited to positional criteria? O'Mullane: it is now - but can add extra constraints in the WHERE clause. Schade: need to think about the general (i.e. non-positional) case - should have a workshop on that. O'Mullane: wider issue of how do you run user code in the VO. Mann: not sure whether you want to have image extraction in the query language. Image extraction and catalogue attribute extraction are different sorts of operation. Shaya: but some images are stored in tabular formats - e.g. COBE 4. Clive Page (AstroGrid thoughts) What is the QL to do? - a language for the user or an internal language in the VO system? Simple Queries: * cone search for catalogues, simple image access, search for observations (e.g. querying logs), resource discovery query (does this need a separate registry query language? - hopefully not), extending cone-search to running on multiple catalogues (laborious, if using column names of all tables - better to use UCDs?), cross-matching query (e.g. SkyQuery, Astrobrowse, MAST). Search parameters: * simple cone search: name or posn; shape and size of search area; limits on named columns (or UCDs); waveband(s) of interest; list of sites to search. Output options: * data formats for tables and images; which column names; max no of rows; ordering of output records; etc Query Language considerations: Xpath, Xquery,...etc etc - many variants: XQuery looks like market leader and is a superset of many of the others, in some sense. Shaya: XQueryX is a pure XML version of XQuery Taylor: do these only work on XML? Shaya: expected that RDBMSs have a virtual XML front end - Oracle does that now, for example Shaya: XQuery is not a distributed system - only works on a single database at a time. Dependent on the schema of a particular database. Page: could add on a layer that translated from UCDs, say, into the column names of individual databases. Lemson: isn't this where the data model WG comes in? Shaya: can use XQuery on WSDL to determine how to get desired data from a particular database. Ohishi: may be a performance issue with XQuery. 5. Pat Dowler (CVO Constraint System) CVO prototype has a constraint-based system. Constraint: a statement the user makes which can be true or false - want to get back all the entries for which the constraint is true. Capture concept of a condition. Want an object-based way of developing SQL queries. Components are expressions (constants, properties, operators) & constraints (e.g. eq, leq, contains, intersect, etc) DB implementer chooses what to do with this for a particular database - this is just an abstract representation...different databases can have their own User Defined Functions, etc. The query language has to reflect the types in the data model. What you're really doing is probing the data model. No explicit FROM clause - assumes that the service can work that out on the basis of the properties involved. Initial prototype used Sybase, but took just 1 day to convert all the transformation mechanism to use DB2. Haven't bothered to include an explicit SELECT clause either - expect the user/application to ignore the unwanted columns that are returned. Ohishi: is it intended that users can write queries in this language? Dowler: could have a programmatic approach, using libraries. Shaya: what about federated queries on distributed databases? Dowler: can do that in a layered approach, with an integrator to do the integration of data from physically separate databases. Demands of portability drive the design, regarding the parse tree. 6. Ed Shaya (VO Query Language) Requirements: (include) providing a means for users to submit general requests for astronomical information from a distributed set of repositories; plug-and- play with metadata and ontology; independent of internal arrangement of data at repositories; span a distributed set of heterogeneous services - including workflow of interactions between registries, services and user; easy to learn and use (hiding obvious and tedious steps from the user) - this may require several levels of language, e.g. one being easy web form; integration of multiple responses; must handle transformation to multiple queries in multiple dialects; extensible - e.g. allow use of namespaces for future-proofing. XQuery: many nice features. if/then/else, FLWR, case switch. basic functions, plus ability to define functions. Much astronomical data is really hierarchical - we may use relational databases, but the structure is not really as flat as that. Allan: RTML is an XML dialect for requesting data from remote telescopes. It is an existing standard within the robotic telescope community. Should look at it as a possible VOQL before starting on inventing a new XML query language. Oschenbein: this seems to be a rapidly-developing standard - difficult to keep up with it, and didn't recall seeing the data return aspects of RTML. Lemson: what about federated queries? Allan: being discussed for RTML3. Shaya: should at least allow for interoperability with RTML - e.g. transforming an RTML document using XSLT XQuery intended for querying one database with known schema. XQueryX is a version of XQuery written in XML. OLAP: online analytical processes - analysis of data cubes, available as add on to many databases. XMLA is an XML version of OLAP. Data mining is not part of VOQL - it's a higher level on top. ADQL (astronomy data query language): includes searching by bibliographic query, obtaining individual tables from data collection. Needs knowledge of how the data are laid out. Knowledge Based Query - higher level, based on scientist's view of the problem, which can be transformed into ADQL. Build up a Property Space from properties, which provides a way to select species of object - e.g. find quasars in colour-colour space. Needs the ability to do fairly complicated mathematical operations on these properties. Problem Statement Language: allow generation of queries about objects, or data objects, or classes - querying very different types of data within the same system. Have put ReadMes for 1400 catalogues into XML, to allow metadata queries. Map from properties to column names. Have a UI which translates PSL into natural language, allowing checking and editing of queries before execution. 7. Masahiro Tanaka (VOQL usage examples) Major query items: fundamental dimensions (posn, wavelength, time, flux), plus maybe redshift and attributes such as object names, types, instrument, etc. Use Cases: i. Follow-up Obtain supplemental data of objects that you have observed (e.g. find GRB counterparts) Query by position, time, cross-match ii. Search for particular objects Find unknown objects with patricular features. (e.g. find brown dwarfs) Query by color/line, cross-match. iii. Search for moving objects Find moving objects (e.g. comets). Query on posn, time, cross-matching Shaya: need to handle some obvious things (e.g. plate limits) automatically Osuna: should the VOQL include a generic macro for matching moving objects, if it a generic operation? Ohishi: the VO should accelerate the astronomer's science, not do it! iv. Line matching Identify lines in spectra, by comparison with model spectra Envisage LINEMATCH function to do this - including possible Doppler shifting of observed lines from rest wavelength. Summary: These use cases suggest need for XMATCH and LINEMATCH functions. Should VOQL include specification of data analysis functionality? O'Mullane: need to do the analysis near the database, so that you don't have to shift lots of data around, but not clear how much of that should go into the VOQL - maybe leave it up to each database to decide how much it will implement. Need to start simple, and then will be able to meet our goals, and have some hope of building the next level on top of that. If we start too ambitious, we'll never get anywhere. 8. Masatoshi Ohishi (WG plans) Finalise first version of VOQL (inc. reference implementation) by the end of 2004. Need first implementation by end of this year. Definition of VOQL: VOQL expresses only what the astronomers want to "observe" on VO(s). Need to offer different routes into the data - eg. GUIs and command line or programmatic interfaces. [Presentation of JVO prototype] Dowler: SQL doesn't hide the details enough to allow the service implementers enough freedom to do their work. Much better to have an abstraction layer. Shaya: want to make queries about astronomical objects, not about observations in databases. Schade: how much conformity to you require of databases for integration in the VO? - stronger requirements will mean less take-up. Shaya: why not use an SQL-like top level language? - maybe better to use XML, because it's easier to transform... Page: data model is hierarchical, while most databases are relational - so there's a fundamental mismatch. Dowler: data model needs two levels - detailed and superficial - and only the superficial one needs to be considered in the VOQL context. Shaya: surely agree that we need one language for data queries and one for queries about astronomical objects. The former should be something like SQL, but preferably in an XML format. Ohishi: that would require databases to implement two interfaces. O'Mullane: best way to proceed in the absence of a data model is to build services using a simple query language - can build more later. Shaya: need data centres to define WSDLs that can be understood by the prototype system. O'Mullane: it looks like a queryable web service, which will later need a portal, to allow the joint querying of a number of databases. Shaya: implementing JVOSQL in a set of web services would speed up many VO demos. Ohishi: the JVO prototype is already using web services - so much of what is wanted is already there. 9. Masatoshi Ohishi (Summary) Have to get definite definition of scope of VOQL. That needs a data model. One possible solution is the two-level approach. V1 is the lower level implementation, with work on higher level language progressing in parallel. Then, in two years, can implement v2 with both levels. Dowler: need join WS, cross-match WS. Shaya: a possible VOQL definition: "A language for the exchange of astronomers' queries with the VO system, which allows expression of the science user scenarios, within reason, and output structured data."