TAPImplementationNotes < IVOA

IVOA Web>IvoaDAL>TAPImplementationNotes (2014-05-13, PatrickDowler)

This page was intended to collect points that should be clarified/fixed in future versions of the UWS/TAP/ADQL combo of standards. By now, this material has been moved into a draft note on Volute; see http://volute.googlecode.com/svn/trunk/projects/dal/TAPNotes/TAPNotes-fmt.html and the containing repository.

Additional material on TAP reform is found at:

UWS

(see also UWSEnhancement -- can we merge the material?)

From Paul's mail

The following points are discussed in the Mail B36FEF85-E316-436E-AC69-2F92D0E0FC5C@manchester.ac.uk dated 2011-05-23 to the GWS list by PaulHarrison

Paul promised to update the UWS in volute to reflect much of this, but it certainly wouldn't hurt to have some if it in an "implementation note"-type document.

Section 2.1.3 HELD status - whilst this might appear to have little utility in current implementations, in future versions where there might be quotas or priorities in the UWS then HELD is a way of expressing within the UWS that the job is accepted in principle, but will not be run until some action (like freeing up some of the quota) is taken.
It is probably not made clear enough that the initial values of the parameters (and certainly the possible parameter names) are all established during the initial POST that creates the job and in most cases this is how the job should be driven - The ability to set an individual parameter after job creation is an additional capability that the UWS may offer - it should not offer the ability to create new parameters nor delete existing parameters - in this way a client that just creates the job with the initial POST does not "miss" out on setting a crucial parameter. We could make this clearer by removing the ability to set the individual parameter, as I believe that it was added as a "would be nice" feature without a strong use case. There is only one guaranteed way to set a parameter that all UWS services must implement - in the initial POST that creates the job.
Section 2.2.3.2 & 2.2.3.3, Changing execution duration & destruction time - if a service choses not to implement these features, then the standard is clear that a value of 0 should be returned for the execution duration, but I agree it is not clear what should be returned for the destruction time - in the job schema the DestructionTime element is nillable, so that would be appropriate representation in the job XML - however for the value returned at the resource URL then I agree that there is no description of what should be returned in the case where the UWS never deletes a job - you could return a value far in the future.
a job can be deleted at any time - it is up to the UWS server side to clean up appropriately
Although the current wording of the document does not make this clear enough in every case, the intention is that changing the PHASE of the job is a request by the client to the server, and the client sees whether it has been successful by examining the XML returned by the redirect to the URI /{jobs}/(job-id)/. The allowable transitions are shown by the state diagram within the document. TODO: Decide if invalid transitions should be an error
Attempt to update a parameter on a job that's not PENDING: a 403 [Forbidden] status should be returned
- 403 should be reserved for cases where a request is not authorised and implies that the request could be allowed with the right credentials; this should be 400 "bad request" if it is always illegal -- PatrickDowler - 2014-05-13
The text needs updating to say that creating a parameter at any stage other than the initial job creation POST is not allowed.

Other

Section 2.2.3.1 defines the HTTP response code for an accepted job as 303, but does not say what should happen for a rejected job. It should do (200 plus error document??) -- MarkTaylor - 08 Jun 2011
The content of the /quote resource is an integer number of seconds (sec 2.1.1), but the content of the uws:quote element is xs:dateTime (schema); this mismatch seems unnecessarily confusing unless there's some rationale I'm missing. -- MarkTaylor - 29 Jun 2011

TAP

Can we come up with a lightweight way of allowing some sort of (insecure) authentication ("don't publish my queries") while keeping available TAP results for uploads to other servers? --MD
- sufficiently obscure job ID values plus not letting anyone list jobs hides them -- PatrickDowler - 2014-05-13
UPLOAD parameter spec needs some clarifications. --MD
- Are quoted identifiers allowed as table names? (in DaCHS, they are not)
- What should happen if a URL or table name contains a comma or semicolon? (in DaCHS, they are effecitvely forbidden in both table names and in URLs, since there is no way to escape them)
- When people re-post an UPLOAD parameter, should uploads be added or replaced? (in DaCHS, they are added)
xtype=adql:REGION on upload: such columns will usually result in polygons, at least when implementing against pgsphere --MD
- this is really an ADQL problem; we should not have had the generic REGION datatype at all -- PatrickDowler - 2014-05-13
Require a filename header on inline uploads? (this would make it easy to tell them from "regular" parameters without having to parse all UPLOAD parameters first) --MD
One of the columns in the TAP_SCHEMA.columns table is named "size". This is an ADQL reserved word, which is unfortunate. Can be got round by quoting the column name in ADQL, but it's a gotcha which might be worth mentioning. -- MarkTaylor - 08 Jun 2011
UWS 2.1.11 discusses how parameters of an existing job can be updated, and says that it's up to the implementation to define what is permitted. As far as I can see this is not really done by TAP, though some examples in the Informative section 5 provide suggestions. It should be clarified. -- MarkTaylor - 24 Jun 2011
Should the table metadata (from /tables endpoint and TAP_SCHEMA tables) include metadata about the TAP_SCHEMA tables themselves? Should be made explicit in the TAP standard. -- MarkTaylor - 28 Jun 2011
- Agreed; since 2.6, second paragraph, says they should be in TAP_SCHEMA, I'd venture it's pretty much implied they should be in /tables, too. -- MarkusDemleitner - 29 Jun 2011
Is BOOLEAN a legal TAPType? VODataService sec 3.5.3 says yes, but TAP sec 2.5 says no. Probably the answer is no, but this should be clarified (see this mail) -- MarkTaylor - 13 Jul 2011
The wording in TAP section 2.9 is somewhat inconsistent about the format of VOTable error documents. Section 2.9 says "The VOTable must contain a RESOURCE element identified with the attribute type='results', containing a single TABLE element with the results of the query." , and Section 2.9.1 says "The RESOURCE element must contain, before the TABLE element, ..." . However, it's clear that this section is discussing both successful and error outputs, and in the case of an error no TABLE element will normally be present, only one or more INFOs. The intention is clear from the fourth example in sec 2.9.1, but it should be reworded. -- MarkTaylor - 21 Jul 2011
There should be some language on what to do with oversized uploads; in the inline case, the server probably should send back a 413 status and just close the connection (which, for common client libraries, will just raise a connection reset exception or so, but there's nothing we can do about this as far as I know) --MD

ADQL

The spec omits language that says <separator> (and thus comments) is what actually separates tokens. Thus, a naive implementation of the grammar only allows comments between parts of split-up string literals. The spec needs to be improved, but meanwhile saying "<separator> is this grammar's token separator" or so should do. --MD
Decaying INTERSECTS with point arguments to CONTAINS is a major implementation effort without much benefit. Can we please just deprecate it? --MD
- I agree that this is mildly annoying at best (in pg_sphere you have to at least cast the point to a circle) and could make a lot of work for some implementations with no real benefit -- PatrickDowler - 2014-05-13
Can we recommend a simple positional crossmatch function like crossmatch(ra1, dec1, ra2, dec2, radius), all in degrees? People use that a lot, and asking them to write that CONTAINS mess all the time is not nice --MD
- In a service with geometry types and indexes on them, they will have 2 column references and a radius instead (3 args); if you define the function this way queries will extract the longutide and latitude and completly foil any indexing... can we define crossmatch(point, point, double) instead? An implementation with separate columns can more easily take that apart. -- PatrickDowler - 2014-05-13
In general, the presence of coordinate system metadata in the geometry functions has been a big pain for ADQL and TAP; in ADQL it implies server-side transformations that lead to many new ways to fail and in TAP they lead to output metadata being encoded in the table cells instead of the header (of VOTable). This should all be replaced with simpler user friendly geometry (circle, coord range, simple convex polygon). Also, there is no need for the generic REGION type and the polymorphism of TAP output values (column could contain a mix of polygons and circles, for example) it implies -- PatrickDowler - 2014-05-13

VOTable issues

[Started from E-mail by Tom McGlynn] See VOTableIssues

ObsTAP

s_region has units "deg" in Table 4, but is unitless in Tables 1, 5 and 6. Unitless is correct, I think. -- MarkTaylor - 28 Nov 2011
Some items are listed as "float" and others as "double" in Table 1. They are all "double" or "adql:DOUBLE" in Tables 4, 5 and Table 6. Is there a difference? -- MarkTaylor - 28 Nov 2011

Table metadata scalability

There is a scalability issue for the table metadata document ( /tables endpoint) of large databases. The XML description is currently about 0.4Mb for GAVO, 5Mb for HEASARC, and predicted 80Mb for VizieR (see pages 5-6 of this presentation by Gilles Landais). An interactive TAP client will typically want to acquire table metadata from the service before offering the user options on which tables are available. An 80Mb download is too much. The other option as it stands is doing a TAP_SCHEMA query (e.g. SELECT table_name from TAP_SCHEMA.tables - ~0.5Mb for VizieR?), and acquiring column info in a similar way when the user has chosen a table. That's OK, but since it involves actual TAP queries, services may queue the query and delay before responding (can TAP service implementors comment on whether that's a legitimate concern?) , while a flat file access from the tables endpoint can be expected to be served immediately. So, an extension/alternative to the existing tables endpoint format might be a good idea, maybe a practical necessity when VizieR TAP arrives. Gilles' talk quoted above suggests one way to do this, but other variations on the idea of storing metadata for list-of-tables and columns-per-table as separate static documents separately are possible. -- MarkTaylor - 11 May 2012

For querying the tap_schema using TAP, I would expect people to use the mandatory sync endpoint and not have to worry about queueing issues. In general sync can respond successfully for the kinds of queries one would use here as the where clause is typically pretty simple. -- PatrickDowler - 2014-05-13

As for the /tables endpoint, the data model is hierarchical so we could just have clients ask for subsets of the tree down to the individual table level (eg GET /tables/someSchema/someTable). For /tables and /tables/<schema_name> we would probably need a query parameter to limit the depth (eg GET /tables?depth=1 would return only the schema metadata but not child (table) lists; GET /tables?depth=2 would return all the schema (d=1) and tables (d=2), but nothe columns; GET /tables/someSchema?depth=1 would return only the tables (d=1) below someSchema). Instead of depth=<int> we could also use detail=<schema|table|column> (kind of like vospace) to set the max item type to be returned. -- PatrickDowler - 2014-05-13

Notes on TAP+ during Gaia Archive implementation

This is a compilation done by the ESAC SAT Gaia Team (Juan Gonzalez, Raul Gutierrez, Juan Carlos Segovia and myself, Jesus Salgado) on TAP/ADQL/UWS specifications during the early phase of implementation of the Gaia Archive. These requirements are described as TAP+ funcitionalities in the Gaia Archive Requirements documents.

Some of them are already mentioned in this page but we have collected them all here for better tracking:

Authentication (and Authorization?): It is requirement for us to maintain certain level of security for the TAP access. There are several aspects involved on that:
- There is a requirement to maintain the ADQL queries from the users not publicly visible. A normal use of our system will be done using an "anonymous" access. However, users will be able to access expert capabilities after login. For these users, the content of the science cases they are studying is, somehow, expressed in the queries done and they would like to maintain them hidden.
- Table upload will be done in the Gaia Archive using a persistent approach. Tables uploaded by the user could be stored in a user schema. User schemas will be designed in a way that will be only visible by the owner or by other users that share the tables. That implies authorization rules.
- We are implementing a filter per user id of the jobs returned by the /async method at TAP level. A typical filter can be done at client level (e.g. storing in memory the job ids list) but, as part of the authentication approach, we needed to have something more secure for logged users. Also, queries to the Gaia data could take long so a user should be able to close their browser, and open it the next day, login int the system and see the status of their pending and old jobs.
Pagination: This is a nice to have in order to allow tabular views of the queries results. We do not have a preference on the place where this should be added; into the language (e.g. as part of the ADQL query) or into the interface (through specific input parameters). There is also certain discussions on this neeed as it is not clear why a user should be interested on page "167" of "150221". It is clear that a query the returns the first n-records of a certain query response is useful but this functionality is already covered by the current specifications.
Data Model representation: Current TAP response does not have structure. The results content is a single table response. In Gaia, we have the requirement to produce a response compatible with a certain Gaia data model version e.g. with fields that can be considered serializations of a certain object/fraction of an object. For the time being, these data model has been maintained quite simple/flat but the requirement is still there. Two possible approaches to fulfill this requiremente were already proposed in the past by allowing a VO/DML preamble in the TAP response (implementation details at server and client sides to be defined)
Crossmatch support: For the Gaia Archive, we have extended the language to allow positional crossmatches. We are not fully sure if this should be covered by the "official" ADQL (I remember that crossmatch support was removed during the ADQL specification definition). In any case and although nobody stops you to publish crosmatch functionalities as User Defined Functions, if the community is asking for this globally, it could be a good idea to standardize it. Of course, we are not talking about covering complex crossmatch approaches... for us, the main requirement isjust a q3c_join like approach.
TAP schema too heavy: We have also the problem reported by Mark on the TAP schema scalability. Our TAP schema metadata is normalized at database level so we can change it dynamically. This is particularly important us to allow thses users schema (the public tables present in the system are more static). That means, although our TAP schema is not so big for the time being, we will need to invoke it several times during a normal session so, if we can minimize the information to be propagated from the server to the client (by, perhaps, a lighter TAP schema serialization) we could obtain benefict too.
Persistent Upload: This point has been already mentioned before in point 1. Users can upload tables into the Gaia system and maintain them in their own TAP schema for further use. This is a basic requirement we will need to fulfill.
ADQL queries without FROM: In ADQL, it is not possible to implement queries without a FROM statement. Why do we need this? In the case of crossmatch operations, there are important performance issues that imply to call user defined functions as queries without the from clause (for posgreSQL or usein the DUAL table in Oracle). If you decide to use a table to put into the from (like, e.g. the ones to be crossmatched) the result of the query will be a "void" (or table result name, depending how you define your crossmatch operations) so many times as the from table records (what it is a killer). Either we go to the postgreSQL approach (queries without FROM) or to the Oracle approach (table name "DUAL" is reserved and it will be a dummy table with one record) (if you have Oracle, you already have it). We would vote in favour of removing the from clause as compulsory.
UWS jobs destruction implementation recommendation: Jobs are properly destroyed under user request (including DB query)and UWS implementation take care of it. We consider this is just a recommendation of a correct implementation more than a real requirement for the protocols.

We hope this list could be of interest for the IVOA community. We can iterate with the rest of the people on the decisions taken by us on the previous points and we will try to update these notes during the rest of the archive implementation process. -- JesusSalgado - 2014-01-31

Topic revision: r22 - 2014-05-13 - PatrickDowler

IVOA

Log in or Register

IVOA.net
Wiki Home
WebChanges
WebTopicList
WebStatistics

Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki

TWiki intro
TWiki tutorial
User registration
Notify me

Working Groups

Interest Groups

Time Domain

Committees

Stds&Procs

www.ivoa.net
Documents
Events
Members
XML Schema