TAP VOResource Extension Schema

Step 1 3/4: Pre-Draft (2011-01-11)

We're running on a tight schedule: a working draft should be done by 2011-01. So, I'd suggest all interested parties just quickly comment in-line in the following pre-draft. Please reply in paragraphs, and a date and initials at the end would help; everything uninitialled would then come from the original pre-draft. -- MD 2011-11-11

Here's a first shot at defining the TAP capability element as an instance document with interspersed comments. From an internally circulated attempt, I've moved things from attribute values to elements, since that, on second deliberation, seemed more in line with the general VOResource style.

I've also added resource limits since they seemed easy, and I've added user defined functions in some very "light" form since I think they are indispensable.

<?xml version="1.0"?>
<capability xmlns:tap="http://www.ivoa.net/xml/TAP/v1.0" xmlns:vr="http://www.ivoa.net/xml/VOResource/v1.0" xmlns:vs="http://www.ivoa.net/xml/VODataService/v1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" standardID="ivo://ivoa.net/std/TAP" xsi:schemaLocation="http://www.ivoa.net/xml/TAP/v1.0 http://vo.ari.uni-heidelberg.de/docs/schemata/TAP-v1.0.xsd http://www.ivoa.net/xml/VOResource/v1.0 http://vo.ari.uni-heidelberg.de/docs/schemata/VOResource-v1.0.xsd http://www.ivoa.net/xml/VODataService/v1.0 http://vo.ari.uni-heidelberg.de/docs/schemata/VODataService-v1.0.xsd">
  <interface role="std" xsi:type="vs:ParamHTTP">
    <accessURL use="base">http://localhost:8080/__system__/tap/run/tap</accessURL>
  </interface>

Ought there to be a suitable xsi:type attribute on the capability element? Some registries seem to use this to identify capability types, e.g. cone search capabilities are marked xsi:type='cs:ConeSearch' in the AstroGrid registry. I haven't been able to work out whether examining this is a good and/or recommended way to locate cone search services though (if the standardID attribute is present it does the same job), so the answer may well be "no". -- MarkTaylor - 11 Jan 2011

Up to here, it's generic. I've made up the namespace in parallel to the S*APs.

  <dataModel ivo-id="ivo://ivoa.net/std/ObsCore-1.0">ObsCore 1.0</dataModel>

dataModels have a "name" in the text content (intended for labels and such, intended for humans), and an ivo-id. I'd tend to make the ivo-id non-optional.

  <language ivo-id="ivo://tap/languages/ADQL-2.0">
    <parameter>ADQL-2.0</parameter>
    <label>ADQL 2.0</label>
  </language>
  <language ivo-id="ivo://tap/languages/ADQL-2.0">
    <parameter>ADQL</parameter>
    <label>ADQL 2.0</label>
  </language>

The languages supported by the service. ivo-id here probably should be optional so people can write stuff like "TurboSQL 23.3". In addition, we give a "parameter" element, which is the value actually passed to the service (in this case, in LANG), and again a "label" element intended to be shown to humans in UIs.

We should probably allow for "description" as well. It's not here since my software doesn't have that for languages and friends.

  <outputFormat>
    <parameter>text/xml</parameter>
    <label>VOTable, binary</label>
    <mime>text/xml</mime>
  </outputFormat>
  <outputFormat>
    <parameter>votable/td</parameter>
    <label>VOTable, tabledata</label>
    <mime>application/x-votable+xml</mime>
  </outputFormat>
  <outputFormat>
    <parameter>text/csv</parameter>
    <label>CSV without column labels</label>
    <mime>text/csv</mime>
  </outputFormat>
  <outputFormat>
    <parameter>votable</parameter>
    <label>VOTable, binary</label>
    <mime>application/x-votable+xml</mime>
  </outputFormat>
  <outputFormat>
    <parameter>fits</parameter>
    <label>FITS binary table</label>
    <mime>application/fits</mime>
  </outputFormat>
  <outputFormat>
    <parameter>text/csv;header=present</parameter>
    <label>CSV with column labels</label>
    <mime>text/csv;header=present</mime>
  </outputFormat>
  <outputFormat>
    <parameter>text/tab-separated-values</parameter>
    <label>Tab separated values</label>
    <mime>text/tab-separated-values</mime>
  </outputFormat>
  <outputFormat>
    <parameter>text/html</parameter>
    <label>HTML table</label>
    <mime>text/html</mime>
  </outputFormat>
  <outputFormat>
    <parameter>application/fits</parameter>
    <label>FITS binary table</label>
    <mime>application/fits</mime>
  </outputFormat>
  <outputFormat>
    <parameter>html</parameter>
    <label>HTML table</label>
    <mime>text/html</mime>
  </outputFormat>
  <outputFormat>
    <parameter>tsv</parameter>
    <label>Tab separated values</label>
    <mime>text/tab-separated-values</mime>
  </outputFormat>
  <outputFormat>
    <parameter>csv</parameter>
    <label>CSV with column labels</label>
    <mime>text/csv;header=present</mime>
  </outputFormat>
  <outputFormat>
    <parameter>application/x-votable+xml</parameter>
    <label>VOTable, binary</label>
    <mime>application/x-votable+xml</mime>
  </outputFormat>

Output formats again have parameter, label and description as languages do. In addition, we give the mime type that will result when the parameter value is put in. Note that the preservation of VOTable mime type is reflected here.

I feel there's too much information here. The registration record is in a position of duplicating (or potentially contradicting) the standard, since, e.g., according to the standard if the parameter is "csv" then the MIME type must be "text/csv". Concerning the label element: presumably the intention is to provide a human-directed description of the option. I'm not sure this does much useful work in most cases - especially given the defined short forms the parameter values are quite readable on their own, and the fact that different services may give slightly different wordings meaning the same thing could end up being more confusing than otherwise to a user. I would suggest instead recording just the parameter value (short form or MIME type) for each known format; possibly a label or description element could be provided for optional use in the case that there was more to say about the format than the parameter value itself (e.g. whether CSV does or does not have headers, or in the case of a particularly obscure MIME type). Similar, though not identical, comments apply to use of label in the language and uploadMethod elements as well. -- MarkTaylor - 11 Jan 2011

  <uploadMethod ivo-id="ivo://tap/uploadmethods/inline">
    <protocol>inline</protocol>
    <label>POST inline upload</label>
  </uploadMethod>
  <uploadMethod>
    <protocol>http</protocol>
    <label>http URL</label>
  </uploadMethod>
  <uploadMethod>
    <protocol>https</protocol>
    <label>https URL</label>
  </uploadMethod>
  <uploadMethod>
    <protocol>ftp</protocol>
    <label>ftp URL</label>
  </uploadMethod>

Upload methods. It would be nice if those could have ivo-ids as well, but giving ivo-ids to http (or, God forbid, ftp) seems wrong. Should we just agree on a controlled vocabulary, starting with inline, http, https, ftp?

It doesn't seem to me appropriate to annotate these options with label elements. If it's a controlled vocabulary the meaning of these terms is fixed by some standard or other and it's not the job of the service registration to explain what they mean. -- MarkTaylor - 11 Jan 2011

Then there's the whole VOSpace business. I'd be grateful if someone who actually did some real stuff with vos could come up with a proposal of how to represent that. If we could get by just saying "vos1", "vos1.2", "vos2.0" or somesuch I think that would be highly preferable.

Pat, on the other hand, has said:

Of course, for a vos URI, there is a whole other level of transfer protocol metadata: a service says it supports "vos" and knows how to talk to a vospace (well, that one would have to be versioned), but that does not mean it can get a file from a vospace that only uses SRB for transport. I don't think we can solve that here.
The extra thing we discussed was what kind of authentication the TAP service could do on the users behalf (in order to get an input table from a URI), but I think that is already covered by the service having a registered and associated CDP service. This would potentially come up with vos and https schemes since that could require an X.509 certficate to authenticate and that is the exact case that is covered by the TAP and associated CDP service: the user knows if a certifciate is needed and the TAP service can declare that is has this associated CDP service where the user can (in advance) store a proxy certficate. So, in my opinion we do not have to be able to specify what authentication the TAP service knows how to perform here.

  <retentionPeriod>
    <default>172800</default>
  </retentionPeriod>
  <executionDuration>
    <default>3600</default>
  </executionDuration>
  <rowLimit>
    <default>2000</default>
    <hard>20000000</hard>
  </rowLimit>

Resource limits. retentionPeriod, executionDuration are given in seconds (it's the SI unit, after all). There's a default limit and possibly a hard limit. Both are optional. A missing limit says "we didn't bother figuring it out" or "no enforced limit".

Pat said

In our service, we try to estimate the result size in total by looking at the selected columns (and knowing the output size of each column) and then dynamically limit MAXREC: fewer columns -> more rows allowed. So this is a limit in megabytes, not rows. Also, the limit is different in different scenarios: sync queries have no default and no limit on MAXREC, async queries currently have a dynamic limit, and once we support output to VOSpace, async queries sending to VOSpace will also have no default or limit. [...] So, values for the attributes could be an integer, "none", or "dynamic".

I don't like this -- not only will it uglyfy the schema, it'll also make the client's life a lot harder when it actually tries to use this information. I'd rather suggest that people with dynamic limits are encouraged to put in some conservative estimate, and probably ignore limits on sync queries for this purpose.

I, in turn, don't like that, since then you don't know whether what you're looking at is a conservative estimate or a properly thought out value. On the other hand I don't think it matters much; my feeling is that clients aren't going to make much use of these limits in any case - a client is unlikely to know how many rows/bytes it's expecting to receive, so trying to do something intelligent with a probably incorrect nominal maximum supplied by the service sounds far fetched. Possibly I'm missing something though - do we have a convincing use case for making use of these values? If the intended recipient of the information is human rather than machine, it might make sense to allow the limit to be specified in either rows or bytes as the service prefers. -- MarkTaylor - 11 Jan 2011

  <udf>
    <name>gavo_match</name>
    <signature>gavo_match(pattern TEXT, string TEXT) -&gt; INTEGER</signature>
    <description>The function returns 1 if the posix regular expression pattern matches
anything in string, 0 otherwise.</description>
  </udf>
</capability>

Finally, the user defined functions. I think those are a must so users have some standard way of figuring them out; on the other hand, I think machines need not be too concerned about them. Therefore, in addition to the name, there's just a human-readable description (user agents are encouraged to reproduce them verbatim, i.e., preserving whitespace and such) and a signature. The signature should be machine-parseable to accommodate use-cases in which this might be useful. The schema, of course, does not need to enforce this.

In the signature, only regular identifiers are allowed, no quoted identifiers. This is implied by the grammar for the name and the type names, so it only needs to be stipulated for the parameter names.

Open issues:

upload limits -- I realize it's a good idea to be able to express those. But how? I don't currently enforce any, but if I did, I'd probably enforce them by bytes. On the other hand limits on the number of rows clearly would make more sense...
quoteMethod -- I suspect that few people currently bother to come up with a good quote. If there's ever some kind of scheduling service, being able to communicate where the quote comes from would be useful. Maybe one should have a controlled vocabulary here ("plan", "queuelength", "sample", "thin air")?
standard inputParams for the protocol parameters (LANG, FORMAT, REQUEST, RUNID, etc)? I'm lazy, so I'd rather not include them, and they should more or less be the same for all services. I don't feel strongly about them, though, and declaring them might be useful for optional input parameters. Certainly, additional ("PQL") parameters should be declared, but VODataService already says how to do that.

About creating a VOResource extension

The VOResource spec has some specific recommendations about how to create an extension schema; however, a "how-to" create an extension presentation gives an introduction to the process.

In summary, the RWG recommends the following steps for defining a new extension:

Name and define the concepts to be captured
Create a prototype VOResource instance
Create the Schema Extension
Describe the extension in an IVOA document (preferably as a section of a protocol document).

Step 1: Concepts to Include

The following concepts should be captured within TAP capabilities (much of it based on grepping the UWS and TAP specs for "may" and "should"):

List of data models exposed -- as URIs, e.g., the ObsCore model: ivo://ivoa.net/std/ObsCore
List of query languages supported -- these should be well-known strings as used in LANG, e.g. ADQL, ADQL-2.0, etc. They should contain a human-readable description (as element content?). We should recommend a convention for SQL in the spirit of "SQL-Postgres", "SQL-MySQL", etc.
List of output formats -- specified with required MIME and optional shorthand. Again, a human-readable description (as element content?) would be nice.

The Upload Problem and VOSpace

From Pat's summary of the Nara discussion:

Controlled vocabulary for well know protocols - I would suggest the protocol scheme in lower case as that is common usage, ivo URI for protocols described in the registry - eg vos.

For vos URI support, we also need to specify if the service can perform authentication, but that is already specified when a service specifies the endpoint for the associated CDP service which would be required, so in my opinion one can just say they support "vos" (via the URI) and that means unauthenticated; if the service also has a supporting CDP then they can do authenticated (CDP spec says explicitly how to do this - maybe we should at least explicitly refer to the CDP spec section)

Things we'd probably not want in the capability

Extended capabilities -- if they exist, create another capability element
format of table names: name vs. schema.name vs. cat.schema.name -- since table names are delivered in qualified form, this is irrelevant for clients
VOSI support -- this can be inferred from elsewhere in the registry record
Passing on the RUNID -- do people need to know this from the registry?
Further tables in TAP_SCHEMA -- can be taken from elsewhere in the registry record

Things deferred at Nara

List of settable parameters (probably open-ended as key-value pairs; for limits and such, absence would mean "unlimited", max==default would mean "changing not supported"):
Server settings
- default/maximum retention period (=destruction time-creation time)
- default/maximum run time
- default/maximum row limit
- uploadRowLimit uploadByteLimit
- maybe quoteMethod -- how does the service come up with a quote: never, always artificial value, based on a query plan, based on the length of an input queue,...
List of user defined functions -- with name, arguments (name, type, description), return type, and a short, human-readable documentation (does plain text suffice?)