+ Free Format data model fields and extensibility mechanism
When a data provider cannot use the proposed value fields (for instance in data product type), how can he/she still provide some information to characterise his/her data sets?
- How could we allow for extensibility? If we need to consider new kinds of science data product , not yet defined in the Obs/tap fields , how should we proceed?
- Description of a data set - obs_title
In Obs/Tap , it is not a mandatory field, it could become important in future releases if its usage is well-defined. At the moment it is just free-format, and that could be an issue.
M.Louys 2011/02/11
Dear all,
In a recent telecon for Obstap draft editing, we chose to shorten top elements.
Observation. --> Obs.
Characterisation --> Char.
We consider Curation ,
DataID, Access, etc. as top elements as in the Spectrum DM.
What I tried to introduce in the last version of the Utype Note was to allow automatic generation from UML to Utype list and xml schema.
This requires that we adopt a few rules for writing Utype strings:
- use lower case for first letter of attributes names
Examples
* Obs.calibLevel
* Obs.dataproductType
- use the name of the role when a class has a reference or an association to some other class. Code this role in lowercase.
This is not always easy to read , so this is why the
CamelCase writing was used in most DM text.
examples: Char.spatialAxis.coverage.bounds.extent
I think we can adopt the fact that a machine , a parser always sees the Utypes in lower case , and that we use
CamelCase in standards definition only in a parcimonious way for new models like
ObsTAP , and as defined in existing models like Spectrum and SSA.
I am currently editing a new version of the tables in the draft to follow these lines and give an idea of the simple utype shape proposed.
Could we agree on a simple strategy like that? Other suggestions?
Mireille
P.Dowler 2011/02/11
>
Could we agree on a simple strategy like that? Other suggestions?
Working from the end back to the source, the
ObsCore document should give the
utypes exactly as they should be set in the tap_schema. As values stored in a
RDBMS, we have to be aware that implementors have to actually insert the
values there and make sure that some value ends up in the VOTable output, so
there is no sense giving them more things to think about: we should be
explicit about the exact value.
In my opinion, it is possible to define simple rules for going from UML to
utypes and we should follow those rules. If there is value in making utypes
readable, then the rules should produce readable utypes everywhere (eg. that
may be a use case for the utype spec).
If, in the utypes spec, utypes are considered to be case-insensitive, we
should be aware that such a choice will force people to add extra code
(utype.equalsIgnoreCase(str) instead of utype.equals(str) in java,
LOWER(tap_schema.columns.utype) = 'abc.def' in
ADQL). And it will always
introduce extra decision making and confusion when people have to setup their
services.
my 2c,
D.Tody to Mireille 2011/02/21
Hi -
I think this looks reasonable; it is nice to have a well defined and
consistent scheme. The "role" business is pretty subtle however; we
should be able to do this consistently in our specifications and
reference implementations, but it is likely that users will often miss
these subtleties and get the case wrong.
Re Utypes being case-insensitive: I continue to think this is needed,
although as Pat notes it will complicate the code slightly. But it will
also improve reliability at the cost of an extra line of code (or merely
a function call) here and there.
The thing about Utypes is that this is the mechanism we use to refer to
data model elements all the way up in science code, with layer after
layer of software, the Internet, etc. between the application and the
eventual remote data service or DBMS. Aside from the user/scientist or
applications programmer being careless about case, there are many
opportunities for errors or translation issues. If this were all
controlled system software we might be ok to have it be case sensitive,
but given the amount of uncontrolled software in the path I think it
would be foolish to not plan on having the occasional toLowerCase() or
whatever in there for string comparisions when looking for Utypes. So
yes, Utypes should be defined as case insensitive, but we should use
case to improve readability.
- Doug
A.Micol to Doug 2011/02/22
The theory:
About string values stored in a database, the principle should be to
always use the same case, whether lowercase or uppercase.
That way it is clear to the user what to use when formulating a query,
and it is always possible for the DBMS to optimse the query by using
the relevant indeces. The (performant) query will look like this:
SELECT * FROM mytable
WHERE mystring = 'all my mystring values are lower-case.'
If instead a mix-case is adopted, and different people use different rules
(
UpperCamelCase vs lowerCamelCase), the only way to make a query working in
all cases is to use a function, like in Pat's example:
SELECT * FROM myTable
WHERE LOWER(mystring) = 'mystring values use some kind of camel-case.'
which does not allow the DBMS to use the index on the column utype. In that
case a table scan is performed by the DBMS, introducing an unnecessary
performance penalty.
The practice:
In the typical realistic case the only queries involving utypes will be
"discovery" queries trying to see which table contains which field
for a given utype.
That is achieved by querying the TAP_SCHEMA.columns table.
Such table is very likely a short table with a number of records
(= number of all columns in all tables served through a given TAP service)
that is not going to exceed by much a couple of thousands entries in the
typical case. With so few records there is no point in creating an index,
a table scan suffices, with no performance penalty, and the query:
SELECT * FROM TAP_SCHEMA.columns
WHERE LOWER(utype) = 'char.spatialaxis.coverage.resolution.resolutionrefval'
is as good.
Therefore in this particular, and typical, case it does not really matter
if utypes are stored as lower-case or not.
But there could be other cases!:
The real question is: are there other cases where querying by free-case utypes
could affect performance?
If that is the case, and we need to query many more (>1E5 or 1E6) records,
then I think there is no option available: there will be a compelling
reason to force lower-case utypes into our databases.
Counter examples?:
VIZIER:
Probably even the case of Vizier is not compelling enough:
Suppose VIZIER hosts 10,000 catalogues, with -say- 50 columns each:
that would total a number of 500,000 records in TAP_SCHEMA.columns.
I think that is not a huge number of records to scan through with modern
hardware, even without indeces.
"event" tables:
A more compelling (though probably rare) scenario could be a kind of
event table (sorry, a better name must exist for what I have in mind)
of the kind here illustrated:
Consider a table that contains all header keywords
of all (single extension FITS for the sake of simplicity) files
in a given archive:
the "id" uniquely identifies a file in the archive,
the "name" uniquely identifies a keyword in the file header
the "value" is the value of that keyword for that "id"
the "utype" is the utype associated to that keyword
In this case, the table could be considerably long, and contain millions
of records.
If the utype is not forcefully single (eg lower) case, then the
performance penalty, when querying, could be huge.
Conclusions:
I'm not stating that the "event" table above is the best system, but
it is
a system that data providers might be using.
What I'm trying to say is that:
* not forcing a single case for utypes in a database *
* might have an impact on data providers' architecture *
and that is not nice.
Therefore,
I vote for LOWER-CASE
UTYPES WHEN STORING THEM IN A DATABASE,
throughout the entire VO.
That is, to me this is a necessary requirement for healthy performance
of the VO in general, and should be a requirement expressed in the
UTYPE standard itself, something like:
if someone has to ingest a VOTable into a DBMS,
at INSERT time, s/he will have to always use the LOWER function
on the utype attribute.
Alberto
PS: the client could always render the output of a query in a more readable way.
P.Dowler 2011/02/22
>
Therefore,
>
> I vote for LOWER-CASE UTYPES WHEN STORING THEM IN A DATABASE,
>
> throughout the entire VO.
I also vote for this option; it means one more thing is explicit and one less
thing that implemetors (and users) have to think about and decide.
D.Tody to Pat and Alberto 2011/02/22
I don't think we should let the limitations of DBMS technology force
us to use monocase for all VO metadata (which is where this appears to
be headed). It reminds me of the old teletype days WHERE EVERYTHING
WAS UPPER CASE.
Utypes should be case-insensitive for the reasons noted in my
earlier email, as well as to conform to existing documentation and
implementions, none of which have had any problem whatsoever with
Utypes being case insensitive. Also just to be more user friendly
and readable.
It seems to me this is mainly just a technical issue of how to
perform an efficient string search/comparison in the DBMS. It may
well be DBMS-specific. I just did a quick check of
MySQL and Oracle
for example, and there is quite a bit online re case sensitive or
insensitive comparison with these two DBMSes. I have not looked into
it further, but it is clear that this is a fairly common issue.
Another issue is how common it will really be to search archives
by UType directly in
ADQL queries. We may find it more common for
example to read in the TAP_SCHEMA and compose queries on the client
side, using table column names. Or perhaps there is a translation
layer on the server side which comes into play when we start doing
higher level data model manipulation or queries.
Note this is not really an issue for
ObsTAP, since we are already
using fixed (lower case) column names in this use case.
- Doug
D.Tody 2011/02/22
Here is one example:
http://venutip.com/content/case-sensitivity-mysql-do-you-really-need-it
Case sensitivity in
MySQL (do you really need it?)
Submitted by venutip on July 24th, 2008
Not sure how this escaped my notice until now, but today
I realized the default collation for
MySQL columns is not
case-sensitive. Among other things, this means that if you're
searching for distinct values on a column with varchar, text,
or any other non-binary data type, values that differ only in
their cases (such as "apple" and "APpLE") are considered equal,
and you won't get both values back.
MySQL will return "apple" or
"APpLE", but not both.
So in this case the collation determines whether string data is
case sensitive or not, and evidently an index can be generated on
a character column with a case insenstive collation.
- Doug
F.Bonnarel to group of editors 2011/02/28
Hi All,
Sorry if I am putting some divergence in the discussion here but I will vote
AGAINST utypes being forced lower case into DBMS.
If it is to be usefull for interoperability ACROSS services utype definition should
allow some rule of construction and I think the one in Mireille's note seem to be the
best. If it is consistent with UML it will be consistent with xml serialization and
Java or C++ code generation where classes and Methods are generally using camelCase.
For those of developpers and users not aware or oblivious of these rules, published
utypes tables could be an absolute reference and this can be installed case sensitivitly in the
column description of the TAP schema without any difficulty. So for sure they will appear in the
Query response VOTABLE and this seems very important to make them understandable.
For future queries by utypes (
ADQL or
PQL) -which is not the intention of the current
ObsTAP draft, as we query only by column names - but is surely critical for IVOA-wide interoperability
in the future I am still not sure if we have to allow case insentiveness and add the piece of code Doug is
proposing or force the query to be written with
CamelCase utypes. But this can still
be discussed after the draft release as long as we correctly fill the TAP_SCHEMA.
Cheers
François
A.Micol to François 2011/02/28
Hi Francois,
While some RDBMes (as Doug mentioned) are natively case-insensitive (e.g. Postgres, mysql), others are not (e.g. Sybase).
Also, XPath statements are case sensitive.
It is always possible to get around this using some DBMS function (e.g. to lower-case) prior the comparison.
For example:
lower(utype) = "char.bla.bla.bla.location"
or
/bookstore/book[translate(
@category, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = '" + searchtext.ToLower() + "']
The fact is that such conversion will be needed by all services, clients, servers, and databases (e.g. at ingestion time)
to make sure that the proper match is found.
That is certainly doable, but it is:
(1) not performant (cannot use indeces in RDBMSes),
(2) prone to error,
(3) all software will need to add utype-case management, making it a bit more complicated than necessary.
Even checking if the utype is spelt with the correct case is something that will be required.
The motto is always: Keep it simple.
I do not see the reason to deviate from simplicity, if nothing else is gained other than presumed
readibility for developers that need to read a utype just for the time they invest in developing a tool.
Alberto
PS: I note that this is the first DAL service which does not utilise UCDs. I personally think
that neither UTYPEs are that relevant for this exercise (but I reserve to make my case at a later time).
D.Tody to Alberto 2011/02/28
>
The fact is that such conversion will be needed by all services, clients, servers, and databases (e.g. at ingestion time)
>
to make sure that the proper match is found.
Right - for Utype matching to work reliably given the many layers of
softtware and languages involved (and Utypes are NOT just an
ObsTAP
issue), case insensitive comparison will be needed. Consider the
alternative. Instead we have case sensititive Utypes; we query to see
if data supports a particular Utype, but do not find it due to a case
issue, even though the Utype is there and it is obvious to a human
looking at the data that we have a match. An important bit of data
model is ignored in the analysis and possibly results in subtle analysis
errors. This is unacceptable merely because it is inconvenient to do
a case-insensitive match in some cases of software implementation.
>
That is certainly doable, but it is:
>
(1) not performant (cannot use indeces in RDBMSes),
This is not true Alberto. In the cases I mentioned, where case
insensitive character data is supported, the data can still be indexed.
If the DBMS does not support this, you can do it in your
ObsTAP
implementation (not that we even search by Utype of course), by storing
data internally in lower case and forcing the externally-supplied Utype
value to lower case when you translate the
ADQL string. For a
PQL query
it will be no problem of course since the conversion can easily be
applied internally.
>
(2) prone to error,
NOT doing a case-insensitive compare to look for Utype tags is what is
prone to error (see above).
>
(3) all software will need to add utype-case management, making it a bit more complicated than necessary.
If the DBMS or other technology supports case-sensitive comparisons (and
virtually all do) then little or no extra code is required, e.g., it is
supported natively by the DBMS.
>
The motto is always: Keep it simple.
>
I do not see the reason to deviate from simplicity, if nothing else is gained other than presumed
>
readibility for developers that need to read a utype just for the time they invest in developing a tool.
So basically this is generalizing into an argument that all string
metadata in VO which is subject to comparision be lower case, even all
the way up into user interfaces. It won't end with Utype of course.
And you are casually suggesting that we invalidate several years of
practice and standarization within VO, where Utypes have long been
considered to be case-insensitive.
- Doug