FreeFormat < IVOA

+ Free Format data model fields and extensibility mechanism

When a data provider cannot use the proposed value fields (for instance in data product type), how can he/she still provide some information to characterise his/her data sets?

How could we allow for extensibility? If we need to consider new kinds of science data product , not yet defined in the Obs/tap fields , how should we proceed?
Description of a data set - obs_title

In Obs/Tap , it is not a mandatory field, it could become important in future releases if its usage is well-defined. At the moment it is just free-format, and that could be an issue.

M.Louys 2011/02/11 Dear all,

In a recent telecon for Obstap draft editing, we chose to shorten top elements. Observation. --> Obs. Characterisation --> Char.

We consider Curation , DataID, Access, etc. as top elements as in the Spectrum DM.

What I tried to introduce in the last version of the Utype Note was to allow automatic generation from UML to Utype list and xml schema. This requires that we adopt a few rules for writing Utype strings:

use lower case for first letter of attributes names

Examples

* Obs.calibLevel * Obs.dataproductType

use the name of the role when a class has a reference or an association to some other class. Code this role in lowercase.

This is not always easy to read , so this is why the CamelCase writing was used in most DM text.

examples: Char.spatialAxis.coverage.bounds.extent

I think we can adopt the fact that a machine , a parser always sees the Utypes in lower case , and that we use CamelCase in standards definition only in a parcimonious way for new models like ObsTAP , and as defined in existing models like Spectrum and SSA.

I am currently editing a new version of the tables in the draft to follow these lines and give an idea of the simple utype shape proposed.

Could we agree on a simple strategy like that? Other suggestions? Mireille

P.Dowler 2011/02/11

> Could we agree on a simple strategy like that? Other suggestions?

Working from the end back to the source, the ObsCore document should give the utypes exactly as they should be set in the tap_schema. As values stored in a RDBMS, we have to be aware that implementors have to actually insert the values there and make sure that some value ends up in the VOTable output, so there is no sense giving them more things to think about: we should be explicit about the exact value.

In my opinion, it is possible to define simple rules for going from UML to utypes and we should follow those rules. If there is value in making utypes readable, then the rules should produce readable utypes everywhere (eg. that may be a use case for the utype spec).

If, in the utypes spec, utypes are considered to be case-insensitive, we should be aware that such a choice will force people to add extra code (utype.equalsIgnoreCase(str) instead of utype.equals(str) in java, LOWER(tap_schema.columns.utype) = 'abc.def' in ADQL). And it will always introduce extra decision making and confusion when people have to setup their services.

my 2c,

D.Tody to Mireille 2011/02/21

Hi -

I think this looks reasonable; it is nice to have a well defined and consistent scheme. The "role" business is pretty subtle however; we should be able to do this consistently in our specifications and reference implementations, but it is likely that users will often miss these subtleties and get the case wrong.

Re Utypes being case-insensitive: I continue to think this is needed, although as Pat notes it will complicate the code slightly. But it will also improve reliability at the cost of an extra line of code (or merely a function call) here and there.

The thing about Utypes is that this is the mechanism we use to refer to data model elements all the way up in science code, with layer after layer of software, the Internet, etc. between the application and the eventual remote data service or DBMS. Aside from the user/scientist or applications programmer being careless about case, there are many opportunities for errors or translation issues. If this were all controlled system software we might be ok to have it be case sensitive, but given the amount of uncontrolled software in the path I think it would be foolish to not plan on having the occasional toLowerCase() or whatever in there for string comparisions when looking for Utypes. So yes, Utypes should be defined as case insensitive, but we should use case to improve readability.

- Doug

A.Micol to Doug 2011/02/22

The theory:

About string values stored in a database, the principle should be to always use the same case, whether lowercase or uppercase.

That way it is clear to the user what to use when formulating a query, and it is always possible for the DBMS to optimse the query by using the relevant indeces. The (performant) query will look like this:

SELECT * FROM mytable WHERE mystring = 'all my mystring values are lower-case.'

If instead a mix-case is adopted, and different people use different rules (UpperCamelCase vs lowerCamelCase), the only way to make a query working in all cases is to use a function, like in Pat's example:

SELECT * FROM myTable WHERE LOWER(mystring) = 'mystring values use some kind of camel-case.'

which does not allow the DBMS to use the index on the column utype. In that case a table scan is performed by the DBMS, introducing an unnecessary performance penalty.

The practice:

In the typical realistic case the only queries involving utypes will be "discovery" queries trying to see which table contains which field for a given utype. That is achieved by querying the TAP_SCHEMA.columns table. Such table is very likely a short table with a number of records (= number of all columns in all tables served through a given TAP service) that is not going to exceed by much a couple of thousands entries in the typical case. With so few records there is no point in creating an index, a table scan suffices, with no performance penalty, and the query:

SELECT * FROM TAP_SCHEMA.columns WHERE LOWER(utype) = 'char.spatialaxis.coverage.resolution.resolutionrefval'

is as good. Therefore in this particular, and typical, case it does not really matter if utypes are stored as lower-case or not.

But there could be other cases!:

The real question is: are there other cases where querying by free-case utypes could affect performance?

If that is the case, and we need to query many more (>1E5 or 1E6) records, then I think there is no option available: there will be a compelling reason to force lower-case utypes into our databases.

Counter examples?:

VIZIER: Probably even the case of Vizier is not compelling enough: Suppose VIZIER hosts 10,000 catalogues, with -say- 50 columns each: that would total a number of 500,000 records in TAP_SCHEMA.columns. I think that is not a huge number of records to scan through with modern hardware, even without indeces.

"event" tables: A more compelling (though probably rare) scenario could be a kind of event table (sorry, a better name must exist for what I have in mind) of the kind here illustrated:

Consider a table that contains all header keywords of all (single extension FITS for the sake of simplicity) files in a given archive:

the "id" uniquely identifies a file in the archive, the "name" uniquely identifies a keyword in the file header the "value" is the value of that keyword for that "id" the "utype" is the utype associated to that keyword

In this case, the table could be considerably long, and contain millions of records.

If the utype is not forcefully single (eg lower) case, then the performance penalty, when querying, could be huge.

Conclusions:

I'm not stating that the "event" table above is the best system, but it is a system that data providers might be using. What I'm trying to say is that: * not forcing a single case for utypes in a database * * might have an impact on data providers' architecture * and that is not nice.

Therefore, I vote for LOWER-CASE UTYPES WHEN STORING THEM IN A DATABASE, throughout the entire VO.

That is, to me this is a necessary requirement for healthy performance of the VO in general, and should be a requirement expressed in the UTYPE standard itself, something like:

if someone has to ingest a VOTable into a DBMS, at INSERT time, s/he will have to always use the LOWER function on the utype attribute.

Alberto PS: the client could always render the output of a query in a more readable way.

P.Dowler 2011/02/22

> Therefore,
> > I vote for LOWER-CASE UTYPES WHEN STORING THEM IN A DATABASE,
> > throughout the entire VO.

I also vote for this option; it means one more thing is explicit and one less thing that implemetors (and users) have to think about and decide.

D.Tody to Pat and Alberto 2011/02/22

I don't think we should let the limitations of DBMS technology force us to use monocase for all VO metadata (which is where this appears to be headed). It reminds me of the old teletype days WHERE EVERYTHING WAS UPPER CASE. smile

Utypes should be case-insensitive for the reasons noted in my earlier email, as well as to conform to existing documentation and implementions, none of which have had any problem whatsoever with Utypes being case insensitive. Also just to be more user friendly and readable.

It seems to me this is mainly just a technical issue of how to perform an efficient string search/comparison in the DBMS. It may well be DBMS-specific. I just did a quick check of MySQL and Oracle for example, and there is quite a bit online re case sensitive or insensitive comparison with these two DBMSes. I have not looked into it further, but it is clear that this is a fairly common issue.

Another issue is how common it will really be to search archives by UType directly in ADQL queries. We may find it more common for example to read in the TAP_SCHEMA and compose queries on the client side, using table column names. Or perhaps there is a translation layer on the server side which comes into play when we start doing higher level data model manipulation or queries.

Note this is not really an issue for ObsTAP, since we are already using fixed (lower case) column names in this use case.

- Doug

D.Tody 2011/02/22

Here is one example:

http://venutip.com/content/case-sensitivity-mysql-do-you-really-need-it Case sensitivity in MySQL (do you really need it?) Submitted by venutip on July 24th, 2008

Not sure how this escaped my notice until now, but today I realized the default collation for MySQL columns is not case-sensitive. Among other things, this means that if you're searching for distinct values on a column with varchar, text, or any other non-binary data type, values that differ only in their cases (such as "apple" and "APpLE") are considered equal, and you won't get both values back. MySQL will return "apple" or "APpLE", but not both.

So in this case the collation determines whether string data is case sensitive or not, and evidently an index can be generated on a character column with a case insenstive collation.

- Doug

F.Bonnarel to group of editors 2011/02/28

Hi All,

Sorry if I am putting some divergence in the discussion here but I will vote AGAINST utypes being forced lower case into DBMS. If it is to be usefull for interoperability ACROSS services utype definition should allow some rule of construction and I think the one in Mireille's note seem to be the best. If it is consistent with UML it will be consistent with xml serialization and Java or C++ code generation where classes and Methods are generally using camelCase. For those of developpers and users not aware or oblivious of these rules, published utypes tables could be an absolute reference and this can be installed case sensitivitly in the column description of the TAP schema without any difficulty. So for sure they will appear in the Query response VOTABLE and this seems very important to make them understandable. For future queries by utypes (ADQL or PQL) -which is not the intention of the current ObsTAP draft, as we query only by column names - but is surely critical for IVOA-wide interoperability in the future I am still not sure if we have to allow case insentiveness and add the piece of code Doug is proposing or force the query to be written with CamelCase utypes. But this can still be discussed after the draft release as long as we correctly fill the TAP_SCHEMA. Cheers François

A.Micol to François 2011/02/28

Hi Francois,

While some RDBMes (as Doug mentioned) are natively case-insensitive (e.g. Postgres, mysql), others are not (e.g. Sybase). Also, XPath statements are case sensitive. It is always possible to get around this using some DBMS function (e.g. to lower-case) prior the comparison. For example:

lower(utype) = "char.bla.bla.bla.location"

/bookstore/book[translate(@category, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = '" + searchtext.ToLower() + "']

The fact is that such conversion will be needed by all services, clients, servers, and databases (e.g. at ingestion time) to make sure that the proper match is found.

That is certainly doable, but it is: (1) not performant (cannot use indeces in RDBMSes), (2) prone to error, (3) all software will need to add utype-case management, making it a bit more complicated than necessary.

Even checking if the utype is spelt with the correct case is something that will be required.

The motto is always: Keep it simple. I do not see the reason to deviate from simplicity, if nothing else is gained other than presumed readibility for developers that need to read a utype just for the time they invest in developing a tool.

Alberto PS: I note that this is the first DAL service which does not utilise UCDs. I personally think that neither UTYPEs are that relevant for this exercise (but I reserve to make my case at a later time).

D.Tody to Alberto 2011/02/28

> The fact is that such conversion will be needed by all services, clients, servers, and databases (e.g. at ingestion time)
> to make sure that the proper match is found.

Right - for Utype matching to work reliably given the many layers of softtware and languages involved (and Utypes are NOT just an ObsTAP issue), case insensitive comparison will be needed. Consider the alternative. Instead we have case sensititive Utypes; we query to see if data supports a particular Utype, but do not find it due to a case issue, even though the Utype is there and it is obvious to a human looking at the data that we have a match. An important bit of data model is ignored in the analysis and possibly results in subtle analysis errors. This is unacceptable merely because it is inconvenient to do a case-insensitive match in some cases of software implementation.

> That is certainly doable, but it is:
> (1) not performant (cannot use indeces in RDBMSes),

This is not true Alberto. In the cases I mentioned, where case insensitive character data is supported, the data can still be indexed. If the DBMS does not support this, you can do it in your ObsTAP implementation (not that we even search by Utype of course), by storing data internally in lower case and forcing the externally-supplied Utype value to lower case when you translate the ADQL string. For a PQL query it will be no problem of course since the conversion can easily be applied internally.

> (2) prone to error,

NOT doing a case-insensitive compare to look for Utype tags is what is prone to error (see above).

> (3) all software will need to add utype-case management, making it a bit more complicated than necessary.

If the DBMS or other technology supports case-sensitive comparisons (and virtually all do) then little or no extra code is required, e.g., it is supported natively by the DBMS.

> The motto is always: Keep it simple.
> I do not see the reason to deviate from simplicity, if nothing else is gained other than presumed
> readibility for developers that need to read a utype just for the time they invest in developing a tool.

So basically this is generalizing into an argument that all string metadata in VO which is subject to comparision be lower case, even all the way up into user interfaces. It won't end with Utype of course. And you are casually suggesting that we invalidate several years of practice and standarization within VO, where Utypes have long been considered to be case-insensitive.

- Doug

This topic: IVOA > WebHome > IvoaDataModel > ObsDMCoreComponents > FreeFormat
Topic revision: r3 - 2011-03-03 - FrancoisBonnarel