|
META TOPICPARENT |
name="IvoaDataModel" |
STC2:Meas Proposed Recommendation: Request for Comments
Summary
Version 1 of STC was developed in 2007, prior to the development and adoption of vo-dml modeling practices. As we progress to the development of vo-dml compliant component models, it is necessary to revisit those models which define core content. Additionally, the scope of the STC-1.0 model is very broad, making a complete implementation and development of validators, very difficult. As such it may be prudent to break the content of STC-1.0 into component models itself, which as a group, cover the scope of the original.
This effort will start from first principles with respect to defining a specific project use-case, from which requirements will be drawn, satisfied by the model, and implemented in the use-case. We will make use of the original model to ensure that the coverage of concepts is complete and that the models will be compatible. However, the form and structure may be quite different. This model will use vo-dml modeling practices, and model elements may be structured differently to more efficiently represent the concepts.
This model covers the description of measured or determined astronomical data, and includes the following concepts:
- The association of the determined ’value’ with corresponding errors. In this model, the ’value’ is given by the various Coordinate types of the Coordinates data model (Rots and Cresitello-Dittmar et al., 2019).
- A description of the Error model.
The Latest version of the model and supporting docs:
- Model document: here
- VO-DML/XML representation: here
- XML Schema: here
Implementation Requirements
(from DM Working group twiki):
The "IVOA Document Standards" standard has a broad outline of the implementation requirements for IVOA standards. These requirements fit best into the higher level standards for applications and protocols than for data models themselves. At the Oct 2017 interop in Trieste, the following implementation requirements for Data Model Standards was agreed upon, which allow the models to be vetted against their requirements and use cases, without needing full science use cases to be implemented.
- VO-DML models must validate against schema
- Serializations which touch each entity of the model. These serializations may be 'fake' (ie: not based on actual data files), and are to be produced by the modeler as unit tests/examples.
- Real world serializations covering use cases, produced by others following the model, in a mutually agreed upon format.
- Software which interprets these serializations and demonstrates proper interpretation of the content
Serializations:
- Modeler Generated Examples:
- using home grown python code, the modeler has generated example serializations which span all elements of the model. The examples are generated in 4 formats:
- VOTable-1.3 standard syntax; Validates using votlint
- VOTable-1.3 annotated with VO-DML/Mapping syntax; Validates using xmllint to a VOTable-1.3 schema enhanced with an imported VO-DML mapping syntax schema
- XML format; Validates against the model schema
- An internal DOC format; XML/DOM structure representing the instances generated when interpreting the instance templates.
- Real world serializations:
- jovial: A set of example serializations of various Coordinate and Measurement instances generated by Omar Laurino using his Jovial DSL package
- TDIG: Working project of Time Series as Cube.
- Example serializations have been generated using various annotation schemes to identify the model elements: here.
- these include elements from the Measurement and Coordinates models
- Cube: Example files for Cube ( nD-Image and Sparce Cube ) incorporate Measurement and Coordinate model instances
Software:
- vodml Parser: Notebook developed by Omar Laurino parses vo-dml/Mapping annotation to generate instances of TimeSeries -Cube-Meas-Coord using AstroPy and generates plots of the content using Matplotlib. Note: this was developed and presented in 2017 using earlier drafts of the models. These vary only in detail, and the demo could be updated to the current model suite.
- TDIG: Working project of Time Series as Cube.
- An ongoing project is underway to enhance SPLAT to load/interpret/analyze TimeSeries data. This was most recently presented at the IVOA Interop in Paris 2019 (see PDF)
- The tool currently uses a combination of TIMESYS, table column descriptions, UCDs and UTypes to identify and interpret the data automatically. Each of these are annotation schemes which tie directly to model components.
- Delays in resolving on a standard annotation syntax has delayed progress on this project to fully realize the possibilities. This is a high-priority for upcoming work.
- pyVO: extract_skycoord_from_votable()
- Demonstrated at the IVOA Interop in Paris 2019, this product of the hack-a-thon generates AstroPy SkyCoord instances from VOTables using various elements embedded in the VOTable.
- Interrogates a VOTable-1.3 serialization, identifies key information and uses that to automatically generate instances of SkyCoord.
- UCD: 'pos.eq.ra', 'pos.eq.dec'
- COOSYS.system: "ICRS", "FK4", "FK5"
- COOSYS.equinox
- The COOSYS maps directly to SpaceFrame, and the value of the system
- The UCD 'pos.eq' maps directly to meas:EquatorialPosition; with 'pos.eq.ra|dec' identifying the corresponding attributes (EquatorialPosition.ra|dec) as coordinates coords:Longitude and coords:Latitude.
- This illustrates that even with minimal annotation, this sort of automatic discovery/instantiation can take place. With a defined annotation syntax, this utility could be expanded to generate other AstroPy objects very easily.
Validators
As noted above, the serializations may be validated to various degrees using the corresponding schema
- VOTable-1.3 using votlint: verifies the serialization complies with VOTable syntax
- VOTable-1.3 + VODML: verifies the serialization is properly annotated
- XML using xmllint with model schema: verifies the serialization is a valid instance of the model.
- NOTE: The modeler examples undergo all levels of validation, showing that the VOTable serializations are also valid instances of the model.
I don't believe there are validators for the various software utilities. Their purpose is to show that given an agreed serialization which can be mapped to the model(s), the data can be interpreted in an accurate and useful manner.
Links with Coords
The Measurement model is heavily dependent on the Coordinates model (also in RFC) for its core elements. Information about its relation to the Coordinates model, and how the requirements are distributed can be found on the STC2 page
Comments from the IVOA Community during RFC/TCG review period: 2019-09-17 - 2019-10-21
Comments by Markus Demleitner, 2019-09-20
First, I am very much in favour of extending the RFC for this until we have the annotation syntax defined, at least at the level of a PR. True, for DMs the question of what "implemenation" means is always a bit tricky. In this particular case, however, it is very clear that most people will only properly look at things if they know what they will be doing with it. That is particularly true for client authors. I'd go as far as to say: I consider the DM implementation-proven if there's astropy-affiliated code consuming at least 60% of the model.
Then to indiviual points:
(1) I would at least like to see one "catalogue data" use case. I will contribute one if you ask me, but frankly, I'd say the VizieR people have the most comprehensive picture of what kinds of errors are out there and what people do with them. The very least, I guess, would be "A client wants to plot error bars without further user intervention".
(2) In requirement meas.003: After reading the standard, I think I understood what that means, although I'm not sure what the reason for the requirement is (let alone which use case it is derived from). Let me try: "Each error instance must only be referenced by a single measurement." Is that what you mean? If so, why?
(3) While the document certainly cannot be an introduction into error calculus, I have to say I can't tell the difference between Error.statError and Error.ranError (I've looked things up in the Wikipedia, and it says: "Measurement errors can be divided into two components: random error and systematic error." So... from my own experience I'd say it would be wise to either say a few words on what's a statError and what's a ranError or, if that's too long, perhaps point to some textbook.
(4) I don't think I quite understand what requirement makes you introduce "Time" over "Generic Measure" -- as far as errors are concerned, is there anything special about time? Why would I use it rather than Generic Measure, which, as far as I can tell, works just as well? If it's just about the value representation, I'd much prefer if it were left to the serialisation format (like VOTable) -- it's always evil if the same information is represented in two different ways. Similar considerations apply to position, velocity, and polarization (see below, though).
(5) I'm totally against having different classes for coordinates in different frames. It makes the model a lot larger without helping anything over the simple provision of a frame. And you'll have to say what should happen if you annotate a GalacticPosition with an equatorial frame. I may be swayed to accept that error modelling is a bit different in curvilinear coordindates (spherical, cylindrical, whatever) versus plain cartesian. But then the classes should be, perhaps, CurvilinearCoordinate versus CartesianCoordinate.
(6) In 8.1, you require "Velocity.coordinate frames == same SpaceFrame" – but measurement doesn't really talk about frames, nor should it, right? Is that, perhaps, a leftover from the monolithic model?
(7) Polarization -- again, I'd much rather not talk about concrete physics in Meas. If you really have to enumerate all the different things people can measure, that should be done somewhere else (and probably reuse resources like the UAT). Having some special treatment for Measurements over discrete domains sounds like a reasonable thing to want, though. For that, though, I'd say the diagram in sect. 9 is rather confusing. I read it as "Polarization inherits from Measure (which has 1:n to Error), and we somehow say n=0 here". To me, it'd feel a lot more natural if discrete observables inherited from something that doesn't have (numerical) errors in the first place. Also, of course, discrete distributions might make sound error models, too, so just saying "discrete values have no errors" is probably too limiting for a number of interesting use cases.
(8) I'm not really too happy about Bounds?D, Ellipse, Ellipsoid, and CovarianceMatrix all together in one model. True, CovaranceMatrix is a bit of an abstract concept, but it's really straightforward to tanslate all the specific cases into a covariance matrix (well, I admit I've not really understood the difference between Bounds?D and the Ellipses; see also my point 10), so straightforward that you could just put it into the model description.
(9) Matrix2x2 and Matrix3x3 -- it feels odd to have separate classes for arrays of different sizes. I think most people would model arrays as essentially all programming languages I'm aware of do: it's a tuple of a type and one or more axis lengths. Be that as it may, I'd not use a generic matrix for covariance matrices anyway, as they are, by construction, symmetric. Perhaps the one extra number (2D) or 3 extra numbers (3D) are not a big deal, but if you put in these extra numbers, there's a noticable amount of extra work (e.g., validators will have to check for symmetry). So, I could very well see classes covMat2 and covMat3 that have the diagonal and, say, the upper diagonal elements.
(10) One thing I'm almost totally missing that I'd consider fairly important for a model concerned with errors is distributions (except in the AsymmetricalX classes which clearly imply there's no simple Gaussian). Sure, you can do a lot of interesting science with what's there, but there's much more that you can't when you make the implict assumption of "aw, it's all gonna be Gaussian somehow" (which the "center+error" specification probably suggests to most people).
Doing this right is hard, so I'm sure we don't want to go too deeply into this at this point. I'm sure we should say something about assumed or non-assumed distributions, and I'm sure we should at least have some idea of what we'll do once we want to model them.
(11) Another rather fundamental thing I'm not really happy with the way correlations are modelled. There's nothing fundamentally wrong with a covariance matrix, of course, but we'd be a lot more flexible in common cases if we could just declare a correlation as such. For instance, you'd say:
ra = GenericMeasure(value=20 Error(id=err-ra, statError=1e-7)) dec = GenericMeasure(value=30, Error(id=err-ra, statError=1e-7)) Correlation(err1=err-ra, err2=err-dec, coeff=0.5)
-- true, it means picking apart the covariance matrix into lots of individual components, but that's very natural in many applications anyway (I'll just mention Gaia), and it lets us grow into more complicated models of correlation if we need them. Also true, when people actually store the values aggregated (as in points), this becomes a bit tricky, but by adding array-index attributes to the correlation reference, that's easily healed.
I'll also note that the current model is insufficient to annotate the Gaia result catalog, because there, the errors of all five primary observables are correlated. Perhaps annotating Gaia DR2 in full depth is a lot to ask, but with a generic Correlation class you could at least sensibly annotate all the columns that we do have, which I'd count as an indication it may be preferable (not to mention it's simpler model-wise).
Summing up: I'd make the model a lot smaller (thus creating space we'll need once we tackle non-Gaussian errors in earnest one day) and only keep Measure and Error with statError and sysError. From Uncertainty, I'd retain symmetrical and asymmetrical errors. Whether I'd keep the covariance matrix or rather explicitly model correlation as per 11 I don't know, though for the reason just mentioned I'm leaning towards separate correlation annotations. I'd probably try to see how either works out in implementation. |