STC2:Meas Proposed Recommendation: Request for Comments

Summary

Version 1 of STC was developed in 2007, prior to the development and adoption of vo-dml modeling practices. As we progress to the development of vo-dml compliant component models, it is necessary to revisit those models which define core content. Additionally, the scope of the STC-1.0 model is very broad, making a complete implementation and development of validators, very difficult. As such it may be prudent to break the content of STC-1.0 into component models itself, which as a group, cover the scope of the original.

This effort will start from first principles with respect to defining a specific project use-case, from which requirements will be drawn, satisfied by the model, and implemented in the use-case. We will make use of the original model to ensure that the coverage of concepts is complete and that the models will be compatible. However, the form and structure may be quite different. This model will use vo-dml modeling practices, and model elements may be structured differently to more efficiently represent the concepts.

This model covers the description of measured or determined astronomical data, and includes the following concepts:

  • The association of the determined ’value’ with corresponding errors. In this model, the ’value’ is given by the various Coordinate types of the Coordinates data model (Rots and Cresitello-Dittmar et al., 2019).
  • A description of the Error model.
The Latest version of the model and supporting docs:
  • Model document: here
  • VO-DML/XML representation: here
  • XML Schema: here

Implementation Requirements

(from DM Working group twiki):

The "IVOA Document Standards" standard has a broad outline of the implementation requirements for IVOA standards. These requirements fit best into the higher level standards for applications and protocols than for data models themselves. At the Oct 2017 interop in Trieste, the following implementation requirements for Data Model Standards was agreed upon, which allow the models to be vetted against their requirements and use cases, without needing full science use cases to be implemented.

  • VO-DML models must validate against schema
  • Serializations which touch each entity of the model. These serializations may be 'fake' (ie: not based on actual data files), and are to be produced by the modeler as unit tests/examples.
  • Real world serializations covering use cases, produced by others following the model, in a mutually agreed upon format.
  • Software which interprets these serializations and demonstrates proper interpretation of the content

Serializations:

  • Modeler Generated Examples:
    • using home grown python code, the modeler has generated example serializations which span all elements of the model. The examples are generated in 4 formats:
      • VOTable-1.3 standard syntax; Validates using votlint
      • VOTable-1.3 annotated with VO-DML/Mapping syntax; Validates using xmllint to a VOTable-1.3 schema enhanced with an imported VO-DML mapping syntax schema
      • XML format; Validates against the model schema
      • An internal DOC format; XML/DOM structure representing the instances generated when interpreting the instance templates.
  • Real world serializations:

Software:

  • vodml Parser: Notebook developed by Omar Laurino parses vo-dml/Mapping annotation to generate instances of TimeSeries -Cube-Meas-Coord using AstroPy and generates plots of the content using Matplotlib. Note: this was developed and presented in 2017 using earlier drafts of the models. These vary only in detail, and the demo could be updated to the current model suite.
  • TDIG: Working project of Time Series as Cube.
    • An ongoing project is underway to enhance SPLAT to load/interpret/analyze TimeSeries data. This was most recently presented at the IVOA Interop in Paris 2019 (see PDF)
    • The tool currently uses a combination of TIMESYS, table column descriptions, UCDs and UTypes to identify and interpret the data automatically. Each of these are annotation schemes which tie directly to model components.
    • Delays in resolving on a standard annotation syntax has delayed progress on this project to fully realize the possibilities. This is a high-priority for upcoming work.
  • pyVO: extract_skycoord_from_votable()
    • Demonstrated at the IVOA Interop in Paris 2019, this product of the hack-a-thon generates AstroPy SkyCoord instances from VOTables using various elements embedded in the VOTable.
      • Interrogates a VOTable-1.3 serialization, identifies key information and uses that to automatically generate instances of SkyCoord.
        • UCD: 'pos.eq.ra', 'pos.eq.dec'
        • COOSYS.system: "ICRS", "FK4", "FK5"
        • COOSYS.equinox
      • The COOSYS maps directly to SpaceFrame, and the value of the system
      • The UCD 'pos.eq' maps directly to meas:EquatorialPosition; with 'pos.eq.ra|dec' identifying the corresponding attributes (EquatorialPosition.ra|dec) as coordinates coords:Longitude and coords:Latitude.
      • This illustrates that even with minimal annotation, this sort of automatic discovery/instantiation can take place. With a defined annotation syntax, this utility could be expanded to generate other AstroPy objects very easily.

Validators

As noted above, the serializations may be validated to various degrees using the corresponding schema

  • VOTable-1.3 using votlint: verifies the serialization complies with VOTable syntax
  • VOTable-1.3 + VODML: verifies the serialization is properly annotated
  • XML using xmllint with model schema: verifies the serialization is a valid instance of the model.
  • NOTE: The modeler examples undergo all levels of validation, showing that the VOTable serializations are also valid instances of the model.
I don't believe there are validators for the various software utilities. Their purpose is to show that given an agreed serialization which can be mapped to the model(s), the data can be interpreted in an accurate and useful manner.

Links with Coords

The Measurement model is heavily dependent on the Coordinates model (also in RFC) for its core elements. Information about its relation to the Coordinates model, and how the requirements are distributed can be found on the STC2 page

Comments from the IVOA Community during RFC/TCG review period: 2019-09-17 - 2019-10-21

Comments by Markus Demleitner, 2019-09-20

First, I am very much in favour of extending the RFC for this until we have the annotation syntax defined, at least at the level of a PR. True, for DMs the question of what "implemenation" means is always a bit tricky. In this particular case, however, it is very clear that most people will only properly look at things if they know what they will be doing with it. That is particularly true for client authors. I'd go as far as to say: I consider the DM implementation-proven if there's astropy-affiliated code consuming at least 60% of the model.

Then to indiviual points:

(1) I would at least like to see one "catalogue data" use case. I will contribute one if you ask me, but frankly, I'd say the VizieR people have the most comprehensive picture of what kinds of errors are out there and what people do with them. The very least, I guess, would be "A client wants to plot error bars without further user intervention".

(2) In requirement meas.003: After reading the standard, I think I understood what that means, although I'm not sure what the reason for the requirement is (let alone which use case it is derived from). Let me try: "Each error instance must only be referenced by a single measurement." Is that what you mean? If so, why?

(3) While the document certainly cannot be an introduction into error calculus, I have to say I can't tell the difference between Error.statError and Error.ranError (I've looked things up in the Wikipedia, and it says: "Measurement errors can be divided into two components: random error and systematic error." So... from my own experience I'd say it would be wise to either say a few words on what's a statError and what's a ranError or, if that's too long, perhaps point to some textbook.

(4) I don't think I quite understand what requirement makes you introduce "Time" over "Generic Measure" -- as far as errors are concerned, is there anything special about time? Why would I use it rather than Generic Measure, which, as far as I can tell, works just as well? If it's just about the value representation, I'd much prefer if it were left to the serialisation format (like VOTable) -- it's always evil if the same information is represented in two different ways. Similar considerations apply to position, velocity, and polarization (see below, though).

(5) I'm totally against having different classes for coordinates in different frames. It makes the model a lot larger without helping anything over the simple provision of a frame. And you'll have to say what should happen if you annotate a GalacticPosition with an equatorial frame. I may be swayed to accept that error modelling is a bit different in curvilinear coordindates (spherical, cylindrical, whatever) versus plain cartesian. But then the classes should be, perhaps, CurvilinearCoordinate versus CartesianCoordinate.

(6) In 8.1, you require "Velocity.coordinate frames == same SpaceFrame" – but measurement doesn't really talk about frames, nor should it, right? Is that, perhaps, a leftover from the monolithic model?

(7) Polarization -- again, I'd much rather not talk about concrete physics in Meas. If you really have to enumerate all the different things people can measure, that should be done somewhere else (and probably reuse resources like the UAT). Having some special treatment for Measurements over discrete domains sounds like a reasonable thing to want, though. For that, though, I'd say the diagram in sect. 9 is rather confusing. I read it as "Polarization inherits from Measure (which has 1:n to Error), and we somehow say n=0 here". To me, it'd feel a lot more natural if discrete observables inherited from something that doesn't have (numerical) errors in the first place. Also, of course, discrete distributions might make sound error models, too, so just saying "discrete values have no errors" is probably too limiting for a number of interesting use cases.

(8) I'm not really too happy about Bounds?D, Ellipse, Ellipsoid, and CovarianceMatrix all together in one model. True, CovaranceMatrix is a bit of an abstract concept, but it's really straightforward to tanslate all the specific cases into a covariance matrix (well, I admit I've not really understood the difference between Bounds?D and the Ellipses; see also my point 10), so straightforward that you could just put it into the model description.

(9) Matrix2x2 and Matrix3x3 -- it feels odd to have separate classes for arrays of different sizes. I think most people would model arrays as essentially all programming languages I'm aware of do: it's a tuple of a type and one or more axis lengths. Be that as it may, I'd not use a generic matrix for covariance matrices anyway, as they are, by construction, symmetric. Perhaps the one extra number (2D) or 3 extra numbers (3D) are not a big deal, but if you put in these extra numbers, there's a noticable amount of extra work (e.g., validators will have to check for symmetry). So, I could very well see classes covMat2 and covMat3 that have the diagonal and, say, the upper diagonal elements.

(10) One thing I'm almost totally missing that I'd consider fairly important for a model concerned with errors is distributions (except in the AsymmetricalX classes which clearly imply there's no simple Gaussian). Sure, you can do a lot of interesting science with what's there, but there's much more that you can't when you make the implict assumption of "aw, it's all gonna be Gaussian somehow" (which the "center+error" specification probably suggests to most people).

Doing this right is hard, so I'm sure we don't want to go too deeply into this at this point. I'm sure we should say something about assumed or non-assumed distributions, and I'm sure we should at least have some idea of what we'll do once we want to model them.

(11) Another rather fundamental thing I'm not really happy with the way correlations are modelled. There's nothing fundamentally wrong with a covariance matrix, of course, but we'd be a lot more flexible in common cases if we could just declare a correlation as such. For instance, you'd say:

ra = GenericMeasure(value=20 Error(id=err-ra, statError=1e-7)) dec = GenericMeasure(value=30, Error(id=err-ra, statError=1e-7)) Correlation(err1=err-ra, err2=err-dec, coeff=0.5)

-- true, it means picking apart the covariance matrix into lots of individual components, but that's very natural in many applications anyway (I'll just mention Gaia), and it lets us grow into more complicated models of correlation if we need them. Also true, when people actually store the values aggregated (as in points), this becomes a bit tricky, but by adding array-index attributes to the correlation reference, that's easily healed.

I'll also note that the current model is insufficient to annotate the Gaia result catalog, because there, the errors of all five primary observables are correlated. Perhaps annotating Gaia DR2 in full depth is a lot to ask, but with a generic Correlation class you could at least sensibly annotate all the columns that we do have, which I'd count as an indication it may be preferable (not to mention it's simpler model-wise).

Summing up: I'd make the model a lot smaller (thus creating space we'll need once we tackle non-Gaussian errors in earnest one day) and only keep Measure and Error with statError and sysError. From Uncertainty, I'd retain symmetrical and asymmetrical errors. Whether I'd keep the covariance matrix or rather explicitly model correlation as per 11 I don't know, though for the reason just mentioned I'm leaning towards separate correlation annotations. I'd probably try to see how either works out in implementation.

Comments by Mark C-D

In generating the XML format examples, I noticed the following issue which should be considered:

  • The specialized positions have [1] multipicity on each of the coordinates, so without all of them, the element is considered invalid. I think this is OK for most, but for Cartesian, I typically want to only include x,y (e.g. chipx,chipy for CCD coordinates). In this case in particular, I think the multiplicities for x, y, and z, should be [0..1] so that any combination may be given.

Comments by F.-X. Pineau, 2019-09-24

I tend to agree with most of the points made by Markus, and I would like both to push a few ones and add extra ones.

1 - "stat" and "rand" errors also seem to be redundant to me (I am in favour of keeping "rand", "stat" sounding more general to me). If not, an explanation on the differences would be welcome (maybe "stat" is the combination of "sys" and "rand", or one describes Gaussian errors and not the other, ...?).

2 - When we assume that errors are Gaussian (we probably need a way to mention it explicitly), we can convert covariance matrices into elliptical errors. But to convert symmetrical or ellipitical errors into covariance matrices we need an extra information which is the "confidence level" (or a number of sigma, ...) the symmetrical or elliptical error is associated to. For example, the error radius in ROSAT is associated to a 68% confidence interval. In the WGCAT or FIRST catalogues, it is a 90% confidence interval. A lot of catalogues in VizieR contain circular positional errors associated to confidence intervals different from 39% (value defined by a 2D covariance Matrix).

3 - I am concerned about the definition of "Ellipse.posAngle" (10.9.2) in the case of positional uncertainties: if I am correct, the definition is different from the IAU definition of a position angle. E.g. see the 2MASS doc for "err_ang": the position angle is defined as being East of North, not North of East (the North being the North Celestial Pole).

4 - Having Equatorial, Galactic, Ecliptic positions on one hand, and a generic CartesianPosition and the other also seems a litte odd to me (a mix of coordinates types(?) and systems). Why not just "Spherical" (lon, lat) (or Curvilinear like suggested by Markus) and "Cartesian" positions?

5 - Positional error matrices in catalogues are provided by 3 columns. The 2 first ones are the standard deviations (I have not yet seen variances). The 3rd one may be: the covariance (I do not remember a catalogue using it), the correlation (see e.g. SDSS DR7 or Gaia) or the co-sigma (see "sigradec" in the AllWISE doc) The current model accounts only for the covariances and Markus suggestion is to use correlations. We should probably support both (plus co-sigma).

6 - Like Markus, I have the Gaia case in mind. One can consider only the postions or the positions + proper motions (+ plx + Vr). Can we make a measurement wich is the composition of positions + PMs (+ ...)? In that case, how to add the covariances (or correlation factors) between positional and PM parameters?

Comments from TCG member during the RFC/TCG Review Period: 2019-09-17 - 2019-10-21

TCG Chair & Vice Chair

Applications Working Group

Data Access Layer Working Group

Data Model Working Group

Grid & Web Services Working Group

Registry Working Group

Semantics Working Group

Data Curation & Preservation Interest Group

Education Interest Group

Knowledge Discovery Interest Group

Solar System Interest Group

Theory Interest Group

Time Domain Interest Group

Operations

  • Sec 8.6.1: The ProperMotion.lon coordinate is described simply "Velocity in angular distance per unit time along the longitude axis." It is common, though not universal, practice to quote longitudinal PM premultiplied by cos(lat) so that the magnitude of the quantity is not affected by its longitudinal position - for instance pmra in the Gaia source catalogue is always reported thus. I don't know if you've considered the possibility of defining it in that way, but I think this issue should be mentioned explicitly in the disucssion of ProperMotion.lon . At the least say something like "this is not premultiplied by cos(lat)" to avoid uncertainty, but other possibilities such as defining it here to include this factor, or providing some way to choose whether the cos(lat) factor is included, might be better.
  • Sec 4.1.1, 7.2.1, 8.2.1: Multiplicity is quoted as "0..*", which seems to be at odds with the notation in Section B.7; it looks like these should be written "*" instead.
-- MarkTaylor - 2019-09-27

Standards and Processes Committee

TCG Vote:

If you have minor comments (typos) on the last version of the document please indicate it in the Comments column of the table and post them in the TCG comments section above with the date.

Group Yes No Abstain Comments
TCG        
Apps        
DAL        
DM        
GWS        
Registry        
Semantics        
DCP        
KDIG        
SSIG        
Theory        
TD        
Ops        
<nop>StdProc        


Topic revision: r9 - 2019-09-27 - MarkTaylor
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback