IVOA November 2024 Interop Operations

Session 11/17/2024

Euro-VO Registry - Henrik Norman

Validator is polling services for metadata, working to feed that back to the registry entries.

Compliance over time

SCS

1700 new services in the last year
Incorrect UCD’s are the most common issue
No ID_MAIN col or more than one
Missing POS_EQ_RA_MAIN / POS_EQ_DEC_MAIN

SIA1

Bump up in new services (NOIRLAB)
VOTable must have a RESOURCE element
Consider adding <testQuery> to VO resource

Sia2 – irsa is 85% of the errors

SSA – non-compliant mostly due to down services, might clean up

TAP – very diverse errors, encourage providers to use taplint or registry validator

Column name collisions (size, date, value, area, position, level) – IRSA usually renames them but curious what general position is.
Markus Demleitner/Mark Taylor – use delimiting quotes in database and also registry records

Conclusions:

2025 will actively reach out to providers
Reach out to esavo.registry@cosmos.esa.int for help

QUESTIONS

Mark Taylor (University of Bristol): If people are using taplint: please feel free to email Mark for help understanding errors
Xiuqin Wu (NED): would love to hear about issues.
- Helps to make sure contact info in registry resources is correct.
- Markus Demleitner: notices when a registry goes down and eventually reaches out, but not when a service goes down. But euro-vo does notice down services; maybe they can start reaching out to providers?
Tamara Civera (CEFCA): who will be contacted (fully non-compliant or partially?)
- First priority – offline services
- A few providers have about 80% of issues so will reach out to them next (partial compliance)

Survey of TAP Error Messages - Markus Demleitner

“Writing good error messages is great art”

Formal survey of TAP responses for various error conditions:

Non-existant table
Non-existant column
Mistyped SELECT or FROM
Non-VOTable inline upload
Referencing non-existing column in uploaded table

Tested wide variety of implementations

Markus Demleitner (GAVO) – do not recommend trying to write your own ADQL parser – several good ones out there already

Classified responses (good, ok, meh, no useful error, unrelated error, no error, IRSA – None as job.phase, bug in async implementation)

(See notebook attached to presentation for results)

Notes – this was as much a test of: async mode + pyVO TAP API

If you run a TAP service, might be good to include some pyVO usage in tests

Open discussion - All

Tom Donaldson (MAST) – Proposed topic: sometimes validators see “errors” due to firewall/security settings

Overprotective IT departments. At MAST, they repeatedly block TAP queries that look like potential SQL injection (appliances are configured to look for that). Happens for outbound traffic too. (errors very cryptic) Will eventually affect https traffic as well as http.
How many of our TAP services and ADQL translators ARE susceptible to sql injection?
Markus D. – doesn’t think SQL injection actually a risk since only passing select statements
Tom D.: yes can do lots of things in ADQL parser to mitigate risks but… also has seen some extremely clever examples of SQL injection that could get around all that.
How do we communicate to IT departments that a certain type of query is ok?
Have a validator for AQDL language – add some tests for catching injection attempts there?
CREATE TABLE (user uploaded table), DROP TABLE – much riskier
How about DDOS mitigation? (Killing jobs that run too long etc) Most have some policy of killing jobs that run too long / take too many resources. Heidelburg: 5-10 second timeout on TAP-sync
Steve Groom (IRSA) – helps to make sure account running the query on the database doesn’t have permissions to do dangerous things. Re DDOS – one of the reasons we will have to do authentication is that when queries coming to us are using the cloud, we can’t tell who they are and can’t block all of AWS because one user is being abusive.
Tamara C.: seconds idea of having checks for sql injection susceptibility in TAP validators. One type of DDOS mitigation is when a (normally) innocent and anonymous user runs a service for more than 1 million objects in parallel (for example a cutout service). We, in those cases, block the user by IP and, in most cases, the user contact us to know what is happening. It is the only way we have to know who is the user in order to help him/her to avoid this "abusive" type of use of our services.
Markus D.: most DDOSes result of innocent errors, not malicious actors Maybe have services return tips/suggestions like “you’re doing it wrong, try it XXX way?”
Anastasia Laity (Caltech) – even so, need a way to connect incoming queries with individual person; can be distributed over cloud IPs such that we can’t just turn off access from a whole subnet.
Xiuqin W.: and users may not see the returned error messages in time to be useful– just fire up a giant parallel script and walk away
429 HTTP code “too many requetsts” one straightforward thing that can be done. (Mark T - clients also need to trap this usefully to pass on to user)
All much easier to manage if we have authenticated users – for example, can route users to individual queues so one heavy user doesn’t prevent others from running queries
Anne Raugh (PSD) – will need to migrate to NASA mission cloud. DDOS in cloud = denial of funding attack. NASA first approach will be throttling. Suspect gonna wind up requiring auth.

Cloud and authentication:

NASA instructing providers to move content to the cloud (via NASA-managed platforms)
Will have to worry about more than just slow servers… what if expense limit totally exhausted 3 days into a payment period? Just no more data egress for anyone for weeks? Auth means you can communicate with or block a particular user before access to data is removed for everyone
Motivation to put in the cloud is principally so people can compute against data IN THE CLOUD
Anne R.: one backup plan if throttling not sufficient: unauthenticated users get a base level of access, with an account you get more support
How would authentication affect validators? Not about limiting them – just about knowing who they are. Validators would need a key / authentication.

User agents:

Steve G – got some raw data from IRSA logs, happy to share / discuss popularity of different user agents
Tim Jenness (Vera Rubin) – is a pretty even split or is it like 95% pyvo/firefly/topcat?
Varies a lot by service (TAP v SCS v SIA). 90% of SCS at IRSA is coming from TopCat. Most other stuff via python/java, probably astropy/firefly. We can only see hit counts which isn’t an exact mapping to people/users. On the image retrieval side, still need to break down search v download; most incoming queries are v wget but that’s probably from running IRSA download scripts.
Suggest clients check the VO note about identifying operational components (useful user-agent)
CDS usage – 3 broad categories. Other institutions synchronizing/lookup (NED etc), public using wide variety of clients, or scientists using a small number of clients