This text is intended as a starting point for the discussion.

We will edit the text together during the session and then transfer the final version back to the IVOA wiki afterwards.

Science Platforms: Topics to discuss

participants: 80

Main area of discussion and open questions:

The role of IVOA

interoperable AAI,

Data access and data proximity,

[Gregory D-F] I'm interested in how to make data access both transparent, so that a reference to data works across multiple SPs (e.g., if the same notebook is executed on each one), but is appropriately optimized for efficient access to locally available data (delivering on the code-to-data aspect of SPs).

Data staging and userspace

Software metadata description and software registry

Massive data analysis and ML

There will be some contributions:

from Antonio Disanto

from Dave Morris

(Christophe Arviset) For reference, at ESA, we are also developing a Science Platform that we call ESA Datalabs: http://tinyurl.com/y8mlqyrn

Use cases were collected through various internal workshops at ESAC with space missions teams and scientists, a summary can be found at http://tinyurl.com/y9hq7glw, mainly articulated around 4 pillars:

science data exploitation,

pipeline development environment,

collaborative reserarch environment

software preservation

(Mark Allen) For reference, there is a white paper about "A Science Platform Network to Facilitate Astrophysics in the 2020s" https://ui.adsabs.harvard.edu/abs/2019BAAS...51g.146D/abstract

Matthew Graham - The steep learning curve with all these science platforms means I will only run at one and not many.

Christine Banek: Agree, I think standardization of the interface and tools facing the user are as interesting if not more interesting

PeterT: I agree, i find them (not used many) all very cumbersome, being used to a very finetuned laptop, thi is like working in the stone age.

Here's my pet peeves from my last painful experience, and this skips the cumbersome steps just to get a terminal

- did not have my own user account (wait, I can't use my own zsh setup?)

- very cumbersome/impossible to cut and paste from my laptop, since i was forced to work in a browser

- frustrating to see a load of 0.0 still run slow, cannot see the other users and memory usage

- many "standard" commands not working (csh, time, man, ...) though it was "nice" to be able to pick from many instantiations of your virtual machine.

- cannot use vnc to do local graphics? (this may be peculiar to sciserver)

This "user experience" really needs to be worked on.

Mine was a special introdiuction to SciServer, and I've also used AWS, which was slightly better because I could install my own packages that i needed.

Could you let us know what platform you were using ?

Matthew G.: We use Google Colab a lot and it's hard to see how this interfaces with Astronomy resources, particularly data centers - it's the old question of how do I process Tbs/Pbs at data center 1 against data center 2 when my computing resources at in Cloud 3

Andy: following what the user want to do is the crucial thing. Not clear if users wants to make platform interoperable. challenge CSP to find waht IVOA can do. Find the user requirements.

Mike Fitzpatrick: For the NOAO Data Lab one design idea was that it would be user-data (e.g. sub-selected tables in a "MyDB" or uploaded/generated files in a VOSpace) that would be moved around, not the petabytes in the main data center. So, aside from the custom interfaces peculiar to our SP, there are entry points for pure VO services like TAP and VOSpace to provide access points for other platforms/apps to pull data out of our SP into another in a standard way. Code is much harder (at this point) but can likewise mix local and remote data thru VO service calls. Mixed Auth systems are a bigger issue (for me) at this point, and CDP is not universally supported yet

Groom: users would like to interact more closely to the data, so they want to run "close to the data". What does that mean from the perspective of the data service? what do those applications need that they can't get remotely, aside from not traversing the network? Is it just about eliminating the network, or is it also looking for a richer API to the data objects held at the server?

(thanks :-)

Kai: we forget that we are tech guys but astronomers are not. how can we package eneriting so that it comes out of the box so that users can use focus on the use of software and make science.

Gregory: Existing VO standards tend to be single item requests, limited support for bulk data.

Kai just adding: -> Massive data is mandatory for a lot of deep-learning application and this limit is currently often a show stopper

History behind different institutes means platforms will be different.

What is possible is optimised data access

Write code locally on laptop then transfer the code to a science platform to get better/faster access.

So code API is similar/same, but graphical user interface may be different.

Would massive-data applications like ML be satisfied using only VO API's to access the data, if those methods were very fast because they were "local"?

user enviroments are complex, having a higher level than we have now but not as complex as user enviroment is.

Standardising the libraries and use them at the platform level (e.g. astro py).

Allow users to share their workspace with other, may allow to share different competences. Enviroment to enhacen collaboration.

Providing a platform to work together as a larger team of experts.

+1 SPs should be collaborative research platforms where users can share data and code (Christophe)

Intra-platform - inside platforms (how to access data/services consistently from platform to platform but only using one platform at a time)

Extra - platform protocols, use Sciserver from outside (how to access platform services programmatically from outside the platform)

Inter-platform, where we want to use resources of multiple platforms within single analysis thread

Gregory: we develop having in mid tha Intra and inter platform capabilties. We can lear from HP community in terms of Grid. Strengt of containers is that I can put and run anywhere. But when I need to access data it is problem. Posix fs must be visible in the container.

Gregory : Levels of stanardisation:

Containers - user defines the whole stack - sites can't modify what is inside the container, the container is a 'black box'.

API access - sites can swap the low level libraries that implement the API to give similar behaviour, optimised for the site

(Question: isnt it a sort od extention of the standard the libraries and APIs we use to build the SP?)

Inter-platform capabilities

Petr: move code between pltforms. But the original idea of VO is that we have an agent moving from a site to another site on different data. Move data instead of sw. Not working on the same processing on different data, but send data from place to place accoring to the applications. SP are tuned to specific algorithms and data is movet to the SP becase of this algoritm.

Missing interfaces and things to improve.

JJ - Use containers to move SW into a SP. Build the container at home and move into the SP to process.

Gerard: +1 maybe container can be built interactively on one SP, then exported to run on another. For notebooks on their own it is hard to provide an environment that can run it without containerizing it someway.

Main issue I think is that I wil likely run on "your" SP because you have interesting data. How can I write my analysis against this data if I cannot try it until it is running on your SP? Honestly I don't think I would want to write IVOA protocols to access data that is basically local. But maybe we can standardize on something like (I guess) LSST's butler (is). And have different SPs implement it as efficiently as possible?

Giuliano +1 I agree (this was +1 vs JJ's comment not mine necessarily)

JJ - experience is that virtual machies are too heavy weight for users to transfer from laptop to platform. Containers may help to make this easier

Dave Morris - when we were looking at access levels one of the things we were thinking external TAP service would have row limits, internal TAP service would have the same data, but faster access and row limits

Antonio: balance between usability and trustability interm of users. Trust on the outcome of the platform (the sw)

Severin: inter-platform interoperability. SKA use case and make interoperable data centers. We have usecases out from SKA that we can use.

SPs are stand alone righ now , not designed for interoperability.

Marcos Lopez-Caniego: it makes a lot fo sense to connect science platforms, for example to analyze Euclid and LSST data connecting ESA Datalabs and the LSST science platform, or SKA data that will be splitted in different data centers

Yihan

Brian Major: Standardizing the software delivery is important so the same software can be sent to different platforms that have different data offerings.

Simon O'Toole +1 Agreed. This is where containers are useful. Users want prebuilt software that they can simply run on their data, wherever that data is. - but i don't mind to say "sudo brew install cfitsio"

- sure, but not all users are comfortable with this. Also, users often break their system mixing brew and condos and other package managers.

Giuliano: containers can be an approach if we find a way to annotate also containers.

Simon: a thin layer that is container platform agnostic? Something that describes the container metadata: what it is and what it does, plus provenance information.

Dave Morris : In response to Steve - what if we added 'POSIX access' or 'filesystem access' as protocols that a service could advertise ?

Ani Thakar: Should the IVOA define a set of core science capabilities (use cases) that science platforms should support? This should also boil down to a core set of libraries that each SP should support. IVOA also needs to define exactly what interoperability means in the SP context. What should users or agents be able to do seamlessly between SPs? Should Docker images be able to run on any SP?

Jesús: Should we have a set of IVOA docker containers that all the science platforms could offer? (including e.g. pyVO)+1

Stelios Voutsinas: I think that would be a good idea Jesus, perhaps we could use https://hub.docker.com/u/ivoa to provide an IVOA repo with a set of VO-related Docker images that we would maintain. (Although there is the danger that we get tied to one container technology) (true)+1

PeterT: Didn't we used to have this? Theere used to be a 'linux for astronomy" CD you had all the stuff you could dream a out. Then ESO has a distro that contained "lot" of tools. Now I see official distro's for linux carry some tools (e.g. ds9, pgplot, cfitsio, libwcs). Maybe we as community need to put more effort in making that easy. That would making building containers easier too (and help those workng on your own laptop).

I think this would be to resurrect this idea but using containers. If all the science platforms have this set of containers, you can run your code in different platforms blindly (e.g. selecting the one that has the data closer)

Mathieu: another aspect would be to re-run an analysis executed on a science platform in a standard way (send a command with a given workflow+configuration that would re-execute a sequence). Alternatively, get in a standard way the provenance graph of what was done to get a result.

Stelios Voutsinas: I think another (different) discussion, is in terms of reproducibility of the Science Platforms themselves. Meaning how does a (power) user take a Science platform (as a set of services) and recreate the environments on a generic cloud. (Something LSST and others are doing very well with Helm Charts / Containers).

This would potentially allow users to take such an environment and scale it as much as there funding allows (i.e. paying for resources in a burst, short term model), in the case where their allocation on the original Science Platform is too limited for their requirements.

Kenny Lo: Another aspect to consider is to increase VISIBILITY of what's generally available in the science platforms. That'll go beyond the VO protocols, into things like file systems, datasets, system capacity, etc, etc.

This topic: IVOA > WebHome > IvoaInteropPOC > InterOpMay2020 > InterOpMay2020GWS > NotesOnSP2020
Topic revision: r1 - 2020-05-05 - GiulianoTaffoni