| Speaker | Title | Abstract | ||
| 15 | JJ Kavelaars (CADC) | From Archives to Astronomical Foundation Models: A CADC View of Archive-Native AI | COSMIC-FM is a CADC proposal to build the infrastructure needed for archive-native machine learning and to use that infrastructure to train and evaluate a foundation model for astronomical archives. The motivation comes from earlier CADC experience with ML-derived archive products, image-quality assessment, and recent work using large language models to construct ADQL queries. These projects show both the promise of ML for archive use and the risk of treating archive data as generic arrays stripped of scientific context. COSMIC-FM starts from the astronomical archive structure as part of the measurement. Calibration state, uncertainties, masks, quality flags, provenance, coordinates, metadata, access controls, and version history should be retained as inputs to learning and evaluation, not removed during preprocessing. The project therefore focuses as much on reusable archive infrastructure as on model training. In the ML context this is representation learning. The COSMIC-FM project will build the infrastructure needed for ML researchers to take advantage of the rich landscape that is astronomical data while enabling and building a foundation model that delivers value to the astronomy researcher. The proposed deliverables are: an archive-native ingest framework for ML-ready but scientifically faithful data packaging; a multimodal foundation model trained across CADC imaging, spectroscopic, and time-domain holdings; a versioned discovery index for similarity search, anomaly discovery, conditional retrieval, and counterexample search; and reliability-gated analysis interfaces that expose deterministic APIs and evidence bundles before natural-language or agentic workflows are treated as operational. The talk will present COSMIC-FM as a proposed research and infrastructure program. I will identify what is grounded in existing CADC archive practice, what remains uncertain, and where IVOA standards for data access, provenance, semantics, and interoperability may be critical to making AI for astronomical archives scientifically trustworthy. |
CADC_AI_IVOA.pdf |
| 15 | Miguel Doctor Yuste (ESA) | Exploring Multi-Agent Systems for Archival Data Discovery | Artificial intelligence is advancing rapidly, and its application across many scientific domains, including astronomy, is already delivering impressive results. Large Language Models (LLMs), with their ability to understand and generate human language, are transforming how researchers interact with complex digital infrastructures and scientific resources. Multi-Agent Systems (MAS) extend these capabilities by enabling multiple AI agents to collaborate within a shared environment. This cooperative approach allows the handling of tasks that are too large, distributed, or complex for a single system. Modern large-scale scientific missions are generating unprecedented volumes of heterogeneous data products, often reaching petabyte scales, and spanning hundreds of distinct data formats and processing levels. These datasets are typically distributed through dedicated science archives, data portals, and cloud-based analysis platforms. Ensuring that researchers can efficiently navigate documentation, understand data models, identify relevant products, and access computational resources has therefore become a major challenge. In this work, we investigate how MAS can support users in discovering and working with scientific data products more effectively. We present several proof-of-concept implementations leveraging different MAS orchestration techniques to automate parts of the scientific workflow and simplify access to mission data ecosystems. Through natural-language interaction, these tools assist users in exploring complex data models, retrieving relevant documentation, generating tailored ADQL queries, and producing ready-to-use Jupyter notebooks for cloud-based scientific analysis platforms. These approaches demonstrate how coordinated AI agents can facilitate more efficient access to and exploitation of large scientific datasets. |
|
| 15 | Roman Machacek (Uni Be) | ADQL Generation using LLMs | Generating database queries from natural language remains a challenging task, particularly in specialized scientific domains. In this work, we study natural language to Astronomical Data Query Language (ADQL) generation using large language models (LLMs). We curate a high-quality dataset of natural language–ADQL pairs through an LLM-assisted filtering and validation pipeline and use it to fine-tune models of varying sizes and capabilities. To enable systematic evaluation, we construct a benchmark spanning a range of query complexities, from simple retrieval tasks to complex joins and aggregations. Finally, we compare fine-tuned models against retrieval-augmented generation (RAG) approaches, analyzing their effectiveness in terms of query correctness and robustness. Our results provide insights into the relative strengths of fine-tuning and retrieval augmentation for domain-specific scientific query generation.
|
|
| 15 | Nick Susemiehl (IPAC) | Accelerating Literature Data Extraction with AI at IPAC | The NASA/IPAC Extragalactic Database (NED) and NASA Exoplanet Archive (NEA), operated by IPAC at the California Institute of Technology, have been serving the scientific community since 1990 and 2011, respectively. Throughout this time, extracting data from journal articles has remained a labor-intensive task, and upcoming large data releases will further increase the demand for timely, accurate data ingestion. We describe a new suite of tools under development to improve the efficiency of transforming information in journal articles into structured database load files by leveraging recent advances in natural language processing enabled by AI. Manual methods for identifying, extracting, and preparing data are being supplemented with an AI-assisted workflow designed to streamline the process and increase throughput. The workflow includes pre-processing articles to identify and label astrophysical object names using regular expressions defined in the NED and NEA name resolvers; a specialized program for handling structured data tables; and inference using a large language model with in-context learning, guided by examples of human-generated load files for similar articles identified via Retrieval-Augmented Generation. All AI-generated outputs undergo expert human review to identify errors and provide feedback that informs iterative improvements to the automated methods. We present preliminary quantitative results comparing the accuracy of AI-assisted load-file generation with files prepared by human experts, and we summarize remaining challenges and next steps for improving performance. The combination of AI/ML techniques with traditional programming methods, supported by human oversight, is establishing a path toward substantially accelerating archive data ingestion. | |
| 15 | Liza Fretel (Paris Obs) | Language Models and Natural Language Processing applications in Astronomy: two Case Studies | We discuss the advantages and limitations of language models and NLP for astronomy, and illustrate them through two case studies: the building of the observation facilities' IVOA vocabulary and the assignment of UAT keywords to papers. | |
| 15 | Michele Delli Veneri (SKAO) | The MADIV Development Study and AI-Assisted Development at SKA Observatory | The recently launched MADIV development study, co-funded by ESO and SKAO, aims to develop a deep learning pipeline for interferometric imaging, building directly on lessons learned from the ESO BRAIN development study. BRAIN demonstrated that deep learning can achieve orders-of-magnitude speed-ups compared with classical CLEAN-based deconvolution. However, it also exposed a fundamental bottleneck: training on simulations alone does not generalise reliably to real ALMA observations. Closing this gap requires training on real archival data at scale, which in turn raises significant infrastructure and software challenges for the community. ALMASim automates much of the data workflow by querying ALMA metadata through TAP and retrieving data products via DataLink. However, DataLink was not designed for the bulk transfer of thousands of files in the TB or PB regime. In addition, the raw visibility products delivered by ESO require non-trivial calibration and processing before they can be used for deep learning model training. This level of domain-specific processing expertise may lie outside the capabilities of many machine learning groups. More broadly, this highlights a gap that goes beyond any single project: to enable large-scale deep learning in radio astronomy, observatories need to provide fast, standardised access to petabyte-scale, machine-learning-ready processed datasets, rather than raw products alone. This talk is structured in two parts. The first introduces the MADIV project, its scientific goals, architecture, and the data infrastructure challenges described above. The second focuses on AI-assisted development practices within MADIV and, more broadly, at SKA Observatory, where we are integrating AI coding agents for code generation and review, user-interface development, and agentic pipelines for automated code curation. | link to presentation |
| I | Attachment | History | Action | Size | Date | Who | Comment |
|---|---|---|---|---|---|---|---|
| |
AI_plenary_IVOA_06-2026_Liza_FRETEL.pdf | r1 | manage | 3141.4 K | 2026-06-09 - 22:03 | FrancescaCivano | |
| |
Accelerating_Literature_Data_Extraction_with_AI_at_IPAC.pdf | r2 r1 | manage | 2928.6 K | 2026-06-09 - 09:53 | FrancescaCivano | |
| |
CADC_AI_IVOA.pdf | r3 r2 r1 | manage | 1570.1 K | 2026-06-09 - 16:46 | JjKavelaars | State_of_AI_in_CADC |
| |
RM_adql_generation_updated.pdf | r1 | manage | 1161.2 K | 2026-06-09 - 22:02 | FrancescaCivano |