AstroCLIP: A Cross-Modal Foundation Model for Galaxies

2310.03024

Published 6/17/2024 by Liam Parker, Francois Lanusse, Siavash Golkar, Leopoldo Sarra, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Geraud Krawezik, Michael McCabe, Ruben Ohana and 5 others

cs.AI cs.LG

📈

Abstract

We present AstroCLIP, a single, versatile model that can embed both galaxy images and spectra into a shared, physically meaningful latent space. These embeddings can then be used - without any model fine-tuning - for a variety of downstream tasks including (1) accurate in-modality and cross-modality semantic similarity search, (2) photometric redshift estimation, (3) galaxy property estimation from both images and spectra, and (4) morphology classification. Our approach to implementing AstroCLIP consists of two parts. First, we embed galaxy images and spectra separately by pretraining separate transformer-based image and spectrum encoders in self-supervised settings. We then align the encoders using a contrastive loss. We apply our method to spectra from the Dark Energy Spectroscopic Instrument and images from its corresponding Legacy Imaging Survey. Overall, we find remarkable performance on all downstream tasks, even relative to supervised baselines. For example, for a task like photometric redshift prediction, we find similar performance to a specifically-trained ResNet18, and for additional tasks like physical property estimation (stellar mass, age, metallicity, and sSFR), we beat this supervised baseline by 19% in terms of $R^2$. We also compare our results to a state-of-the-art self-supervised single-modal model for galaxy images, and find that our approach outperforms this benchmark by roughly a factor of two on photometric redshift estimation and physical property prediction in terms of $R^2$, while remaining roughly in-line in terms of morphology classification. Ultimately, our approach represents the first cross-modal self-supervised model for galaxies, and the first self-supervised transformer-based architectures for galaxy images and spectra.

Create account to get full access

Overview

Presents a single model called AstroCLIP that can encode both galaxy images and spectra into a shared, physically meaningful latent space
Enables a variety of downstream tasks without any model fine-tuning, including semantic similarity search, photometric redshift estimation, galaxy property estimation, and morphology classification
Pretrains separate transformer-based image and spectrum encoders in self-supervised settings, then aligns the encoders using a contrastive loss
Applies the method to spectra from the Dark Energy Spectroscopic Instrument and images from its corresponding Legacy Imaging Survey
Outperforms supervised baselines and state-of-the-art self-supervised single-modal models on various tasks

Plain English Explanation

The researchers have developed a powerful AI model called AstroCLIP that can understand both images and spectral data of galaxies. This allows it to be used for a wide range of tasks, from finding similar galaxies to estimating a galaxy's distance from Earth and its physical properties, all without needing to be retrained for each new task.

The key to AstroCLIP is that it first learns to encode images and spectra separately in a self-supervised way, then aligns these two encoding paths into a shared, meaningful latent space. This means the model can grasp the connections between the visual appearance of a galaxy and its underlying physical characteristics revealed by its spectrum.

Compared to other state-of-the-art models, AstroCLIP shows remarkable performance across the board. For example, it can estimate a galaxy's distance just as accurately as a model trained specifically for that task. And for estimating properties like stellar mass, age, and star formation rate, it outperforms the supervised baseline by a significant margin.

The researchers' approach of jointly modeling galaxy images and spectra in a self-supervised way represents a breakthrough in combining different data modalities for astronomy. This paves the way for more powerful and flexible AI tools to unlock insights from the vast amount of multi-modal data being collected about the universe.

Technical Explanation

The core of AstroCLIP is its ability to embed both galaxy images and spectra into a shared, physically meaningful latent space. This is achieved in two steps:

Separate Pretraining: The researchers first pretrain separate transformer-based image and spectrum encoders in self-supervised settings. This allows the encoders to learn useful representations from the data without any labels.
Encoder Alignment: The researchers then align the image and spectrum encoders using a contrastive loss function. This encourages the model to map related image and spectrum samples to nearby points in the latent space, capturing the underlying physical connections.

The researchers apply AstroCLIP to data from the Dark Energy Spectroscopic Instrument and its corresponding Legacy Imaging Survey. They find that the model achieves remarkable performance on a variety of downstream tasks, often outperforming supervised baselines.

For example, on photometric redshift estimation (predicting a galaxy's distance from its appearance), AstroCLIP matches the performance of a dedicated ResNet18 model. And for physical property estimation (like stellar mass and star formation rate), it beats the supervised baseline by 19% in terms of $R^2$ score.

Compared to a state-of-the-art self-supervised single-modal model for galaxy images, AstroCLIP outperforms by roughly a factor of two on photometric redshift and physical property prediction, while remaining competitive on morphology classification.

Critical Analysis

The researchers acknowledge several limitations and avenues for future work:

The current version of AstroCLIP only supports 1D spectral data, while many modern instruments also collect 2D spectroscopic data. Extending the model to handle this more complex data type could further improve its capabilities.
The study is limited to a single dataset (DESI/Legacy Imaging Survey). Evaluating AstroCLIP on a wider range of astronomical datasets would help validate its broader applicability.
While the self-supervised pretraining approach is powerful, the researchers note that the subsequent contrastive alignment step is computationally intensive. Exploring more efficient ways to align the image and spectrum encoders could make the training process more scalable.

Overall, AstroCLIP represents a significant step forward in combining multimodal data for optimal representation learning. However, as with any research, there are opportunities to build upon this work and address its current limitations.

Conclusion

The AstroCLIP model presented in this paper is a versatile and high-performing solution for jointly encoding galaxy images and spectra into a shared, physically meaningful latent space. By leveraging self-supervised pretraining and contrastive alignment, the researchers have developed the first cross-modal self-supervised model for galaxies, outperforming both supervised baselines and state-of-the-art single-modal self-supervised approaches.

This breakthrough paves the way for more powerful and flexible AI tools to unlock insights from the increasingly rich and multimodal astronomical datasets being collected. As the field of machine learning for astronomy continues to advance, AstroCLIP and similar cross-modal models will likely play a crucial role in accelerating scientific discovery about the universe.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Galaxy spectroscopy without spectra: Galaxy properties from photometric images with conditional diffusion models

Lars Doorenbos, Eva Sextl, Kevin Heng, Stefano Cavuoti, Massimo Brescia, Olena Torbaniuk, Giuseppe Longo, Raphael Sznitman, Pablo M'arquez-Neila

Modern spectroscopic surveys can only target a small fraction of the vast amount of photometrically cataloged sources in wide-field surveys. Here, we report the development of a generative AI method capable of predicting optical galaxy spectra from photometric broad-band images alone. This method draws from the latest advances in diffusion models in combination with contrastive networks. We pass multi-band galaxy images into the architecture to obtain optical spectra. From these, robust values for galaxy properties can be derived with any methods in the spectroscopic toolbox, such as standard population synthesis techniques and Lick indices. When trained and tested on 64x64-pixel images from the Sloan Digital Sky Survey, the global bimodality of star-forming and quiescent galaxies in photometric space is recovered, as well as a mass-metallicity relation of star-forming galaxies. The comparison between the observed and the artificially created spectra shows good agreement in overall metallicity, age, Dn4000, stellar velocity dispersion, and E(B-V) values. Photometric redshift estimates of our generative algorithm can compete with other current, specialized deep-learning techniques. Moreover, this work is the first attempt in the literature to infer velocity dispersion from photometric images. Additionally, we can predict the presence of an active galactic nucleus up to an accuracy of 82%. With our method, scientifically interesting galaxy properties, normally requiring spectroscopic inputs, can be obtained in future data sets from large-scale photometric surveys alone. The spectra prediction via AI can further assist in creating realistic mock catalogs.

6/27/2024

cs.AI

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

Sedigheh Eslami, Gerard de Melo

Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.

6/27/2024

cs.CV cs.AI cs.CL cs.LG

Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels

Amaya Dharmasiri, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision language models such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations.

4/17/2024

cs.CV

Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment

Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Jiangbin Zheng, Kaicheng yu, Wanyu Chen, Stan Z. Li

Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matched pairs but traditional methods like pseudo-labeling are difficult to apply in domains with no label information. To address these problems, we transform semi-supervised multimodal alignment into a manifold matching problem and propose a new method based on CLIP, named Gentle-CLIP. Specifically, we design a novel semantic density distribution loss to explore implicit semantic alignment information from unpaired multimodal data by constraining the latent representation distribution with fine granularity, thus eliminating the need for numerous strictly matched pairs. Meanwhile, we introduce multi-kernel maximum mean discrepancy as well as self-supervised contrastive loss to pull separate modality distributions closer and enhance the stability of the representation distribution. In addition, the contrastive loss used in CLIP is employed on the supervised matched data to prevent negative optimization. Extensive experiments conducted on a range of tasks in various fields, including protein, remote sensing, and the general vision-language field, demonstrate the effectiveness of our proposed Gentle-CLIP.

6/11/2024

cs.LG cs.AI cs.CL cs.CV