Thinking aloud about the shape of scientific data

llm

Some ideas on why some domains will benefit more from general-purpose models than others

Published

January 9, 2025

Introduction

In many scientific fields, we are witnessing the emergence of “foundation models” - a term that, while widely used, often lacks precise definition. For our purposes, we consider foundation models to be those that can be readily adapted to diverse tasks within a domain, serving as a foundation for modeling various phenomena.

In chemistry, we observe two parallel trends. On one side, there’s growing enthusiasm for general-purpose large language models (LLMs), with some arguing that “The future of chemistry is language” (White 2023) - a perspective I largely share. Simultaneously, we see the development of specialized foundation models, such as MACE-MP (Batatia et al. 2024) for molecular simulations and AlphaFold (Abramson et al. 2024) for protein structure prediction .

Even though it is very interesting to ponder that some equivariance features were thrown out in AF3 — in favor of scale, which one might think of as the Bitter lesson hitting again.

This duality raises a crucial question: “When should we invest in specialized architectures that incorporate domain knowledge, and when might general-purpose approaches be more effective?” The question becomes particularly relevant as we observe both specialized models achieving remarkable success and general-purpose LLMs demonstrating unexpected capabilities across scientific domains.

In my research group, we’ve focused on applying general-purpose LLMs to chemistry - an approach that might seem counterintuitive. Here, I attempt a systematic (though admittedly preliminary) analysis of when different modeling approaches might be most appropriate by examining the fundamental structure of scientific data spaces.

Big parts of this discussion are inspired by the excellent Biology 2.0 post from Michael Bronstein and Luca Naef.

The Shape of Scientific Data

To understand why different modeling approaches succeed or fail, we need to examine the inherent structure of their data spaces. We’ll focus on four fundamental types of scientific data that represent distinct points along the spectrum of structure and complexity: molecular properties (governed by physical laws), chemical experiments (complex real-world scenarios), biological sequences (shaped by evolution), and code (human-created structure).

Aspect	Molecular Properties	Chemical Experiments	Biological Sequences	Code
State Space	\(\psi \in L^2(\mathbb{R}^{3N})\)	\(\mathcal{R}(t)=\{(c_i, n_i, p_i)\}\)	\(\{0,1,\ldots,k\}^n\)	Discrete tree \(\mathcal{T}\)
Governing Distribution	\(P(\psi) \propto e^{-\beta E[\psi]}\)	Complex, multi-modal	\(\log P(s) \propto f(s)\)	\(P(\text{code}) = P(\text{syntax}) \cdot P(\text{semantics}\\|\text{syntax})\)
Structure-to-Noise Ratio	High	Low	Medium	Very High
Reproducibility	Very high	Low	High	Perfect
Hidden Variables	No/Few	Many	No/Few	No
Validation	Physical laws	Empirical	Functional tests	Compiler
Causality	Quantum mechanics	Partially hidden	Evolutionary force	Explicit

Let’s examine each domain in detail to understand why they might require different modeling approaches: Here’s Part 2, continuing with the domain-specific sections:

Molecular Properties

The quantum mechanical description of molecular properties provides perhaps the cleanest example of a structured scientific data space. Here, the state space is described by wavefunctions \(\psi \in L^2(\mathbb{R}^{3N})\), representing the quantum state of N particles in three-dimensional space. Several key characteristics make this domain particularly amenable to specialized models:

High Structure-to-Noise Ratio: The underlying physics is well-understood and deterministic (up to quantum mechanical uncertainties)
Clear Symmetries: Physical laws impose translational and rotational invariance, providing strong inductive biases for model design
Few Hidden Variables: All molecular properties can, in principle, be determined from the wavefunction, requiring only atomic positions and types as input
Perfect Reproducibility: While numerical implementations introduce some noise, quantum mechanical measurements are fully determined by their corresponding operators

Chemical Experiments

Chemical experiments present a striking contrast. Despite being fundamentally governed by quantum mechanics, “real world” experimental chemistry introduces numerous complexities:

Complex State Space: While we can represent basic parameters as \(\mathcal{R}(t)=\{(c_i, n_i, p_i)\}\) (concentrations, stoichiometry, phase information), many crucial variables remain hidden
Low Structure-to-Noise Ratio: Hidden features and their interactions lead to high variability in outcomes
Hidden Variables: Critical factors often go unrecorded or unrecognized (impurities, atmospheric conditions, surface effects) and might only be implicitly captured in experimental protocols
Limited Reproducibility: Even carefully controlled experiments may yield different results due to uncontrolled variables

Biological Sequences

Biological sequences present a unique case where their distribution in sequence space (\(\{0,1,\ldots,k\}^n\)) is shaped by evolution, creating a direct link between sequence distribution and fitness (Sella and Hirsh 2005).

Notably, such a driving force does not exist in chemistry, where the space of synthetic molecules seems mostly shaped by human imagination.

Medium Structure-to-Noise Ratio: Evolution provides underlying structure, while neutral mutations introduce noise
Clear Alphabet: Fixed set of building blocks (amino acids, nucleotides) constrains the possible space
Evolutionary Causality: Natural selection provides a clear driving force for sequence distributions
High Reproducibility: Modern sequence determination is highly reliable

Code

Programming languages represent a fascinating case of highly structured but human-created information:

Discrete, Tree-like Structure: Abstract syntax trees provide clear organization
Perfect Reproducibility: Same input consistently produces the same output
Explicit Causality: Control flow and data dependencies are explicit
Human-Created Rules: Unlike physical laws, programming language rules are human-designed and well-documented
Rich Training Data: Vast amounts of self-documenting code examples and error messages are available

Here’s Part 3, continuing with the implications and conclusions:

Implications for Model Choice

Our analysis of data spaces reveals a nuanced framework for choosing modeling approaches, one that goes beyond simple metrics to consider the fundamental nature of structure in each domain.

The Structure-to-Noise Ratio and Types of Structure

We can (somewhat handwavily) formalize the structure-to-noise ratio as:

\[ R = \frac{\text{structured\_information}}{\text{unstructured\_variation}} \]

However, this ratio alone is insufficient. We must distinguish between fundamentally different types of structure:

Physical/Mathematical Structure (Molecular Properties):
- Governed by immutable natural laws
- Benefits from explicit architectural enforcement
- Can data-efficiently be handled by specialized architectures (e.g., equivariant neural networks)
Human-Created Structure (Code):
- Well-documented in training data
- Can be learned statistically
- Amenable to general-purpose models like LLMs
Mixed or Emergent Structure (Biological Sequences):
- Combines physical constraints with evolutionary patterns
- Benefits from hybrid approaches

This refined view explains several observed patterns in scientific machine learning:

Domains with Physical Structure (Molecular Properties):
- Specialized architectures effectively leverage conservation laws and symmetries
- Investment in domain-specific inductive biases pays off
- Example: Equivariant neural networks for molecular properties
Domains with Human-Created Structure (Code):
- General-purpose models can learn patterns effectively
- Benefit from large amounts of self-documenting training data
- Example: LLMs for code generation
Low-Structure Domains (Chemical Experiments):
- General-purpose models may be more effective (as we do not even know what inductive biases to design and many factors are hidden/implicit)
- Pattern recognition and statistical approaches shine
- Example: LLMs leveraging implicit knowledge from literature
Mixed-Structure Domains (Biological Sequences):
- Hybrid approaches combining structure and statistics work well
- Balance between specialized architectures and statistical power
- Example: AlphaFold’s combination of structural constraints with evolutionary information

The Role of Hidden Variables

The presence and nature of hidden variables significantly impacts model choice:

Few Hidden Variables: Enables direct modeling with specialized architectures
Many Unknown Hidden Variables: Benefits from models that can learn representations from data

Conclusions

This analysis, while admittedly preliminary, provides a framework for understanding when to apply specialized versus general-purpose models in scientific domains. The choice appears guided by three key factors:

The type of structure present (physical, human-created, or mixed)
The structure-to-noise ratio
The presence and nature of hidden variables

In domains with physical structure and few hidden variables, specialized architectures can effectively leverage domain knowledge. However, in domains with human-created structure or many hidden variables, general-purpose models may be more appropriate. This explains why our group remains optimistic about applying LLMs to chemistry - the complexity and hidden variables in chemical experiments might make them particularly suitable for statistical pattern recognition through large language models.

References

Abramson, Josh, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, et al. 2024. “Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” Nature 630 (8016): 493–500. https://doi.org/10.1038/s41586-024-07487-w.

Batatia, Ilyes, Philipp Benner, Yuan Chiang, Alin M. Elena, Dávid P. Kovács, Janosh Riebesell, Xavier R. Advincula, et al. 2024. “A Foundation Model for Atomistic Materials Chemistry.” https://arxiv.org/abs/2401.00096.

Sella, Guy, and Aaron E. Hirsh. 2005. “The Application of Statistical Physics to Evolutionary Biology.” Proceedings of the National Academy of Sciences 102 (27): 9541–46. https://doi.org/10.1073/pnas.0501865102.

White, Andrew D. 2023. “The Future of Chemistry Is Language.” Nature Reviews Chemistry 7 (7): 457–58. https://doi.org/10.1038/s41570-023-00502-0.