# Workshop on ML for materials science

Some notes to accompany the ML workshop.

## Motivation

To design new materials, we need to know their properties. There are two main routes to get the properties of a material:

1. Perform an experiment to measure them
2. Perform a simulation to “measure” them in silico

In many cases, performing an experiment is time-consuming and, hence, expensive. Also high-fidelity simulations can be very costly.

Fidelity expresses the exactness with which a surrogate represents the truth. In the context of ML you might also see the term multi-fidelity, which means that the approach uses multiple approximations with different levels of fidelity, e.g. density-functional theory and coupled cluster theory

Therefore, there is a need for methods that can help us to predict the properties of materials with high fidelity and low cost. In this lecture, we will see that supervised machine learning (ML) is a powerful tool to achieve this goal.

Interestingly, this tool can be used in many different ways.

### Where does ML fit in the design process?

Machine learning can be used in multiple ways to make high-fidelity predictions of materials less expensive.

Note that reducing the cost has been a challenge for chemists and material scientists for a long time. Dirac famously said “The fundamental laws necessary for the mathematical treatment of a large part of physics and the whole of chemistry are thus completely known, and the difficulty lies only in the fact that application of these laws leads to equations that are too complex to be solved. […] approximate practical methods of applying quantum mechanics should be developed, which can lead to an explanation of the main features of complex atomic systems without too much computation”
1. Replace expensive evaluation of the potential energy surface $$U(\mathbf{X}, \{\mathbf{Z}\})$$: Quantum chemistry as a field is concerned with the prediction of the potential energy surface $$U(\mathbf{X}, \{\mathbf{Z}\})$$ of a system of atoms of types $$\mathbf{Z}$$ at positions $$\mathbf{X}$$. Quantum chemists have developed different approximations to this problem. However, since they are all kinds of functions that map positions of atoms (and atom types, and in some cases electron densities/coordinates) to energies, we can learn those functions with ML.

Note that once we have done that, we generally still need to perform simulations to extract the properties of interest (e.g. as ensemble averages).

There are many good review articles about this. For example, see this one by Unke et al. as well as the ones by Deringer et al. and Behler in the same issue of Chemical Reviews.

2. Directly predict the properties of interest Instead of computing the properties of interest using a molecular simulations, we can build models that learn the $$f(\mathrm{structure}) \to \mathrm{property}$$ mapping directly. The basis for this mapping might be experimental data or high-fidelity computational data.

Also about this approach, there are many review articles. I also wrote one, focussing on porous materials.

Note that in the context of using ML for molecular simulations, it can also be used to address sampling problems. We will not cover this in detail in this lecture. For a good introduction, see the seminal paper by Noe and a piece about it by Tuckerman.

## Supervised ML workflow

For the main part of this lecture, we will assume that we use models that consume so-called tabular data, i.e. data that is stored in a table (feature matrix $$\mathbf{X}$$ and target/label vector/matrix $$\mathbf{Y}$$), where each row corresponds to a material and each of the $$p$$ columns corresponds to a so-called feature. We wil later see that this is not the only way to use ML for materials science, but it is the most common one. We will also explore in more detail how we obtain the features.

We will use some data $$\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N$$ to train a model $$f(\mathbf{x}) \to y$$ that can predict the target $$y$$ for a new structure described with the feature vector $$\mathbf{x}^*$$.

## Feeding structures into models

### Incorporating symmetries/invariances/equivariances

#### Learning a very simple force field

To understand what it takes to feed structures into ML models, let us try to build a very simple force field. To make things simple and fast, we will just attempt to predict the energies of different conformers of the same molecule.

We will create some data using RDkit and then use scikit-learn to train a model.

##### Generating data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pymatviz.parity import density_scatter_with_hist
from rdkit import Chem
from rdkit.Chem import AllChem, PyMol
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotx
plt.style.use(['science', 'nature', matplotx.styles.dufte])

def gen_conformers(mol, numConfs=10_000, maxAttempts=1000,
pruneRmsThresh=0.2, useExpTorsionAnglePrefs=True,
useBasicKnowledge=True, enforceChirality=True):
"""Use RDkit to generate conformers for a molecule."""
ids = AllChem.EmbedMultipleConfs(mol, numConfs=numConfs, maxAttempts=maxAttempts, pruneRmsThresh=pruneRmsThresh, useExpTorsionAnglePrefs=useExpTorsionAnglePrefs, useBasicKnowledge=useBasicKnowledge, enforceChirality=enforceChirality, numThreads=0)
return list(ids)

def calc_energy(mol, conformer_id, iterations=0):
"""Calculate the energy of a conformer using the Merck Molecular Force Field."""
ff = AllChem.MMFFGetMoleculeForceField(mol, AllChem.MMFFGetMoleculeProperties(mol), confId=conformer_id)
ff.Initialize()
ff.CalcEnergy()
results = {}
if iterations > 0:
results["converged"] = ff.Minimize(maxIts=iterations)
results["energy_abs"] = ff.CalcEnergy()
return results

# create a molecule

# visualize some conformers using PyMol
conformer_ids = gen_conformers(mol)
v= PyMol.MolViewer()
v.DeleteAll()
for cid in conformer_ids[:50]:
v.ShowMol(mol,confId=cid,name='Conf-%d'%cid,showOnly=False)
v.server.do('set grid_mode, on')
v.server.do('ray')
v.GetPNG()

For those conformers, we can now retrieve the positions and energies and save them in a pandas dataframe.

# make column names
coordinate_names = sum([[f'x_{n}',f'y_{n}', f'z_{n}'] for n in range(mol.GetNumAtoms())], [])

# make a dataframe
data = []
for conformer_id in conformer_ids:
energy = calc_energy(mol, conformer_id)['energy_abs']
positions = mol.GetConformer(conformer_id).GetPositions().flatten()
position_dict = dict(zip(coordinate_names, positions))
position_dict['energy'] = energy
data.append(position_dict)
data = pd.DataFrame(data).sample(len(data))
data
x_0 y_0 z_0 x_1 y_1 z_1 x_2 y_2 z_2 x_3 ... x_36 y_36 z_36 x_37 y_37 z_37 x_38 y_38 z_38 energy
1046 -2.212281 -1.474999 -2.390409 -1.562491 -0.355279 -1.552936 -2.574018 0.279176 -0.658268 -3.217086 ... 4.383762 -0.980267 0.056970 5.522230 -1.019915 1.414831 -0.100292 1.419222 1.300613 48.129356
1791 -2.725521 -1.570268 -0.008617 -1.420433 -0.745241 0.167732 -1.761553 0.659563 0.457654 -2.569109 ... 4.089157 0.909448 1.853696 2.290849 1.084068 1.508593 0.817653 -1.854129 -2.626004 70.779133
2690 2.106368 -1.556137 -0.275108 1.674815 -0.651654 0.795833 1.889560 0.782223 0.461278 3.278361 ... -2.118368 2.632123 -0.509996 -3.777892 3.014443 -1.202697 -1.944397 -2.422734 0.016038 63.048163
577 -1.702710 -2.151021 0.018615 -1.161150 -0.845488 0.474544 -2.158003 -0.017822 1.223527 -3.410603 ... 2.625792 2.504632 1.879925 4.344447 2.221012 1.528942 0.789643 0.258301 -2.784349 56.945381
2321 -2.410615 0.036347 -1.407882 -1.514375 0.715345 -0.483198 -2.071248 1.193091 0.824212 -2.629851 ... 4.279797 1.822461 1.492311 4.810844 0.133699 1.765070 0.220271 1.130136 -2.573808 69.120898
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1903 -1.874600 -1.729838 0.296233 -1.242327 -0.423624 0.662343 -2.360888 0.556635 0.935360 -3.274811 ... 5.123665 0.348017 0.274119 4.824946 1.267063 -1.287177 0.989804 -0.061443 -2.574835 52.258451
2963 -2.085652 -1.522719 1.316025 -1.858266 -1.020026 -0.052872 -3.130666 -0.933779 -0.872329 -4.203947 ... 6.026970 1.391672 -1.103528 5.360257 -0.211862 -0.469421 0.778024 -1.285147 -1.653611 56.778165
1399 2.734858 -1.857506 0.436812 1.674471 -0.985306 -0.285730 1.895531 0.401886 0.230331 3.346859 ... -4.408550 0.444544 1.670268 -4.406963 2.201135 1.983818 -1.392504 -3.008175 -0.388963 55.894580
1826 2.249926 -1.031273 0.945958 1.432377 0.040120 0.330105 2.067005 1.406148 0.266247 3.294075 ... -3.437225 2.283544 -1.291475 -4.960940 1.550765 -0.582463 -0.201556 -1.938533 -0.920068 63.887687
1926 -2.295382 -0.994476 0.655058 -1.329595 -0.011970 0.096757 -2.009801 0.916414 -0.871239 -3.080991 ... 3.449466 2.518873 -0.752006 4.822737 2.016511 0.328768 1.423537 -2.471398 -1.341771 60.188082

3208 rows × 118 columns

Given this data, we can build a model. We will use a gradient boosting regressor from scikit-learn. We will also split the data into a training and a test set. In later sections, we will see why this is important. But for now, let us us just appreciate that a test set—conformers we did not train on—will give us a measure of how well our model will perform on new, unseen, conformers.

positions = data[coordinate_names] # X
energies = data['energy'] # y

# split into training and test set
train_points = 3000
train_positions = positions[:train_points]
test_positions = positions[train_points:]
train_energies = energies[:train_points]
test_energies = energies[train_points:]

# train a model
model.fit(train_positions, train_energies)
HistGradientBoostingRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.

Once we have trained a model, we can use it to predict the energies of new conformers. Let’s first see how well it does on the data it was trained on.

train_predictions = model.predict(train_positions)

density_scatter_with_hist(train_energies.values, train_predictions, xlabel='True energy', ylabel='Predicted energy')
<AxesSubplot: xlabel='True energy', ylabel='Predicted energy'>

This looks pretty good. But how well does it do on new conformers? Let’s see.

test_predictions = model.predict(test_positions)

density_scatter_with_hist(test_energies.values, test_predictions, xlabel='True energy', ylabel='Predicted energy')
<AxesSubplot: xlabel='True energy', ylabel='Predicted energy'>

From physics we know that (without external field) the energy of a molecule does not depend on where in space it is. That is, if we translate a molecule along $$[1, 1, 1]$$, the energy should not change.

# translate the molecule along [1, 1, 1]
translated_positions = train_positions + 1
translated_predictions = model.predict(translated_positions)
density_scatter_with_hist(train_energies.values, translated_predictions)
<AxesSubplot: xlabel='Actual', ylabel='Predicted'>

This is not what we expect. Our model shows completly unphysical behavior and predicts a different energy for the same conformers in different positions in space.

To fix this, and related problems, we need to use a more elaborate approach to building a model.

#### Mmaking predictions invariant/equivariant to transformations

Invariance and equivariance are terms that have become very relevant in ML. It is always important to mention with respect to what operation something is invariant and equivariant; if people don’t mention this they often refer to the symmetry operations of the Euclidean group which comprises all translations, rotations, and reflection. Invariant means that the property of interest does not change under those operations. Equivariant means that it changes in the same way. The energy, for example, is invariant and the forces are equivariant.
##### What are symmetries we would like to respect?

Before we can talk about how to build a model that respects symmetries, we need to know what symmetries we would like to respect.

In the case of molecules, we would like to respect the following symmetries:

• translation: that is, if we move a molecule along a vector, the energy should not change (see above)
• rotation: that is, if we rotate a molecule, the energy should not change
• permutation of atoms: that is the order with which we put the atoms in the model does not matter

For crystals, we additionally need to respect periodicity. That is, for intensive properties, there should be no difference between using a unit cell or a super cell of that unit cell as input for a model.

Broadly speaking, there are three different ways to build models that respect symmetries.

1. Data augmentation: This is the most straightforward approach. We can generate new data points by applying the symmetries to the existing data points. For example, we can generate new conformers by rotating the existing conformers. This approach is very simple to implement, but it can be very expensive. For example, if we want to generate new conformers by rotating the existing conformers, we need to generate a new conformer for every rotation. This approach is often used for computer vision pipelines in which you might want to detect a cat in an image independent of the orientation. In this case, you can generate new images by rotating the existing images.
2. Features that are invariant/equivariant : This approach is more sophisticated. We can build features that are invariant/equivariant to the symmetries we want to respect. For example, we can build features that are invariant to rotation. In the case of force field such features are bond lengths and angles. This is approach is widely used in ML for chemistry and materials science.
3. Models that are invariant/equivariant: Alternatively, one can build special models that can consume point clouds as inputs and are equivariant to the symmetries we want to respect. We will not discuss this in detail, but you can find starting points in this perspective by Tess Smidt.

## Training a model

### How to know if a model is good?

Before we can proceed to building models, we need to estabilsh a way to measure how good a model is.

Interestingly, this is not as trivial as it may sound. To realize this, it is useful to formally write down what we mean by a good model.

#### Empirical risk minimization

Let’s assume we have some input space $$\mathcal{X}$$ and some output space $$\mathcal{Y}$$. We can think of $$\mathcal{X}$$ as the space of all possible inputs and $$\mathcal{Y}$$ as the space of all possible outputs. For example, $$\mathcal{X}$$ could be the space of all possible molecules and $$\mathcal{Y}$$ could be the space of all possible energies. We want to learn a function $$f: \mathcal{X} \rightarrow \mathcal{Y}$$ that maps inputs to outputs. We can think of $$f$$ as a model that we want to train.

To build this models we have samples of the joint distribution $$p(x, y)$$, where $$x$$ is an input and $$y$$ is the corresponding output. We can think of this as a set of data points $$\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}$$.

If we now define a loss function $$L$$ we can compute the risk, which is the expected value of the loss function:

$R(h)={\mathbf {E}}[L(f(x),y)]=\int L(f(x),y)\,dP(x,y).$

our goal is to find a model $$f$$ that minimizes the risk:

${\displaystyle h^{*}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,{R(h)}.}$

In practice we cannot compute this. The reason is that we do not have access to the joint distribution $$p(x, y)$$, but only to a finite set of samples $$\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}$$.

### Linear regression

import jax.numpy as jnp

def linear_regression(x, w, b):
return jnp.dot(x, w) + b
def loss(w, b):
prediction = linear_regression(x, w, b)
return jnp.mean((prediction - y) ** 2)
def init_params(num_feat):
return np.random.normal(size=(num_feat,)), 0.0
loss_grad = jax.grad(loss, argnums=(0, 1))
learning_rate = 1e-6
num_epochs = 1000

## Feature selection

### Curse of dimensionality

For understanding the curse of dimensionality, it is useful to consider a very simple ML model, the $$k$$-nearest neighbors model. In this model, we have a set of training points $$\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}$$, where $$x_i$$ is a vector of features and $$y_i$$ is the corresponding label. To make a prediction, we compute the distance between the input and all training points and return the mode of the labels of the $$k$$ closest training points.

Clearly, in this algorithm it is important to find the nearest neighbor. In general, this is important in many algorithms, for instance also in kernel-based learning.

Let’s now ask ourself what part of the space we need to find the nearest neighbors.

For this, let’s start considering a unit cube $$[0,1]^d$$ and $$n$$ data points $$x_i$$ sampled uniformly from this cube.

The smallest hypercube that contains $$k$$ out of the $$n$$ points has the following edge length

$l^d = \frac{k}{n} \quad \Rightarrow \quad l = \left(\frac{k}{n}\right)^{1/d}$

If we plot this for different values of $$d$$ we get the following plot:

import matplotlib.pyplot as plt
import numpy as np

def length(d, k=5, n=10_000):
return (k/n)**(1/d)

d = np.arange(1, 1000)

plt.plot(d, length(d))
plt.xlabel('numbr of dimensions')
plt.ylabel('length of hypercube that contains k neighbors')
Text(0, 0.5, 'length of hypercube that contains k neighbors')

Clearly, for large $$d$$ the length approaches 1—which means that all points are now almost equally far apart and comparing distances no longer makes much sense.

We can also check this by performing a simulation: Generating random $$d$$ dimensional points and computing the distance between them. We can then plot the distribution of distances.

from scipy.spatial import distance_matrix
dimensions = [2, 5, 10, 100, 10_000]
num_points = 1000

fig, axes = plt.subplots(1, len(dimensions), sharey='all')

def get_distances(d, num_points):
points = np.random.uniform(size=(num_points, d))
distances = distance_matrix(points, points)
return np.array(distances).flatten()

for d, ax in zip(dimensions, axes):
distances = get_distances(d, num_points)
ax.hist(distances, bins=20)
ax.set_title(f'd={d} \n cv={distances.std()/distances.mean():.2f}')

Clearly, for large $$d$$ the distances are almost the same (the histograms are much more peaked). We can also see this in terms of the coefficient of variation (cv), which is the standard deviation divided by the mean. For large $$d$$ the cv is very small, which means that the distances are very similar.