```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pymatviz.parity import density_scatter_with_hist
from rdkit import Chem
from rdkit.Chem import AllChem, PyMol
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotx
'science', 'nature', matplotx.styles.dufte])
plt.style.use([
def gen_conformers(mol, numConfs=10_000, maxAttempts=1000,
=0.2, useExpTorsionAnglePrefs=True,
pruneRmsThresh=True, enforceChirality=True):
useBasicKnowledge"""Use RDkit to generate conformers for a molecule."""
= AllChem.EmbedMultipleConfs(mol, numConfs=numConfs, maxAttempts=maxAttempts, pruneRmsThresh=pruneRmsThresh, useExpTorsionAnglePrefs=useExpTorsionAnglePrefs, useBasicKnowledge=useBasicKnowledge, enforceChirality=enforceChirality, numThreads=0)
ids return list(ids)
def calc_energy(mol, conformer_id, iterations=0):
"""Calculate the energy of a conformer using the Merck Molecular Force Field."""
= AllChem.MMFFGetMoleculeForceField(mol, AllChem.MMFFGetMoleculeProperties(mol), confId=conformer_id)
ff
ff.Initialize()
ff.CalcEnergy()= {}
results if iterations > 0:
"converged"] = ff.Minimize(maxIts=iterations)
results["energy_abs"] = ff.CalcEnergy()
results[return results
# create a molecule
= Chem.AddHs(Chem.MolFromSmiles('CC(CCC)CC(C)(CCCC)O'))
mol
# visualize some conformers using PyMol
= gen_conformers(mol)
conformer_ids = PyMol.MolViewer()
v
v.DeleteAll()for cid in conformer_ids[:50]:
=cid,name='Conf-%d'%cid,showOnly=False)
v.ShowMol(mol,confId'set grid_mode, on')
v.server.do('ray')
v.server.do( v.GetPNG()
```

# Workshop on ML for materials science

*Some notes to accompany the ML workshop.*

## Motivation

To design new materials, we need to know their properties. There are two main routes to get the properties of a material:

- Perform an experiment to measure them
- Perform a simulation to “measure” them in silico

In many cases, performing an experiment is time-consuming and, hence, expensive. Also high-fidelity simulations can be very costly.

Therefore, there is a need for methods that can help us to predict the properties of materials with high fidelity and low cost. In this lecture, we will see that *supervised machine learning* (ML) is a powerful tool to achieve this goal.

Interestingly, this tool can be used in many different ways.

### Where does ML fit in the design process?

Machine learning can be used in multiple ways to make high-fidelity predictions of materials less expensive.

*Replace expensive evaluation of the potential energy surface \(U(\mathbf{X}, \{\mathbf{Z}\})\)*: Quantum chemistry as a field is concerned with the prediction of the potential energy surface \(U(\mathbf{X}, \{\mathbf{Z}\})\) of a system of atoms of types \(\mathbf{Z}\) at positions \(\mathbf{X}\). Quantum chemists have developed different approximations to this problem. However, since they are all kinds of functions that map positions of atoms (and atom types, and in some cases electron densities/coordinates) to energies, we can learn those functions with ML.Note that once we have done that, we generally still need to perform simulations to extract the properties of interest (e.g. as ensemble averages).

There are many good review articles about this. For example, see this one by Unke et al. as well as the ones by Deringer et al. and Behler in the same issue of Chemical Reviews.

*Directly predict the properties of interest*Instead of computing the properties of interest using a molecular simulations, we can build models that learn the \(f(\mathrm{structure}) \to \mathrm{property}\) mapping directly. The basis for this mapping might be experimental data or high-fidelity computational data.Also about this approach, there are many review articles. I also wrote one, focussing on porous materials.

Note that in the context of using ML for molecular simulations, it can also be used to address sampling problems. We will not cover this in detail in this lecture. For a good introduction, see the seminal paper by Noe and a piece about it by Tuckerman.

## Supervised ML workflow

For the main part of this lecture, we will assume that we use models that consume so-called tabular data, i.e. data that is stored in a table (feature matrix \(\mathbf{X}\) and target/label vector/matrix \(\mathbf{Y}\)), where each row corresponds to a material and each of the \(p\) columns corresponds to a so-called feature. We wil later see that this is not the only way to use ML for materials science, but it is the most common one. We will also explore in more detail how we obtain the features.

We will use some data \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\) to train a model \(f(\mathbf{x}) \to y\) that can predict the target \(y\) for a new structure described with the feature vector \(\mathbf{x}^*\).

## Feeding structures into models

### Incorporating symmetries/invariances/equivariances

#### Learning a very simple force field

To understand what it takes to feed structures into ML models, let us try to build a very simple force field. To make things simple and fast, we will just attempt to predict the energies of different conformers of the same molecule.

We will create some data using RDkit and then use scikit-learn to train a model.

##### Generating data

For those conformers, we can now retrieve the positions and energies and save them in a pandas dataframe.

```
# make column names
= sum([[f'x_{n}',f'y_{n}', f'z_{n}'] for n in range(mol.GetNumAtoms())], [])
coordinate_names
# make a dataframe
= []
data for conformer_id in conformer_ids:
= calc_energy(mol, conformer_id)['energy_abs']
energy = mol.GetConformer(conformer_id).GetPositions().flatten()
positions = dict(zip(coordinate_names, positions))
position_dict 'energy'] = energy
position_dict[
data.append(position_dict)= pd.DataFrame(data).sample(len(data))
data data
```

x_0 | y_0 | z_0 | x_1 | y_1 | z_1 | x_2 | y_2 | z_2 | x_3 | ... | x_36 | y_36 | z_36 | x_37 | y_37 | z_37 | x_38 | y_38 | z_38 | energy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2350 | 2.082377 | -2.384621 | -0.886825 | 1.878611 | -1.063902 | -0.149281 | 3.062111 | -0.226795 | -0.498702 | 3.164710 | ... | -4.783486 | 2.805272 | 0.453616 | -3.121490 | 3.344990 | 0.081068 | -1.268761 | -3.134410 | -0.056116 | 50.527359 |

1031 | 2.077103 | -1.521307 | -1.978108 | 1.380593 | -0.800183 | -0.726914 | 2.472656 | -0.385603 | 0.144759 | 2.231073 | ... | -4.911553 | -0.809955 | 1.899992 | -3.547584 | 0.108867 | 2.562697 | 0.556878 | 2.723737 | -0.847688 | 61.708833 |

2043 | -1.562612 | -2.510985 | -1.091513 | -1.538347 | -1.446391 | 0.001791 | -2.874616 | -0.790099 | -0.074631 | -3.214859 | ... | 3.185514 | 2.119784 | -2.037756 | 3.116718 | 3.713313 | -1.258657 | 1.010212 | -1.434941 | -2.027757 | 54.934918 |

488 | -2.622666 | -0.984364 | -1.736397 | -1.552771 | -0.325951 | -0.876648 | -2.211194 | 0.177802 | 0.351123 | -3.280326 | ... | 4.928752 | 1.771672 | -1.560194 | 3.312213 | 1.208019 | -2.103907 | 0.947618 | -1.027059 | 2.062733 | 46.566148 |

3182 | -2.622062 | -1.702202 | -0.649370 | -1.699629 | -0.442711 | -0.448595 | -2.041035 | 0.022269 | 0.917023 | -3.557981 | ... | 4.663731 | -0.690552 | 1.900821 | 5.054191 | -1.298227 | 0.261869 | 1.695958 | -1.012533 | -2.171839 | 58.263424 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

1695 | 1.545078 | -1.676186 | -0.579824 | 1.733232 | -0.758289 | 0.569323 | 2.313415 | 0.545339 | 0.010601 | 3.660710 | ... | -4.968294 | -0.564927 | -1.690845 | -5.102694 | 1.251015 | -1.373149 | -1.049935 | 0.839526 | 2.870953 | 65.543487 |

895 | -2.078717 | 2.181547 | 1.377408 | -1.922400 | 0.699183 | 1.110325 | -2.860042 | 0.256573 | 0.041219 | -2.749035 | ... | 3.568744 | -2.309582 | 0.450033 | 3.535776 | -1.985371 | -1.363970 | 1.251191 | 2.215607 | 1.256163 | 53.297442 |

433 | 2.872947 | -1.275055 | 1.324504 | 1.803448 | -1.107335 | 0.204128 | 2.282703 | 0.009835 | -0.633029 | 2.427814 | ... | -4.479378 | 0.508470 | 1.416176 | -4.317413 | 2.044153 | 0.469898 | -1.669397 | -2.149311 | 1.292007 | 54.034508 |

1870 | 2.664259 | 2.034236 | 0.575131 | 1.761238 | 0.833087 | 0.858491 | 2.505428 | -0.422292 | 0.967442 | 3.344749 | ... | -5.091052 | -1.406824 | 0.155182 | -6.283711 | -0.587390 | -0.984725 | -1.040008 | 0.992988 | 1.719511 | 66.927399 |

1366 | -2.664309 | -0.816584 | -1.457318 | -1.533768 | -0.292689 | -0.538467 | -2.304004 | 0.230210 | 0.679246 | -3.268371 | ... | 3.204639 | 2.086104 | -1.194536 | 2.268599 | 2.114207 | 0.313189 | 0.522256 | 0.077753 | 2.072437 | 56.919313 |

3214 rows × 118 columns

Given this data, we can build a model. We will use a gradient boosting regressor from scikit-learn. We will also split the data into a training and a test set. In later sections, we will see why this is important. But for now, let us us just appreciate that a test set—conformers we did not train on—will give us a measure of how well our model will perform on new, unseen, conformers.

```
= data[coordinate_names] # X
positions = data['energy'] # y
energies
# split into training and test set
= 3000
train_points = positions[:train_points]
train_positions = positions[train_points:]
test_positions = energies[:train_points]
train_energies = energies[train_points:]
test_energies
# train a model
from sklearn.ensemble import HistGradientBoostingRegressor
= HistGradientBoostingRegressor()
model model.fit(train_positions, train_energies)
```

HistGradientBoostingRegressor()

**In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.**

On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

HistGradientBoostingRegressor()

Once we have trained a model, we can use it to predict the energies of new conformers. Let’s first see how well it does on the data it was trained on.

```
= model.predict(train_positions)
train_predictions
='True energy', ylabel='Predicted energy') density_scatter_with_hist(train_energies.values, train_predictions, xlabel
```

`<Axes: xlabel='True energy', ylabel='Predicted energy'>`

This looks pretty good. But how well does it do on new conformers? Let’s see.

```
= model.predict(test_positions)
test_predictions
='True energy', ylabel='Predicted energy') density_scatter_with_hist(test_energies.values, test_predictions, xlabel
```

`<Axes: xlabel='True energy', ylabel='Predicted energy'>`

From physics we know that (without external field) the energy of a molecule does not depend on where in space it is. That is, if we translate a molecule along \([1, 1, 1]\), the energy should not change.

```
# translate the molecule along [1, 1, 1]
= train_positions + 1
translated_positions = model.predict(translated_positions)
translated_predictions density_scatter_with_hist(train_energies.values, translated_predictions)
```

`<Axes: xlabel='Actual', ylabel='Predicted'>`

This is not what we expect. Our model shows completly unphysical behavior and predicts a different energy for the same conformers in different positions in space.

To fix this, and related problems, we need to use a more elaborate approach to building a model.

#### Mmaking predictions invariant/equivariant to transformations

##### What are symmetries we would like to respect?

Before we can talk about how to build a model that respects symmetries, we need to know what symmetries we would like to respect.

In the case of molecules, we would like to respect the following symmetries:

- translation: that is, if we move a molecule along a vector, the energy should not change (see above)
- rotation: that is, if we rotate a molecule, the energy should not change
- permutation of atoms: that is the order with which we put the atoms in the model does not matter

For crystals, we additionally need to respect periodicity. That is, for intensive properties, there should be no difference between using a unit cell or a super cell of that unit cell as input for a model.

Broadly speaking, there are three different ways to build models that respect symmetries.

*Data augmentation*: This is the most straightforward approach. We can generate new data points by applying the symmetries to the existing data points. For example, we can generate new conformers by rotating the existing conformers. This approach is very simple to implement, but it can be very expensive. For example, if we want to generate new conformers by rotating the existing conformers, we need to generate a new conformer for every rotation. This approach is often used for computer vision pipelines in which you might want to detect a cat in an image independent of the orientation. In this case, you can generate new images by rotating the existing images.*Features that are invariant/equivariant*: This approach is more sophisticated. We can build features that are invariant/equivariant to the symmetries we want to respect. For example, we can build features that are invariant to rotation. In the case of force field such features are bond lengths and angles. This is approach is widely used in ML for chemistry and materials science.*Models that are invariant/equivariant*: Alternatively, one can build special models that can consume point clouds as inputs and are equivariant to the symmetries we want to respect. We will not discuss this in detail, but you can find starting points in this perspective by Tess Smidt.

##### Invariant/equivariant features

###### Symmetry functions

###### Fingerprints

###### Correlation functions

###### Symmetry functions

###### Cheaper computations

## Training a model

### How to know if a model is good?

Before we can proceed to building models, we need to estabilsh a way to measure how good a model is.

Interestingly, this is not as trivial as it may sound. To realize this, it is useful to formally write down what we mean by a good model.

#### Empirical risk minimization

Let’s assume we have some input space \(\mathcal{X}\) and some output space \(\mathcal{Y}\). We can think of \(\mathcal{X}\) as the space of all possible inputs and \(\mathcal{Y}\) as the space of all possible outputs. For example, \(\mathcal{X}\) could be the space of all possible molecules and \(\mathcal{Y}\) could be the space of all possible energies. We want to learn a function \(f: \mathcal{X} \rightarrow \mathcal{Y}\) that maps inputs to outputs. We can think of \(f\) as a model that we want to train.

To build this models we have samples of the joint distribution \(p(x, y)\), where \(x\) is an input and \(y\) is the corresponding output. We can think of this as a set of data points \(\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}\).

If we now define a loss function \(L\) we can compute the risk, which is the expected value of the loss function:

\[ R(h)={\mathbf {E}}[L(f(x),y)]=\int L(f(x),y)\,dP(x,y). \]

our goal is to find a model \(f\) that minimizes the risk:

\[ {\displaystyle h^{*}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,{R(h)}.} \]

In practice we cannot compute this. The reason is that we do not have access to the joint distribution \(p(x, y)\), but only to a finite set of samples \(\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}\).

### Linear regression

```
import jax.numpy as jnp
def linear_regression(x, w, b):
return jnp.dot(x, w) + b
```

```
def loss(w, b):
= linear_regression(x, w, b)
prediction return jnp.mean((prediction - y) ** 2)
```

```
def init_params(num_feat):
return np.random.normal(size=(num_feat,)), 0.0
```

`= jax.grad(loss, argnums=(0, 1)) loss_grad `

```
= 1e-6
learning_rate = 1000 num_epochs
```

## Bias-variance trade-off

## Hyperparameters

## Kernel trick

## Feature importance

### Permutation feature importance

## Feature selection

### Curse of dimensionality

For understanding the curse of dimensionality, it is useful to consider a very simple ML model, the \(k\)-nearest neighbors model. In this model, we have a set of training points \(\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}\), where \(x_i\) is a vector of features and \(y_i\) is the corresponding label. To make a prediction, we compute the distance between the input and all training points and return the mode of the labels of the \(k\) closest training points.

Clearly, in this algorithm it is important to find the nearest neighbor. In general, this is important in many algorithms, for instance also in kernel-based learning.

Let’s now ask ourself what part of the space we need to find the nearest neighbors.

For this, let’s start considering a unit cube \([0,1]^d\) and \(n\) data points \(x_i\) sampled uniformly from this cube.

The smallest hypercube that contains \(k\) out of the \(n\) points has the following edge length

\[ l^d = \frac{k}{n} \quad \Rightarrow \quad l = \left(\frac{k}{n}\right)^{1/d} \]

If we plot this for different values of \(d\) we get the following plot:

```
import matplotlib.pyplot as plt
import numpy as np
def length(d, k=5, n=10_000):
return (k/n)**(1/d)
= np.arange(1, 1000)
d
plt.plot(d, length(d))'numbr of dimensions')
plt.xlabel('length of hypercube that contains k neighbors') plt.ylabel(
```

`Text(0, 0.5, 'length of hypercube that contains k neighbors')`

Clearly, for large \(d\) the length approaches 1—which means that all points are now almost equally far apart and comparing distances no longer makes much sense.

We can also check this by performing a simulation: Generating random \(d\) dimensional points and computing the distance between them. We can then plot the distribution of distances.

```
from scipy.spatial import distance_matrix
= [2, 5, 10, 100, 10_000]
dimensions = 1000
num_points
= plt.subplots(1, len(dimensions), sharey='all')
fig, axes
def get_distances(d, num_points):
= np.random.uniform(size=(num_points, d))
points = distance_matrix(points, points)
distances return np.array(distances).flatten()
for d, ax in zip(dimensions, axes):
= get_distances(d, num_points)
distances =20)
ax.hist(distances, binsf'd={d} \n cv={distances.std()/distances.mean():.2f}') ax.set_title(
```

Clearly, for large \(d\) the distances are almost the same (the histograms are much more peaked). We can also see this in terms of the coefficient of variation (cv), which is the standard deviation divided by the mean. For large \(d\) the cv is very small, which means that the distances are very similar.