Metropolis Algorithm

Today, I want to talk about the Metropolis algorithm. But before that, we need to discuss the Monte Carlo method. In a previous post, I introduced the Monte Carlo method and explained that it is excellent for solving multidimensional integrals. The magic lies in drawing random samples from the problem instead of integrating the functions directly. We discussed both direct and Markov-chain approaches. Now, we’ll explore how to apply importance sampling through a Markov chain using the Metropolis algorithm.

Disclaimer: Apologies in advance for the lengthy text you are about to read (I hope you find it engaging!). There are no interactive simulations here, but give it a try!

Much of this post is inspired by the book Introduction to Computational Materials Science by Richard Lesar. I highly recommend this book to anyone interested in computational physics, materials science, or statistical mechanics. It provides clear explanations that make complex topics accessible and engaging.

First, let’s talk about statistical mechanics. Statistical mechanics links the random motions of large numbers of particles to macroscopic properties, using probability to explain things like temperature, pressure, and entropy. To achieve this, statistical mechanics helps us understand how the macroscopic properties (e.g., temperature) of a system are related to its microscopic variables (e.g., the velocities of the particles).

One of the central goals of statistical mechanics is to calculate averages of physical quantities, such as energy, pressure, or magnetization, by considering all the possible microscopic states a system can have. These averages are crucial because they allow us to relate the unpredictable, random behavior of individual particles to the stable, measurable properties we observe in the macroscopic world. In essence, statistical mechanics provides a framework to bridge the gap between the microscopic laws of physics and the macroscopic phenomena seen in everyday experiments by rigorously defining how ensemble averages correspond to observable quantities.

A macrostate describes the thermodynamic state of a system, defined by its set of macroscopic properties.

The microstate of a system is the instantaneous value of its internal variables. For example, if a system has $N$ atoms, the microstate at a specific time $t_o$ is the set of atomic coordinates and momenta of all the atoms at $t_o$ . The microstates will change with time as the system evolves.

Statistical mechanics tells us how to connect averages over the microstates to the thermodynamics of the macrostates. It turns out that there are two ways we can think about making that connection: by time averages or by ensemble averages.

Time Averages

Suppose we have a system with $N$ atoms, where $N$ is very large. If the motion of these atoms follows classical mechanics, then at any given time $t$ , the atoms have a specific set of positions and momenta. The time-dependent properties of the system are completely characterized by $6N$ quantities, known as the mechanical degrees of freedom: the $3N$ coordinates of the positions ( $\mathbf{r}_1$ , $\mathbf{r}_2$ , …, $\mathbf{r}_N$ ) and the $3N$ coordinates of the momenta ( $\mathbf{p}_1$ , $\mathbf{p}_2$ , …, $\mathbf{p}_N$ ). Together, these specify the system’s instantaneous microstate.

These positions and momenta serve as coordinates in a $6N$ -dimensional space called phase space. At any moment, the system is represented by a single point in phase space, defined by the $3N$ positions and $3N$ momenta at that instant. As time progresses, the system moves through phase space as the atoms change their positions and momenta. The path traced out by this “motion” in phase space is known as the trajectory. We could try to explicitly track the changes in these variables over time. However, knowing the instantaneous state of the system does not provide us with much useful information, it is simply too complex to consider, even for relatively small values of $N$ .

The goal is to connect the microstates to the macrostates by means of quantities averaged over the microstates.

The most intuitive idea is that the value we observe for a macrostate in an experiment should be equal to the average of that quantity over time. For example, the observed total energy $E_{\text{obs}}$ would be the time average of $E(t)$ over a time period $t_o$ :

$E_{\text{obs}} = \langle E \rangle = \dfrac{1}{t_o} \displaystyle \int_0^{t_o}E(t)\,dt$

where $E$ is a continuous function of time. $E(t)$ is the value of $E$ at the point in phase space at time $t$ . What is a sufficient length of time for a time average to correspond to a macroscale observable? It must be long enough for the system to visit enough points in the accessible regions of phase space.

Ensemble Averages

There is another way to calculate average quantities. When we performed the time average, we evaluated $E$ at each configuration visited in phase space along the trajectory. Suppose that along that trajectory, the system visited (or came arbitrarily close to) certain configurations more than once. We could imagine keeping track of how many times the system occupied a specific configuration. The average could then be determined by multiplying the value of, for example, $E_\alpha$ at each distinct configuration ( $\alpha$ ), by the number of times $n_\alpha$ the system was in that configuration. To find the average, we just need to divide by the total number of configurations, $N_{\text{config}}$ :

$\langle E \rangle = \dfrac{1}{N_{\text{config}}} \displaystyle \sum_{\alpha=1}^{N_{\text{config}}} n_\alpha E_\alpha = \displaystyle \sum_{\alpha=1}^{N_{\text{config}}} \rho_\alpha E_\alpha$

In the second part of the last equation, we introduced the probability density, $\rho_\alpha$ , which is the fraction of all possible states that are in configuration $\alpha$ .

An important property of the probability density is that it is normalized, i.e.:

$\displaystyle \sum_{\alpha=1}^{N_{\text{config}}} \rho_\alpha = 1$

Instead of following a single system over a long period of time, imagine that a very large number of identical systems were created and allowed to evolve independently, an ensemble of systems. While these systems are identical, they will be in different states. The probability density of being in a given state can be found by counting the number of times that state is represented in the ensemble of systems and dividing by the number of systems. Averages are then found using the last equations. This way of looking at averaging is called an ensemble average.

Partition Function

Consider the probability of being in a state, $\rho_\alpha$ . At this point, we do not have a way to specify exactly which factors influence that probability, such as how it might depend on thermodynamic variables like temperature. However, we can introduce a function that plays this role, in the hope that we can specify its explicit form later. Let us call this function $f_\alpha$ for state $\alpha$ . Keep in mind that this function can depend on many variables. Because $f_\alpha$ determines how thermodynamic quantities weight the probability of being in a particular state relative to others, it is known as a weighting function. Thus, we assume the probability has the form $\rho_\alpha \propto f_\alpha$ .

However, we want the probabilities to be normalized (it is much easier to work with normalized probabilities), so we write:

$\rho_\alpha = \dfrac{f_\alpha}{\sum_\alpha f_\alpha} = \dfrac{1}{Z} f_\alpha$

where the sum in the denominator is over all states. The normalization factor

$Z = \displaystyle \sum_\alpha f_\alpha$

is called the partition function, and it is fundamentally important in statistical mechanics, as it provides a key connection to thermodynamics. The average of a quantity $B$ is given by:

$\langle B \rangle = \dfrac{1}{Z} \displaystyle \sum_\alpha f_\alpha B_{\alpha}$

The properties of an ensemble of systems depend on the constraints imposed on them, and each set of constraints corresponds to a different ensemble. Each ensemble can be characterized by a different weighting function $f_\alpha$ , and thus by a different partition function $Z$ .

Ergodicity

We have assumed that the two ways of taking averages, by averaging either over time or over an ensemble of systems, are equivalent. Systems for which this equivalence holds are said to be ergodic.

Canonical Ensemble

The canonical ensemble is characterized by having a fixed number of particles in a fixed volume at constant temperature ( $NVT$ ). All other thermodynamic quantities will fluctuate (in equilibrium) around their average values.

As we mentioned before, the weighting function describes how thermodynamics influences the probability and, for the canonical ensemble, is given by:

$f_\alpha = \exp{\left(-\frac{E_\alpha}{k_B T}\right)}$

where $E_\alpha$ is the energy of state $\alpha$ and $k_B$ is the Boltzmann constant. The canonical partition function is given by:

$Z = \displaystyle \sum_\alpha \exp{\left(-\frac{E_\alpha}{k_B T}\right)}$

Evaluating $Z$ requires a sum over all states in the system. In most cases there are far too many possible states and we cannot evaluate $Z$ directly. Thus, while we know the weighting function $f_\alpha$ for each state, we do not know the actual probability density $\rho_\alpha$ for each state.

Metropolis Algorithm

We now have all the tools needed to discuss the Metropolis Algorithm. In 1953, Metropolis et al. introduced a groundbreaking method to explore configuration space without explicitly knowing the probability of each state. This technique, now known as the Metropolis Algorithm, constructs a representative set of configurations by relying on their relative probabilities rather than their absolute values. Through this process, we generate a sequence of states distributed according to the correct statistical weights, allowing us to compute average quantities with remarkable accuracy.

As mentioned earlier, it is impractical to compute the partition function $Z$ , which means we cannot determine the exact probability of any particular state $\alpha$ . However, we do know that, although there are many possible configurations, only a relatively small subset are truly important, as they carry most of the probability. Instead of trying to calculate the exact probability for every state, we can focus on exploring the space of states in a way that emphasizes these most relevant ones.

To do this, we start from a given state $i$ (with probability $\rho_i$ ), consider a possible move to a new state $i+1$ (with probability $\rho_{i+1}$ ), and compute the ratio of probabilities between these two states. This ratio will guide us in deciding which configurations to sample more frequently, allowing us to efficiently explore the set of important states without needing to know the full partition function:

$\rho_i = \dfrac{1}{Z} \exp{\left(-\frac{E_i}{k_B T}\right)}$

$\rho_{i+1} = \dfrac{1}{Z} \exp{\left(-\frac{E_{i+1}}{k_B T}\right)}$

$\dfrac{\rho_{i+1}}{\rho_{i}} = \exp{\left(-\frac{(E_{i+1} - E_i)}{k_B T}\right)}$

Note that the probability ratio between the states does not depend on evaluating the partition function!

The Metropolis algorithm is based on this idea: while we cannot know the actual probability of a state (since we cannot evaluate $Z$ ), we can create a list of configurations through configuration space that has the correct probability distribution. This list is the trajectory through configuration space we talked before. Their approach is to start a system at a configuration, then to make a trial move of the system to a new configuration, and to test, based on the probability of the new configuration relative to the starting configuration, whether the new configuration should be added to the trajectory or not. The decision of whether to add the state $i+1$ to the trajectory is based on the ratio of probabilities, $\rho_{i+1}/ \rho_{i} = \exp{\left(-(E_{i+1} - E_i)/{k_B T}\right)}$ . Let’s analyze a bit this relation:

If $\rho_{i+1}/ \rho_{i} \geq 1$

If $\rho_{i+1}/ \rho_{i} \geq 1$ , this means that state $i+1$ is at least as probable as state $i$ . From an energy perspective, this requires that $(E_{i+1} - E_i) \leq 0$ , or in other words, $E_{i+1} \leq E_i$ . This is a safe move because we are adding to our trajectory a state with equal or higher probability.

If $\rho_{i+1}/ \rho_{i} < 1$

However, if $\rho_{i+1}/ \rho_{i} < 1$ , the new state $i+1$ is actually less probable than the current state $i$ , that is, it typically corresponds to a higher energy configuration, which the system would be less likely to occupy at equilibrium. But the Metropolis algorithm cleverly allows for such moves, introducing an element of controlled randomness that enables the system to properly sample configurations even in regions of lower probability.

This is crucial, because if we only ever accepted moves to more probable (lower energy) states, our sampling would quickly become stuck in local minima, failing to explore the true landscape of configuration space. To address this, the algorithm accepts moves to less probable states with a probability equal to the ratio $\rho_{i+1} / \rho_i = \exp{\left(-\frac{E_{i+1} - E_i}{k_B T}\right)}$ . Here, $T$ is the temperature, which acts as a tuning parameter: at higher temperatures, even unlikely (higher energy) moves are more likely to be accepted, helping the system escape unfavorable local minima. At lower temperatures, the algorithm favors lower energy moves, mimicking physical cooling and focusing exploration on more probable states.

Algorithm

Here is the Metropolis Algorithm step by step:

Suppose the system starts at configuration $i$ . Compute the energy $E_i$ for that state.
A trial move is made to configuration $i+1$ with energy $E_{i+1}$ .
Compute the energy difference $\Delta E = E_{i+1} - E_i$ .
Decide whether to accept or reject the trial move:

a. If $\Delta E \leq 0$ , then the relative probability $\rho_{i+1}/ \rho_{i} \geq 1$ , so the trial move is accepted and added to the trajectory.

b. If $\Delta E > 0$ , then the move will be accepted with probability $\exp{\left(-\Delta E / k_B T\right)}$ . That decision is made by first generating a random number $r$ between 0 and 1. If $r \leq \exp{\left(-\Delta E / k_B T\right)}$ then the move is accepted, otherwise, the move is rejected.
If the trial move is accepted, then the next configuration in the trajectory is the new state $i+1$ .
If, however, the trial move is rejected, the next configuration remains the same as configuration $i$ . Each trial generates a configuration for the trajectory, but it may be identical to the previous one. This point is crucial, because otherwise an incorrect distribution will result (just as when we computed $\pi$ using Markov-Chain sampling, where it was necessary to count trials even if they landed outside the square).
Repeat.

Conclusion

Considering a system in which there are many possible configurations, over the course of a Monte Carlo calculation, the goal is to sample enough of the probable configuration space to obtain good averages. The Metropolis algorithm enables a decision of whether a move to a new configuration is accepted or rejected in a way that generates the correct probability within the canonical ensemble for each configuration. However, while it determines whether to accept or reject trial moves, it does not prescribe how to select or propose those new configurations in the first place. We will talk about this in another post!

In summary, though the Metropolis algorithm is a powerful and general approach for sampling complex systems where direct calculation of probabilities is not feasible, its effectiveness stems from relying on relative probabilities and allowing for controlled acceptance of less likely states. This ensures thorough exploration of configuration space and enables accurate estimation of statistical properties, making the Metropolis algorithm invaluable in computational physics, chemistry, and related fields where sampling in high-dimensional spaces is required.

Metropolis Algorithm

Time Averages

Ensemble Averages

Partition Function

Ergodicity

Canonical Ensemble

Metropolis Algorithm

If ρi+1/ρi≥1\rho_{i+1}/ \rho_{i} \geq 1ρi+1​/ρi​≥1

If ρi+1/ρi<1\rho_{i+1}/ \rho_{i} < 1ρi+1​/ρi​<1

Algorithm

Conclusion

If $\rho_{i+1}/ \rho_{i} \geq 1$

If $\rho_{i+1}/ \rho_{i} < 1$