From Physics to Probability: Hamiltonian Mechanics for Generative Modeling and MCMC

Phase space of a nonlinear pendulum. Photo by the author.

mechanics is a way to describe how physical systems, like planets or pendulums, move over time, focusing on energy rather than just forces. By reframing complex dynamics through energy lenses, this 19th-century physics framework now powers cutting-edge generative AI. It uses generalized coordinates \( q \) (like position) and their conjugate momenta \( p \) (related to momentum), forming a phase space that captures the system’s state. This approach is particularly useful for complex systems with many parts, making it easier to find patterns and conservation laws.

Mathematical Reformation: From Second-Order to First Order ⚙️

Newton’s \( F = m\ddotq \) requires solving second-order differential equations, which become unwieldy for constrained systems or when identifying conserved quantities.

The Core Idea

Hamiltonian mechanics splits \( \ddotq = F(q)/m \) into two first-order equations by introducing conjugate momentum \( p \):

\[
\beginalign*
\dotq = \frac\partial H\partial p & \text(Position), \quad \dotp = -\frac\partial H\partial q & \text(Momentum)
\endalign*
\]

It decomposes acceleration into complementary momentum/position flows. This phase space perspective reveals hidden geometric structure.

Lagrangian Prelude: Action Principles

The Lagrangian \( \mathcalL(q, \dotq) = K – U \) leads to Euler-Lagrange equations via variational calculus:

\[
\fracddt\left( \frac\partial \mathcalL\partial \dotq \right) – \frac\partial \mathcalL\partial q = 0
\]

Kinetic Energy Symbol

Note that the \( K \) in the \( \mathcalL(q, \dotq) = K – U \) is also represented as \( T \).

But these remain second-order. The critical leap comes through Legendre Transformation \( (\dotq \rightarrow p) \). The Hamiltonian is derived from the Lagrangian through a Legendre transformation by defining the conjugate momentum as \( p_i = \frac\partial \mathcalL\partial \dotq_i \); then the Hamiltonian can be written as:

\[
H(q,p) = \sum_i p_i \dotq_i – \mathcalL(q, \dotq)
\]

We can write \( H(q,p) \) more intuitively as:

\[
H(q,p) = K(p) + U(q)
\]

This flips the script: instead of \( \dotq \)-centric dynamics, we get symplectic phase flow.

Why This Matters

The Hamiltonian becomes the system’s total energy \( H = K + U \) for many physical systems. It also provides a framework where time evolution is a canonical transformation – a symmetry preserving the fundamental Poisson bracket structure \( \q_i, p_j\ = \delta_ij \).

For more about canonical, non-canonical transformations, and Poisson bracket, including detailed math and examples, check out the TorchEBM post on Hamiltonian mechanics.

This transformation is not canonical because it does not preserve the Poisson bracket structure.

Newton vs. Lagrange vs. Hamilton: A Philosophical Showdown

Aspect	Newtonian	Lagrangian	Hamiltonian
State Variables	Position \( x \) and velocity \( \dotx \)	Generalized coordinates \( q \) and velocities \( \dotq \)	Generalized coordinates \( q \) and conjugate momenta \( p \)
Formulation	Second-order differential equations \( (F=ma) \)	Principle of least action (\( \delta \int L \, dt = 0 \)): \( L = K – U \)	First-order differential equations from Hamiltonian function (Phase flow \( (dH) \)): \( H = K + U \)
Identifying Symmetries	Manual identification or through specific methods	Noether’s theorem	Canonical transformations and Poisson brackets
Machine Learning Connection	Physics-informed neural networks, simulations	Optimal control, reinforcement learning	Hamiltonian Monte Carlo (HMC) sampling, energy-based models
Energy Conservation	Not inherent (must be derived)	Built-in through conservation laws	Central (Hamiltonian is energy)
General Coordinates	Possible, but often cumbersome	Natural fit	Natural fit
Time Reversibility	Yes	Yes	Yes, especially in symplectic formulations

Hamilton’s Equations: The Geometry of Phase Space ⚙️

The phase space is a mathematical space where we can represent the set of possible states of a physical system. For a system with \( n \) degrees of freedom, the phase space is a \( 2n \)-dimensional space, often visualized as a map where each point \( (q, p) \) represents a unique state. The evolution of the system is described by the motion of a point in this space, governed by Hamilton’s equations.

This formulation offers several advantages. It makes it straightforward to identify conserved quantities and symmetries through canonical transformations and Poisson brackets, which provides deeper insights into the system’s behavior. For instance, Liouville’s theorem states that the volume in phase space occupied by an ensemble of systems remains constant over time, expressed as:

\[
\frac\partial \rho\partial t + \\rho, H\ = 0
\]

or equivalently:

\[
\frac\partial \rho\partial t + \sum_i \left(\frac\partial \rho\partial q_i\frac\partial H\partial p_i – \frac\partial \rho\partial p_i\frac\partial H\partial q_i\right) = 0
\]

where \( \rho(q, p, t) \) is the density function. This helps us to represent the phase space flows and how they preserve area under symplectic transformations. Its relation to symplectic geometry enables mathematical properties that are directly relevant to many numerical methods. For instance, it enables hamiltonian monte carlo to perform well in high-dimensions by defining MCMC strategies that increases the chances of accepting a sample (particle).

Symplecticity: The Sacred Invariant

Hamiltonian flows preserve the symplectic 2-form \( \omega = \sum_i dq_i \wedge dp_i \).

Symplectic 2-form \( \omega \)
The symplectic 2-form, denoted by \( \omega = \sum_i dq_i \wedge dp_i \), is a mathematical object used in symplectic geometry. It measures the area of parallelograms formed by vectors in the tangent space of a phase space.

\( dq_i \) and \( dp_i \): Infinitesimal changes in position and momentum coordinates.
\( \wedge \): The wedge product, which combines differential forms in an antisymmetric way meaning that \( dq_i \wedge dp_i = -dp_i \wedge dq_i \).
\( \sum_i \): Sum over all degrees of freedom.

Imagine a phase space where each point represents a state of a physical system. The symplectic form assigns a value to each pair of vectors, effectively measuring the area of the parallelogram they span. This area is preserved under Hamiltonian flows.
Key Properties

Closed: \( d\omega = 0 \) which means its exterior derivative is zero \( d\omega=0 \). This property ensures that the form does not change under continuous transformations.
Non-degenerate: The form is non-degenerate if \( d\omega(X,Y)=0 \) for all \( Y \)s, then \( X=0 \). This ensures that every vector has a unique “partner” vector such that their pairing under \( \omega \) is non-zero.

Example
For a simple harmonic oscillator with one degree of freedom, \( \omega = dq \wedge dp \). This measures the area of parallelograms in the phase space spanned by vectors representing changes in position and momentum.
A Very Simplistic PyTorch Code:
While PyTorch doesn’t directly handle differential forms, you can conceptually represent the symplectic form using tensors:

This code illustrates the antisymmetric nature of the wedge product.

Numerically, this means good integrators must respect:

\[
\frac\partial (q(t + \epsilon), p(t + \epsilon))\partial (q(t), p(t))^T J \frac\partial (q(t + \epsilon), p(t + \epsilon))\partial (q(t), p(t)) = J \\ \textwhere J = \beginpmatrix 0 & I \\ -I & 0 \endpmatrix
\]

Breaking Down the Formula

Geometric numerical integration: Solves differential equations while preserving geometric properties of the system.
Symplecticity: A geometric property inherent to Hamiltonian systems. It ensures that the area of geometric structures (e.g., parallelograms) in phase space \( (q, p) \) remains constant over time. This is encoded in the symplectic form \( \omega = \sum_i dq_i \wedge dp_i \).
A numerical method is symplectic: If it preserves \( \omega \). The Jacobian matrix of the transformation from \( (q(t), p(t)) \) to \( (q(t + \epsilon), p(t + \epsilon)) \) must satisfy the condition above.
Jacobian matrix \( \frac\partial (q(t + \epsilon), p(t + \epsilon))\partial (q(t), p(t)) \): Quantifies how small changes in the initial state \( (q(t), p(t)) \) propagate to the next state \( (q(t + \epsilon), p(t + \epsilon)) \).
\( q(t) \) and \( p(t) \): Position and momentum at time \( t \).
\( q(t + \epsilon) \) and \( p(t + \epsilon) \): Updated position and momentum after one time step \( \epsilon \).
\( \frac\partial\partial (q(t), p(t)) \): Partial derivatives with respect to the initial state.

How are We Going to Solve it?

Numerical solvers for differential equations inevitably introduce errors that affect solution accuracy. These errors manifest as deviations from the true trajectory in phase space, particularly noticeable in energy-conserving systems like the harmonic oscillator. The errors fall into two main categories: local truncation error, arising from the approximation of continuous derivatives with discrete steps (proportional to \( \mathcalO(\epsilon^n+1) \) where \( \epsilon \) is the step size and n depends on the method); and global accumulation error, which compounds over integration time.

Forward Euler Method Fails at This!

Key Issue: Energy Drift from Non-Symplectic Updates

The forward Euler method (FEM) violates the geometric structure of Hamiltonian systems, leading to energy drift in long-term simulations. Let’s dissect why.

For a detailed exploration of how methods like Forward Euler perform in Hamiltonian systems and why they don’t preserve symplecticity—including mathematical breakdowns and practical examples—check out this post on Hamiltonian mechanics from the TorchEBM library documentation.

To overcome this, we turn to symplectic integrators—methods that respect the underlying geometry of Hamiltonian systems, leading us naturally to the Leapfrog Verlet method, a powerful symplectic alternative. 🚀

Symplectic Numerical Integrators 💻

Leapfrog Verlet

For a separable Hamiltonian \( H(q,p) = K(p) + U(q) \), where the corresponding probability distribution is given by:

\[
P(q,p) = \frac1Z e^-U(q) e^-K(p),
\]

the Leapfrog Verlet integrator proceeds as follows:

\[
\beginaligned
p_i\left(t + \frac\epsilon2\right) &= p_i(t) – \frac\epsilon2 \frac\partial U\partial q_i(q(t)) \\
q_i(t + \epsilon) &= q_i(t) + \epsilon \frac\partial K\partial p_i\left(p\left(t + \frac\epsilon2\right)\right) \\
p_i(t + \epsilon) &= p_i\left(t + \frac\epsilon2\right) – \frac\epsilon2 \frac\partial U\partial q_i(q(t + \epsilon))
\endaligned
\]

This Störmer-Verlet scheme preserves symplecticity exactly, with local error \( \mathcalO(\epsilon^3) \) and global error \( \mathcalO(\epsilon^2) \). You can read more about numerical methods and analysis in Python here.

How Exactly?

Want to know exactly how the Leapfrog Verlet method ensures symplecticity with detailed equations and proofs? The TorchEBM library documentation on Leapfrog Verlet breaks it down step-by-step.

Why Symplecticity Matters

They’re the reversible neural nets of physics simulations!

Symplectic integrators like Leapfrog Verlet are critical for long-term stability in Hamiltonian systems.

Phase space preservation: The volume in \( (q, p) \)-space is conserved exactly, avoiding artificial energy drift.
Approximate energy conservation: While energy \( H(q,p) \) is not perfectly conserved (due to \( \mathcalO(\epsilon^2) \) error), it oscillates near the true value over exponentially long timescales.
Practical relevance: This makes symplectic integrators indispensable in molecular dynamics and Hamiltonian Monte Carlo (HMC), where accurate sampling relies on stable trajectories.

Comparison of numerical integration methods for a simple harmonic oscillator in phase space. Color gradients indicate error magnitude with brighter colors showing larger divergence from the exact solution (white). Euler’s method (a) exhibits energy growth, Modified Euler’s method (b) shows improved stability, while Leapfrog maintains excellent energy conservation at small stepsize (c) but develops geometric distortion at larger stepsize (d). Photo by the author.

Euler’s method (first-order) systematically injects energy into the system, causing the characteristic outward spiral seen in the plots. Modified Euler’s method (second-order) significantly reduces this energy drift. Most importantly, symplectic integrators like the Leapfrog method preserve the geometric structure of Hamiltonian systems even with relatively large step sizes by maintaining phase space volume conservation. This structural preservation is why Leapfrog remains the preferred method for long-time simulations in molecular dynamics and astronomy, where energy conservation is critical despite the visible polygon-like discretization artifacts at large step sizes.

Non-symplectic methods (e.g., Euler-Maruyama) often fail catastrophically in these settings.

Integrator	Symplecticity	Order	Type
Euler Method	❌	1	Explicit
Symplectically Euler	✅	1	Explicit
Leapfrog (Verlet)	✅	2	Explicit
Runge-Kutta 4	❌	4	Explicit
Forest-Ruth Integrator	✅	4	Explicit
Yoshida 6th-order	✅	6	Explicit
Heun’s Method (RK2)	❌	2	Explicit
Third-order Runge-Kutta	❌	3	Explicit
Implicit Midpoint Rule	✅	2	Implicit (solving equations)
Fourth-order Adams-Bashforth	❌	4	Multi-step (explicit)
Backward Euler Method	❌	1	Implicit (solving equations)

For more details on things like local and global errors or what these integrators are best suited for, there’s a handy write-up over at Hamiltonian mechanics: Why Symplecticity Matters that covers it all.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) method that leverages Hamiltonian dynamics to efficiently sample from complex probability distributions, particularly in Bayesian statistics and Machine Learning.

From Phase Space to Probability Space

HMC interprets target distribution \( P(z) \) as a Boltzmann distribution:

\[
P(z) = \frac1Z e^\frac-E(z)T
\]

Substituting into this formulation, the Hamiltonian gives us a joint density:

\[
P(q,p) = \frac1Z e^-U(q) e^-K(p) \\ \textwhere U(q) = -\log[p(q), p(q|D)]
\]

where \( p(q|D) \) is the likelihood of the given data \( D \) and T=1 and therefore removed. We estimate our posterior distribution using the potential energy \( U(q) \) since \( P(q,p) \) consists of two independent probability distributions.

Augment with artificial momentum \( p \sim \mathcalN(0,M) \), then simulate Hamiltonian dynamics to propose new \( q’ \) based on the distribution of the position variables \( U(q) \) which acts as the “potential energy” of the target distribution \( P(q) \), thereby creating valleys at high-probability regions.

For more on HMC, check out this explanation or this tutorial.

Physical Systems: \( H(q,p) = U(q) + K(p) \) represents total energy

Sampling Systems: \( H(q,p) = -\log P(q) + \frac12p^T M^-1 p \) defines exploration dynamics

The kinetic energy with the popular form of \( K(p) = \frac12p^T M^-1 p \), often Gaussian, injects momentum to traverse these landscapes. Crucially, the mass matrix \( M \) plays the role of a preconditioner – diagonal \( M \) adapts to parameter scales, while dense \( M \) can align with correlation structure. \( M \) is symmetric, positive definite and typically diagonal.

What is Positive Definite?

Positive Definite: For any non-zero vector \( x \), the expression \( x^T M x \) is always positive. This ensures stability and efficiency.

Illustration of different quadratic forms in two variables that shows how different covariance matrices
influence the shape of these forms. The plots depict:

a) **Positive Definite Form**: A bowl-shaped surface where all eigenvalues are positive, indicating a minimum.

b) **Negative Definite Form**: An inverted bowl where all eigenvalues are negative, indicating a maximum.

c) **Indefinite Form**: A saddle-shaped surface with both positive and negative eigenvalues, indicating neither a maximum nor a minimum.

Each subplot includes the matrix ( M ) and the corresponding quadratic form (Q(x) = x^T M x). Photo by the author.

\[
x^T M x > 0
\]

Kinetic Energy Choices

Gaussian (Standard HMC): \( K(p) = \frac12p^T M^-1 p \)
Yields Euclidean trajectories, efficient for moderate dimensions.
Relativistic (Riemannian HMC): \( K(p) = \sqrtp^T M^-1 p + c^2 \)
Limits maximum velocity, preventing divergences in ill-conditioned spaces.
Adaptive (Surrogate Gradients): Learn \( K(p) \) via neural networks to match target geometry.

Key Intuition

The Hamiltonian \( H(q,p) = U(q) + \frac12p^T M^-1 p \) creates an energy landscape where momentum carries the sampler through high-probability regions, avoiding random walk behavior.

The HMC Algorithm

The algorithm involves:

Initialization: Start with an initial position \( q_0 \) and sample momentum \( p_0 \sim \mathcalN(0,M) \).
Leapfrog Integration: Use the leapfrog method to approximate Hamiltonian dynamics. For a step size \( \epsilon \) and L steps, update:
- Half-step momentum: \( p(t + \frac\epsilon2) = p(t) – \frac\epsilon2 \frac\partial U\partial q(q(t)) \)
- Full-step position: \( q(t + \epsilon) = q(t) + \epsilon \frac\partial K\partial p(p(t + \frac\epsilon2)) \), where \( K(p) = \frac12 p^T M^-1 p \), so \( \frac\partial K\partial p = M^-1 p \)
- Full-step momentum: \( p(t + \epsilon) = p(t + \frac\epsilon2) – \frac\epsilon2 \frac\partial U\partial q(q(t + \epsilon)) \)
This is repeated L times to get proposed \( \dotq \) and \( \dotp \).
Metropolis-Hastings Acceptance: Accept the proposed \( \dotq \) with probability \( \min(1, e^H(q_0,p_0) – H(\dotq,\dotp)) \), where \( H(q,p) = U(q) + K(p) \).

This process generates a Markov chain with stationary distribution \( P(q) \), leveraging Hamiltonian dynamics to take larger, more efficient steps compared to random-walk methods.

Why Better Than Random Walk?

HMC navigates high-dimensional spaces along energy contours – like following mountain paths instead of wandering randomly!

Recap of the Hamilton’s equations?

\[
\begincases
\dotq = \nabla_p K(p) = M^-1p & \text(Guided exploration) \\
\dotp = -\nabla_q U(q) = \nabla_q \log P(q) & \text(Bayesian updating)
\endcases
\]

This coupled system drives \( (q,p) \) along iso-probability contours of \( P(q) \), with momentum rotating rather than resetting at each step like in Random Walk Metropolis–think of following mountain paths instead of wandering randomly! The key parameters – integration time \( \tau = L\epsilon \) and step size \( \epsilon \) – balance exploration vs. computational cost:

Short \( \tau \): Local exploration, higher acceptance
Long \( \tau \): Global moves, risk of U-turns (periodic orbits)

Key Parameters and Tuning

Tuning \( M \) to match the covariance of \( P(q) \) (e.g., via warmup adaptation) and setting \( \tau \sim \mathcalO(1/\lambda_\textmax) \), where \( \lambda_\textmax \) is the largest eigenvalue of \( \nabla^2 U \), often yields optimal mixing.

TorchEBM Library 📚

Oh, by the way, I’ve been messing around with this stuff in Python and threw together a library called TorchEBM. It’s got some tools for energy-based, score matching, diffusion- and flow-based models and HMC bits I’ve been playing with. Nothing fancy, just a researcher’s sandbox for testing ideas like these. If you’re into coding this kind of thing, poke around on TorchEBM GitHub and lemme know what you think—PRs welcome! Been fun tinkering with it while writing this post.

Connection with Energy-Based Models

Energy-based models (EBMs) are a class of generative models that define a probability distribution over data points using an energy function. The probability of a data point is proportional to \( e^-E(x) \), where \( E(x) \) is the energy function. This formulation is directly analogous to the Boltzmann distribution in statistical physics, where the probability is related to the energy of a state. In Hamiltonian mechanics, the Hamiltonian function \( H(q, p) \) represents the total energy of the system, and the probability distribution in phase space is given by \( e^-H(q,p)/T \), where \( T \) is the temperature.

In EBMs, Hamiltonian Monte Carlo (HMC) is often used to sample from the model’s distribution. HMC leverages Hamiltonian dynamics to propose new states, which are then accepted or rejected based on the Metropolis-Hastings criterion. This method is particularly effective for high-dimensional problems, as it reduces the correlation between samples and allows for more efficient exploration of the state space. For instance, in image generation tasks, HMC can sample from the distribution defined by the energy function, facilitating the generation of high-quality images.

EBMs define probability through Hamiltonians:

\[
p(x) = \frac1Ze^-E(x) \quad \leftrightarrow \quad H(q,p) = E(q) + K(p)
\]

Potential Research Directions 🔮

Symplecticity in Machine Learning Models

Overview of the Hamiltonian Neural Networks architecture. Image from the HNN paper.

Incorporate the symplectic structure of Hamiltonian mechanics into machine learning models to preserve properties like energy conservation, which is crucial for long-term predictions. Generalizing Hamiltonian Neural Networks (HNNs), as discussed in Hamiltonian Neural Networks, to more complex systems or developing new architectures that preserve symplecticity

HMC for Complex Distributions

HMC for sampling from complex, high-dimensional, and multimodal distributions, such as those encountered in deep learning. Combining HMC with other techniques, like parallel tempering, could handle distributions with multiple modes more effectively.

Combining Hamiltonian Mechanics with Other ML Techniques

Integrate Hamiltonian mechanics with reinforcement learning to guide exploration in continuous state and action spaces. Using it to model the environment could improve exploration strategies, as seen in potential applications in robotics. Additionally, using Hamiltonian mechanics to define approximate posteriors in variational inference could lead to more flexible and accurate approximations.

Hamiltonian GANs

Employing Hamiltonian formalism as an inductive bias for the generation of physically plausible videos with neural networks.

Wanna Team Up on This? 🤓

If some of you brilliant folks wherever you’re doing high-level wizardry are into research collaboration, I’d love to chat generative models over coffee ☕️ (virtual or IRL (London)). If you’re into pushing these ideas further, hit me up! Follow me on Twitter/BlueSky or GitHub—I’m usually rambling about this stuff there. Also on LinkedIn and Medium/TDS if you’re curious. To find more about my research interests, check out my personal website.

Conclusion

Hamiltonian mechanics reframes physical systems through energy, using phase space to reveal symmetries and conservation laws via first-order equations. Symplectic integrators like Leapfrog Verlet preserve this structure, ensuring stability in simulations—crucial for applications like molecular dynamics and Hamiltonian Monte Carlo (HMC). HMC leverages these dynamics to sample complex distributions efficiently, bridging classical physics with modern machine learning.

References and Useful Links 📚

\[\]