Definition
In Boltzmann machine, each unit updates its state stochastically based on the Boltzmann Distribution in the process of energy minimization. This stochasticity is a key distinction from the Hopfield Network; while the latter always descends to the nearest local minimum, the Boltzmann machine uses probabilistic transitions to explore the energy landscape more effectively. Boltzmann machine models the total energy of the system as a weighted sum of interactions between units.
Stochastic Update Rule
Consider a single unit with the rest of the network remaining fixed. Then we calculate the change in total energy depending on the two possible states (on and off) of the unit.
E_{\text{on}} = \underbrace{-\sum\limits_{j\neq i} w_{ij}x_{j}}_{E_{x_{i}=1}} + E_{\text{rest}}\qquad E_{\text{off}} = \underbrace{\sum\limits_{j\neq i} w_{ij}x_{j}}_{E_{x_{i}=-1}} + E_{\text{rest}}$$ ![[Pasted image 20260208212506.webp|700]] From this, the probability of the unit $x_{i}$ being on is derived using the [[Boltzmann Distribution]] and it has a form of the [[Logistic Function|Sigmoid Function]] of the weighted sum of inputs.p_{\text{on}} = \frac{1}{Z} e^{-E_{\text{on}}} = \frac{e^{-E_{\text{on}}}}{e^{-E_{\text{on}}} + e^{-E_{\text{off}}}} = \frac{e^{-E_{x_{i}=1}}}{e^{-E_{x_{i}=1}} + e^{-E_{x_{i}=-1}}} = \frac{1}{1 + e^{-\Delta E}} = \sigma(\Delta E) = \sigma\left( 2\sum\limits_{j\neq i} w_{ij}x_{j} \right)
where $\Delta E = E_{\text{off}} - E_{\text{on}} = 2\sum\limits_{j\neq i} w_{ij}x_{j}$ and the temperature is set to 1 for simplicity. # Contrastive Hebbian Rule The objective of training is maximizing the likelihood of training data under the model. $$\ln p(X) = \ln\left[ \prod_{n=1}^{N} p(x^{(n)}) \right] = \sum\limits_{n=1}^{N}\ln p(x^{(n)}) = \sum\limits_{n=1}^{N}\ln\left( \frac{1}{Z} e^{-E(x^{(n)})} \right) = -\sum\limits_{n=1}^{N}E(x^{(n)}) -N\ln Z $$ To maximize the log-likelihood given training data, we need to simultaneously minimize the energy (first term) and normalizing constant (second term). Let's take the partial derivative of the log-probability with respect to a weight $w_{ij}$. $$\frac{\partial \ln p(X)}{\partial w_{ij}} = -\sum\limits_{n=1}^{N}\frac{\partial E(x^{(n)})}{\partial w_{ij}} -N \frac{\partial \ln Z}{\partial w_{ij}} $$ The first (energy) term simplifies to the product of the unit pair. $$\frac{\partial E(x^{(n)})}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}}\left[ -\sum\limits_{i<j}^{}w_{ij}x_{i}x_{j} \right] = -x_{i}^{(n)}x_{j}^{(n)}$$ The second term simplifies to the expectation of the product of the unit pair over all possible states.\begin{aligned} \frac{\partial \ln Z}{\partial w_{ij}} &= \frac{1}{Z}\frac{\partial Z}{\partial w_{ij}} = \frac{1}{Z}\frac{\partial}{\partial w_{ij}}\left[\sum\limits_{s\in S}e^{-E(s)}\right]\ &= \frac{1}{Z}\sum\limits_{s\in S}\left[\frac{\partial}{\partial w_{ij}}e^{-E(s)}\right] = \frac{1}{Z}\sum\limits_{s\in S}\left[e^{-E(s)}\frac{\partial}{\partial w_{ij}}[-E(s)]\right]\ &= \frac{1}{Z}\sum\limits_{s\in S}e^{-E(s)}x_{i}^{s}x_{j}^{s} = \sum\limits_{s\in S}\frac{e^{-E(s)}}{Z}x_{i}^{s}x_{j}^{s}\ &= \sum\limits_{s\in S}p(s)x_{i}^{s}x_{j}^{s} \end{aligned}
where $S$ is a set of all possible states of the system. Putting everything together\begin{aligned} \frac{1}{N}\frac{\partial \ln p(X)}{\partial w_{ij}} &= \frac{1}{N}\sum\limits_{n=1}^{N}x_{i}^{(n)}x_{j}^{(n)} -\sum\limits_{s\in S}p(s)x_{i}^{s}x_{j}^{s}\ &= \underbrace{\mathbb{E}{\text{data}}[x{i} x_{j}]}{\text{Hebbian term}} - \underbrace{\mathbb{E}{\text{model}}[x_{i}x_{j}]}_{\text{Anti-Hebbian term}}\ \end{aligned}
Thus the weight update rule is defined as $$w_{ij} \leftarrow w_{ij} + \eta(\mathbb{E}_{\text{data}}[x_{i} x_{j}] - \mathbb{E}_{\text{model}}[x_{i}x_{j}])$$ # Training Algorithm ![[Pasted image 20260209024935.webp|700]] 1. Positive Phase (Clamped) 1. Clamp visible units to a training data. 2. Run the stochastic update rule on each hidden unit until reaching equilibrium ([[Gibbs Sampling]]). 3. Repeat this process for the entire ($N_{p}$) training data and calculate the average product for each pair. $\mathbb{E}_{\text{data}}[x_{i} x_{j}] = \frac{1}{N_{p}}\sum\limits_{n=1}^{N_{p}}x_{i} x_{j}$, where $N_{p}$ is the number of training data. 2. Negative Phase (Unclamped) 1. Start with random states for all units. 2. Run the stochastic update rule iteratively to both visible and hidden units until reaching equilibrium. 3. Repeat this process multiple ($N_{n}$) times and calculate the average product for each pair ([[Monte Carlo Integration]]). $\mathbb{E}_{\text{model}}[x_{i}x_{j}] \approx \frac{1}{N_{n}}\sum\limits_{n=1}^{N_{n}} x_{i}x_{j}$, where $N_{n}$ is the negative sample count. 3. Update weights 1. Adjust the weights based on the calculated terms $w_{ij} \leftarrow w_{ij} + \eta\left( \frac{1}{N_{p}}\sum\limits_{n=1}^{N_{p}}x_{i} x_{j} - \frac{1}{N_{n}}\sum\limits_{n=1}^{N_{n}} x_{i}x_{j} \right)$, where $\eta$ is learning rate. 2. Repeat the entire steps iteratively. # Hidden Units In the training process hidden units form strong connections to the visible units that co-occur in training data to minimize the energy of the system. Through this mechanism, hidden units extract features of the training data, and act as latent variables.