Definition

Noise contrastive estimation (NCE) transforms the problem of density estimation to binary classification between data samples and noise samples. Given a sample of data points that follow unknown probability distribution $p_{m} (\cdot; θ^{*})$ parametrized by $θ$ . Noise contrastive estimation (NCE) is used to find an estimator $\hat{θ}$ that best approximate the true parameter $θ^{*}$ . Although MLE has good properties, it requires the parametric family $p_{m} (\cdot; θ)$ to be normalized $\int p_{m} (x; θ) d x = 1$ while calculating. NCE finds the estimator by maximizing an objective function (like in MLE) but without needs of normalizing while calculating, so that it treats normalization coefficient as another estimation parameter. It is desirable property, since the normalization constant may be difficult or expensive to calculate.

The idea is to pollute sample with noise, data points that come from a known distribution, and perform nonlinear logistic regression to discriminate between data and noise.

Consider a data sample $X = (x_{1}, x_{2}, \dots, x_{M}), x_{i} \sim p_{m} (\cdot; θ)$ and noise sample $Y = (y_{1}, y_{2}, \dots, y_{N}), p_{n} (\cdot)$ and the union of the two samples $U = X \cup Y$ . A binary class label $C_{i}$ is assigned to each $u_{i}$ , where $C_{i} = {10 u_{i} \in X u_{i} \in Y$ .

By the definition of the function, $P (u ∣ C = 1) = p_{m} (u; θ)$ and $P (u ∣ C = 0) = p_{n} (u)$ . By the Bayes Theorem, $P (C = 1∣ u) = \frac{p _{m} ( u ; θ )}{p _{m} ( u ; θ ) + ν p _{n} ( u )}$ and $P (C = 0∣ u) = 1 - \frac{p _{m} ( u ; θ )}{p _{m} ( u ; θ ) + ν p _{n} ( u )}$ , where $ν = \frac{N}{M}$ is the sample-noise ratio. The log-likelihood function of the binary classification problem is derived as $l (θ) = i = 1 \sum M ln [P (C = 1∣ x_{i})] + i = 1 \sum N ln [1 - P (C = 1∣ y_{i})]$ The NCE estimator is obtained by maximizing the objective function (log-likelihood) with respect to $θ$ $\hat{θ} = θ argmax l (θ)$

Examples

Word Embedding (Skip-Gram Model)

In the context of word embeddings, particularly the Skip-gram model, the unknown distribution $p_{m} (\cdot; θ)$ becomes $p (c ∣ w; θ)$ , the probability of a context word $c$ given an input word $w$ . The $θ$ represents the parameters of the word embedding model.

The noise distribution $p_{n} (c) = \frac{cnt ( c ) ^{3/4}}{\sum _{c^{'} \in C} cnt ( c ^{'} )}$ , where $cnt (c)$ is the count of word $c$ in the corpus, is the unigram distribution of contexts raised to the power of $\frac{3}{4}$ to balance frequent and rare words.

For each word-context pair $(w, c)$ from the true data, we sample $k$ negative examples from $p_{n} (\cdot)$ . The NCE objective function is derived as: $L_{NCE} = (w, c) \in D \sum [ln (\frac{p _{m} ( c ∣ w ; θ )}{p _{m} ( c ∣ w ; θ ) + k p _{n} ( c )}) + c^{'} \sim p_{n} (c) \sum k ln (\frac{k p _{n} ( c ^{'} )}{p _{m} ( c ^{'} ∣ w ) + k p _{n} ( c ^{'} )})]$ where $D$ is the set of observed word-context pairs, and $k$ is the number of noise samples per data sample.

In practice, $p_{m} (c ∣ w)$ is often modeled as $exp (v_{w} v_{c})$ , without explicit normalization. By maximizing this objective, we obtain word embeddings (vectors $v_{w}$ and $v_{c}$ ) that can discriminate between true context words and random noise words, effectively capturing semantic relationships in the vector space.

My Knowledge Base

Explorer

Noise Contrastive Estimation

Definition

Examples

Word Embedding (Skip-Gram Model)

Graph View

Table of Contents

Backlinks