Definition
Noise contrastive estimation (NCE) transforms the problem of density estimation to binary classification between data samples and noise samples. Given a sample of data points that follow unknown probability distribution parametrized by . Noise contrastive estimation (NCE) is used to find an estimator that best approximate the true parameter . Although MLE has good properties, it requires the parametric family to be normalized while calculating. NCE finds the estimator by maximizing an objective function (like in MLE) but without needs of normalizing while calculating, so that it treats normalization coefficient as another estimation parameter. It is desirable property, since the normalization constant may be difficult or expensive to calculate.
The idea is to pollute sample with noise, data points that come from a known distribution, and perform nonlinear logistic regression to discriminate between data and noise.
Consider a data sample and noise sample and the union of the two samples . A binary class label is assigned to each , where .
By the definition of the function, and . By the Bayes Theorem, and , where is the sample-noise ratio. The log-likelihood function of the binary classification problem is derived as The NCE estimator is obtained by maximizing the objective function (log-likelihood) with respect to
Examples
Word Embedding (Skip-Gram Model)
In the context of word embeddings, particularly the Skip-gram model, the unknown distribution becomes , the probability of a context word given an input word . The represents the parameters of the word embedding model.
The noise distribution , where is the count of word in the corpus, is the unigram distribution of contexts raised to the power of to balance frequent and rare words.
For each word-context pair from the true data, we sample negative examples from . The NCE objective function is derived as: where is the set of observed word-context pairs, and is the number of noise samples per data sample.
In practice, is often modeled as , without explicit normalization. By maximizing this objective, we obtain word embeddings (vectors and ) that can discriminate between true context words and random noise words, effectively capturing semantic relationships in the vector space.