Natural Language Processing Note

Word2Vec
Definition

The Word2Vec model learns vector representations of words that effectively capture the semantic relationships them using large corpus of text.

Continuous Bag-Of-Words

Continuous bag-of-words (CBOW) predicts a target word given its context words. The model takes a window of context words around a target word. The context words are fed into the network, and it tries to predict the target word. The model is trained to maximize the probability of the target word given the context words.

Algorithm

Generate one-hot encodings ${x_{1}, x_{2}, \dots, x_{C}}$ of the context words $v_{I_{1}}, v_{I_{2}}, \dots, v_{I_{C}}$ of window size $C$ .

Get embedded words vectors ${h_{1}, h_{2}, \dots, h_{C}}$ using linear transformation, and take average of them $h = \frac{1}{C} \sum_{i = 1}^{C} x_{i} W$ where the weight $W$ is shared

Generate a score vector and turn the scores into probability with Softmax Function $y = softmax (h W^{'})$

Adjust the weights $W$ and $W^{'}$ to match the result $y$ to the one-hot encoding of actual output word $v_{O}$ by minimizing the loss $L = - ln P (v_{O} ∣ v_{I_{1}}, v_{I_{2}}, \dots, v_{I_{C}}) = - i = 1 \sum ∣ V ∣ v_{O_{i}} ln (y_{i})$ where $∣ V ∣$ is the dimension of the vectors, and $v_{O_{i}}$ is the $i$ -th element of the one-hot encoded output word vector.

Skip-Gram

The Skip-gram model does the opposite of CBOW. It predicts the context words given a target word The model takes a single target word as input. The target word is fed into the network, and it tries to predict the surrounding context words. The model is trained to maximize the probability of each context word given the target word.

Algorithm

Generate one-hot encoding $x$ of the input word $v_{I}$

Get an embedded word vector $h$ using linear transformation $h = x W$

Generate a score vector and turn the scores into probability with Softmax Function $y = softmax (h W^{'})$

Adjust the weights $W$ and $W^{'}$ to match the result $y$ to the $C$ many one-hot encodings of the actual output word $v_{O_{1}}, v_{O_{2}}, \dots, v_{O_{C}}$ by minimizing the loss $L = - ln P (∣ v_{O_{1}}, v_{O_{2}}, \dots, v_{O_{C}} ∣ v_{I}) = - \sum_{j = 1}^{C} \sum_{i = 1}^{∣ V ∣} v_{O_{j, i}} ln (y_{i})$ where $∣ V ∣$ is the dimension of the vectors, and $v_{O_{j, i}}$ is the $i$ -th element of the one-hot encoded output $j$ -th context word vector.

Link to original

Noise Contrastive Estimation
Definition

Noise contrastive estimation (NCE) transforms the problem of density estimation to binary classification between data samples and noise samples. Given a sample of data points that follow unknown probability distribution $p_{m} (\cdot; θ^{*})$ parametrized by $θ$ . Noise contrastive estimation (NCE) is used to find an estimator $\hat{θ}$ that best approximate the true parameter $θ^{*}$ . Although MLE has good properties, it requires the parametric family $p_{m} (\cdot; θ)$ to be normalized $\int p_{m} (x; θ) d x = 1$ while calculating. NCE finds the estimator by maximizing an objective function (like in MLE) but without needs of normalizing while calculating, so that it treats normalization coefficient as another estimation parameter. It is desirable property, since the normalization constant may be difficult or expensive to calculate.

The idea is to pollute sample with noise, data points that come from a known distribution, and perform nonlinear logistic regression to discriminate between data and noise.

Consider a data sample $X = (x_{1}, x_{2}, \dots, x_{M}), x_{i} \sim p_{m} (\cdot; θ)$ and noise sample $Y = (y_{1}, y_{2}, \dots, y_{N}), p_{n} (\cdot)$ and the union of the two samples $U = X \cup Y$ . A binary class label $C_{i}$ is assigned to each $u_{i}$ , where $C_{i} = {10 u_{i} \in X u_{i} \in Y$ .

By the definition of the function, $P (u ∣ C = 1) = p_{m} (u; θ)$ and $P (u ∣ C = 0) = p_{n} (u)$ . By the Bayes Theorem, $P (C = 1∣ u) = \frac{p _{m} ( u ; θ )}{p _{m} ( u ; θ ) + ν p _{n} ( u )}$ and $P (C = 0∣ u) = 1 - \frac{p _{m} ( u ; θ )}{p _{m} ( u ; θ ) + ν p _{n} ( u )}$ , where $ν = \frac{N}{M}$ is the sample-noise ratio. The log-likelihood function of the binary classification problem is derived as $l (θ) = i = 1 \sum M ln [P (C = 1∣ x_{i})] + i = 1 \sum N ln [1 - P (C = 1∣ y_{i})]$ The NCE estimator is obtained by maximizing the objective function (log-likelihood) with respect to $θ$ $\hat{θ} = θ argmax l (θ)$

Examples

Word Embedding (Skip-Gram Model)

In the context of word embeddings, particularly the Skip-gram model, the unknown distribution $p_{m} (\cdot; θ)$ becomes $p (c ∣ w; θ)$ , the probability of a context word $c$ given an input word $w$ . The $θ$ represents the parameters of the word embedding model.

The noise distribution $p_{n} (c) = \frac{cnt ( c ) ^{3/4}}{\sum _{c^{'} \in C} cnt ( c ^{'} )}$ , where $cnt (c)$ is the count of word $c$ in the corpus, is the unigram distribution of contexts raised to the power of $\frac{3}{4}$ to balance frequent and rare words.

For each word-context pair $(w, c)$ from the true data, we sample $k$ negative examples from $p_{n} (\cdot)$ . The NCE objective function is derived as: $L_{NCE} = (w, c) \in D \sum [ln (\frac{p _{m} ( c ∣ w ; θ )}{p _{m} ( c ∣ w ; θ ) + k p _{n} ( c )}) + c^{'} \sim p_{n} (c) \sum k ln (\frac{k p _{n} ( c ^{'} )}{p _{m} ( c ^{'} ∣ w ) + k p _{n} ( c ^{'} )})]$ where $D$ is the set of observed word-context pairs, and $k$ is the number of noise samples per data sample.

In practice, $p_{m} (c ∣ w)$ is often modeled as $exp (v_{w} v_{c})$ , without explicit normalization. By maximizing this objective, we obtain word embeddings (vectors $v_{w}$ and $v_{c}$ ) that can discriminate between true context words and random noise words, effectively capturing semantic relationships in the vector space.
Link to original

Negative Sampling
Definition

Negative sampling (NS) is a simplified version of NCE. It shares the core idea of transforming the problem of density estimation into binary classification between true data samples and noise samples.

In the context of word embeddings, particularly the Skip-gram model, NS aims to learn good word representations without the need to estimate a full probability distribution. The goal is to find parameters $θ$ that best represent the relationship between words and their contexts. Given a word-context pair $(w, c)$ from the true data distribution, NS samples $k$ negative examples (noise) from a noise distribution $p_{n} (c)$ . The objective is to maximize the probability of the true pair while minimizing the probability of the noise pairs.

NS models $p (D = 1∣ w, c; θ)$ with the Sigmoid Function: $p (D = 1∣ w, c; θ) = σ (v_{c} \cdot v_{w}) = \frac{1}{1 + e x p ( - v _{c} \cdot v _{w} )}$

The NS objective function is derived as: $L_{NS} = (w, c) \in D \sum ln σ (v_{c} \cdot v_{w}) + (w, c) \in D^{'} \sum k ln σ (- v_{c} \cdot v_{w})$ where $D^{'}$ is the set of negative samples drawn from the noise distribution.

The NS objective function is similar to the Skip-gram objective but replaces the expensive Softmax Function with a simpler binary classification task between true and noise samples.
Link to original

Global Vectors for Word Representation
Definition

Global vectors for word representation (GloVe) model learns word embedding vectors such that their Dot Product is proportional to the logarithm of the words’ probability of co-occurrence.

$w_{i}^{⊺} w_{j} = ln p (X_{i, j})$ where $w_{i}$ and $w_{j}$ are word vectors, $X_{i, j}$ is the co-occurence count between words $i$ and $j$ .
Link to original

Attention
Definition

Attention is a method that determines the relative importance of each component in a sequence relative to the other components in that sequence.

The attention function is formulated as $Attention (Q, K, V) = softmax (\frac{Q K ^{⊺}}{d _{k}}) V$ where Q (query) represents the current context, K (key) and V (value) represents the references, and $d_{k}$ is the dimension of the keys.

The attention value is the convex combination of values, where each weight is proportional to the relevance between the query and the corresponding key.

Examples

Seq-to-Seq with RNN

Link to original

Attention LSTM
Definition

Attention LSTM is a variant of LSTM architecture incorporating Attention mechanism. In a sequence-to-sequence setting, the model uses Attention in the decoding stage. The previous hidden state of LSTM cell is used as the query, and the hidden states of LSTM cell of the encoder are used as the key and value.
Link to original

Transformer
Definition

Transformer model uses self-attention, the Attention in which the Q, K, and V derived from the same source, for sentences. The result vector of the self-attention reflects its context. Usually, self-attention is repreated multiple times to further contexualize.

Architecture

Self-Attention

The initial query (Q), key (K), and value (V) are matrices are the result of linear transformation of the input sequence.

Multi-head Self-Attention

The Transformer uses multiple attention heads in parallel like the channel in CNN, allowing it to focus on different aspects of the input simultaneously. The output of multi-head attention is a concatenation of the outputs from individual attention heads, followed by a linear transformation.

$MultiHead (Q, K, V) = concat (head_{1}, head_{2}, \dots, head_{h}) W^{O}$ where $head_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

Feed-Forward Layer

Each layer in the Transformer also contains a feed-forward layer applied to each position separately, i.e. there is no cross-token dependency. The linear transformations are the same across different positions in the same layer, but different in other layers. $FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$

Positional Encoding

Since the Transformer doesn’t inherently capture sequence order, positional encodings are added to the input embeddings. These are typically sine and cosine functions of different frequencies:
$PE_{(\text{pos},2i)} &= sin(\text{pos} / 10000^{2i/d_{\text{model}}})\\ PE_{(\text{pos},2i+1)} &= cos(\text{pos} / 10000^{2i/d_{\text{model}}}) \end{aligned}$$ ## Masked Multi-Head Self-Attention In the decoder, the self-attention layer is modified to prevent attending to later positions. This is achieved by masking future positions with negative infinity before the [[Softmax Function|softmax]] step. ## Encoder-Decoder Attention The decoder has an additional attention layer that performs multi-head attention over the output of the encoder. Where the query (Q) comes from the previous layer in the decoder, and the key (K) and value (V) come from the output of the encoder.$ Link to original

GPT
Definition

GPT is based on the Transformer architecture, specifically using only the decoder portion of the original Transformer model. It utilizes self-attention mechanisms to process input sequences.

GPT-1

Next Token Prediction

GPT-1 was trained on a diverse corpus of web pages, using semi-supervised learning to predict the next token in a sequence. This pre-training allowed the model to learn general language patterns and representations (vector representations of words).

GPT-2

GPT-2 was trained on a more large-size dataset. It performs tasks without specific fine-tuning, demonstrating strong zero-shot learning capabilities.

In the GPT-2 model, the Layer Normalization is moved to the input of each sub-block, similar to a pre-activation of ResNet.

GPT-3

GPT-3 scales the architecture of GPT-2 up dramatically with larger dataset. The research shows the effectiveness of few-shot learning.

where the gray region is masked

GPT-3 uses dense and locally banded sparse attention patterns in the layers of the transformer alternatively.
Link to original

BERT
Definition

BERT model appended a CLS token to the input, and uses it as the aggregated embedding. The model learn word embedding by solving the masked token prediction and next sentence prediction problems.

Tasks

Masked Language Modeling

Figuring out the hidden words using the context.

Next Sentence Prediction

A binary classification problem, predicting if the two sentences in the input are consecutive or not.
Link to original

My Knowledge Base

Explorer

Natural Language Processing Note

Word2Vec

Definition

Continuous Bag-Of-Words

Algorithm

Skip-Gram

Algorithm

Noise Contrastive Estimation

Definition

Examples

Word Embedding (Skip-Gram Model)

Negative Sampling

Definition

Global Vectors for Word Representation

Definition

Attention

Definition

Examples

Seq-to-Seq with RNN

Attention LSTM

Definition

Transformer

Definition

Architecture

Self-Attention

Multi-head Self-Attention

Feed-Forward Layer

Positional Encoding

GPT

Definition

GPT-1

Next Token Prediction

GPT-2

GPT-3

BERT

Definition

Tasks

Masked Language Modeling

Next Sentence Prediction

Graph View

Backlinks