Definition

The Word2Vec model learns vector representations of words that effectively capture the semantic relationships them using large corpus of text.

Continuous Bag-Of-Words

Continuous bag-of-words (CBOW) predicts a target word given its context words. The model takes a window of context words around a target word. The context words are fed into the network, and it tries to predict the target word. The model is trained to maximize the probability of the target word given the context words.

Algorithm

Generate one-hot encodings ${x_{1}, x_{2}, \dots, x_{C}}$ of the context words $v_{I_{1}}, v_{I_{2}}, \dots, v_{I_{C}}$ of window size $C$ .
Get embedded words vectors ${h_{1}, h_{2}, \dots, h_{C}}$ using linear transformation, and take average of them $h = \frac{1}{C} \sum_{i = 1}^{C} x_{i} W$ where the weight $W$ is shared
Generate a score vector and turn the scores into probability with Softmax Function $y = softmax (h W^{'})$
Adjust the weights $W$ and $W^{'}$ to match the result $y$ to the one-hot encoding of actual output word $v_{O}$ by minimizing the loss $L = - ln P (v_{O} ∣ v_{I_{1}}, v_{I_{2}}, \dots, v_{I_{C}}) = - i = 1 \sum ∣ V ∣ v_{O_{i}} ln (y_{i})$ where $∣ V ∣$ is the dimension of the vectors, and $v_{O_{i}}$ is the $i$ -th element of the one-hot encoded output word vector.

Skip-Gram

The Skip-gram model does the opposite of CBOW. It predicts the context words given a target word The model takes a single target word as input. The target word is fed into the network, and it tries to predict the surrounding context words. The model is trained to maximize the probability of each context word given the target word.

Algorithm

Generate one-hot encoding $x$ of the input word $v_{I}$
Get an embedded word vector $h$ using linear transformation $h = x W$
Generate a score vector and turn the scores into probability with Softmax Function $y = softmax (h W^{'})$
Adjust the weights $W$ and $W^{'}$ to match the result $y$ to the $C$ many one-hot encodings of the actual output word $v_{O_{1}}, v_{O_{2}}, \dots, v_{O_{C}}$ by minimizing the loss $L = - ln P (∣ v_{O_{1}}, v_{O_{2}}, \dots, v_{O_{C}} ∣ v_{I}) = - \sum_{j = 1}^{C} \sum_{i = 1}^{∣ V ∣} v_{O_{j, i}} ln (y_{i})$ where $∣ V ∣$ is the dimension of the vectors, and $v_{O_{j, i}}$ is the $i$ -th element of the one-hot encoded output $j$ -th context word vector.

My Knowledge Base

Explorer

Word2Vec

Definition

Continuous Bag-Of-Words

Algorithm

Skip-Gram

Algorithm

Graph View

Table of Contents

Backlinks