From 287a20c8dddc0accede9a036458055ff2d1bf1de Mon Sep 17 00:00:00 2001 From: m-misiura Date: Fri, 28 Jul 2023 09:17:54 +0100 Subject: [PATCH] added index.html --- index.html | 2532 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2532 insertions(+) create mode 100644 index.html diff --git a/index.html b/index.html new file mode 100644 index 0000000..6f9540b --- /dev/null +++ b/index.html @@ -0,0 +1,2532 @@ + + + + + + + + Introduction to BERT + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+

Introduction to BERT

+

NICD

+

2022-01-25

+
+ +
+

Content

+
    +
  • Part 1: BERT Basics and Applications
  • +
  • Part 2: BERT Architecture
  • +
  • Part 3: Pre-Training Tasks
  • +
  • Part 4: BERT Variants
  • +
+
+ +
+
+

Part 1: BERT Basics and Applications

+ +
+
+

Introduction

+

Bidirectional Encoder Representations from Transformers BERT is the most important natural language processing model to date

+

It can be used for a variety of tasks including:

+
    +
  • token classification (e.g. named entity recognition or question answering)
  • +
  • text classification (e.g. sentiment analysis)
  • +
  • text-pair classification (e.g. sentence similarity)
  • +
+
+
+

What BERT cannot do ?

+

BERT is not suitable for the following tasks:

+
    +
  • text generation
  • +
  • machine translation
  • +
  • text summarisation
  • +
+

These are tasks which require decoder or encoder-decoder architectures. BERT is an encoder-only architecture.

+
+
+

Why is BERT useful ?

+

All downstream natural language processing tasks benefit from improved word embeddings

+

BERT–style architectures produce state-of-the-art word embeddings since they can be trained on the large volumes of data and yield contextual word representations

+
+
+

BERT Illustrated

+

At a high-level, BERT takes the following structure:

+ +
+
+
+

Part 2: BERT Architecture

+ +
+
+

Self-Attention

+

Self-attention is the primary mechanism used to enhance word embeddings with context. To understand we will explore a more complex sentence.

+ +
+
+

Bert or Ernie?

+ +

This sentence includes the word “he”. Question: Does “he” refer to Bert or Ernie? The answer should inform how the word embedding for “he” is enhanced.

+
+
+

Weighted-Average

+

Producing an enhanced word embedding for “he” with self-attention involves taking a weighted-average of the word embeddings. The weights indicate how important the context words are to the enhancement. For example we would expect the word “Bert” to have a larger weight than “Ernie” in the previous example.

+ +
+
+

Enhanced Word Embedding

+
    +
  • \(n\) is the number of words in the sentence
  • +
  • \(i\) is the position of the word in the sentence
  • +
  • \(\mathbf{x}_i\in\mathbb{R}^{768}\) is an embedding for the word in position \(i\)
  • +
  • \(w_i\) is a weight for the word in position \(i\)
  • +
+

Enhanced word embedding:

+

\[ +\mathbf{x}=\sum_{i=1}^nw_i\mathbf{x}_i +\]

+
+
+

Values, Keys and Queries

+

Before we go into anymore detail we first have to introduce some matrices:

+
    +
  • \(X\in\mathbb{R}^{n\times 768}\) rows are input word embeddings
  • +
  • \(V=X\Theta_V\in\mathbb{R}^{n\times d}\) rows are value word embeddings
  • +
  • \(K=X\Theta_K\in\mathbb{R}^{n\times d}\) rows are key word embeddings
  • +
  • \(Q=X\Theta_Q\in\mathbb{R}^{n\times d}\) rows are query word embeddings
  • +
+

where \(\Theta_V\), \(\Theta_K\) and \(\Theta_Q\) are \(d\times d\) parameter matrices. These are often referred to as projection matrices! Note: for now we assume \(d=768\).

+
+
+

Value Projection

+ +
+
+

Key Projection

+ +
+
+

Query Projection

+ +
+
+

Self-Attention

+ +
+
+

Self-Attention Equation

+

Let \(\mathbf{q}\) be a query embedding (row vector) for the word embedding \(\mathbf{x}\). The corresponding enhanced word embedding is:

+

\[ +\text{softmax}\left(\frac{\mathbf{q}K^\top}{\sqrt{d}}\right)V +\]

+

Why \(\sqrt{d}\)?

+
+
+

The Square Root

+

If \(X=[X_1,\dots,X_d]\) and \(Y=[Y_1,\dots,Y_d]\) are constructed from independent random variables where \(\mathbb{E}[X_i]=\mathbb{E}[Y_i]=0\) and \(\mathbb{V}[X_i]=\mathbb{V}[Y_i]=1\) for all \(i\) then:

+

\[ +\mathbb{V}[XY^\top]=\sum_{i=1}^d\mathbb{V}[X_iY_i]= \sum_{i=1}^d\mathbb{V}[X_i]\mathbb{V}[Y_i]=d +\]

+

Consequently \(\mathbb{V}[XY^\top/\sqrt{d}]=1\). This is the trick used in the self-attention equation to stabilise the variance and gradients.

+
+
+

Feed-Forward Neural Network

+

BERT has 12 layers and each layer is a combination of self-attention and a feed-forward neural network.

+ +
+
+

Feed-Forward Neural Network

+ +

The feed-forward neural networks have a single hidden layer of dimension 3,072 (four times the embedding dimension)

+
+
+

Number of Parameters per Layer

+
    +
  • The Self-Attention component consists of three projection matrices, each \(768\times768\), for a total of \(​3\times768\times768=1,769,472\)​ parameters.
  • +
  • The feed-forward neural network consists of the hidden-layer and output-layer weights, both \(768\times3,072\), for a total$ of \(​2\times768\times3072=4,718,592\)​ weights.
  • +
  • This means that the feed-forward neural network is roughly ​2.66x larger than the self-attention component!
  • +
+
+
+

Multi-Headed Attention

+

In our first example of self-attention we discussed how the word embedding for “he” could be enhanced with the word embedding for “Bert” to which it refers. The self-attention to capture references is just one type of self-attention. For example:

+ +

is a self-attention mechanism that captures adjectives.

+
+
+

Multi-Headed Attention

+
    +
  • In BERT each self-attention component in each layer has 12 heads. That is, each layer has 12 self-attention components that capture 12 different aspects of attention (e.g., adjectives, references).
  • +
  • However, with 12 self-attention components we have a \(12\times 764=9,216\) dimensional output rather than a \(764\) dimensional output. This gets reduced with another parameter matrix.
  • +
+
+
+

Multi-Headed Attention

+ +
+
+

Recombining Self-Attention

+ +
+
+

Number of Parameters

+
    +
  • With 12 self-attention components in each layer we have increased the number of parameters by 12
  • +
  • Additionally we have added a recombination component with \(9,215\times768=7,077,888\) parameters
  • +
  • To help with this explosion of parameters BERT reduces the embedding size to \(d=64\)
  • +
  • This means no recombination matrix is required!
  • +
+
+
+

Full-Size Attention Head

+ +
+
+

Reduced Attention Head

+ +
+
+

Other Details

+

Other than the main architectural components that we have covered there are other details that we have not discussed:

+
    +
  1. Layer Normalisation
  2. +
  3. Residual Connections
  4. +
  5. Drop out
  6. +
+

These are located throughout the model to improved training.

+
+
+
+

Part 3: Pre-Training Tasks

+ +
+
+

BERT

+
    +
  • Published in 2018, Bidirectional Encoder Representations from Transformers BERT is considered to be the first deeply bidrectional, unsupervised (contextual) language representation model pre-trained using only a plain text corpus (BookCorpus and Wikipedia)

  • +
  • BERT’s novelty lies in the way it was pre-trained:

    +
      +
    • masked language model (MLM) – randomly mask some of the tokens from the input and predict the original vocabulary id of the masked token
    • +
    • next sentence prediction (NSP) – predict if the two sentences were following each other or not
    • +
  • +
+
+
+

Illustration of MLM and NSP

+

Consider the following passage of text:

+
+

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks

+
+
    +
  • MLM: predict the crossed out word (“simple”)
  • +
  • NSP: was sentence B found immediately after sentence A or from somewhere else ?
  • +
+
+
+

Pre-trained representations

+
    +
  • Pre-trained representations in language models can either be context-free or contextual

  • +
  • Context-free models, such as word2vec and GloVe generate a single word embedding representation for each word in the vocabulary

  • +
  • Contrastingly, contextual models instead generate a representation of each word based on the other words in the sentence

  • +
  • Contextual representations can be either:

    +
      +
    • unidirectional – context is conditional upon preceding words
    • +
    • bidirectional – context is conditional on both preceding and following words
    • +
  • +
+
+
+

Masked Language Modelling

+ +

12% of tokens are masked out for prediction.

+
+
+

Other Predictions

+ +
+
+

Other Predictions

+ +
+
+

Full Example

+ +
+
+

Next Sentence Prediction

+ +
+
+

Training

+

In BERT MLM and NSP are trained concurrently.

+

“The training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood”

+
+
+
+

Part 4: BERT Variants

+ +
+
+

BERT and Derivatives

+

BERT has a number of children models, including:

+
    +
  • ALBERT, which performs n-gram MLM and uses Sentence Order Prediction instead of NSP
  • +
  • RoBERTa and XLNet, which remove NSP
  • +
  • ELECTRA, which performs MLM with plausible, generated tokens
  • +
  • DistilBERT, which uses knowledge distilation to reduce the size of BERT
  • +
+
+
+

BERT and Sentence Length

+

The self-attention equation involves calculating \(QK^\top\) which requires \(\mathcal{O}(n^2)\) computation and memory. This limits the lengths of sentences that can be contextually embedded with BERT. There are a number of ways to reduce these costs and they often exploit sparsity.

+ +
+
+

BERT and Sentence Length

+

+

The above models are all examples of BERT-based architectures that have exploited combinations of sparsity to reduce computational and memory cost. Where as BERT can process sentences that are 512 tokens long, Longformer and BigBird can process sentences that are 4096 tokens long.

+

+ +
+
+
+ + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file