diff --git a/404.html b/404.html index 6899e38..fdce8bb 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

404: File not found

The requested file was not found.

Please click here to go to the home page, or have a look at the website modules below.

Modules

\ No newline at end of file + Dataflowr - Deep Learning DIY

404: File not found

The requested file was not found.

Please click here to go to the home page, or have a look at the website modules below.

Modules

\ No newline at end of file diff --git a/homework/1-mlp-from-scratch/index.html b/homework/1-mlp-from-scratch/index.html index 53545c8..f0c793f 100644 --- a/homework/1-mlp-from-scratch/index.html +++ b/homework/1-mlp-from-scratch/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Homework 1: MLP from scratch

Homework 1 is in the form of a jupyter notebook. You must complete it and submit it on moodle (for students enrolled on this course).

The Jupyter notebook

This homework will run fine on regular CPU (no need for GPU). If you want to run it locally (on your laptop), you can follow the procedure described in Module 0. Note that if you cloned the GitHub repository, the homework will be in the folder /notebooks/HW1

\ No newline at end of file + Dataflowr - Deep Learning DIY

Homework 1: MLP from scratch

Homework 1 is in the form of a jupyter notebook. You must complete it and submit it on moodle (for students enrolled on this course).

The Jupyter notebook

This homework will run fine on regular CPU (no need for GPU). If you want to run it locally (on your laptop), you can follow the procedure described in Module 0. Note that if you cloned the GitHub repository, the homework will be in the folder /notebooks/HW1

\ No newline at end of file diff --git a/homework/2-CAM-adversarial/index.html b/homework/2-CAM-adversarial/index.html index 34a248a..3eb00f9 100644 --- a/homework/2-CAM-adversarial/index.html +++ b/homework/2-CAM-adversarial/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Homework 2: Class Activation Map and adversarial examples

Can you see the cat below? No? Have a look at the code ;-)

Homework 2 is in the form of a jupyter notebook. You must complete it and submit it on moodle (for students enrolled on this course).

The Jupyter notebook

This homework will run fine on regular CPU (no need for GPU). If you want to run it locally (on your laptop), you can follow the procedure described in Module 0. Note that if you cloned the GitHub repository, the homework will be in the folder /notebooks/HW2

\ No newline at end of file + Dataflowr - Deep Learning DIY

Homework 2: Class Activation Map and adversarial examples

Can you see the cat below? No? Have a look at the code ;-)

Homework 2 is in the form of a jupyter notebook. You must complete it and submit it on moodle (for students enrolled on this course).

The Jupyter notebook

This homework will run fine on regular CPU (no need for GPU). If you want to run it locally (on your laptop), you can follow the procedure described in Module 0. Note that if you cloned the GitHub repository, the homework will be in the folder /notebooks/HW2

\ No newline at end of file diff --git a/homework/3-VAE/index.html b/homework/3-VAE/index.html index 61d3289..3d241c7 100644 --- a/homework/3-VAE/index.html +++ b/homework/3-VAE/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Homework 3: VAE for MNIST clustering and generation

Image source

Homework 3 is in the form of a jupyter notebook. You must complete it and submit it on moodle (for students enrolled on this course).

The Jupyter notebook

\ No newline at end of file + Dataflowr - Deep Learning DIY

Homework 3: VAE for MNIST clustering and generation

Image source

Homework 3 is in the form of a jupyter notebook. You must complete it and submit it on moodle (for students enrolled on this course).

The Jupyter notebook

\ No newline at end of file diff --git a/index.html b/index.html index a8f10a9..8bdf4dc 100644 --- a/index.html +++ b/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Deep Learning Do It Yourself!

This site collects resources to learn Deep Learning in the form of Modules available through the sidebar on the left. As a student, you can walk through the modules at your own pace and interact with others thanks to the associated Discord server. You don’t need any special hardware or software.

Practical deep learning course

The main goal of the course is to allow students to understand papers, blog posts and codes available online and to adapt them to their projects as soon as possible. In particular, we avoid the use of any high-level neural networks API and focus on the PyTorch library in Python.

The course is divided into sessions (containing possibly several modules), each session requiring a significant amount of coding. At the end of this course, students were able to read very recent papers and reproduce (or even ameliorate) their experiments.

All the code used in this course is available on the GitHub repository dataflowr/notebooks. You will find the solutions to the practicals on this repo! You can fork the repo if you want to run the code locally: GitHub Docs about fork then follow the steps in Module 0. Most of the code will not require a GPU.

⚠ When a GPU is required , you can launch the code on colab by following the corresponding link given in the module (see for example Module 1).

Pre-requisites:

🌻 Session 1 - Finetuning VGG

Start right away and train a deep neural network on a GPU with Module 1 - Introduction & General Overview

Be sure to build your own classifier with more dogs and cats in the practicals.

Things to remember

  • you do not need to understand everything to run a deep learning model! But the main goal of this course will be to come back to each step done today and understand them...

  • to use the dataloader from Pytorch, you need to follow the API (i.e. for classification store your dataset in folders)

  • using a pretrained model and modifying it to adapt it to a similar task is easy.

  • if you do not understand why we take this loss, that's fine, we'll cover that in Module 3.

  • even with a GPU, avoid unnecessary computations!

🌻 Session 2 - PyTorch tensors and Autodiff

Things to remember
  • Pytorch tensors = Numpy on GPU + gradients!

  • in deep learning, broadcasting is used everywhere. The rules are the same as for Numpy.

  • Automatic differentiation is not only the chain rule! Backpropagation algorithm (or dual numbers) is a clever algorithm to implement automatic differentiation...

🌻 Session 3

Things to remember
  • Loss vs Accuracy. Know your loss for a classification task!

  • know your optimizer (Module 4)

  • know how to build a neural net with torch.nn.module (Module 5)

  • know how to use convolution and pooling layers (kernel, stride, padding)

  • know how to use dropout

🌻 Session 4

Things to remember
  • know how to use dataloader

  • to deal with categorical variables in deep learning, use embeddings

  • in the case of word embedding, starting in an unsupervised setting, we built a supervised task (i.e. predicting central / context words in a window) and learned the representation thanks to negative sampling

  • know your batchnorm

  • architectures with skip connections allows deeper models

🌻 Session 5

🌻 Session 6

🌻 Session 7

🌻 Session 8

🌻 Session 9

For more updates: Twitter URL

and check the

GitHub repository: dataflowr/notebooks

Curators

Marc Lelarge, Andrei Bursuc with Jill-Jênn Vie

Course in a hurry

Super fast track to learn the basics of deep learning from scratch:

For contributors

Join the GitHub repo dataflowr and make a pull request. What are pull requests?

Thanks to Daniel Huynh, Eric Daoud, Simon Coste

Materials from this site is used for courses at ENS and X.

\ No newline at end of file + Dataflowr - Deep Learning DIY

Deep Learning Do It Yourself!

This site collects resources to learn Deep Learning in the form of Modules available through the sidebar on the left. As a student, you can walk through the modules at your own pace and interact with others thanks to the associated Discord server. You don’t need any special hardware or software.

Practical deep learning course

The main goal of the course is to allow students to understand papers, blog posts and codes available online and to adapt them to their projects as soon as possible. In particular, we avoid the use of any high-level neural networks API and focus on the PyTorch library in Python.

The course is divided into sessions (containing possibly several modules), each session requiring a significant amount of coding. At the end of this course, students were able to read very recent papers and reproduce (or even ameliorate) their experiments.

All the code used in this course is available on the GitHub repository dataflowr/notebooks. You will find the solutions to the practicals on this repo! You can fork the repo if you want to run the code locally: GitHub Docs about fork then follow the steps in Module 0. Most of the code will not require a GPU.

⚠ When a GPU is required , you can launch the code on colab by following the corresponding link given in the module (see for example Module 1).

Pre-requisites:

🌻 Session 1 - Finetuning VGG

Start right away and train a deep neural network on a GPU with Module 1 - Introduction & General Overview

Be sure to build your own classifier with more dogs and cats in the practicals.

Things to remember

  • you do not need to understand everything to run a deep learning model! But the main goal of this course will be to come back to each step done today and understand them...

  • to use the dataloader from Pytorch, you need to follow the API (i.e. for classification store your dataset in folders)

  • using a pretrained model and modifying it to adapt it to a similar task is easy.

  • if you do not understand why we take this loss, that's fine, we'll cover that in Module 3.

  • even with a GPU, avoid unnecessary computations!

🌻 Session 2 - PyTorch tensors and Autodiff

Things to remember
  • Pytorch tensors = Numpy on GPU + gradients!

  • in deep learning, broadcasting is used everywhere. The rules are the same as for Numpy.

  • Automatic differentiation is not only the chain rule! Backpropagation algorithm (or dual numbers) is a clever algorithm to implement automatic differentiation...

🌻 Session 3

Things to remember
  • Loss vs Accuracy. Know your loss for a classification task!

  • know your optimizer (Module 4)

  • know how to build a neural net with torch.nn.module (Module 5)

  • know how to use convolution and pooling layers (kernel, stride, padding)

  • know how to use dropout

🌻 Session 4

Things to remember
  • know how to use dataloader

  • to deal with categorical variables in deep learning, use embeddings

  • in the case of word embedding, starting in an unsupervised setting, we built a supervised task (i.e. predicting central / context words in a window) and learned the representation thanks to negative sampling

  • know your batchnorm

  • architectures with skip connections allows deeper models

🌻 Session 5

🌻 Session 6

🌻 Session 7

🌻 Session 8

🌻 Session 9

For more updates: Twitter URL

and check the

GitHub repository: dataflowr/notebooks

Curators

Marc Lelarge, Andrei Bursuc with Jill-Jênn Vie

Course in a hurry

Super fast track to learn the basics of deep learning from scratch:

For contributors

Join the GitHub repo dataflowr and make a pull request. What are pull requests?

Thanks to Daniel Huynh, Eric Daoud, Simon Coste

Materials from this site is used for courses at ENS and X.

\ No newline at end of file diff --git a/modules/0-julia-setup/index.html b/modules/0-julia-setup/index.html index 919328e..707bd01 100644 --- a/modules/0-julia-setup/index.html +++ b/modules/0-julia-setup/index.html @@ -35,7 +35,7 @@

Star
diff --git a/modules/0-sotfware-installation/index.html b/modules/0-sotfware-installation/index.html index d971a92..0d5269a 100644 --- a/modules/0-sotfware-installation/index.html +++ b/modules/0-sotfware-installation/index.html @@ -37,7 +37,7 @@

tl;dr

diff --git a/modules/1-intro-general-overview/index.html b/modules/1-intro-general-overview/index.html index a1da440..f049961 100644 --- a/modules/1-intro-general-overview/index.html +++ b/modules/1-intro-general-overview/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 1 - Introduction & General Overview

Table of Contents

Introduction & General Overview


0:00 Intro
0:31 Goal of this lecture
2:08 What is deep learning?
7:06 Why deep learning now?
9:33 Deep learning pipeline
12:17 General overview
16:02 Organization of the course
18:24 A first example in Colab (setting)
19:35 Dogs vs cats (data wrangling)
25:50 Data processing (dataset and dataloader)
40:51 VGG model
45:55 Modifying the last layer
49:50 Choosing your loss and optimizer for training
57:40 Precomputing features
1:03:39 Qualitative analysis

Slides and Notebook

Practicals

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 1 - Introduction & General Overview

Table of Contents

Introduction & General Overview


0:00 Intro
0:31 Goal of this lecture
2:08 What is deep learning?
7:06 Why deep learning now?
9:33 Deep learning pipeline
12:17 General overview
16:02 Organization of the course
18:24 A first example in Colab (setting)
19:35 Dogs vs cats (data wrangling)
25:50 Data processing (dataset and dataloader)
40:51 VGG model
45:55 Modifying the last layer
49:50 Choosing your loss and optimizer for training
57:40 Precomputing features
1:03:39 Qualitative analysis

Slides and Notebook

Practicals

\ No newline at end of file diff --git a/modules/10-generative-adversarial-networks/index.html b/modules/10-generative-adversarial-networks/index.html index da53f10..6473a7f 100644 --- a/modules/10-generative-adversarial-networks/index.html +++ b/modules/10-generative-adversarial-networks/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 10 - Generative Adversarial Networks

Table of Contents

Generative Adversarial Networks


0:00 Recap
0:15 Presentation of GANs
1:49 GAN learning
4:13 Learning the discriminator
6:16 Learning the generator
7:25 A trick for learning the generator
10:00 GAN for 2d-point clouds
11:51 Training loop in PyTorch
15:08 Loss curves
16:12 Generation with GANs
17:15 Mode collapse
20:00 Conditional GAN
21:15 InfoGAN
22:54 Deep convolutional GAN
25:45 Practicals
28:38 Non convergence for GANs
33:00 Coding a conditional GAN
39:13 Coding an InfoGAN
43:35 Examples of failures

Slides

Practicals

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 10 - Generative Adversarial Networks

Table of Contents

Generative Adversarial Networks


0:00 Recap
0:15 Presentation of GANs
1:49 GAN learning
4:13 Learning the discriminator
6:16 Learning the generator
7:25 A trick for learning the generator
10:00 GAN for 2d-point clouds
11:51 Training loop in PyTorch
15:08 Loss curves
16:12 Generation with GANs
17:15 Mode collapse
20:00 Conditional GAN
21:15 InfoGAN
22:54 Deep convolutional GAN
25:45 Practicals
28:38 Non convergence for GANs
33:00 Coding a conditional GAN
39:13 Coding an InfoGAN
43:35 Examples of failures

Slides

Practicals

\ No newline at end of file diff --git a/modules/11a-recurrent-neural-networks-theory/index.html b/modules/11a-recurrent-neural-networks-theory/index.html index 947167f..4112d4e 100644 --- a/modules/11a-recurrent-neural-networks-theory/index.html +++ b/modules/11a-recurrent-neural-networks-theory/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 11a - Recurrent Neural Networks theory

Table of Contents

Theory of RNNs


0:00 Recap
0:52 Introduction to RNNs
1:17 1D convolutional networks for sequences
2:16 Various tasks for RNNs
5:15 Theory of RNN
7:59 Backprop for RNN
10:30 A binary classification problem for sequences
17:17 Elman network
21:02 Training RNN
22:51 Results for Elman network
24:22 Gating for RNN
28:10 Gated RNN in PyTorch
29:27 Results for gated RNN
30:12 LSTM and GRU
34:11 Equations for GRU
37:23 Equations for LSTM
40:31 LSTM in PyTorch
42:44 Results for LSTM
43:43 Empirical results for LSTM and GRU

Slides

References

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 11a - Recurrent Neural Networks theory

Table of Contents

Theory of RNNs


0:00 Recap
0:52 Introduction to RNNs
1:17 1D convolutional networks for sequences
2:16 Various tasks for RNNs
5:15 Theory of RNN
7:59 Backprop for RNN
10:30 A binary classification problem for sequences
17:17 Elman network
21:02 Training RNN
22:51 Results for Elman network
24:22 Gating for RNN
28:10 Gated RNN in PyTorch
29:27 Results for gated RNN
30:12 LSTM and GRU
34:11 Equations for GRU
37:23 Equations for LSTM
40:31 LSTM in PyTorch
42:44 Results for LSTM
43:43 Empirical results for LSTM and GRU

Slides

References

\ No newline at end of file diff --git a/modules/11b-recurrent-neural-networks-practice/index.html b/modules/11b-recurrent-neural-networks-practice/index.html index aa07478..864691f 100644 --- a/modules/11b-recurrent-neural-networks-practice/index.html +++ b/modules/11b-recurrent-neural-networks-practice/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 11b - Recurrent Neural Networks practice

Table of Contents

Theory of RNNs


0:00 Generating the dataset for binary classification of parentheses
4:56 Elman network
11:25 RNN with gating
14:06 LSTM
18:33 Be careful with errors given on the training set!

Notebook

Practicals

References

RNNs can generate bounded hierarchical languages with optimal memory (2020) John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, Christopher D. Manning arXiv:2010.07515

Self-Attention Networks Can Process Bounded Hierarchical Languages (2021) Shunyu Yao, Binghui Peng, Christos Papadimitriou, Karthik Narasimhan arXiv:2105.11115

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 11b - Recurrent Neural Networks practice

Table of Contents

Theory of RNNs


0:00 Generating the dataset for binary classification of parentheses
4:56 Elman network
11:25 RNN with gating
14:06 LSTM
18:33 Be careful with errors given on the training set!

Notebook

Practicals

References

RNNs can generate bounded hierarchical languages with optimal memory (2020) John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, Christopher D. Manning arXiv:2010.07515

Self-Attention Networks Can Process Bounded Hierarchical Languages (2021) Shunyu Yao, Binghui Peng, Christos Papadimitriou, Karthik Narasimhan arXiv:2105.11115

\ No newline at end of file diff --git a/modules/11c-batches-with-sequences/index.html b/modules/11c-batches-with-sequences/index.html index 6514f48..55c8b78 100644 --- a/modules/11c-batches-with-sequences/index.html +++ b/modules/11c-batches-with-sequences/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 11c - Batches with sequences in Pytorch

Table of Contents

Pytorch tutorial on batch for sequences


0:00 Presentation
2:15 Step 1: Construct Vocabulary
2:50 Step 2: Load indexed data (list of instances, where each instance is list of character indices)
3:45 Step 3: Make Model
4:50 Step 4: Pad instances with 0s till max length sequence
5:47 Step 5: Sort instances by sequence length in descending order
6:55 Step 6: Embed the instances
9:10 Step 7: Call pack_padded_sequence with embeded instances and sequence lengths
12:41 Step 8: Forward with LSTM
14:38 Step 9: Call unpack_padded_sequences if required / or just pick last hidden vector

Notebook

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 11c - Batches with sequences in Pytorch

Table of Contents

Pytorch tutorial on batch for sequences


0:00 Presentation
2:15 Step 1: Construct Vocabulary
2:50 Step 2: Load indexed data (list of instances, where each instance is list of character indices)
3:45 Step 3: Make Model
4:50 Step 4: Pad instances with 0s till max length sequence
5:47 Step 5: Sort instances by sequence length in descending order
6:55 Step 6: Embed the instances
9:10 Step 7: Call pack_padded_sequence with embeded instances and sequence lengths
12:41 Step 8: Forward with LSTM
14:38 Step 9: Call unpack_padded_sequences if required / or just pick last hidden vector

Notebook

\ No newline at end of file diff --git a/modules/12-attention/index.html b/modules/12-attention/index.html index d5f16d7..183ea0a 100644 --- a/modules/12-attention/index.html +++ b/modules/12-attention/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 12 - Attention and Transformers

Table of Contents

Attention with RNNs

The first attention mechanism was proposed in Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (presented at ICLR 2015).

The task considered is English-to-French translation and the attention mechanism is proposed to extend a seq2seq architecture by adding a context vector cic_i in the RNN decoder so that, the hidden states for the decoder are computed recursively as si=f(si1,yi1,ci)s_i = f(s_{i-1}, y_{i-1}, c_i) where yi1y_{i-1} is the previously predicted token and predictions are made in a probabilist manner as yig(yi1,si,ci)y_i \sim g(y_{i-1},s_i,c_i) where sis_i and cic_i are the current hidden state and context of the decoder.

Now the main novelty is the introduction of the context cic_i which is a weighted average of all the hidden states of the encoder: ci=j=1Tαi,jhjc_i = \sum_{j=1}^T \alpha_{i,j} h_j where TT is the length of the input sequence, h1,,hTh_1,\dots, h_T are the corresponding hidden states of the decoder and jαi,j=1\sum_j \alpha_{i,j}=1. Hence the context allows passing direct information from the 'relevant' part of the input to the decoder. The coefficients (αi,j)j=1T(\alpha_{i,j})_{j=1}^T are computed from the current hidden state of the decoder si1s_{i-1} and all the hidden states from the encoder (h1,,hT)(h_1, \dots, h_T) as explained below (taken from the original paper):

PyTorch implementation

In Attention for seq2seq, you can play with a simple model and code the attention mechanism proposed in the paper. For the alignment network aa (used to define the coefficient αi,j=softmaxj(a(si1,hj))\alpha_{i,j} = softmax_{j}(a(s_{i-1},h_j))), we take a MLP with tanh\tanh activations.

You will learn about seq2seq, teacher-forcing for RNNs and build the attention mechanism. To simplify things, we do not deal with batches (see Batches with sequences in Pytorch for more on that). The solution for this practical is provided in Attention for seq2seq- solution

Note that each αi,j\alpha_{i,j} is a real number so that we can display the matrix of αi,j\alpha_{i,j}'s where jj ranges over the input tokens and ii over the output tokens, see below (taken from the paper):

(Self-)Attention in Transformers

We now describe the attention mechanism proposed in Attention Is All You Need by Vaswani et al. First, we recall basic notions from retrieval systems: query/key/value illustrated by an example: search for videos on Youtube. In this example, the query is the text in the search bar, the key is the metadata associated with the videos which are the values. Hence a score can be computed from the query and all the keys. Finally, the matched video with the highest score is returned.

We see that we can formalize this process as follows: if QsQ_s is the current query and KtK_t and VtV_t are all the keys and values in the database, we return

Ys=t=1Tsoftmaxt(score(Qs,Kt))Vt, Y_s = \sum_{t=1}^T\text{softmax}_{t}(\text{score}(Q_s, K_t))V_t,

where t=1Tsoftmaxt(score(Qs,Kt))=1\sum_{t=1}^T\text{softmax}_{t}(\text{score}(Q_s, K_t))=1.

Note that this formalism allows us to recover the way contexts were computed above (where the score function was called the alignment network). Now, we will change the score function and consider dot-product attention: score(Qs,Kt)=QsTKtd \text{score}(Q_s, K_t) = \frac{Q_s^TK_t}{\sqrt{d}}. Note that for this definition to make sense, both the query QsQ_s and the key KtK_t need to live in the same space and dd is the dimension of this space.

Given ss inputs in Rdin\mathbb{R}^{d_{\text{in}}} denoted by a matrix XRdin×sX\in \mathbb{R}^{d_{\text{in}}\times s} and a database containing tt samples in Rd\mathbb{R}^{d'} denoted by a matrix XRd×tX'\in \mathbb{R}^{d'\times t}, we define:

the queries: Q=WQX, with, WQRk×dinthe keys: K=WKX, with, WKRk×dthe values: V=WVX, with, WVRdout×d \text{the queries: } Q = W_Q X, \text{ with, } W_Q\in \mathbb{R}^{k\times d_{\text{in}}}\\ \text{the keys: } K = W_K X', \text{ with, } W_K\in \mathbb{R}^{k\times d'}\\ \text{the values: } V = W_V X', \text{ with, } W_V\in \mathbb{R}^{d_{\text{out}}\times d'}

Now self-attention is simply obtained with X=XX=X' (so that d=dind'=d_{\text{in}}) and din=dout=dd_{\text{in}} = d_{\text{out}} = d. In summary, self-attention layer can take as input any tensor of the form XRd×TX \in \mathbb{R}^{d\times T} (for any TT) has parameters:

WQRk×d,WKRk×d,WVRd×d, W_Q\in \mathbb{R}^{k\times d}, W_K\in \mathbb{R}^{k\times d}, W_V\in \mathbb{R}^{d\times d},

and produce YRd×TY \in \mathbb{R}^{d\times T} (with same dd and tt as for the input). dd is the dimension of the input and kk is a hyper-parameter of the self-attention layer:

Ys=t=1Tsoftmaxt(XsTWQTWKXtk)WVXt, Y_s = \sum_{t=1}^T\text{softmax}_{t}\left(\frac{X_s^TW_Q^TW_KX_t}{\sqrt{k}}\right)W_VX_t,

with the convention that XtRdX_t\in \mathbb{R}^d (resp. YsRdY_s\in \mathbb{R}^d) is the tt-th column of XX (resp. the ss-th column of YY). Note that the notation softmaxt(.)\text{softmax}_{t}(.) might be a bit confusing. Recall that softmax\text{softmax} is always taking as input a vector and returning a (normalized) vector. In practice, most of the time, we are dealing with batches so that the softmax\text{softmax} function is taking as input a matrix (or tensor) and we need to normalize according to the right axis! Named tensor notation see below deals with this notational issue. I also find the interpretation given below helpful:

Mental model for self-attention: self-attention interpreted as taking expectation

ys=t=1Tp(xtxs)v(xt)=E[v(x)xs],with, p(xtxs)=exp(q(xs)k(xt))rexp(q(xs)k(xr)), y_s = \sum_{t=1}^T p(x_t | x_s) v(x_t) = \mathbb{E}[v(x) | x_s],\\ \text{with, } p(x_t|x_s) = \frac{\exp(q(x_s)k(x_t))}{\sum_{r}\exp(q(x_s)k(x_r))},

where the mappings q(.),k(.)q(.), k(.) and v(.)v(.) represent query, key and value.

Multi-head attention combines several such operations in parallel, and YY is the concatenation of the results along the feature dimension to which is applied one more linear transformation.

Transformer block

To finish the description of a transformer block, we need to define two last layers: Layer Norm and Feed Forward Network.

The Layer Norm used in the transformer block is particularly simple as it acts on vectors and standardizes it as follows: for xRdx\in \mathbb{R}^d, we define

mean(x)=1di=1dxiRstd(x)2=1di=1d(ximean(x))2R \text{mean}(x) =\frac{1}{d}\sum_{i=1}^d x_i\in \mathbb{R}\\ \text{std}(x)^2 = \frac{1}{d}\sum_{i=1}^d(x_i-\text{mean}(x))^2\in \mathbb{R}

and then the Layer Norm has two parameters γ,βRd\gamma, \beta\in \mathbb{R}^d and

LN(x)=γxmean(x)std(x)+β, LN(x) = \gamma \cdot \frac{x-\text{mean}(x)}{\text{std}(x)}+\beta,

where we used the natural broadcasting rule for subtracting the mean and dividing by std and \cdot is component-wise multiplication.

A Feed Forward Network is an MLP acting on vectors: for xRdx\in \mathbb{R}^d, we define

FFN(x)=max(0,xW1+b1)W2+b2, FFN(x) = \max(0,xW_1+b_1)W_2+b_2,

where W1Rd×hW_1\in \mathbb{R}^{d\times h}, b1Rhb_1\in \mathbb{R}^h, W2Rh×dW_2\in \mathbb{R}^{h\times d}, b2Rdb_2\in \mathbb{R}^d.

Each of these layers is applied on each of the inputs given to the transformer block as depicted below:

Note that this block is equivariant: if we permute the inputs, then the outputs will be permuted with the same permutation. As a result, the order of the input is irrelevant to the transformer block. In particular, this order cannot be used. The important notion of positional encoding allows us to take order into account. It is a deterministic unique encoding for each time step that is added to the input tokens.

Transformers using Named Tensor Notation

In Transformers using Named Tensor Notation, we derive the formal equations for the Transformer block using named tensor notation.

Hacking a simple Transformer block

Now is the time to have fun building a simple transformer block and to think like transformers (open in colab).

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 12 - Attention and Transformers

Table of Contents

Attention with RNNs

The first attention mechanism was proposed in Neural Machine Translation by Jointly Learning to Align and Translate by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (presented at ICLR 2015).

The task considered is English-to-French translation and the attention mechanism is proposed to extend a seq2seq architecture by adding a context vector cic_i in the RNN decoder so that, the hidden states for the decoder are computed recursively as si=f(si1,yi1,ci)s_i = f(s_{i-1}, y_{i-1}, c_i) where yi1y_{i-1} is the previously predicted token and predictions are made in a probabilist manner as yig(yi1,si,ci)y_i \sim g(y_{i-1},s_i,c_i) where sis_i and cic_i are the current hidden state and context of the decoder.

Now the main novelty is the introduction of the context cic_i which is a weighted average of all the hidden states of the encoder: ci=j=1Tαi,jhjc_i = \sum_{j=1}^T \alpha_{i,j} h_j where TT is the length of the input sequence, h1,,hTh_1,\dots, h_T are the corresponding hidden states of the decoder and jαi,j=1\sum_j \alpha_{i,j}=1. Hence the context allows passing direct information from the 'relevant' part of the input to the decoder. The coefficients (αi,j)j=1T(\alpha_{i,j})_{j=1}^T are computed from the current hidden state of the decoder si1s_{i-1} and all the hidden states from the encoder (h1,,hT)(h_1, \dots, h_T) as explained below (taken from the original paper):

PyTorch implementation

In Attention for seq2seq, you can play with a simple model and code the attention mechanism proposed in the paper. For the alignment network aa (used to define the coefficient αi,j=softmaxj(a(si1,hj))\alpha_{i,j} = softmax_{j}(a(s_{i-1},h_j))), we take a MLP with tanh\tanh activations.

You will learn about seq2seq, teacher-forcing for RNNs and build the attention mechanism. To simplify things, we do not deal with batches (see Batches with sequences in Pytorch for more on that). The solution for this practical is provided in Attention for seq2seq- solution

Note that each αi,j\alpha_{i,j} is a real number so that we can display the matrix of αi,j\alpha_{i,j}'s where jj ranges over the input tokens and ii over the output tokens, see below (taken from the paper):

(Self-)Attention in Transformers

We now describe the attention mechanism proposed in Attention Is All You Need by Vaswani et al. First, we recall basic notions from retrieval systems: query/key/value illustrated by an example: search for videos on Youtube. In this example, the query is the text in the search bar, the key is the metadata associated with the videos which are the values. Hence a score can be computed from the query and all the keys. Finally, the matched video with the highest score is returned.

We see that we can formalize this process as follows: if QsQ_s is the current query and KtK_t and VtV_t are all the keys and values in the database, we return

Ys=t=1Tsoftmaxt(score(Qs,Kt))Vt, Y_s = \sum_{t=1}^T\text{softmax}_{t}(\text{score}(Q_s, K_t))V_t,

where t=1Tsoftmaxt(score(Qs,Kt))=1\sum_{t=1}^T\text{softmax}_{t}(\text{score}(Q_s, K_t))=1.

Note that this formalism allows us to recover the way contexts were computed above (where the score function was called the alignment network). Now, we will change the score function and consider dot-product attention: score(Qs,Kt)=QsTKtd \text{score}(Q_s, K_t) = \frac{Q_s^TK_t}{\sqrt{d}}. Note that for this definition to make sense, both the query QsQ_s and the key KtK_t need to live in the same space and dd is the dimension of this space.

Given ss inputs in Rdin\mathbb{R}^{d_{\text{in}}} denoted by a matrix XRdin×sX\in \mathbb{R}^{d_{\text{in}}\times s} and a database containing tt samples in Rd\mathbb{R}^{d'} denoted by a matrix XRd×tX'\in \mathbb{R}^{d'\times t}, we define:

the queries: Q=WQX, with, WQRk×dinthe keys: K=WKX, with, WKRk×dthe values: V=WVX, with, WVRdout×d \text{the queries: } Q = W_Q X, \text{ with, } W_Q\in \mathbb{R}^{k\times d_{\text{in}}}\\ \text{the keys: } K = W_K X', \text{ with, } W_K\in \mathbb{R}^{k\times d'}\\ \text{the values: } V = W_V X', \text{ with, } W_V\in \mathbb{R}^{d_{\text{out}}\times d'}

Now self-attention is simply obtained with X=XX=X' (so that d=dind'=d_{\text{in}}) and din=dout=dd_{\text{in}} = d_{\text{out}} = d. In summary, self-attention layer can take as input any tensor of the form XRd×TX \in \mathbb{R}^{d\times T} (for any TT) has parameters:

WQRk×d,WKRk×d,WVRd×d, W_Q\in \mathbb{R}^{k\times d}, W_K\in \mathbb{R}^{k\times d}, W_V\in \mathbb{R}^{d\times d},

and produce YRd×TY \in \mathbb{R}^{d\times T} (with same dd and tt as for the input). dd is the dimension of the input and kk is a hyper-parameter of the self-attention layer:

Ys=t=1Tsoftmaxt(XsTWQTWKXtk)WVXt, Y_s = \sum_{t=1}^T\text{softmax}_{t}\left(\frac{X_s^TW_Q^TW_KX_t}{\sqrt{k}}\right)W_VX_t,

with the convention that XtRdX_t\in \mathbb{R}^d (resp. YsRdY_s\in \mathbb{R}^d) is the tt-th column of XX (resp. the ss-th column of YY). Note that the notation softmaxt(.)\text{softmax}_{t}(.) might be a bit confusing. Recall that softmax\text{softmax} is always taking as input a vector and returning a (normalized) vector. In practice, most of the time, we are dealing with batches so that the softmax\text{softmax} function is taking as input a matrix (or tensor) and we need to normalize according to the right axis! Named tensor notation see below deals with this notational issue. I also find the interpretation given below helpful:

Mental model for self-attention: self-attention interpreted as taking expectation

ys=t=1Tp(xtxs)v(xt)=E[v(x)xs],with, p(xtxs)=exp(q(xs)k(xt))rexp(q(xs)k(xr)), y_s = \sum_{t=1}^T p(x_t | x_s) v(x_t) = \mathbb{E}[v(x) | x_s],\\ \text{with, } p(x_t|x_s) = \frac{\exp(q(x_s)k(x_t))}{\sum_{r}\exp(q(x_s)k(x_r))},

where the mappings q(.),k(.)q(.), k(.) and v(.)v(.) represent query, key and value.

Multi-head attention combines several such operations in parallel, and YY is the concatenation of the results along the feature dimension to which is applied one more linear transformation.

Transformer block

To finish the description of a transformer block, we need to define two last layers: Layer Norm and Feed Forward Network.

The Layer Norm used in the transformer block is particularly simple as it acts on vectors and standardizes it as follows: for xRdx\in \mathbb{R}^d, we define

mean(x)=1di=1dxiRstd(x)2=1di=1d(ximean(x))2R \text{mean}(x) =\frac{1}{d}\sum_{i=1}^d x_i\in \mathbb{R}\\ \text{std}(x)^2 = \frac{1}{d}\sum_{i=1}^d(x_i-\text{mean}(x))^2\in \mathbb{R}

and then the Layer Norm has two parameters γ,βRd\gamma, \beta\in \mathbb{R}^d and

LN(x)=γxmean(x)std(x)+β, LN(x) = \gamma \cdot \frac{x-\text{mean}(x)}{\text{std}(x)}+\beta,

where we used the natural broadcasting rule for subtracting the mean and dividing by std and \cdot is component-wise multiplication.

A Feed Forward Network is an MLP acting on vectors: for xRdx\in \mathbb{R}^d, we define

FFN(x)=max(0,xW1+b1)W2+b2, FFN(x) = \max(0,xW_1+b_1)W_2+b_2,

where W1Rd×hW_1\in \mathbb{R}^{d\times h}, b1Rhb_1\in \mathbb{R}^h, W2Rh×dW_2\in \mathbb{R}^{h\times d}, b2Rdb_2\in \mathbb{R}^d.

Each of these layers is applied on each of the inputs given to the transformer block as depicted below:

Note that this block is equivariant: if we permute the inputs, then the outputs will be permuted with the same permutation. As a result, the order of the input is irrelevant to the transformer block. In particular, this order cannot be used. The important notion of positional encoding allows us to take order into account. It is a deterministic unique encoding for each time step that is added to the input tokens.

LLM Visualization.

Have a look at Brendan Bycroft’s beautifully crafted interactive explanation of the transformers architecture:

gif

Transformers using Named Tensor Notation

In Transformers using Named Tensor Notation, we derive the formal equations for the Transformer block using named tensor notation.

Hacking a simple Transformer block

Now is the time to have fun building a simple transformer block and to think like transformers (open in colab).

\ No newline at end of file diff --git a/modules/12-intro-julia/index.html b/modules/12-intro-julia/index.html index 549ac6b..3f09e5c 100644 --- a/modules/12-intro-julia/index.html +++ b/modules/12-intro-julia/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module - Introduction to Julia: Automatic differentiation with dual numbers

Table of Contents

Introduction to Julia: Automatic differentiation with dual numbers


0:00 Dual numbers in Julia
8:47 Using conversion and promotion
13:25 Automatic differentiation for polynomials
17:35 Using Babylonian algorithm for the square root
24:27 Checking the derivative by hand
25:37 Pkg ForwardDiff.jl

Notebook

Binder

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module - Introduction to Julia: Automatic differentiation with dual numbers

Table of Contents

Introduction to Julia: Automatic differentiation with dual numbers


0:00 Dual numbers in Julia
8:47 Using conversion and promotion
13:25 Automatic differentiation for polynomials
17:35 Using Babylonian algorithm for the square root
24:27 Checking the derivative by hand
25:37 Pkg ForwardDiff.jl

Notebook

Binder

\ No newline at end of file diff --git a/modules/13-siamese/index.html b/modules/13-siamese/index.html index d11c11c..e657536 100644 --- a/modules/13-siamese/index.html +++ b/modules/13-siamese/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 13 - Siamese Networks and Representation Learning

Table of Contents

Siamese Networks and Representation Learning


0:57 Siamese networks for face recognition
4:21 Siamese architecture
6:09 Contrastive loss
11:23 Training siamese networks
14:26 Triplet architecture
15:00 Triplet loss
17:16 Training with triplet loss
17:45 Pytorch code
20:20 Hard negative sampling
22:55 Applications
31:00 N-pair loss
32:06 Histogram loss
33:35 Prototypical networks
36:10 Take-away

Slides and Notebook

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 13 - Siamese Networks and Representation Learning

Table of Contents

Siamese Networks and Representation Learning


0:57 Siamese networks for face recognition
4:21 Siamese architecture
6:09 Contrastive loss
11:23 Training siamese networks
14:26 Triplet architecture
15:00 Triplet loss
17:16 Training with triplet loss
17:45 Pytorch code
20:20 Hard negative sampling
22:55 Applications
31:00 N-pair loss
32:06 Histogram loss
33:35 Prototypical networks
36:10 Take-away

Slides and Notebook

\ No newline at end of file diff --git a/modules/14a-depth/index.html b/modules/14a-depth/index.html index 7498796..7588777 100644 --- a/modules/14a-depth/index.html +++ b/modules/14a-depth/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 14a - The Benefits of Depth

Table of Contents

Benefits of Depth

Slides

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 14a - The Benefits of Depth

Table of Contents

Benefits of Depth

Slides

\ No newline at end of file diff --git a/modules/14b-depth/index.html b/modules/14b-depth/index.html index 2b4a518..c2c1b86 100644 --- a/modules/14b-depth/index.html +++ b/modules/14b-depth/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 14b - The Problems with Depth

Table of Contents

The Problems with Depth

Slides

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 14b - The Problems with Depth

Table of Contents

The Problems with Depth

Slides

\ No newline at end of file diff --git a/modules/15-dropout/index.html b/modules/15-dropout/index.html index bed8450..2223b75 100644 --- a/modules/15-dropout/index.html +++ b/modules/15-dropout/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 15 - Dropout

Table of Contents

Dropout

Slides and Notebook

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 15 - Dropout

Table of Contents

Dropout

Slides and Notebook

\ No newline at end of file diff --git a/modules/16-batchnorm/index.html b/modules/16-batchnorm/index.html index 8dfd96f..7ee5c61 100644 --- a/modules/16-batchnorm/index.html +++ b/modules/16-batchnorm/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 16 - Batchnorm

Table of Contents

Batchnorm

Slides and Notebook

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 16 - Batchnorm

Table of Contents

Batchnorm

Slides and Notebook

\ No newline at end of file diff --git a/modules/17-resnets/index.html b/modules/17-resnets/index.html index 1a8d7ec..6e53a1a 100644 --- a/modules/17-resnets/index.html +++ b/modules/17-resnets/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 17 - Resnets

Table of Contents

Resnets

Slides

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 17 - Resnets

Table of Contents

Resnets

Slides

\ No newline at end of file diff --git a/modules/18a-diffusion/index.html b/modules/18a-diffusion/index.html index d0adf02..6c3c6d6 100644 --- a/modules/18a-diffusion/index.html +++ b/modules/18a-diffusion/index.html @@ -57,4 +57,4 @@ pred_prev_sample = pred_prev_sample + variance - return pred_prev_sample

Summary: Denoising Diffusion Probabilistic Models

(J. Ho, A. Jain, P. Abbeel 2020)

Given a schedule β1<β2<<βT\beta_1<\beta_2<\dots <\beta_T, the forward diffusion process is defined by: q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1},\beta_t I) and q(x1:Tx0)=t=1Tq(xtxt1)q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}).

With αt=1βt\alpha_t = 1-\beta_t and αt=i=1tαi\overline{\alpha_t} = \prod_{i=1}^t\alpha_i, we see that, with ϵN(0,I)\epsilon\sim\mathcal{N}(0,I):

xt=αtx0+1αtϵ.\begin{aligned} x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1-\overline{\alpha}_t}\epsilon. \end{aligned}
The law q(xt1xt,ϵ)q(x_{t-1}|x_t,\epsilon) is explicit: q(xt1xt,ϵ)=N(xt1;μ(xt,ϵ,t),γtI)q(x_{t-1}|x_t,\epsilon) = \mathcal{N}(x_{t-1};\mu(x_t,\epsilon,t), \gamma_t I) with,
μ(xt,ϵ,t)=1αt(xt1αt1αtϵ) and, γt=1αt11αtβt\begin{aligned} \mu(x_t,\epsilon, t) = \frac{1}{\sqrt{\alpha_t}}\left( x_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon\right)\text{ and, } \gamma_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t}}\beta_t \end{aligned}

Training: to approximate the reversed diffusion q(xt1xt)q(x_{t-1}|x_t) by a neural network given by pθ(xt1xt)=N(xt1;μθ(xt,t),βtI)p_{\theta}(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t,t), \beta_t I) and p(xT)N(0,I)p(x_T) \sim \mathcal{N}(0,I), we maximize the usual Variational bound:
Eq(x0)lnpθ(x0)LT+t=2TLt1+L0 with, Lt1=Eq[12σt2μθ(xt,t)μ(xt,ϵ,t)2].\begin{aligned} \mathbb{E}_{q(x_0)} \ln p_{\theta}(x_0) &\geq L_T +\sum_{t=2}^T L_{t-1}+L_0 \text{ with, }L_{t-1} = \mathbb{E}_q\left[ \frac{1}{2\sigma_t^2}\|\mu_\theta(x_t,t) -\mu(x_t,\epsilon,t)\|^2\right]. \end{aligned}
With the change of variable:
μθ(xt,t)=1αt(xt1αt1αtϵθ(xt,t)),\begin{aligned} \mu_\theta(x_t,t) = \frac{1}{\sqrt{\alpha_t}}\left( x_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta(x_t,t)\right), \end{aligned}
ignoring the prefactor and sampling τ\tau instead of summing over all tt, the loss is finally:
(θ)=EτEϵ[ϵϵθ(ατx0+1ατϵ,τ)2]\begin{aligned} \ell(\theta) = \mathbb{E}_\tau\mathbb{E}_\epsilon \left[ \|\epsilon - \epsilon_\theta(\sqrt{\overline{\alpha}_\tau}x_0 + \sqrt{1-\overline{\alpha}_\tau}\epsilon, \tau)\|^2\right] \end{aligned}
Sampling: to simulate the reversed diffusion with the learned ϵθ(xt,t)\epsilon_\theta(x_t,t) starting from xTN(0,I)x_T\sim \mathcal{N}(0,I), iterate for t=T,,1t=T,\dots, 1:
xt1=1αt(xt1αt1αtϵθ(xt,t))+βtϵ, with ϵN(0,I).\begin{aligned} x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( x_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta(x_t,t)\right)+\sqrt{\beta_t}\epsilon,\text{ with } \epsilon\sim\mathcal{N}(0,I). \end{aligned}

Implementation

MNIST

The training of this notebook on colab takes approximately 20 minutes.

CIFAR10

The training of this notebook on colab takes approximately 20 minutes (so do not expect high-quality pictures!). Still, after finetuning on specific classes, we see that the model learns features of the class.

With a bit more training (100 epochs), you can get results like this:

Technical details

Note that the Denoising Diffusion Probabilistic Model is the same for MNIST and CIFAR10, we only change the UNet learning to reverse the noise. For CIFAR10, we adapt the UNet provided in Module 9b. Indeed, you can still use the code provided here for DDPM with other architectures like more complex ones with self-attention like this Unet coded by lucidrains which is the one used in the original paper.

In the paper, the authors used Exponential Moving Average (EMA) on model parameters with a decay factor of 0.9990.999. This is not implemented here to keep the code as simple as possible.

\ No newline at end of file + return pred_prev_sample

Summary: Denoising Diffusion Probabilistic Models

(J. Ho, A. Jain, P. Abbeel 2020)

Given a schedule β1<β2<<βT\beta_1<\beta_2<\dots <\beta_T, the forward diffusion process is defined by: q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1},\beta_t I) and q(x1:Tx0)=t=1Tq(xtxt1)q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}).

With αt=1βt\alpha_t = 1-\beta_t and αt=i=1tαi\overline{\alpha_t} = \prod_{i=1}^t\alpha_i, we see that, with ϵN(0,I)\epsilon\sim\mathcal{N}(0,I):

xt=αtx0+1αtϵ.\begin{aligned} x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1-\overline{\alpha}_t}\epsilon. \end{aligned}
The law q(xt1xt,ϵ)q(x_{t-1}|x_t,\epsilon) is explicit: q(xt1xt,ϵ)=N(xt1;μ(xt,ϵ,t),γtI)q(x_{t-1}|x_t,\epsilon) = \mathcal{N}(x_{t-1};\mu(x_t,\epsilon,t), \gamma_t I) with,
μ(xt,ϵ,t)=1αt(xt1αt1αtϵ) and, γt=1αt11αtβt\begin{aligned} \mu(x_t,\epsilon, t) = \frac{1}{\sqrt{\alpha_t}}\left( x_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon\right)\text{ and, } \gamma_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t}}\beta_t \end{aligned}

Training: to approximate the reversed diffusion q(xt1xt)q(x_{t-1}|x_t) by a neural network given by pθ(xt1xt)=N(xt1;μθ(xt,t),βtI)p_{\theta}(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t,t), \beta_t I) and p(xT)N(0,I)p(x_T) \sim \mathcal{N}(0,I), we maximize the usual Variational bound:
Eq(x0)lnpθ(x0)LT+t=2TLt1+L0 with, Lt1=Eq[12σt2μθ(xt,t)μ(xt,ϵ,t)2].\begin{aligned} \mathbb{E}_{q(x_0)} \ln p_{\theta}(x_0) &\geq L_T +\sum_{t=2}^T L_{t-1}+L_0 \text{ with, }L_{t-1} = \mathbb{E}_q\left[ \frac{1}{2\sigma_t^2}\|\mu_\theta(x_t,t) -\mu(x_t,\epsilon,t)\|^2\right]. \end{aligned}
With the change of variable:
μθ(xt,t)=1αt(xt1αt1αtϵθ(xt,t)),\begin{aligned} \mu_\theta(x_t,t) = \frac{1}{\sqrt{\alpha_t}}\left( x_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta(x_t,t)\right), \end{aligned}
ignoring the prefactor and sampling τ\tau instead of summing over all tt, the loss is finally:
(θ)=EτEϵ[ϵϵθ(ατx0+1ατϵ,τ)2]\begin{aligned} \ell(\theta) = \mathbb{E}_\tau\mathbb{E}_\epsilon \left[ \|\epsilon - \epsilon_\theta(\sqrt{\overline{\alpha}_\tau}x_0 + \sqrt{1-\overline{\alpha}_\tau}\epsilon, \tau)\|^2\right] \end{aligned}
Sampling: to simulate the reversed diffusion with the learned ϵθ(xt,t)\epsilon_\theta(x_t,t) starting from xTN(0,I)x_T\sim \mathcal{N}(0,I), iterate for t=T,,1t=T,\dots, 1:
xt1=1αt(xt1αt1αtϵθ(xt,t))+βtϵ, with ϵN(0,I).\begin{aligned} x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( x_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta(x_t,t)\right)+\sqrt{\beta_t}\epsilon,\text{ with } \epsilon\sim\mathcal{N}(0,I). \end{aligned}

Implementation

MNIST

The training of this notebook on colab takes approximately 20 minutes.

CIFAR10

The training of this notebook on colab takes approximately 20 minutes (so do not expect high-quality pictures!). Still, after finetuning on specific classes, we see that the model learns features of the class.

With a bit more training (100 epochs), you can get results like this:

Technical details

Note that the Denoising Diffusion Probabilistic Model is the same for MNIST and CIFAR10, we only change the UNet learning to reverse the noise. For CIFAR10, we adapt the UNet provided in Module 9b. Indeed, you can still use the code provided here for DDPM with other architectures like more complex ones with self-attention like this Unet coded by lucidrains which is the one used in the original paper.

In the paper, the authors used Exponential Moving Average (EMA) on model parameters with a decay factor of 0.9990.999. This is not implemented here to keep the code as simple as possible.

\ No newline at end of file diff --git a/modules/19-clip/index.html b/modules/19-clip/index.html index dd0004c..21479fe 100644 --- a/modules/19-clip/index.html +++ b/modules/19-clip/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 19 - Zero-shot classification with CLIP

Notebook

References

CLIP Learning Transferable Visual Models From Natural Language Supervision (ICML 2021) Alec Radford et al.

Visual Classification via Description from Large Language Models (ICLR 2023) Menon, Sachit and Vondrick, Carl

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 19 - Zero-shot classification with CLIP

Notebook

References

CLIP Learning Transferable Visual Models From Natural Language Supervision (ICML 2021) Alec Radford et al.

Visual Classification via Description from Large Language Models (ICLR 2023) Menon, Sachit and Vondrick, Carl

\ No newline at end of file diff --git a/modules/2a-pytorch-tensors/index.html b/modules/2a-pytorch-tensors/index.html index 1fd90df..0759200 100644 --- a/modules/2a-pytorch-tensors/index.html +++ b/modules/2a-pytorch-tensors/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 2a - Pytorch tensors

Table of Contents

Pytorch tensors


0:00 Recap
1:43 Introduction to tensors
4:32 Sizes
5:25 Bridge to numpy
11:10 Broadcasting
14:35 Inplace modification
16:30 Shared memory
18:40 Cuda
22:34 CIFAR dataset

Notebook

Quiz

To check your understanding of the material, you can do the quizzes

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 2a - Pytorch tensors

Table of Contents

Pytorch tensors


0:00 Recap
1:43 Introduction to tensors
4:32 Sizes
5:25 Bridge to numpy
11:10 Broadcasting
14:35 Inplace modification
16:30 Shared memory
18:40 Cuda
22:34 CIFAR dataset

Notebook

Quiz

To check your understanding of the material, you can do the quizzes

\ No newline at end of file diff --git a/modules/2b-automatic-differentiation/index.html b/modules/2b-automatic-differentiation/index.html index a5b61ed..e929690 100644 --- a/modules/2b-automatic-differentiation/index.html +++ b/modules/2b-automatic-differentiation/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 2b - Automatic differentiation

Table of Contents

Automatic differentiation


0:00 Recap
0:40 A simple example (more in the practicals)
3:44 Pytorch tensor: requires_grad field
6:44 Pytorch backward function
9:05 The chain rule on our example
16:00 Linear regression
18:00 Gradient descent with numpy...
27:30 ... with pytorch tensors
31:30 Using autograd
34:35 Using a neural network (linear layer)
39:50 Using a pytorch optimizer
44:00 algorithm: how automatic differentiation works

Slides and Notebook

Quiz

To check your understanding of automatic differentiation, you can do the quizzes

Practicals

Challenge

Adapt your code to solve the following challenge:

Some small modifications:

Bonus:

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 2b - Automatic differentiation

Table of Contents

Automatic differentiation


0:00 Recap
0:40 A simple example (more in the practicals)
3:44 Pytorch tensor: requires_grad field
6:44 Pytorch backward function
9:05 The chain rule on our example
16:00 Linear regression
18:00 Gradient descent with numpy...
27:30 ... with pytorch tensors
31:30 Using autograd
34:35 Using a neural network (linear layer)
39:50 Using a pytorch optimizer
44:00 algorithm: how automatic differentiation works

Slides and Notebook

Quiz

To check your understanding of automatic differentiation, you can do the quizzes

Practicals

Challenge

Adapt your code to solve the following challenge:

Some small modifications:

Bonus:

\ No newline at end of file diff --git a/modules/2c-jax/index.html b/modules/2c-jax/index.html index 39c3632..148abd0 100644 --- a/modules/2c-jax/index.html +++ b/modules/2c-jax/index.html @@ -9,4 +9,4 @@ result, = ctx.saved_tensors return grad_output * result # Use it by calling the apply method: -output = Exp.apply(input)

You can have a look at Module 2b to learn more about this approach as well as MLP from scratch.

Backprop the functional way

Here we will implement in numpy a different approach mimicking the functional approach of JAX see The Autodiff Cookbook.

Each function will take 2 arguments: one being the input x and the other being the parameters w. For each function, we build 2 vjp functions taking as argument a gradient u\mathbf{u}, and corresponding to Jf(x)J_{\mathbf{f}}(\mathbf{x}) and Jf(w)J_{\mathbf{f}}(\mathbf{w}) so that these functions return Jf(x)TuJ_{\mathbf{f}}(\mathbf{x})^T \mathbf{u} and Jf(w)TuJ_{\mathbf{f}}(\mathbf{w})^T \mathbf{u} respectively. To summarize, for xRn\mathbf{x} \in \mathbb{R}^n, wRd\mathbf{w} \in \mathbb{R}^d, and, f(x,w)Rm\mathbf{f}(\mathbf{x},\mathbf{w}) \in \mathbb{R}^m,

vjpx(u)=Jf(x)Tu, with Jf(x)Rm×n,uRmvjpw(u)=Jf(w)Tu, with Jf(w)Rm×d,uRm\begin{aligned} {\bf vjp}_\mathbf{x}(\mathbf{u}) &= J_{\mathbf{f}}(\mathbf{x})^T \mathbf{u}, \text{ with } J_{\mathbf{f}}(\mathbf{x})\in\mathbb{R}^{m\times n}, \mathbf{u}\in \mathbb{R}^m\\ {\bf vjp}_\mathbf{w}(\mathbf{u}) &= J_{\mathbf{f}}(\mathbf{w})^T \mathbf{u}, \text{ with } J_{\mathbf{f}}(\mathbf{w})\in\mathbb{R}^{m\times d}, \mathbf{u}\in \mathbb{R}^m \end{aligned}

Then backpropagation is simply done by first computing the gradient of the loss and then composing the vjp functions in the right order.

Practice

\ No newline at end of file +output = Exp.apply(input)

You can have a look at Module 2b to learn more about this approach as well as MLP from scratch.

Backprop the functional way

Here we will implement in numpy a different approach mimicking the functional approach of JAX see The Autodiff Cookbook.

Each function will take 2 arguments: one being the input x and the other being the parameters w. For each function, we build 2 vjp functions taking as argument a gradient u\mathbf{u}, and corresponding to Jf(x)J_{\mathbf{f}}(\mathbf{x}) and Jf(w)J_{\mathbf{f}}(\mathbf{w}) so that these functions return Jf(x)TuJ_{\mathbf{f}}(\mathbf{x})^T \mathbf{u} and Jf(w)TuJ_{\mathbf{f}}(\mathbf{w})^T \mathbf{u} respectively. To summarize, for xRn\mathbf{x} \in \mathbb{R}^n, wRd\mathbf{w} \in \mathbb{R}^d, and, f(x,w)Rm\mathbf{f}(\mathbf{x},\mathbf{w}) \in \mathbb{R}^m,

vjpx(u)=Jf(x)Tu, with Jf(x)Rm×n,uRmvjpw(u)=Jf(w)Tu, with Jf(w)Rm×d,uRm\begin{aligned} {\bf vjp}_\mathbf{x}(\mathbf{u}) &= J_{\mathbf{f}}(\mathbf{x})^T \mathbf{u}, \text{ with } J_{\mathbf{f}}(\mathbf{x})\in\mathbb{R}^{m\times n}, \mathbf{u}\in \mathbb{R}^m\\ {\bf vjp}_\mathbf{w}(\mathbf{u}) &= J_{\mathbf{f}}(\mathbf{w})^T \mathbf{u}, \text{ with } J_{\mathbf{f}}(\mathbf{w})\in\mathbb{R}^{m\times d}, \mathbf{u}\in \mathbb{R}^m \end{aligned}

Then backpropagation is simply done by first computing the gradient of the loss and then composing the vjp functions in the right order.

Practice

\ No newline at end of file diff --git a/modules/3-loss-functions-for-classification/index.html b/modules/3-loss-functions-for-classification/index.html index 906d58d..599a8b0 100644 --- a/modules/3-loss-functions-for-classification/index.html +++ b/modules/3-loss-functions-for-classification/index.html @@ -10,4 +10,4 @@ C = 8 input = torch.randn(3,C,4,5) target = torch.empty(3,4,5 dtype=torch.long).random_(0,C) -assert loss1(m(input),target) == loss2(input,target)

Quiz

To check you know your loss, you can do the quizzes

\ No newline at end of file +assert loss1(m(input),target) == loss2(input,target)

Quiz

To check you know your loss, you can do the quizzes

\ No newline at end of file diff --git a/modules/4-optimization-for-deep-learning/index.html b/modules/4-optimization-for-deep-learning/index.html index 45e303e..f71d029 100644 --- a/modules/4-optimization-for-deep-learning/index.html +++ b/modules/4-optimization-for-deep-learning/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 4 - Optimization for deep leaning

Table of Contents

Optimization for deep leaning


0:00 Recap
0:31 Plan
1:14 Optimization in deep learning
3:44 Gradient descent variants
7:58 Setting for the jupyter notebook
9:49 Vanilla gradient descent
12:14 Momentum
15:38 Nesterov accelerated gradient descent
18:00 Adagrad
20:06 RMSProp
22:11 Adam
24:39 AMSGrad
27:09 Pytorch optimizers

Slides and Practicals

References

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 4 - Optimization for deep leaning

Table of Contents

Optimization for deep leaning


0:00 Recap
0:31 Plan
1:14 Optimization in deep learning
3:44 Gradient descent variants
7:58 Setting for the jupyter notebook
9:49 Vanilla gradient descent
12:14 Momentum
15:38 Nesterov accelerated gradient descent
18:00 Adagrad
20:06 RMSProp
22:11 Adam
24:39 AMSGrad
27:09 Pytorch optimizers

Slides and Practicals

References

\ No newline at end of file diff --git a/modules/5-stacking-layers/index.html b/modules/5-stacking-layers/index.html index 3497e17..a38c97b 100644 --- a/modules/5-stacking-layers/index.html +++ b/modules/5-stacking-layers/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 5 - Stacking layers

Table of Contents

Stacking layers


0:00 Recap
1:35 Plan of the lesson: define a NN model
2:24 MLP with pytorch Sequential
6:41 Using Torch.nn.module
10:08 Writing a pytorch module

Slides

Practicals

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 5 - Stacking layers

Table of Contents

Stacking layers


0:00 Recap
1:35 Plan of the lesson: define a NN model
2:24 MLP with pytorch Sequential
6:41 Using Torch.nn.module
10:08 Writing a pytorch module

Slides

Practicals

\ No newline at end of file diff --git a/modules/6-convolutional-neural-network/index.html b/modules/6-convolutional-neural-network/index.html index 1caf3b7..37dab9e 100644 --- a/modules/6-convolutional-neural-network/index.html +++ b/modules/6-convolutional-neural-network/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 6 - Convolutional neural network

Table of Contents

Convolutional neural network


0:00 Recap
0:52 MNIST dataset
2:56 A simple binary classifier
6:21 Precision and recall
8:44 Filters and convolutions
19:40 Max pooling

Notebook

Practicals


28:24 Practicals: your first CNN

Post

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 6 - Convolutional neural network

Table of Contents

Convolutional neural network


0:00 Recap
0:52 MNIST dataset
2:56 A simple binary classifier
6:21 Precision and recall
8:44 Filters and convolutions
19:40 Max pooling

Notebook

Practicals


28:24 Practicals: your first CNN

Post

\ No newline at end of file diff --git a/modules/7-dataloading/index.html b/modules/7-dataloading/index.html index 27582e3..d4318d2 100644 --- a/modules/7-dataloading/index.html +++ b/modules/7-dataloading/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 7 - Dataloading

Table of Contents

Dataloading


0:00 Recap
1:09 Plan of the lesson
2:08 Dataloading
4:40 Example 1: torchvision.datasets.Imagefolder
9:45 Example 2: dataset from numpy arrays
14:47 Example 3: custom dataloader

Slides

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 7 - Dataloading

Table of Contents

Dataloading


0:00 Recap
1:09 Plan of the lesson
2:08 Dataloading
4:40 Example 1: torchvision.datasets.Imagefolder
9:45 Example 2: dataset from numpy arrays
14:47 Example 3: custom dataloader

Slides

\ No newline at end of file diff --git a/modules/8a-embedding-layers/index.html b/modules/8a-embedding-layers/index.html index 619e9fb..d104688 100644 --- a/modules/8a-embedding-layers/index.html +++ b/modules/8a-embedding-layers/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 8a - Embedding layers

Table of Contents

Embedding layers


17:46 Dealing with symbolic data
18:31 One-hot encoding
22:46 Embeddings
27:40 Pytorch sparse layer

Slides

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 8a - Embedding layers

Table of Contents

Embedding layers


17:46 Dealing with symbolic data
18:31 One-hot encoding
22:46 Embeddings
27:40 Pytorch sparse layer

Slides

\ No newline at end of file diff --git a/modules/8b-collaborative-filtering/index.html b/modules/8b-collaborative-filtering/index.html index ea87c22..9f35c37 100644 --- a/modules/8b-collaborative-filtering/index.html +++ b/modules/8b-collaborative-filtering/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 8b - Collaborative filtering

Table of Contents

Collaborative filtering


0:00 Collaborative filtering
6:50 Movielens dataset: data wrangling with pandas
11:36 Test/train split with sklearn
13:51 The dot model neural network
19:03 Checking your model
21:19 Coding the training loop
21:49 Checking your training loop
23:27 FactorizationModel: a deep learning framework
27:36 Checking your FactorizationModel
30:55 Sorting the movies
33:00 PCA of movies embeddings
36:40 SPOTLIGHT lib

Notebook

Practicals


13:51 Start with your implementation of the dot model
\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 8b - Collaborative filtering

Table of Contents

Collaborative filtering


0:00 Collaborative filtering
6:50 Movielens dataset: data wrangling with pandas
11:36 Test/train split with sklearn
13:51 The dot model neural network
19:03 Checking your model
21:19 Coding the training loop
21:49 Checking your training loop
23:27 FactorizationModel: a deep learning framework
27:36 Checking your FactorizationModel
30:55 Sorting the movies
33:00 PCA of movies embeddings
36:40 SPOTLIGHT lib

Notebook

Practicals


13:51 Start with your implementation of the dot model
\ No newline at end of file diff --git a/modules/8c-word2vec/index.html b/modules/8c-word2vec/index.html index dd54a2b..e1101fb 100644 --- a/modules/8c-word2vec/index.html +++ b/modules/8c-word2vec/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 8c - Word2vec

Table of Contents

Practicals

References

-word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method by Yoav Goldberg and Omer Levy

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 8c - Word2vec

Table of Contents

Practicals

References

-word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method by Yoav Goldberg and Omer Levy

\ No newline at end of file diff --git a/modules/9a-autoencoders/index.html b/modules/9a-autoencoders/index.html index 11e3b25..a420bea 100644 --- a/modules/9a-autoencoders/index.html +++ b/modules/9a-autoencoders/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 9a - Autoencoders

Table of Contents

Autoencoders


0:00 Recap and unsupervised learning
2:19 Plan
3:09 Theory of autoencoders
7:29 Practice of autoencoders in PyTorch
11:19 Representation learning with autoencoders
15:55 Practicals
16:49 A simple autoencoder
20:10 Stacked autoencoders
22:16 Interpolation
22:29 Denoising autoencoder

Slides

Practicals

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 9a - Autoencoders

Table of Contents

Autoencoders


0:00 Recap and unsupervised learning
2:19 Plan
3:09 Theory of autoencoders
7:29 Practice of autoencoders in PyTorch
11:19 Representation learning with autoencoders
15:55 Practicals
16:49 A simple autoencoder
20:10 Stacked autoencoders
22:16 Interpolation
22:29 Denoising autoencoder

Slides

Practicals

\ No newline at end of file diff --git a/modules/9b-unet/index.html b/modules/9b-unet/index.html index 262aafa..969260e 100644 --- a/modules/9b-unet/index.html +++ b/modules/9b-unet/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 9b - UNets

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 9b - UNets

\ No newline at end of file diff --git a/modules/9c-flows/index.html b/modules/9c-flows/index.html index fee7a00..60bdd2d 100644 --- a/modules/9c-flows/index.html +++ b/modules/9c-flows/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module 9c - Flows

Table of Contents

Normalizing flows

The image below is taken from this very good blog post on normalizing flows: blogpost

Here we only describe flow-based generative models, you can have look at VAE and GAN.

A flow-based generative model is constructed by a sequence of invertible transformations. The main advantage of flows is that the model explicitly learns the data distribution \(p(\mathbf{x})\) and therefore the loss function is simply the negative log-likelihood.

Given a sample \(\mathbf{x}\) and a prior \(p(\mathbf{z})\), we compute \(f(\mathbf{x}) = \mathbf{z}\) with an invertible function \(f\) that will be learned. Given \(f\) and the prior \(p(\mathbf{z})\), we can compute the evidence \(p(\mathbf{x})\) thanks to the change of variable formula:

\[\begin{aligned} \mathbf{z} &\sim p(\mathbf{z}), \mathbf{z} = f(\mathbf{x}), \\ p(\mathbf{x}) &= p(\mathbf{z}) \left\vert \det \dfrac{d \mathbf{z}}{d \mathbf{x}} \right\vert = p(f(\mathbf{x})) \left\vert \det \dfrac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \right\vert \end{aligned}\]

where \(\dfrac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\) is the Jacobian matrix of \(f\). Recall that given a function mapping a \(n\)-dimensional input vector \(\mathbf{x}\) to a \(m\)-dimensional output vector, \(f: \mathbb{R}^n \mapsto \mathbb{R}^m\), the matrix of all first-order partial derivatives of this function is called the Jacobian matrix, \(J_f\) where one entry on the i-th row and j-th column is \((J_f(\mathbf{x}))_{ij} = \frac{\partial f_i(\mathbf{x})}{\partial x_j}\):

\[\begin{aligned} {J_f(\mathbf{x})} = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & \dots & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\[6pt] \vdots & \ddots & \vdots \\[6pt] \frac{\partial f_m(\mathbf{x})}{\partial x_1} & \dots & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\[6pt] \end{bmatrix} \end{aligned}\]

Below, we will parametrize \(f\) with a neural network and learn \(f\) by maximizing \(\ln p(\mathbf{x})\). More precisely, given a dataset \((\mathbf{x}_1,\dots,\mathbf{x}_n)\) and a model provided by a prior \(p(\mathbf{z})\) and a neural network \(f\), we optimize the weights of \(f\) by minimizing:

\[\begin{aligned} -\sum_{i}\ln p(\mathbf{x_i}) = \sum_i -\ln p(f(\mathbf{x}_i)) -\ln\left\vert \det \dfrac{\partial f(\mathbf{x}_i)}{\partial \mathbf{x}} \right\vert. \end{aligned}\]

We need to ensure that \(f\) is always invertible and that the determinant is simple to compute.

Density estimation using Real NVP

Real NVP (introduced by Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio in 2016) uses function \(f\) obtained by stacking affine coupling layers which for an input \(\mathbf{x}\in \mathbb{R}^D\) produce the output \(\mathbf{y}\in\mathbb{R}^D\) defined by (with \( d < D \) ):

\[\begin{aligned} \mathbf{y}_{1:d} &= \mathbf{x}_{1:d}\\ \mathbf{y}_{d+1:D} &= \mathbf{x}_{d+1:D} \odot \exp\left(s(\mathbf{x}_{1:d})\right) +t(\mathbf{x}_{1:d}) , \end{aligned}\]

where \(s\) (scale) and \(t\) (translation) are neural networks mapping \(\mathbb{R}^d\) to \(\mathbb{R}^{D-d}\) and \(\odot\) is the element-wise product.

For any functions \(s\) and \(t\), the affine coupling layer is invertible:

\[\begin{aligned} \begin{cases} \mathbf{y}_{1:d} &= \mathbf{x}_{1:d} \\ \mathbf{y}_{d+1:D} &= \mathbf{x}_{d+1:D} \odot \exp({s(\mathbf{x}_{1:d})}) + t(\mathbf{x}_{1:d}) \end{cases} \Leftrightarrow \begin{cases} \mathbf{x}_{1:d} &= \mathbf{y}_{1:d} \\ \mathbf{x}_{d+1:D} &= (\mathbf{y}_{d+1:D} - t(\mathbf{y}_{1:d})) \odot \exp(-s(\mathbf{y}_{1:d})) \end{cases} \end{aligned}\]

The Jacobian of an affine coupling layer is a lower triangular matrix:

\[\begin{aligned} J(\mathbf{x}) = \frac{\partial \mathbf{y}}{\partial \mathbf{x}}= \begin{bmatrix} \mathbb{I}_d & \mathbf{0}_{d\times(D-d)} \\[5pt] \frac{\partial \mathbf{y}_{d+1:D}}{\partial \mathbf{x}_{1:d}} & \text{diag}(\exp(s(\mathbf{x}_{1:d}))) \end{bmatrix} \end{aligned}\]

Hence the determinant is simply the product of terms on the diagonal:

\[\begin{aligned} \left\vert\det(J(\mathbf{x}))\right\vert = \prod_{j=1}^{D-d}\exp(s(\mathbf{x}_{1:d}))_j = \exp\left(\sum_{j=1}^{D-d} s(\mathbf{x}_{1:d})_j\right) \end{aligned}\]

Note that, we do not need to compute the Jacobian of \(s\) or \(t\) and to compute \(f^{-1}\), we do not need to compute the inverse of \(s\) or \(t\) (which might not exist!). In other words, we can take arbitrary complex functions for \(s\) and \(t\).

In one affine coupling layer, some dimensions (channels) remain unchanged. To make sure all the inputs have a chance to be altered, the model reverses the ordering in each layer so that different components are left unchanged. Following such an alternating pattern, the set of units which remain identical in one transformation layer are always modified in the next.

This can be implemented with binary masks. First, we can extend the scale and neural networks to mappings form \(\mathbb{R}^D\) to \(\mathbb{R}^D\). Then taking a mask \(\mathbf{b} = (1,\dots,1,0,\dots,0)\) with \(d\) ones, so that we have for the affine layer:

\[\begin{aligned} \mathbf{y} = \mathbf{x} \odot \exp\big((1-\mathbf{b}) \odot s(\mathbf{b} \odot \mathbf{x})\big) + (1-\mathbf{b}) \odot t(\mathbf{b} \odot \mathbf{x}). \end{aligned}\]

Note that we have

\[\begin{aligned} \ln \left\vert\det(J(\mathbf{x}))\right\vert = \sum_{j=1}^{D} \Big((1-\mathbf{b})\odot s(\mathbf{b} \odot \mathbf{x})\Big)_j, \end{aligned}\]

and to invert the affine layer:

\[\begin{aligned} \mathbf{x} = \left( \mathbf{y} -(1-\mathbf{b}) \odot t(\mathbf{b} \odot \mathbf{y})\right)\odot \exp\left( -(1-\mathbf{b}) \odot s(\mathbf{b} \odot \mathbf{y})\right). \end{aligned}\]

Now we alternates the binary mask \(\mathbf{b}\) from one coupling layer to the other.

Note, that the formula given in the paper is slightly different:

\[\mathbf{y} = \mathbf{b} \odot \mathbf{x} + (1 - \mathbf{b}) \odot \Big(\mathbf{x} \odot \exp\big(s(\mathbf{b} \odot \mathbf{x})\big) + t(\mathbf{b} \odot \mathbf{x})\Big),\]

but the 2 formulas give the same result!

Implementation of Real NVP

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module 9c - Flows

Table of Contents

Normalizing flows

The image below is taken from this very good blog post on normalizing flows: blogpost

Here we only describe flow-based generative models, you can have look at VAE and GAN.

A flow-based generative model is constructed by a sequence of invertible transformations. The main advantage of flows is that the model explicitly learns the data distribution \(p(\mathbf{x})\) and therefore the loss function is simply the negative log-likelihood.

Given a sample \(\mathbf{x}\) and a prior \(p(\mathbf{z})\), we compute \(f(\mathbf{x}) = \mathbf{z}\) with an invertible function \(f\) that will be learned. Given \(f\) and the prior \(p(\mathbf{z})\), we can compute the evidence \(p(\mathbf{x})\) thanks to the change of variable formula:

\[\begin{aligned} \mathbf{z} &\sim p(\mathbf{z}), \mathbf{z} = f(\mathbf{x}), \\ p(\mathbf{x}) &= p(\mathbf{z}) \left\vert \det \dfrac{d \mathbf{z}}{d \mathbf{x}} \right\vert = p(f(\mathbf{x})) \left\vert \det \dfrac{\partial f(\mathbf{x})}{\partial \mathbf{x}} \right\vert \end{aligned}\]

where \(\dfrac{\partial f(\mathbf{x})}{\partial \mathbf{x}}\) is the Jacobian matrix of \(f\). Recall that given a function mapping a \(n\)-dimensional input vector \(\mathbf{x}\) to a \(m\)-dimensional output vector, \(f: \mathbb{R}^n \mapsto \mathbb{R}^m\), the matrix of all first-order partial derivatives of this function is called the Jacobian matrix, \(J_f\) where one entry on the i-th row and j-th column is \((J_f(\mathbf{x}))_{ij} = \frac{\partial f_i(\mathbf{x})}{\partial x_j}\):

\[\begin{aligned} {J_f(\mathbf{x})} = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & \dots & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\[6pt] \vdots & \ddots & \vdots \\[6pt] \frac{\partial f_m(\mathbf{x})}{\partial x_1} & \dots & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\[6pt] \end{bmatrix} \end{aligned}\]

Below, we will parametrize \(f\) with a neural network and learn \(f\) by maximizing \(\ln p(\mathbf{x})\). More precisely, given a dataset \((\mathbf{x}_1,\dots,\mathbf{x}_n)\) and a model provided by a prior \(p(\mathbf{z})\) and a neural network \(f\), we optimize the weights of \(f\) by minimizing:

\[\begin{aligned} -\sum_{i}\ln p(\mathbf{x_i}) = \sum_i -\ln p(f(\mathbf{x}_i)) -\ln\left\vert \det \dfrac{\partial f(\mathbf{x}_i)}{\partial \mathbf{x}} \right\vert. \end{aligned}\]

We need to ensure that \(f\) is always invertible and that the determinant is simple to compute.

Density estimation using Real NVP

Real NVP (introduced by Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio in 2016) uses function \(f\) obtained by stacking affine coupling layers which for an input \(\mathbf{x}\in \mathbb{R}^D\) produce the output \(\mathbf{y}\in\mathbb{R}^D\) defined by (with \( d < D \) ):

\[\begin{aligned} \mathbf{y}_{1:d} &= \mathbf{x}_{1:d}\\ \mathbf{y}_{d+1:D} &= \mathbf{x}_{d+1:D} \odot \exp\left(s(\mathbf{x}_{1:d})\right) +t(\mathbf{x}_{1:d}) , \end{aligned}\]

where \(s\) (scale) and \(t\) (translation) are neural networks mapping \(\mathbb{R}^d\) to \(\mathbb{R}^{D-d}\) and \(\odot\) is the element-wise product.

For any functions \(s\) and \(t\), the affine coupling layer is invertible:

\[\begin{aligned} \begin{cases} \mathbf{y}_{1:d} &= \mathbf{x}_{1:d} \\ \mathbf{y}_{d+1:D} &= \mathbf{x}_{d+1:D} \odot \exp({s(\mathbf{x}_{1:d})}) + t(\mathbf{x}_{1:d}) \end{cases} \Leftrightarrow \begin{cases} \mathbf{x}_{1:d} &= \mathbf{y}_{1:d} \\ \mathbf{x}_{d+1:D} &= (\mathbf{y}_{d+1:D} - t(\mathbf{y}_{1:d})) \odot \exp(-s(\mathbf{y}_{1:d})) \end{cases} \end{aligned}\]

The Jacobian of an affine coupling layer is a lower triangular matrix:

\[\begin{aligned} J(\mathbf{x}) = \frac{\partial \mathbf{y}}{\partial \mathbf{x}}= \begin{bmatrix} \mathbb{I}_d & \mathbf{0}_{d\times(D-d)} \\[5pt] \frac{\partial \mathbf{y}_{d+1:D}}{\partial \mathbf{x}_{1:d}} & \text{diag}(\exp(s(\mathbf{x}_{1:d}))) \end{bmatrix} \end{aligned}\]

Hence the determinant is simply the product of terms on the diagonal:

\[\begin{aligned} \left\vert\det(J(\mathbf{x}))\right\vert = \prod_{j=1}^{D-d}\exp(s(\mathbf{x}_{1:d}))_j = \exp\left(\sum_{j=1}^{D-d} s(\mathbf{x}_{1:d})_j\right) \end{aligned}\]

Note that, we do not need to compute the Jacobian of \(s\) or \(t\) and to compute \(f^{-1}\), we do not need to compute the inverse of \(s\) or \(t\) (which might not exist!). In other words, we can take arbitrary complex functions for \(s\) and \(t\).

In one affine coupling layer, some dimensions (channels) remain unchanged. To make sure all the inputs have a chance to be altered, the model reverses the ordering in each layer so that different components are left unchanged. Following such an alternating pattern, the set of units which remain identical in one transformation layer are always modified in the next.

This can be implemented with binary masks. First, we can extend the scale and neural networks to mappings form \(\mathbb{R}^D\) to \(\mathbb{R}^D\). Then taking a mask \(\mathbf{b} = (1,\dots,1,0,\dots,0)\) with \(d\) ones, so that we have for the affine layer:

\[\begin{aligned} \mathbf{y} = \mathbf{x} \odot \exp\big((1-\mathbf{b}) \odot s(\mathbf{b} \odot \mathbf{x})\big) + (1-\mathbf{b}) \odot t(\mathbf{b} \odot \mathbf{x}). \end{aligned}\]

Note that we have

\[\begin{aligned} \ln \left\vert\det(J(\mathbf{x}))\right\vert = \sum_{j=1}^{D} \Big((1-\mathbf{b})\odot s(\mathbf{b} \odot \mathbf{x})\Big)_j, \end{aligned}\]

and to invert the affine layer:

\[\begin{aligned} \mathbf{x} = \left( \mathbf{y} -(1-\mathbf{b}) \odot t(\mathbf{b} \odot \mathbf{y})\right)\odot \exp\left( -(1-\mathbf{b}) \odot s(\mathbf{b} \odot \mathbf{y})\right). \end{aligned}\]

Now we alternates the binary mask \(\mathbf{b}\) from one coupling layer to the other.

Note, that the formula given in the paper is slightly different:

\[\mathbf{y} = \mathbf{b} \odot \mathbf{x} + (1 - \mathbf{b}) \odot \Big(\mathbf{x} \odot \exp\big(s(\mathbf{b} \odot \mathbf{x})\big) + t(\mathbf{b} \odot \mathbf{x})\Big),\]

but the 2 formulas give the same result!

Implementation of Real NVP

\ No newline at end of file diff --git a/modules/extras/Convolutions_first/index.html b/modules/extras/Convolutions_first/index.html index 5f5e260..e01f4f6 100644 --- a/modules/extras/Convolutions_first/index.html +++ b/modules/extras/Convolutions_first/index.html @@ -91,4 +91,4 @@ end plot(target, (-1.,1.)...,label="target") ylims!((-10,10)) -plot!(pred, (-1.,1.)...,label="pred")

training_plot

We see that we get a pretty good approximation of our target polynomial. Below is the a gif showing the convergence of our network towards the target:

gif

By stacking convolutions with kernel of size 3, we obtained a network with a receptive field of size 9.

Thanks for reading!

Follow on twitter!

\ No newline at end of file +plot!(pred, (-1.,1.)...,label="pred")

training_plot

We see that we get a pretty good approximation of our target polynomial. Below is the a gif showing the convergence of our network towards the target:

gif

By stacking convolutions with kernel of size 3, we obtained a network with a receptive field of size 9.

Thanks for reading!

Follow on twitter!

\ No newline at end of file diff --git a/modules/extras/GCN_inductivebias_spectral/index.html b/modules/extras/GCN_inductivebias_spectral/index.html index d037d5b..a604129 100644 --- a/modules/extras/GCN_inductivebias_spectral/index.html +++ b/modules/extras/GCN_inductivebias_spectral/index.html @@ -194,7 +194,7 @@

Th
diff --git a/modules/extras/attention/transformer_vizu.gif b/modules/extras/attention/transformer_vizu.gif new file mode 100644 index 0000000..79140f7 Binary files /dev/null and b/modules/extras/attention/transformer_vizu.gif differ diff --git a/modules/extras/graph_invariant/index.html b/modules/extras/graph_invariant/index.html index d43fa1c..e003373 100644 --- a/modules/extras/graph_invariant/index.html +++ b/modules/extras/graph_invariant/index.html @@ -1 +1 @@ - Exploiting Graph Invariants in Deep Learning

Exploiting Graph Invariants in Deep Learning


0:48 Skip the french part!
\ No newline at end of file + Exploiting Graph Invariants in Deep Learning

Exploiting Graph Invariants in Deep Learning


0:48 Skip the french part!
\ No newline at end of file diff --git a/modules/extras/invariant_equivariant/index.html b/modules/extras/invariant_equivariant/index.html index 104de26..8cdbd4e 100644 --- a/modules/extras/invariant_equivariant/index.html +++ b/modules/extras/invariant_equivariant/index.html @@ -1 +1 @@ - Invariant and Equivariant layers

Invariant and equivariant layers with applications to GNN, PointNet and Transformers

author: Marc Lelarge, course: dataflowr

date: April 23, 2021

Invariant and equivariant functions

As shown in the module on GNN, invariant and equivariant functions are crucial for GNN. For example, the message passing GNN (MGNN) layer is defined by:

hi+1=f(hi,{{hj}}ji), h^{\ell+1}_i = f(h^\ell_i , \{\{ h^\ell_j\}\}_{j\sim i}),

where iji\sim j means that nodes ii and jj are neighbors and the function ff should not depend on the order of the elements in the multiset {{hj}}ji\{\{ h^\ell_j\}\}_{j\sim i}. This layer is applied in parallel to all nodes (with the same function ff) producing a mapping from h=(h1,hn){\bf h}^\ell = (h^\ell_1\dots, h^\ell_n) to F(h)=h+1F({\bf h}^\ell) = {\bf h}^{\ell+1} with F:RnRnF:\mathbb{R}^n \to \mathbb{R}^n where nn is the number of nodes in the graph (and only real hidden states are considered for simplicity). It is easy to see that FF is an equivariant function, i.e. permuting its input will permute its output.

Another example of invariant and equivariant functions is given by the attention layer Attention(Q,K,V)=Z\text{Attention}(Q,K,V) = Z defined for QQ a tensor of row queries, KK the keys and VV the values, Q,K,VRn×dQ,K,V\in \mathbb{R}^{n\times d} by

Zj=i=1nsoftmaxi(QjKiT)Vi. Z_j = \sum_{i=1}^n \text{softmax}_i(Q_jK_i^T) V_i.

The queries are obtained from a tensor XRn×cX\in \mathbb{R}^{n\times c} by Q=XWQTQ= XW_Q^T and the keys and values are obtained from a tensor XRn×cX' \in \mathbb{R}^{n\times c'} by K=XWKTK = X' W_K^T and V=XWVTV = X' W_V^T. We see that when the queries are fixed, the attention layer is invariant in the pair (keys, values):

Zj=i=1nsoftmaxi(QjKσ(i)T)Vσ(i), Z_j = \sum_{i=1}^n \text{softmax}_{i}(Q_j K_{\sigma(i)}^T) V_{\sigma(i)},

hence Attention(X,X)\text{Attention}(X,X') is invariant in XX'. Similarly, when the pair (keys, values) is fixed, the attention layer is equivariant in the queries:

Zσ(j)=i=1nsoftmaxi(Qσ(j)KiT)Vi, Z_{\sigma(j)} = \sum_{i=1}^n \text{softmax}_{i}(Q_{\sigma(j)}K_{i}^T) V_{i},

hence Attention(X,X)\text{Attention}(X,X') is equivariant in XX. If X=XX'=X, we get the self-attention layer so that SelfAttention(X)=Attention(X,X)\text{SelfAttention}(X) = \text{Attention}(X,X) is equivariant in XX.

In this post, we will characterize invariant and equivariant functions following the ideas given in the paper Deep Sets.

Representation of invariant and equivariant functions

We start with some definitions.

For a vector x=(x1,,xn)Rn{\bf x} = (x_1,\dots, x_n)\in \mathbb{R}^n and a permutation σSn\sigma \in \mathcal{S}_n, we define

σx=(xσ1(1),,xσ1(n)) \sigma \star {\bf x} = (x_{\sigma^{-1}(1)},\dots, x_{\sigma^{-1}(n)})

Definitions:

  • A function f:RnRf:\mathbb{R}^n\to \mathbb{R} is invariant if for all x{\bf x} and all σSn\sigma \in \mathcal{S}_n, we have f(σx)=f(x)f(\sigma \star {\bf x}) = f({\bf x}).

  • A function f:RnRnf:\mathbb{R}^n\to \mathbb{R}^n is equivariant if for all x{\bf x} and all σSn\sigma \in \mathcal{S}_n, we have f(σx)=σf(x)f(\sigma \star {\bf x}) = \sigma \star f({\bf x}).

We can now state our main result:

Theorem

  • invariant case: let f:[0,1]nRf:[0,1]^n \to \mathbb R be a continuous function. ff is invariant if and only if there are continuous functions ϕ:[0,1]Rn\phi: [0,1] \to \mathbb R^n and ρ:RnR\rho: \mathbb R^n\to \mathbb R such that

f(x)=ρ(i=1nϕ(xi)) f({\bf x}) = \rho\left( \sum_{i=1}^n \phi(x_i)\right)
  • equivariant case: let f:[0,1]nRnf:[0,1]^n \to \mathbb R^n be a continuous function. ff is equivariant if and only if there are continuous functions ϕ:[0,1]Rn\phi: [0,1] \to \mathbb R^n and ρ:[0,1]×RnR\rho: [0,1]\times \mathbb R^n\to \mathbb R such that

fj(x)=ρ(xj,i=1nϕ(xi)) f_j({\bf x}) = \rho\left( x_j, \sum_{i=1}^n \phi(x_i)\right)

We give some remarks before providing the proof below. For the sake of simplicity, we consider here a fixed number of points nn on the unit interval [0,1][0,1]. For results with a varying number of points, see On the Limitations of Representing Functions on Sets and for points in higher dimension [0,1]d[0,1]^d with d>1d>1, see On Universal Equivariant Set Networks and Expressive Power of Invariant and Equivariant Graph Neural Networks.

Our proof will make the mapping ϕ\phi explicit and it will not depend on the function ff. The mapping ϕ\phi can be seen as an embedding of the points in [0,1][0,1] in a space of high-dimension. Indeed this embedding space has to be of dimension at least the number of points nn in order to ensure universality. This is an important remark as in learning scenario, the size of the embedding is typically fixed and hence will limit the expressiveness of the algorithm.

Coming back to the GNN layer (1), our result on the invariant case tells us that we can always rewrite it as:

hi+1=ρ(hi,jiϕ(hj)), h^{\ell+1}_i =\rho\left( h_i^{\ell}, \sum_{j\sim i} \phi(h^\ell_j)\right),

and the dimension of the embedding ϕ(h)\phi(h) needs to be of the same order as the maximum degree in the graph. Note that (8) is not of the form of (7) as the sum inside the ρ\rho function is taken only on neighbors. Indeed, we know that message passing GNN are not universal (see Expressive Power of Invariant and Equivariant Graph Neural Networks).

As a last remark, note that the original PointNet architecture ff is of the form fi(x)=ρ(xi)f_i({\bf x}) = \rho(x_i) which is not universal equivariant. Indeed, it is impossible to approximate the equivariant function gi(x)=ixig_i({\bf x}) = \sum_i x_i as shown below (we denote e1=(1,0,,0){\bf e}_1=(1,0,\dots,0)):

f(0)g(0)2=nρ(0)2f(e1)g(e1)2=(ρ(1)1)2+(n1)(ρ(0)1)2(n1)(ρ(0)1)2, \|f(0) - g(0)\|^2 = n \rho(0)^2\\ \|f({\bf e}_1) -g({\bf e}_1)\|^2 = (\rho(1)-1)^2 + (n-1)(\rho(0)-1)^2\geq (n-1)(\rho(0)-1)^2,

and these quantities cannot be small together. Hence PointNet is not universal equivariant but as shown in On Universal Equivariant Set Networks, modifying PointNet by adding the term i=1nϕ(xi) \sum_{i=1}^n \phi(x_i) inside the ρ\rho function as in (7) makes it universal equivariant. We refer to Are Transformers universal approximators of sequence-to-sequence functions? for similar results about transformers based on self-attention.

Proof of the Theorem

We first show that the equivariant case is not more difficult than the invariant case. Assume that we proved the invariant case. Consider a permutation σSn\sigma\in \mathcal S_n such that σ(1)=1\sigma(1)=1 so that f(σx)=σf(x)f(\sigma \star {\bf x}) = \sigma \star f({\bf x}) gives for the first component:

f1(x1,xσ(2),,xσ(n))=f1(x1,x2,,xn). f_1(x_1,x_{\sigma(2)},\dots, x_{\sigma(n)}) = f_1(x_1,x_2,\dots, x_n).

For any x1x_1, the mapping (x2,,xn)f1(x1,x2,,xn)(x_2,\dots, x_n) \mapsto f_1(x_1, x_2,\dots, x_n) is invariant. Hence by (6), we have

f1(x1,x2,,xn)=ρ(x1,i1ϕ(xi)) f_1(x_1,x_2,\dots, x_n) = \rho\left(x_1, \sum_{i\neq 1}\phi(x_i) \right)

Now consider a permutation such that σ(1)=k,σ(k)=1\sigma(1)=k, \sigma(k)=1 and σ(i)=i\sigma(i)=i for i1,ki\neq 1,k, then we have

fk(x1,x2,,xn)=f1(xk,x2,x1,xn), f_k(x_1,x_2,\dots, x_n) = f_1(x_k,x_2\dots, x_1,\dots x_n),

hence fk(x1,x2,,xn)=ρ(xk,ikϕ(xi))f_k(x_1,x_2,\dots, x_n)=\rho\left(x_k, \sum_{i\neq k}\phi(x_i) \right) and (7) follows.

Hence, we only need to prove (6) and follow the proof given in Deep Sets. We start with a crucial result stating that a set of nn real points is characterized by the first nn moments of its empirical measure. Let see what it means for n=2n=2: we can recover the values of x1x_1 and x2x_2 from the quantities p1=x1+x2p_1=x_1+x_2 and p2=x12+x22p_2=x_1^2+x_2^2. To see that this is correct, note that

p12=x12+2x1x2+x22=p2+2x1x2, p_1^2 = x_1^2+2x_1x_2+x_2^2 = p_2+2x_1x_2,

so that x1x2=p12p22x_1x_2 = \frac{p_1^2-p_2}{2}. As a result, we have

(xx1)(xx2)=x2p1x+p12p22, (x-x_1)(x-x_2) = x^2-p_1x+\frac{p_1^2-p_2}{2},

and clearly x1x_1 and x2x_2 can be recovered as the roots of this polynomial whose coefficients are functions of p1p_1 and p2p_2. The result below extends this argument for a general nn:

Proposition

Let Φ:[0,1]nRn\Phi:[0,1]_{\leq}^n \to \mathbb{R}^{n}, where [0,1]n={x[0,1]n, x1x2xn}[0,1]_{\leq}^n = \{ {\bf x}\in [0,1]^n,\: x_1\leq x_2\leq \dots\leq x_n\}, be defined by

Φ(x1,x2,,xn)=(ix1,ixi2,,ixin) \Phi(x_1,x_2,\dots, x_n) = \left( \sum_i x_1, \sum_i x_i^2,\dots, \sum_i x_i^n\right)

is injective and has a continuous inverse mapping.

The proof follows from Newton's identities. For knk\leq n, we denote by pk=i=1nxikp_k = \sum_{i=1}^n x_i^k the power sums and by eke_k the elementary symmetric polynomials (note that all polynomials are function of the x1,,xnx_1,\dots, x_n):

e0=1e1=ixie2=i<jxixj e_0 = 1\\ e_1 = \sum_i x_i\\ e_2 = \sum_{i < j} x_i x_j\\ \dots

From Newton's identities, we have for knk\leq n,

kek=i=1k(1)i1ekipi, k e_k = \sum_{i=1}^k (-1)^{i-1}e_{k-i}p_i,

so that, we can express the elementary symmetric polynomials from the power sums:

e1=p12e2=e1p1p2=p12p23e3=e2p2e1p2+p3=12p1332p1p2+p3 e_1 = p_1\\ 2e_2 = e_1p_1-p_2=p_1^2-p_2\\ 3e_3 = e_2p_2-e_1p_2+p_3 = \frac{1}{2}p_1^3-\frac{3}{2}p_1p_2+p3\\ \dots

Note that Φ(x1,x2,,xn)=(p1,,pn)\Phi(x_1,x_2,\dots, x_n) = (p_1,\dots, p_n) and since

i=1n(xxi)=xne1xn1+e2xn2+(1)nen, \prod_{i=1}^n (x-x_i) = x^n -e_1x^{n-1}+e_2x^{n-2}\dots + (-1)^n e_n,

if Φ(x)=Φ(y)\Phi({\bf x}) = \Phi({\bf y}) then i=1n(xxi)=i=1n(xyi)\prod_{i=1}^n (x-x_i)=\prod_{i=1}^n (x-y_i) so that {{x1,,xn}}={{y1,,yn}}\{\{x_1,\dots, x_n\}\} = \{\{y_1,\dots, y_n\}\} and x=y[0,1]n{\bf x}={\bf y} \in [0,1]^n_{\leq}, showing that Φ\Phi is injective.

Hence we proved that Φ:[0,1]nIm(Φ)\Phi:[0,1]^n_{\leq} \to \text{Im}(\Phi) where Im(Φ)\text{Im}(\Phi) is the image of Φ\Phi, is a bijection. We need now to prove that Φ1\Phi^{-1} is continuous and we'll prove it directly. Let ykyIm(Φ){\bf y}_k \to {\bf y} \in\text{Im}(\Phi), we need to show that Φ1(yk)Φ1(y)\Phi^{-1}({\bf y}_k) \to \Phi^{-1}({\bf y}). Now if Φ1(yk)↛Φ1(y)\Phi^{-1}({\bf y}_k) \not\to \Phi^{-1}({\bf y}), since [0,1]M[0,1]^M_{\leq} is compact, this means that there exists a convergent subsequence of Φ1(yk)\Phi^{-1}({\bf y}_{k}) with Φ1(ymk)xΦ1(y)\Phi^{-1}({\bf y}_{m_k}) \to {\bf x}\neq \Phi^{-1}({\bf y}) . But by continuity of Φ\Phi, we have ymkΦ(x)=y{\bf y}_{m_k} \to \Phi({\bf x}) = {\bf y}, so that we get a contradiction and hence proved the continuity of Φ1\Phi^{-1}, finishing the proof of the proposition.

We are now ready to prove (6). Let ϕ:[0,1]Rn\phi:[0,1] \to \mathbb R^n be defined by ϕ(x)=(x,x2,,xn)\phi(x) = (x,x^2,\dots, x^n) and ρ=fΦ1\rho = f\circ \Phi^{-1}. Note that ρ:Im(Φ)R\rho: \text{Im}(\Phi) \to \mathbb R and iϕ(xi)=Φ(x)\sum_{i}\phi(x_i) = \Phi({\bf x}_{\leq}), where x{\bf x}_{\leq} is the vector x{\bf x} with components sorted in non-decreasing order. Hence as soon as f is invariant, we have f(x)=f(x)f({\bf x}) = f({\bf x}_{\leq}) so that (6) is valid. We only need to extend the function ρ\rho from the domain Im(Φ)\text{Im}(\Phi) to Rn\mathbb R^n in a continuous way. This can be done by considering the projection π\pi on the compact Im(Φ)\text{Im}(\Phi) and define ρ(x)=fΦ1(π(x))\rho({\bf x}) = f\circ \Phi^{-1}(\pi({\bf x})).

Follow on twitter!

Thanks for reading!

\ No newline at end of file + Invariant and Equivariant layers

Invariant and equivariant layers with applications to GNN, PointNet and Transformers

author: Marc Lelarge, course: dataflowr

date: April 23, 2021

Invariant and equivariant functions

As shown in the module on GNN, invariant and equivariant functions are crucial for GNN. For example, the message passing GNN (MGNN) layer is defined by:

hi+1=f(hi,{{hj}}ji), h^{\ell+1}_i = f(h^\ell_i , \{\{ h^\ell_j\}\}_{j\sim i}),

where iji\sim j means that nodes ii and jj are neighbors and the function ff should not depend on the order of the elements in the multiset {{hj}}ji\{\{ h^\ell_j\}\}_{j\sim i}. This layer is applied in parallel to all nodes (with the same function ff) producing a mapping from h=(h1,hn){\bf h}^\ell = (h^\ell_1\dots, h^\ell_n) to F(h)=h+1F({\bf h}^\ell) = {\bf h}^{\ell+1} with F:RnRnF:\mathbb{R}^n \to \mathbb{R}^n where nn is the number of nodes in the graph (and only real hidden states are considered for simplicity). It is easy to see that FF is an equivariant function, i.e. permuting its input will permute its output.

Another example of invariant and equivariant functions is given by the attention layer Attention(Q,K,V)=Z\text{Attention}(Q,K,V) = Z defined for QQ a tensor of row queries, KK the keys and VV the values, Q,K,VRn×dQ,K,V\in \mathbb{R}^{n\times d} by

Zj=i=1nsoftmaxi(QjKiT)Vi. Z_j = \sum_{i=1}^n \text{softmax}_i(Q_jK_i^T) V_i.

The queries are obtained from a tensor XRn×cX\in \mathbb{R}^{n\times c} by Q=XWQTQ= XW_Q^T and the keys and values are obtained from a tensor XRn×cX' \in \mathbb{R}^{n\times c'} by K=XWKTK = X' W_K^T and V=XWVTV = X' W_V^T. We see that when the queries are fixed, the attention layer is invariant in the pair (keys, values):

Zj=i=1nsoftmaxi(QjKσ(i)T)Vσ(i), Z_j = \sum_{i=1}^n \text{softmax}_{i}(Q_j K_{\sigma(i)}^T) V_{\sigma(i)},

hence Attention(X,X)\text{Attention}(X,X') is invariant in XX'. Similarly, when the pair (keys, values) is fixed, the attention layer is equivariant in the queries:

Zσ(j)=i=1nsoftmaxi(Qσ(j)KiT)Vi, Z_{\sigma(j)} = \sum_{i=1}^n \text{softmax}_{i}(Q_{\sigma(j)}K_{i}^T) V_{i},

hence Attention(X,X)\text{Attention}(X,X') is equivariant in XX. If X=XX'=X, we get the self-attention layer so that SelfAttention(X)=Attention(X,X)\text{SelfAttention}(X) = \text{Attention}(X,X) is equivariant in XX.

In this post, we will characterize invariant and equivariant functions following the ideas given in the paper Deep Sets.

Representation of invariant and equivariant functions

We start with some definitions.

For a vector x=(x1,,xn)Rn{\bf x} = (x_1,\dots, x_n)\in \mathbb{R}^n and a permutation σSn\sigma \in \mathcal{S}_n, we define

σx=(xσ1(1),,xσ1(n)) \sigma \star {\bf x} = (x_{\sigma^{-1}(1)},\dots, x_{\sigma^{-1}(n)})

Definitions:

  • A function f:RnRf:\mathbb{R}^n\to \mathbb{R} is invariant if for all x{\bf x} and all σSn\sigma \in \mathcal{S}_n, we have f(σx)=f(x)f(\sigma \star {\bf x}) = f({\bf x}).

  • A function f:RnRnf:\mathbb{R}^n\to \mathbb{R}^n is equivariant if for all x{\bf x} and all σSn\sigma \in \mathcal{S}_n, we have f(σx)=σf(x)f(\sigma \star {\bf x}) = \sigma \star f({\bf x}).

We can now state our main result:

Theorem

  • invariant case: let f:[0,1]nRf:[0,1]^n \to \mathbb R be a continuous function. ff is invariant if and only if there are continuous functions ϕ:[0,1]Rn\phi: [0,1] \to \mathbb R^n and ρ:RnR\rho: \mathbb R^n\to \mathbb R such that

f(x)=ρ(i=1nϕ(xi)) f({\bf x}) = \rho\left( \sum_{i=1}^n \phi(x_i)\right)
  • equivariant case: let f:[0,1]nRnf:[0,1]^n \to \mathbb R^n be a continuous function. ff is equivariant if and only if there are continuous functions ϕ:[0,1]Rn\phi: [0,1] \to \mathbb R^n and ρ:[0,1]×RnR\rho: [0,1]\times \mathbb R^n\to \mathbb R such that

fj(x)=ρ(xj,i=1nϕ(xi)) f_j({\bf x}) = \rho\left( x_j, \sum_{i=1}^n \phi(x_i)\right)

We give some remarks before providing the proof below. For the sake of simplicity, we consider here a fixed number of points nn on the unit interval [0,1][0,1]. For results with a varying number of points, see On the Limitations of Representing Functions on Sets and for points in higher dimension [0,1]d[0,1]^d with d>1d>1, see On Universal Equivariant Set Networks and Expressive Power of Invariant and Equivariant Graph Neural Networks.

Our proof will make the mapping ϕ\phi explicit and it will not depend on the function ff. The mapping ϕ\phi can be seen as an embedding of the points in [0,1][0,1] in a space of high-dimension. Indeed this embedding space has to be of dimension at least the number of points nn in order to ensure universality. This is an important remark as in learning scenario, the size of the embedding is typically fixed and hence will limit the expressiveness of the algorithm.

Coming back to the GNN layer (1), our result on the invariant case tells us that we can always rewrite it as:

hi+1=ρ(hi,jiϕ(hj)), h^{\ell+1}_i =\rho\left( h_i^{\ell}, \sum_{j\sim i} \phi(h^\ell_j)\right),

and the dimension of the embedding ϕ(h)\phi(h) needs to be of the same order as the maximum degree in the graph. Note that (8) is not of the form of (7) as the sum inside the ρ\rho function is taken only on neighbors. Indeed, we know that message passing GNN are not universal (see Expressive Power of Invariant and Equivariant Graph Neural Networks).

As a last remark, note that the original PointNet architecture ff is of the form fi(x)=ρ(xi)f_i({\bf x}) = \rho(x_i) which is not universal equivariant. Indeed, it is impossible to approximate the equivariant function gi(x)=ixig_i({\bf x}) = \sum_i x_i as shown below (we denote e1=(1,0,,0){\bf e}_1=(1,0,\dots,0)):

f(0)g(0)2=nρ(0)2f(e1)g(e1)2=(ρ(1)1)2+(n1)(ρ(0)1)2(n1)(ρ(0)1)2, \|f(0) - g(0)\|^2 = n \rho(0)^2\\ \|f({\bf e}_1) -g({\bf e}_1)\|^2 = (\rho(1)-1)^2 + (n-1)(\rho(0)-1)^2\geq (n-1)(\rho(0)-1)^2,

and these quantities cannot be small together. Hence PointNet is not universal equivariant but as shown in On Universal Equivariant Set Networks, modifying PointNet by adding the term i=1nϕ(xi) \sum_{i=1}^n \phi(x_i) inside the ρ\rho function as in (7) makes it universal equivariant. We refer to Are Transformers universal approximators of sequence-to-sequence functions? for similar results about transformers based on self-attention.

Proof of the Theorem

We first show that the equivariant case is not more difficult than the invariant case. Assume that we proved the invariant case. Consider a permutation σSn\sigma\in \mathcal S_n such that σ(1)=1\sigma(1)=1 so that f(σx)=σf(x)f(\sigma \star {\bf x}) = \sigma \star f({\bf x}) gives for the first component:

f1(x1,xσ(2),,xσ(n))=f1(x1,x2,,xn). f_1(x_1,x_{\sigma(2)},\dots, x_{\sigma(n)}) = f_1(x_1,x_2,\dots, x_n).

For any x1x_1, the mapping (x2,,xn)f1(x1,x2,,xn)(x_2,\dots, x_n) \mapsto f_1(x_1, x_2,\dots, x_n) is invariant. Hence by (6), we have

f1(x1,x2,,xn)=ρ(x1,i1ϕ(xi)) f_1(x_1,x_2,\dots, x_n) = \rho\left(x_1, \sum_{i\neq 1}\phi(x_i) \right)

Now consider a permutation such that σ(1)=k,σ(k)=1\sigma(1)=k, \sigma(k)=1 and σ(i)=i\sigma(i)=i for i1,ki\neq 1,k, then we have

fk(x1,x2,,xn)=f1(xk,x2,x1,xn), f_k(x_1,x_2,\dots, x_n) = f_1(x_k,x_2\dots, x_1,\dots x_n),

hence fk(x1,x2,,xn)=ρ(xk,ikϕ(xi))f_k(x_1,x_2,\dots, x_n)=\rho\left(x_k, \sum_{i\neq k}\phi(x_i) \right) and (7) follows.

Hence, we only need to prove (6) and follow the proof given in Deep Sets. We start with a crucial result stating that a set of nn real points is characterized by the first nn moments of its empirical measure. Let see what it means for n=2n=2: we can recover the values of x1x_1 and x2x_2 from the quantities p1=x1+x2p_1=x_1+x_2 and p2=x12+x22p_2=x_1^2+x_2^2. To see that this is correct, note that

p12=x12+2x1x2+x22=p2+2x1x2, p_1^2 = x_1^2+2x_1x_2+x_2^2 = p_2+2x_1x_2,

so that x1x2=p12p22x_1x_2 = \frac{p_1^2-p_2}{2}. As a result, we have

(xx1)(xx2)=x2p1x+p12p22, (x-x_1)(x-x_2) = x^2-p_1x+\frac{p_1^2-p_2}{2},

and clearly x1x_1 and x2x_2 can be recovered as the roots of this polynomial whose coefficients are functions of p1p_1 and p2p_2. The result below extends this argument for a general nn:

Proposition

Let Φ:[0,1]nRn\Phi:[0,1]_{\leq}^n \to \mathbb{R}^{n}, where [0,1]n={x[0,1]n, x1x2xn}[0,1]_{\leq}^n = \{ {\bf x}\in [0,1]^n,\: x_1\leq x_2\leq \dots\leq x_n\}, be defined by

Φ(x1,x2,,xn)=(ix1,ixi2,,ixin) \Phi(x_1,x_2,\dots, x_n) = \left( \sum_i x_1, \sum_i x_i^2,\dots, \sum_i x_i^n\right)

is injective and has a continuous inverse mapping.

The proof follows from Newton's identities. For knk\leq n, we denote by pk=i=1nxikp_k = \sum_{i=1}^n x_i^k the power sums and by eke_k the elementary symmetric polynomials (note that all polynomials are function of the x1,,xnx_1,\dots, x_n):

e0=1e1=ixie2=i<jxixj e_0 = 1\\ e_1 = \sum_i x_i\\ e_2 = \sum_{i < j} x_i x_j\\ \dots

From Newton's identities, we have for knk\leq n,

kek=i=1k(1)i1ekipi, k e_k = \sum_{i=1}^k (-1)^{i-1}e_{k-i}p_i,

so that, we can express the elementary symmetric polynomials from the power sums:

e1=p12e2=e1p1p2=p12p23e3=e2p2e1p2+p3=12p1332p1p2+p3 e_1 = p_1\\ 2e_2 = e_1p_1-p_2=p_1^2-p_2\\ 3e_3 = e_2p_2-e_1p_2+p_3 = \frac{1}{2}p_1^3-\frac{3}{2}p_1p_2+p3\\ \dots

Note that Φ(x1,x2,,xn)=(p1,,pn)\Phi(x_1,x_2,\dots, x_n) = (p_1,\dots, p_n) and since

i=1n(xxi)=xne1xn1+e2xn2+(1)nen, \prod_{i=1}^n (x-x_i) = x^n -e_1x^{n-1}+e_2x^{n-2}\dots + (-1)^n e_n,

if Φ(x)=Φ(y)\Phi({\bf x}) = \Phi({\bf y}) then i=1n(xxi)=i=1n(xyi)\prod_{i=1}^n (x-x_i)=\prod_{i=1}^n (x-y_i) so that {{x1,,xn}}={{y1,,yn}}\{\{x_1,\dots, x_n\}\} = \{\{y_1,\dots, y_n\}\} and x=y[0,1]n{\bf x}={\bf y} \in [0,1]^n_{\leq}, showing that Φ\Phi is injective.

Hence we proved that Φ:[0,1]nIm(Φ)\Phi:[0,1]^n_{\leq} \to \text{Im}(\Phi) where Im(Φ)\text{Im}(\Phi) is the image of Φ\Phi, is a bijection. We need now to prove that Φ1\Phi^{-1} is continuous and we'll prove it directly. Let ykyIm(Φ){\bf y}_k \to {\bf y} \in\text{Im}(\Phi), we need to show that Φ1(yk)Φ1(y)\Phi^{-1}({\bf y}_k) \to \Phi^{-1}({\bf y}). Now if Φ1(yk)↛Φ1(y)\Phi^{-1}({\bf y}_k) \not\to \Phi^{-1}({\bf y}), since [0,1]M[0,1]^M_{\leq} is compact, this means that there exists a convergent subsequence of Φ1(yk)\Phi^{-1}({\bf y}_{k}) with Φ1(ymk)xΦ1(y)\Phi^{-1}({\bf y}_{m_k}) \to {\bf x}\neq \Phi^{-1}({\bf y}) . But by continuity of Φ\Phi, we have ymkΦ(x)=y{\bf y}_{m_k} \to \Phi({\bf x}) = {\bf y}, so that we get a contradiction and hence proved the continuity of Φ1\Phi^{-1}, finishing the proof of the proposition.

We are now ready to prove (6). Let ϕ:[0,1]Rn\phi:[0,1] \to \mathbb R^n be defined by ϕ(x)=(x,x2,,xn)\phi(x) = (x,x^2,\dots, x^n) and ρ=fΦ1\rho = f\circ \Phi^{-1}. Note that ρ:Im(Φ)R\rho: \text{Im}(\Phi) \to \mathbb R and iϕ(xi)=Φ(x)\sum_{i}\phi(x_i) = \Phi({\bf x}_{\leq}), where x{\bf x}_{\leq} is the vector x{\bf x} with components sorted in non-decreasing order. Hence as soon as f is invariant, we have f(x)=f(x)f({\bf x}) = f({\bf x}_{\leq}) so that (6) is valid. We only need to extend the function ρ\rho from the domain Im(Φ)\text{Im}(\Phi) to Rn\mathbb R^n in a continuous way. This can be done by considering the projection π\pi on the compact Im(Φ)\text{Im}(\Phi) and define ρ(x)=fΦ1(π(x))\rho({\bf x}) = f\circ \Phi^{-1}(\pi({\bf x})).

Follow on twitter!

Thanks for reading!

\ No newline at end of file diff --git a/modules/extras/jupyterlab/index.html b/modules/extras/jupyterlab/index.html index 01b9fc0..7897a0f 100644 --- a/modules/extras/jupyterlab/index.html +++ b/modules/extras/jupyterlab/index.html @@ -67,7 +67,7 @@

Edit this page on - Last modified: November 12, 2023. Website built with Franklin.jl and the Julia programming language. + Last modified: December 13, 2023. Website built with Franklin.jl and the Julia programming language. diff --git a/modules/graph0/index.html b/modules/graph0/index.html index a23d527..9acd7c6 100644 --- a/modules/graph0/index.html +++ b/modules/graph0/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY
\ No newline at end of file + Dataflowr - Deep Learning DIY
\ No newline at end of file diff --git a/modules/graph1/index.html b/modules/graph1/index.html index 45113c9..0f66cbc 100644 --- a/modules/graph1/index.html +++ b/modules/graph1/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module - Deep Learning on graphs (1)

Table of Contents

Node embedding


0:00 Introduction
2:12 Language model
5:04 Skip-gram model
8:44 Hierarchical softmax
11:19 DeepWalk
14:26 Negative sampling
19:10 node2vec
22:28 results on les Misérables
25:10 results for multi-label classification

Slides

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module - Deep Learning on graphs (1)

Table of Contents

Node embedding


0:00 Introduction
2:12 Language model
5:04 Skip-gram model
8:44 Hierarchical softmax
11:19 DeepWalk
14:26 Negative sampling
19:10 node2vec
22:28 results on les Misérables
25:10 results for multi-label classification

Slides

\ No newline at end of file diff --git a/modules/graph2/index.html b/modules/graph2/index.html index 0277c17..2a476e0 100644 --- a/modules/graph2/index.html +++ b/modules/graph2/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module - Deep Learning on graphs (2)

Table of Contents

Signal processing on graphs


0:00 Introduction
1:40 Signal processing on graphs
3:04 Recap on Fourier analysis
5:04 Spectral graph theory
13:44 Graph Fourier analysis
16:38 Filtering
18:33 Filtering on graphs
22:01 Learning a localized kernel
25:03 Chebyshev polynomials
30:28 Convolutional neural networks on graphs

Slides

Notebook

Posts

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module - Deep Learning on graphs (2)

Table of Contents

Signal processing on graphs


0:00 Introduction
1:40 Signal processing on graphs
3:04 Recap on Fourier analysis
5:04 Spectral graph theory
13:44 Graph Fourier analysis
16:38 Filtering
18:33 Filtering on graphs
22:01 Learning a localized kernel
25:03 Chebyshev polynomials
30:28 Convolutional neural networks on graphs

Slides

Notebook

Posts

\ No newline at end of file diff --git a/modules/graph3/index.html b/modules/graph3/index.html index e94cc1a..97dea2d 100644 --- a/modules/graph3/index.html +++ b/modules/graph3/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module - Deep Learning on graphs (3)

Table of Contents

Graph embedding


0:00 Introduction
1:30 Graph embedding
2:43 How to represent graphs?
3:58 Why graph symmetries matter?
8:25 Invariant and equivariant functions
12:30 Message passing GNN
16:02 The many flavors of MGNN
20:00 Separating power
22:51 2-Weisfeiler-Lehman test
26:59 How powerful are MGNN
28:27 Empirical results
29:10 Graphs as higher order tensors
31:45 Invariant and equivariant linear operator
35:47 Invariant linear GNN
38:18 Folklore GNN

Slides

Post

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module - Deep Learning on graphs (3)

Table of Contents

Graph embedding


0:00 Introduction
1:30 Graph embedding
2:43 How to represent graphs?
3:58 Why graph symmetries matter?
8:25 Invariant and equivariant functions
12:30 Message passing GNN
16:02 The many flavors of MGNN
20:00 Separating power
22:51 2-Weisfeiler-Lehman test
26:59 How powerful are MGNN
28:27 Empirical results
29:10 Graphs as higher order tensors
31:45 Invariant and equivariant linear operator
35:47 Invariant linear GNN
38:18 Folklore GNN

Slides

Post

\ No newline at end of file diff --git a/modules/privacy-preserving-ML/index.html b/modules/privacy-preserving-ML/index.html index a2b3480..db00f00 100644 --- a/modules/privacy-preserving-ML/index.html +++ b/modules/privacy-preserving-ML/index.html @@ -1 +1 @@ - Dataflowr - Deep Learning DIY

Module - Privacy Preserving Machine Learning

by Daniel Huynh

Table of Contents

Privacy Preserving Machine Learning


0:00 Presentation
2:50 Context and cloud data threads
5:15 Confidential Computing (CC)
7:12 Intel SGX
8:40 Enclave
12:19 Azure Attestation Service
13:25 Use cases
14:50 Abdstraction layers for enclaves
15:57 Open enclave SDK
16:27 Lightweight OS + Demo (Graphene SGX)
23:44 Multi-party machine learning
26:50 Q&A
33:26 Homomorphic Encryption (HE)
37:20 CKKS encoder
41:29 Homomorphic Encryption high-level view
42:24 Homomorphic Encryption in practice
45:17 Demo with TenSEAL
50:25 Demo Homomorphic Random Forests
1:01:38 to go beyond
1:02:28 Secure Multi-Party Computing (MPC)
1:07:58 Conclusion

Slides and code

to go beyond

\ No newline at end of file + Dataflowr - Deep Learning DIY

Module - Privacy Preserving Machine Learning

by Daniel Huynh

Table of Contents

Privacy Preserving Machine Learning


0:00 Presentation
2:50 Context and cloud data threads
5:15 Confidential Computing (CC)
7:12 Intel SGX
8:40 Enclave
12:19 Azure Attestation Service
13:25 Use cases
14:50 Abdstraction layers for enclaves
15:57 Open enclave SDK
16:27 Lightweight OS + Demo (Graphene SGX)
23:44 Multi-party machine learning
26:50 Q&A
33:26 Homomorphic Encryption (HE)
37:20 CKKS encoder
41:29 Homomorphic Encryption high-level view
42:24 Homomorphic Encryption in practice
45:17 Demo with TenSEAL
50:25 Demo Homomorphic Random Forests
1:01:38 to go beyond
1:02:28 Secure Multi-Party Computing (MPC)
1:07:58 Conclusion

Slides and code

to go beyond

\ No newline at end of file diff --git a/notebooks_md/01_intro/index.html b/notebooks_md/01_intro/index.html index 33bf371..c327dd2 100644 --- a/notebooks_md/01_intro/index.html +++ b/notebooks_md/01_intro/index.html @@ -344,7 +344,7 @@

Conclusion

diff --git a/notebooks_md/02a_basics/index.html b/notebooks_md/02a_basics/index.html index 12ef7d4..56e9c4d 100644 --- a/notebooks_md/02a_basics/index.html +++ b/notebooks_md/02a_basics/index.html @@ -243,7 +243,7 @@

Edit this page on - Last modified: November 12, 2023. Website built with Franklin.jl and the Julia programming language. + Last modified: December 13, 2023. Website built with Franklin.jl and the Julia programming language. diff --git a/sitemap.xml b/sitemap.xml index eb3cb74..2af0726 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -3,283 +3,283 @@ https://dataflowr.github.io/website/modules/18a-diffusion/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/5-stacking-layers/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/14a-depth/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/graph0/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/graph1/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/notebooks_md/02a_basics/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/extras/jupyterlab/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/15-dropout/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/homework/1-mlp-from-scratch/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/homework/3-VAE/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/2b-automatic-differentiation/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/extras/graph_invariant/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/13-siamese/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/graph2/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/extras/invariant_equivariant/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/3-loss-functions-for-classification/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/extras/Convolutions_first/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/homework/2-CAM-adversarial/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/graph3/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/extras/GCN_inductivebias_spectral/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/12-intro-julia/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/notebooks_md/01_intro/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/8c-word2vec/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/9c-flows/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/1-intro-general-overview/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/privacy-preserving-ML/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/0-sotfware-installation/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/0-julia-setup/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/8b-collaborative-filtering/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/6-convolutional-neural-network/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/11a-recurrent-neural-networks-theory/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/4-optimization-for-deep-learning/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/10-generative-adversarial-networks/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/17-resnets/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/14b-depth/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/9b-unet/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/9a-autoencoders/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/2c-jax/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/11c-batches-with-sequences/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/16-batchnorm/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/8a-embedding-layers/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/7-dataloading/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/19-clip/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/12-attention/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/2a-pytorch-tensors/index.html - 2023-11-12 + 2023-12-13 monthly 0.5 https://dataflowr.github.io/website/modules/11b-recurrent-neural-networks-practice/index.html - 2023-11-12 + 2023-12-13 monthly 0.5