Skip to content

Commit

Permalink
Organizing
Browse files Browse the repository at this point in the history
  • Loading branch information
JUSTSUJAY committed Aug 29, 2024
1 parent 5c708b0 commit d285904
Show file tree
Hide file tree
Showing 69 changed files with 207 additions and 153 deletions.
20 changes: 10 additions & 10 deletions 01_Tokenization.ipynb → Notebooks/01_Tokenization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
},
"source": [
"<center>\n",
"<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
"<img src=\"https://github.com/JUSTSUJAY/NLP_One_Shot/assets/1/lang-pic.jpg\" width=600>\n",
"</center>\n",
" \n",
"# 1. Introduction\n",
Expand All @@ -26,8 +26,8 @@
"\n",
"NLP series:\n",
"\n",
"1. [NLP1 - Tokenization](https://www.kaggle.com/samuelcortinhas/nlp1-tokenization) (this one)\n",
"2. [NLP2 - Pre-processing](https://www.kaggle.com/samuelcortinhas/nlp2-pre-processing)\n",
"1. [NLP1 - Tokenization]() (this one)\n",
"2. [NLP2 - Pre-processing](https://github.com/JUSTSUJAY/NLP_One_Shot/Notebooks/02_Pre_Processing.ipynb)\n",
"3. [NLP3 - Bag-of-Words and Similarity](https://www.kaggle.com/samuelcortinhas/nlp3-bag-of-words-and-similarity)\n",
"4. [NLP4 - TF-IDF and Document Search](https://www.kaggle.com/samuelcortinhas/nlp4-tf-idf-and-document-search)\n",
"5. [NLP5 - Text Classification with Naive Bayes](https://www.kaggle.com/samuelcortinhas/nlp5-text-classification-with-naive-bayes)\n",
Expand Down Expand Up @@ -63,7 +63,7 @@
"## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">2.1 What is NLP?</p>\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/Z5qHwxFt/hello-banner.png' width=600>\n",
"<img src='..\\assets\\1\\hello-banner.png' width=600>\n",
"</center>\n",
"<br>\n",
"<br>\n",
Expand All @@ -89,7 +89,7 @@
"source": [
"## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">2.2 Common tasks</p>\n",
"\n",
"<img src='https://i.postimg.cc/c4F21HFW/namedentity.png' width=250 style=\"float: right;\">\n",
"<img src='..\\assets\\1\\namedentity.png' width=250 style=\"float: right;\">\n",
"\n",
"Some of the **easier** NLP tasks include:\n",
"* Spell checking\n",
Expand All @@ -99,7 +99,7 @@
"<br>\n",
"<br>\n",
"\n",
"<img src='https://i.postimg.cc/HWhbp1bK/wall-5.jpg' width=250 style=\"float: right;\">\n",
"<img src='..\\assets\\1\\wall-5.jpg' width=250 style=\"float: right;\">\n",
"\n",
"**Medium** difficulty NLP tasks include:\n",
"* Sentiment analysis\n",
Expand All @@ -109,7 +109,7 @@
"<br>\n",
"<br>\n",
"\n",
"<img src='https://i.postimg.cc/g0hYQ53f/chatbot2.png' width=250 style=\"float: right;\">\n",
"<img src='..\\assets\\1\\chatbot2.png' width=250 style=\"float: right;\">\n",
"\n",
"**Hard** NLP tasks include:\n",
"* Text summarisation\n",
Expand Down Expand Up @@ -138,7 +138,7 @@
"There are so many **layers** to how humans interact (which we are still trying to fully understand ourselves) that there should be no surprise at how incredibly difficult it is to **teach computers** to do the same. \n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/7607XmBV/yaint-meme.png' width=300>\n",
"<img src='..\\assets\\1\\yaint-meme.png' width=300>\n",
"</center>"
]
},
Expand Down Expand Up @@ -194,7 +194,7 @@
"**Tokenization** is the process of breaking down a corpus into **tokens**. The procedure might look like **segmenting** a piece of text into sentences and then further segmenting these sentences into individual **words, numbers and punctuation**, which would be tokens. \n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/ydcbhtkj/tokenization2.jpg' width=600>\n",
"<img src='..\\assets\\1\\tokenization2.jpg' width=600>\n",
"</center>\n",
"\n",
"Each token should be chosen to be as **small as possible** while still carrying carrying **meaning on its own**. For example, `\"£10\"` can be split into the two tokens `\"£\"` and `\"10\"` as each one possess its own meaning. \n",
Expand All @@ -219,7 +219,7 @@
"## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">3.2 Tokenization using spaCy</p>\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/ryZxP111/spacy.png' width=250>\n",
"<img src='..\\assets\\1\\spacy.png' width=250>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down
12 changes: 6 additions & 6 deletions 02_Pre_Processing.ipynb → Notebooks/02_Pre_Processing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
},
"source": [
"<center>\n",
"<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
"<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
"</center>\n",
" \n",
"# 1. Introduction\n",
Expand Down Expand Up @@ -65,7 +65,7 @@
"This is the act of converting every token to be uniformly **lower case** or **upper case**. \n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/JhQRD6Jk/case-folding.jpg' width=600>\n",
"<img src='..\\assets\\2\\case-folding.jpg' width=600>\n",
"</center>\n",
" \n",
"This can be beneficial because it will **reduce the number of unique tokens** in a corpus, i,e. the size of the **vocabulary**, hence make the processing of these tokens more memory and computational effecient. The downside however is **information loss**. \n",
Expand Down Expand Up @@ -243,7 +243,7 @@
"Stop words are words that **appear commonly** but **carry little information**. Examples include, `\"a\"`, `\"the\"`, `\"of\"`, `\"an\"`, `\"this\"`,`\"that\"`. Similar to case folding, removing stop words can **improve efficiency** but comes at the cost of **losing contextual information**. \n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/B6XY2bkG/stop-word-removal.jpg' width=600>\n",
"<img src='..\\assets\\2\\stop-word-removal.jpg' width=600>\n",
"</center>\n",
"\n",
"The choice of whether to use stop word removal will depend on the task being performed. For some tasks like **topic modelling** (identifying topics in text), contextual information is not as **important** compared to a task like **sentiment analysis** where the stop word `\"not\"` can change the sentiment completely. \n",
Expand Down Expand Up @@ -425,7 +425,7 @@
"While this is similar to stemming, it also takes into account things like **tenses** and **synonyms**. For example, the words `\"did\"`, `\"done\"` and `\"doing\"` would be converted to the base form `\"do\"`.\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/0NwRqt5S/lemmatization.jpg' width=600>\n",
"<img src='..\\assets\\2\\lemmatization.jpg' width=600>\n",
"</center>\n",
" \n",
"It also takes into account whether a word is a **noun**, **verb** or **adjective** on deciding whether to lemmatize. For example, it might not modify some adjectives so not to change their meaning. (`\"energetic\"` is different to `\"energy\"`).\n",
Expand Down Expand Up @@ -532,7 +532,7 @@
"Part-of-speech tagging is the method of **classifying how a word is used in a sentence**, for example, **noun, verb, adjective**.\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/Jh5wnsJ7/pos-tagging.jpg' width=600>\n",
"<img src='..\\assets\\2\\pos-tagging.jpg' width=600>\n",
"</center>\n",
"\n",
"This is very helpful because it can help us understand the **intent or action** of an ambiguous word. For example, when we say `\"Hand me a hammer.\"`, the word `\"hand\"` is a **verb** (doing word) as opposed to `\"The hammer is in my hand.\"` where it is a **noun** (thing) and has a different meaning. "
Expand Down Expand Up @@ -662,7 +662,7 @@
"<br>\n",
"<br>\n",
"<center>\n",
"<img src='https://i.postimg.cc/VNbYfqBx/ner.png' width=600>\n",
"<img src='..\\assets\\2\\ner.png' width=600>\n",
"</center>\n",
"<br>\n",
"<br>\n",
Expand Down
12 changes: 6 additions & 6 deletions 03_BOW_Similarity.ipynb → Notebooks/03_BOW_Similarity.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
},
"source": [
"<center>\n",
"<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
"<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
"</center>\n",
" \n",
"# 1. Introduction\n",
Expand Down Expand Up @@ -67,7 +67,7 @@
"The **simplest** approach to overcome this is by using a **bag-of-words**, which simply **counts how many times each word appears** in a document. It's called a **bag** because the **order of the words is ignored** - we only care about whether a word appeared or not. \n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/pLvhy7zs/basicbow.png' width=600>\n",
"<img src='..\\assets\\3\\basicbow.png' width=600>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down Expand Up @@ -95,7 +95,7 @@
"For now, we'll focus on the **binary** version of a bag-of-words. This just indicates **whether a word appeared or not**, ignoring word order and word frequency.\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/LsBRs9js/binarybow.jpg' width=600>\n",
"<img src='..\\assets\\3\\binarybow.jpg' width=600>\n",
"</center>\n",
"\n",
"Each **row** in a **binary bag-of-words matrix** corresponds to a **single document** in the corpus. Each **column** corresponds to a **token** in the vocabulary. Note that the order of the tokens isn't important but it does need to be **fixed beforehand** when building the vocabulary. \n",
Expand Down Expand Up @@ -126,7 +126,7 @@
"We have gone from thinking of documents as a sequence of words to **points in a multi-dimensional vector space**. Importantly, the dimension of this space if **fixed**, i.e. each vector has the same length.\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/L5nfB0Vr/unitcube.png' width=400>\n",
"<img src='..\\assets\\3\\unitcube.png' width=400>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down Expand Up @@ -162,7 +162,7 @@
"\n",
"<br>\n",
"<center>\n",
"<img src='https://i.postimg.cc/DzSnKShk/cosine-similarity-vectors-original.jpg' width=800>\n",
"<img src='..\\assets\\3\\cosine-similarity-vectors-original.jpg' width=800>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down Expand Up @@ -216,7 +216,7 @@
"A 2-gram (aka **bigram**) would have 2 tokens per chunk, a 3-gram (aka **trigram**) would have 3 tokens per chuck, etc. \n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/ZncZPgp4/8ARA1.png' width=600>\n",
"<img src='..\\assets\\3\\8ARA1.png' width=600>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
},
"source": [
"<center>\n",
"<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
"<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
"</center>\n",
" \n",
"# 1. Introduction\n",
Expand Down Expand Up @@ -63,7 +63,7 @@
"## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">2.1 Shortcomings of Bag-of-Words</p>\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/1zw5pqP1/wordcloud.jpg' width=500>\n",
"<img src='..\\assets\\4\\wordcloud.jpg' width=500>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down Expand Up @@ -123,7 +123,7 @@
"## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">3.1 Term Frequency</p>\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/xCL5Dt2p/wren.jpg' width=600>\n",
"<img src='..\\assets\\4\\wren.jpg' width=600>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down Expand Up @@ -225,7 +225,7 @@
"## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">4.1 TF-IDF with sklearn</p>\n",
"\n",
"<center>\n",
"<img src='https://i.postimg.cc/9Fc6bN5f/moon.webp' width=450>\n",
"<img src='..\\assets\\4\\moon.jpg' width=450>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

38 changes: 22 additions & 16 deletions 07_Word_Embeddings.ipynb → Notebooks/07_Word_Embeddings.ipynb

Large diffs are not rendered by default.

26 changes: 16 additions & 10 deletions 08_RNNs_LMs.ipynb → Notebooks/08_RNNs_LMs.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
},
"source": [
"<center>\n",
"<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
"<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
"</center>\n",
" \n",
"# 1. Introduction\n",
Expand Down Expand Up @@ -66,7 +66,7 @@
"\n",
"<br>\n",
"<center>\n",
"<img src=\"https://i.postimg.cc/T3XjC3Mw/rnn-setups.jpg\" width=600>\n",
"<img src=\"..\\assets\\9\\rnn-setups.jpg\" width=600>\n",
"</center>\n",
"<br>\n",
"\n",
Expand All @@ -88,7 +88,7 @@
"\n",
"<br>\n",
"<center>\n",
"<img src=\"https://i.postimg.cc/8Pk5M39D/seq2seq.png\" width=600>\n",
"<img src=\"..\\assets\\9\\seq2seq.png\" width=600>\n",
"</center>\n",
"<br>\n",
"\n",
Expand All @@ -113,7 +113,7 @@
"\n",
"<br>\n",
"<center>\n",
"<img src=\"https://i.postimg.cc/vZjMc8pg/beamsearch.png\" width=600>\n",
"<img src=\"..\\assets\\9\\beamsearch.png\" width=600>\n",
"</center>\n",
"<br>\n",
"\n",
Expand All @@ -122,10 +122,12 @@
"In practice, we **normalize** the log probabilities by the number of words so beam search doesn't have a bias for shorter translations. The formula then is\n",
"\n",
"<br>\n",
"\n",
"$$\n",
"\\Large\n",
"\\text{score}(w_1,...,w_T) = \\frac{1}{T} \\sum_{i=1}^{T} \\log \\mathbb{P}(w_i | w_1, ..., w_{i-1}, x)\n",
"$$\n",
"\n",
"<br>\n",
"\n",
"This allows us to explore **more combinations** and return better translations than greedy search. But this comes at the cost of being more computationally expensive. The larger k is, the more combinations we consider but the longer it takes at **inference time**, so there is a **trade-off**. \n",
Expand All @@ -135,21 +137,22 @@
"BLEU stands for **Bilingual Evaluation Understudy** and is a way to evaluate how good a translation is. Interestingly it comes from a [paper](https://aclanthology.org/P02-1040.pdf) from 2002, way before computers became good at translation. \n",
"\n",
"In essence, it measures **similarity** by counting **n-gram overlap** between the prediction and a reference translation. The formula is given by \n",
"\n",
"<br>\n",
"\n",
"$$\n",
"\\Large\n",
"\\text{BLEU} = \\min\\left(1, \\exp \\left(1-\\frac{\\text{reference_len}}{\\text{cadidate_len}}\\right)\\right)\\left(\\prod_{i=1}^{4} \\text{precision}_i \\right)^{\\frac{1}{4}}\n",
"\\text{BLEU} = \\min\\left(1, \\exp \\left(1-\\frac{\\text{reference\\_len}}{\\text{candidate\\_len}}\\right)\\right)\\left(\\prod_{i=1}^{4} \\text{precision}_i \\right)^{\\frac{1}{4}}\n",
"\n",
"$$\n",
"<br>\n",
"\n",
"<br>\n",
"where\n",
"\n",
"<br>\n",
"\n",
"$$\n",
"\\large\n",
"\\text{precision}_i = \\frac{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} \\min(n^{i}_{\\text{cand_match}},n^{i}_{\\text{ref}})}{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} n^{i}_{\\text{cand}}}\n",
"\\text{precision}_i = \\frac{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} \\min\\left(n^{i}_{\\text{cand\\_match}}, n^{i}_{\\text{ref}}\\right)}{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} n^{i}_{\\text{cand}}}\n",
"\n",
"$$\n",
"\n",
"<br>\n",
"\n",
"The first term in the BLEU formula is a **brevity penalty**, which penalizes candidate translations that are short relative to the reference translation. The second term measures the **precision** of 1-gram, 2-grams, 3-grams and 4-grams by counting how many times they **overlaps** in the candidate and reference translations. The min is there to prevent predictions filled with the same word from getting high scores. \n",
Expand All @@ -162,7 +165,7 @@
"\n",
"<br>\n",
"<center>\n",
"<img src=\"https://i.postimg.cc/wMDbVyY1/bottleneck.png\" width=500>\n",
"<img src=\"..\\assets\\9\\bottleneck.png\" width=500>\n",
"</center>\n",
"<br>\n",
"\n",
Expand Down Expand Up @@ -191,7 +194,7 @@
"\n",
"<br>\n",
"<center>\n",
"<img src=\"https://i.postimg.cc/2jHChSf8/attention.jpg\" width=450>\n",
"<img src=\"..\\assets\\9\\attention.jpg\" width=450>\n",
"</center>\n",
"<br>\n",
"\n",
Expand All @@ -202,10 +205,12 @@
"The attention weights are used to calcualte a **context vector** via a weighted sum of the weights with the hidden states. That is,\n",
"\n",
"<br>\n",
"\n",
"$$\n",
"\\Large\n",
"c_j = \\sum_{i} \\alpha_{i,j} h_{i}\n",
"$$\n",
"\n",
"<br>\n",
"\n",
"At the $j$-th output time step, we **concatenate** the $j$-th context vector with the **embedding** of the $j$-th input token to the decoder. Recall that this decoder input will be determined by the previous time step. The concatenated vector is then fed through the decoder **without any other changes**. \n",
Expand All @@ -217,10 +222,12 @@
"The function is calcualted by multiplying the input vectors with matrices and **adding** these together. This is passed through a tanh activation function and then reduced to a scalar. That is, \n",
"\n",
"<br>\n",
"\n",
"$$\n",
"\\Large\n",
"f(h_i, s_j) = v^{T}\\tanh(W h_i+U s_j)\n",
"$$\n",
"\n",
"<br>\n",
"\n",
"This is simply a **feed-forward neural network**, which gets trained via backpropagation. So by adding more **learnable parameters** we allow the model to learn which hidden encoder states are more important given the current decoder hidden state. \n",
Expand All @@ -230,19 +237,23 @@
"An alternative to the additive attention scoring function is **Multiplicative (aka Luong) Attention**. This is simply **multiplies** the input vectors together via the dot product. \n",
"\n",
"<br>\n",
"\n",
"$$\n",
"\\Large\n",
"f(h_i, s_j) = s_j^{T} h_i\n",
"$$\n",
"\n",
"<br>\n",
"\n",
"Some implementations use an **additional matrix** of learnable weights in the multiplication. \n",
"\n",
"<br>\n",
"\n",
"$$\n",
"\\Large\n",
"f(h_i, s_j) = s_j^{T} W h_i\n",
"$$\n",
"\n",
"<br>\n",
"\n",
"This is usually prefered to additive attention because it is much **faster** to implement and reaches a **similar performance**. \n",
Expand Down
Loading

0 comments on commit d285904

Please sign in to comment.