Organizing

JUSTSUJAY · Aug 29, 2024 · d285904 · d285904
1 parent 5c708b0
commit d285904
Show file tree

Hide file tree

Showing 69 changed files with 207 additions and 153 deletions.
diff --git a/01_Tokenization.ipynb → Notebooks/01_Tokenization.ipynb b/01_Tokenization.ipynb → Notebooks/01_Tokenization.ipynb
@@ -15,7 +15,7 @@
    },
    "source": [
     "<center>\n",
-    "<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
+    "<img src=\"https://github.com/JUSTSUJAY/NLP_One_Shot/assets/1/lang-pic.jpg\" width=600>\n",
     "</center>\n",
     "    \n",
     "# 1. Introduction\n",
@@ -26,8 +26,8 @@
     "\n",
     "NLP series:\n",
     "\n",
-    "1. [NLP1 - Tokenization](https://www.kaggle.com/samuelcortinhas/nlp1-tokenization) (this one)\n",
-    "2. [NLP2 - Pre-processing](https://www.kaggle.com/samuelcortinhas/nlp2-pre-processing)\n",
+    "1. [NLP1 - Tokenization]() (this one)\n",
+    "2. [NLP2 - Pre-processing](https://github.com/JUSTSUJAY/NLP_One_Shot/Notebooks/02_Pre_Processing.ipynb)\n",
     "3. [NLP3 - Bag-of-Words and Similarity](https://www.kaggle.com/samuelcortinhas/nlp3-bag-of-words-and-similarity)\n",
     "4. [NLP4 - TF-IDF and Document Search](https://www.kaggle.com/samuelcortinhas/nlp4-tf-idf-and-document-search)\n",
     "5. [NLP5 - Text Classification with Naive Bayes](https://www.kaggle.com/samuelcortinhas/nlp5-text-classification-with-naive-bayes)\n",
@@ -63,7 +63,7 @@
     "## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">2.1 What is NLP?</p>\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/Z5qHwxFt/hello-banner.png' width=600>\n",
+    "<img src='..\\assets\\1\\hello-banner.png' width=600>\n",
     "</center>\n",
     "<br>\n",
     "<br>\n",
@@ -89,7 +89,7 @@
    "source": [
     "## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">2.2 Common tasks</p>\n",
     "\n",
-    "<img src='https://i.postimg.cc/c4F21HFW/namedentity.png' width=250 style=\"float: right;\">\n",
+    "<img src='..\\assets\\1\\namedentity.png' width=250 style=\"float: right;\">\n",
     "\n",
     "Some of the **easier** NLP tasks include:\n",
     "* Spell checking\n",
@@ -99,7 +99,7 @@
     "<br>\n",
     "<br>\n",
     "\n",
-    "<img src='https://i.postimg.cc/HWhbp1bK/wall-5.jpg' width=250 style=\"float: right;\">\n",
+    "<img src='..\\assets\\1\\wall-5.jpg' width=250 style=\"float: right;\">\n",
     "\n",
     "**Medium** difficulty NLP tasks include:\n",
     "* Sentiment analysis\n",
@@ -109,7 +109,7 @@
     "<br>\n",
     "<br>\n",
     "\n",
-    "<img src='https://i.postimg.cc/g0hYQ53f/chatbot2.png' width=250 style=\"float: right;\">\n",
+    "<img src='..\\assets\\1\\chatbot2.png' width=250 style=\"float: right;\">\n",
     "\n",
     "**Hard** NLP tasks include:\n",
     "* Text summarisation\n",
@@ -138,7 +138,7 @@
     "There are so many **layers** to how humans interact (which we are still trying to fully understand ourselves) that there should be no surprise at how incredibly difficult it is to **teach computers** to do the same. \n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/7607XmBV/yaint-meme.png' width=300>\n",
+    "<img src='..\\assets\\1\\yaint-meme.png' width=300>\n",
     "</center>"
    ]
   },
@@ -194,7 +194,7 @@
     "**Tokenization** is the process of breaking down a corpus into **tokens**. The procedure might look like **segmenting** a piece of text into sentences and then further segmenting these sentences into individual **words, numbers and punctuation**, which would be tokens. \n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/ydcbhtkj/tokenization2.jpg' width=600>\n",
+    "<img src='..\\assets\\1\\tokenization2.jpg' width=600>\n",
     "</center>\n",
     "\n",
     "Each token should be chosen to be as **small as possible** while still carrying carrying **meaning on its own**. For example, `\"£10\"` can be split into the two tokens `\"£\"` and `\"10\"` as each one possess its own meaning. \n",
@@ -219,7 +219,7 @@
     "## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">3.2 Tokenization using spaCy</p>\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/ryZxP111/spacy.png' width=250>\n",
+    "<img src='..\\assets\\1\\spacy.png' width=250>\n",
     "</center>\n",
     "<br>\n",
     "\n",

diff --git a/02_Pre_Processing.ipynb → Notebooks/02_Pre_Processing.ipynb b/02_Pre_Processing.ipynb → Notebooks/02_Pre_Processing.ipynb
@@ -15,7 +15,7 @@
    },
    "source": [
     "<center>\n",
-    "<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
+    "<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
     "</center>\n",
     "    \n",
     "# 1. Introduction\n",
@@ -65,7 +65,7 @@
     "This is the act of converting every token to be uniformly **lower case** or **upper case**. \n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/JhQRD6Jk/case-folding.jpg' width=600>\n",
+    "<img src='..\\assets\\2\\case-folding.jpg' width=600>\n",
     "</center>\n",
     "    \n",
     "This can be beneficial because it will **reduce the number of unique tokens** in a corpus, i,e. the size of the **vocabulary**, hence make the processing of these tokens more memory and computational effecient. The downside however is **information loss**. \n",
@@ -243,7 +243,7 @@
     "Stop words are words that **appear commonly** but **carry little information**. Examples include, `\"a\"`, `\"the\"`, `\"of\"`, `\"an\"`, `\"this\"`,`\"that\"`. Similar to case folding, removing stop words can **improve efficiency** but comes at the cost of **losing contextual information**. \n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/B6XY2bkG/stop-word-removal.jpg' width=600>\n",
+    "<img src='..\\assets\\2\\stop-word-removal.jpg' width=600>\n",
     "</center>\n",
     "\n",
     "The choice of whether to use stop word removal will depend on the task being performed. For some tasks like **topic modelling** (identifying topics in text), contextual information is not as **important** compared to a task like **sentiment analysis** where the stop word `\"not\"` can change the sentiment completely. \n",
@@ -425,7 +425,7 @@
     "While this is similar to stemming, it also takes into account things like **tenses** and **synonyms**. For example, the words `\"did\"`, `\"done\"` and `\"doing\"` would be converted to the base form `\"do\"`.\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/0NwRqt5S/lemmatization.jpg' width=600>\n",
+    "<img src='..\\assets\\2\\lemmatization.jpg' width=600>\n",
     "</center>\n",
     "    \n",
     "It also takes into account whether a word is a **noun**, **verb** or **adjective** on deciding whether to lemmatize. For example, it might not modify some adjectives so not to change their meaning. (`\"energetic\"` is different to `\"energy\"`).\n",
@@ -532,7 +532,7 @@
     "Part-of-speech tagging is the method of **classifying how a word is used in a sentence**, for example, **noun, verb, adjective**.\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/Jh5wnsJ7/pos-tagging.jpg' width=600>\n",
+    "<img src='..\\assets\\2\\pos-tagging.jpg' width=600>\n",
     "</center>\n",
     "\n",
     "This is very helpful because it can help us understand the **intent or action** of an ambiguous word. For example, when we say `\"Hand me a hammer.\"`, the word `\"hand\"` is a **verb** (doing word) as opposed to `\"The hammer is in my hand.\"` where it is a **noun** (thing) and has a different meaning. "
@@ -662,7 +662,7 @@
     "<br>\n",
     "<br>\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/VNbYfqBx/ner.png' width=600>\n",
+    "<img src='..\\assets\\2\\ner.png' width=600>\n",
     "</center>\n",
     "<br>\n",
     "<br>\n",

diff --git a/03_BOW_Similarity.ipynb → Notebooks/03_BOW_Similarity.ipynb b/03_BOW_Similarity.ipynb → Notebooks/03_BOW_Similarity.ipynb
@@ -15,7 +15,7 @@
    },
    "source": [
     "<center>\n",
-    "<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
+    "<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
     "</center>\n",
     "    \n",
     "# 1. Introduction\n",
@@ -67,7 +67,7 @@
     "The **simplest** approach to overcome this is by using a **bag-of-words**, which simply **counts how many times each word appears** in a document. It's called a **bag** because the **order of the words is ignored** - we only care about whether a word appeared or not. \n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/pLvhy7zs/basicbow.png' width=600>\n",
+    "<img src='..\\assets\\3\\basicbow.png' width=600>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -95,7 +95,7 @@
     "For now, we'll focus on the **binary** version of a bag-of-words. This just indicates **whether a word appeared or not**, ignoring word order and word frequency.\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/LsBRs9js/binarybow.jpg' width=600>\n",
+    "<img src='..\\assets\\3\\binarybow.jpg' width=600>\n",
     "</center>\n",
     "\n",
     "Each **row** in a **binary bag-of-words matrix** corresponds to a **single document** in the corpus. Each **column** corresponds to a **token** in the vocabulary. Note that the order of the tokens isn't important but it does need to be **fixed beforehand** when building the vocabulary.  \n",
@@ -126,7 +126,7 @@
     "We have gone from thinking of documents as a sequence of words to **points in a multi-dimensional vector space**. Importantly, the dimension of this space if **fixed**, i.e. each vector has the same length.\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/L5nfB0Vr/unitcube.png' width=400>\n",
+    "<img src='..\\assets\\3\\unitcube.png' width=400>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -162,7 +162,7 @@
     "\n",
     "<br>\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/DzSnKShk/cosine-similarity-vectors-original.jpg' width=800>\n",
+    "<img src='..\\assets\\3\\cosine-similarity-vectors-original.jpg' width=800>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -216,7 +216,7 @@
     "A 2-gram (aka **bigram**) would have 2 tokens per chunk, a 3-gram (aka **trigram**) would have 3 tokens per chuck, etc. \n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/ZncZPgp4/8ARA1.png' width=600>\n",
+    "<img src='..\\assets\\3\\8ARA1.png' width=600>\n",
     "</center>\n",
     "<br>\n",
     "\n",

diff --git a/04_TFIDF_DocSearch.ipynb → Notebooks/04_TFIDF_DocSearch.ipynb b/04_TFIDF_DocSearch.ipynb → Notebooks/04_TFIDF_DocSearch.ipynb
@@ -15,7 +15,7 @@
    },
    "source": [
     "<center>\n",
-    "<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
+    "<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
     "</center>\n",
     "    \n",
     "# 1. Introduction\n",
@@ -63,7 +63,7 @@
     "## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">2.1 Shortcomings of Bag-of-Words</p>\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/1zw5pqP1/wordcloud.jpg' width=500>\n",
+    "<img src='..\\assets\\4\\wordcloud.jpg' width=500>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -123,7 +123,7 @@
     "## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">3.1 Term Frequency</p>\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/xCL5Dt2p/wren.jpg' width=600>\n",
+    "<img src='..\\assets\\4\\wren.jpg' width=600>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -225,7 +225,7 @@
     "## <p style=\"font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;\">4.1 TF-IDF with sklearn</p>\n",
     "\n",
     "<center>\n",
-    "<img src='https://i.postimg.cc/9Fc6bN5f/moon.webp' width=450>\n",
+    "<img src='..\\assets\\4\\moon.jpg' width=450>\n",
     "</center>\n",
     "<br>\n",
     "\n",

diff --git a/05_NaiveBayes_TextClf.ipynb → Notebooks/05_NaiveBayes_TextClf.ipynb b/05_NaiveBayes_TextClf.ipynb → Notebooks/05_NaiveBayes_TextClf.ipynb
diff --git a/06_LDA_TopicModelling.ipynb → Notebooks/06_LDA_TopicModelling.ipynb b/06_LDA_TopicModelling.ipynb → Notebooks/06_LDA_TopicModelling.ipynb
diff --git a/07_Word_Embeddings.ipynb → Notebooks/07_Word_Embeddings.ipynb b/07_Word_Embeddings.ipynb → Notebooks/07_Word_Embeddings.ipynb
diff --git a/08_RNNs_LMs.ipynb → Notebooks/08_RNNs_LMs.ipynb b/08_RNNs_LMs.ipynb → Notebooks/08_RNNs_LMs.ipynb
diff --git a/09_Machine_Translation_Attention.ipynb → ...ks/09_Machine_Translation_Attention.ipynb b/09_Machine_Translation_Attention.ipynb → ...ks/09_Machine_Translation_Attention.ipynb
@@ -15,7 +15,7 @@
    },
    "source": [
     "<center>\n",
-    "<img src='https://i.postimg.cc/63rDLhtJ/lang-pic.jpg' width=600>\n",
+    "<img src='..\\assets\\1\\lang-pic.jpg' width=600>\n",
     "</center>\n",
     "    \n",
     "# 1. Introduction\n",
@@ -66,7 +66,7 @@
     "\n",
     "<br>\n",
     "<center>\n",
-    "<img src=\"https://i.postimg.cc/T3XjC3Mw/rnn-setups.jpg\" width=600>\n",
+    "<img src=\"..\\assets\\9\\rnn-setups.jpg\" width=600>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -88,7 +88,7 @@
     "\n",
     "<br>\n",
     "<center>\n",
-    "<img src=\"https://i.postimg.cc/8Pk5M39D/seq2seq.png\" width=600>\n",
+    "<img src=\"..\\assets\\9\\seq2seq.png\" width=600>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -113,7 +113,7 @@
     "\n",
     "<br>\n",
     "<center>\n",
-    "<img src=\"https://i.postimg.cc/vZjMc8pg/beamsearch.png\" width=600>\n",
+    "<img src=\"..\\assets\\9\\beamsearch.png\" width=600>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -122,10 +122,12 @@
     "In practice, we **normalize** the log probabilities by the number of words so beam search doesn't have a bias for shorter translations. The formula then is\n",
     "\n",
     "<br>\n",
+    "\n",
     "$$\n",
     "\\Large\n",
     "\\text{score}(w_1,...,w_T) = \\frac{1}{T} \\sum_{i=1}^{T} \\log \\mathbb{P}(w_i | w_1, ..., w_{i-1}, x)\n",
     "$$\n",
+    "\n",
     "<br>\n",
     "\n",
     "This allows us to explore **more combinations** and return better translations than greedy search. But this comes at the cost of being more computationally expensive. The larger k is, the more combinations we consider but the longer it takes at **inference time**, so there is a **trade-off**. \n",
@@ -135,21 +137,22 @@
     "BLEU stands for **Bilingual Evaluation Understudy** and is a way to evaluate how good a translation is. Interestingly it comes from a [paper](https://aclanthology.org/P02-1040.pdf) from 2002, way before computers became good at translation.  \n",
     "\n",
     "In essence, it measures **similarity** by counting **n-gram overlap** between the prediction and a reference translation. The formula is given by \n",
-    "\n",
     "<br>\n",
+    "\n",
     "$$\n",
-    "\\Large\n",
-    "\\text{BLEU} = \\min\\left(1, \\exp \\left(1-\\frac{\\text{reference_len}}{\\text{cadidate_len}}\\right)\\right)\\left(\\prod_{i=1}^{4} \\text{precision}_i \\right)^{\\frac{1}{4}}\n",
+    "\\text{BLEU} = \\min\\left(1, \\exp \\left(1-\\frac{\\text{reference\\_len}}{\\text{candidate\\_len}}\\right)\\right)\\left(\\prod_{i=1}^{4} \\text{precision}_i \\right)^{\\frac{1}{4}}\n",
+    "\n",
     "$$\n",
-    "<br>\n",
     "\n",
+    "<br>\n",
     "where\n",
-    "\n",
     "<br>\n",
+    "\n",
     "$$\n",
-    "\\large\n",
-    "\\text{precision}_i = \\frac{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} \\min(n^{i}_{\\text{cand_match}},n^{i}_{\\text{ref}})}{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} n^{i}_{\\text{cand}}}\n",
+    "\\text{precision}_i = \\frac{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} \\min\\left(n^{i}_{\\text{cand\\_match}}, n^{i}_{\\text{ref}}\\right)}{\\sum_{C \\in \\text{Candidates}} \\sum_{i \\in C} n^{i}_{\\text{cand}}}\n",
+    "\n",
     "$$\n",
+    "\n",
     "<br>\n",
     "\n",
     "The first term in the BLEU formula is a **brevity penalty**, which penalizes candidate translations that are short relative to the reference translation. The second term measures the **precision** of 1-gram, 2-grams, 3-grams and 4-grams by counting how many times they **overlaps** in the candidate and reference translations. The min is there to prevent predictions filled with the same word from getting high scores. \n",
@@ -162,7 +165,7 @@
     "\n",
     "<br>\n",
     "<center>\n",
-    "<img src=\"https://i.postimg.cc/wMDbVyY1/bottleneck.png\" width=500>\n",
+    "<img src=\"..\\assets\\9\\bottleneck.png\" width=500>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -191,7 +194,7 @@
     "\n",
     "<br>\n",
     "<center>\n",
-    "<img src=\"https://i.postimg.cc/2jHChSf8/attention.jpg\" width=450>\n",
+    "<img src=\"..\\assets\\9\\attention.jpg\" width=450>\n",
     "</center>\n",
     "<br>\n",
     "\n",
@@ -202,10 +205,12 @@
     "The attention weights are used to calcualte a **context vector** via a weighted sum of the weights with the hidden states. That is,\n",
     "\n",
     "<br>\n",
+    "\n",
     "$$\n",
     "\\Large\n",
     "c_j = \\sum_{i} \\alpha_{i,j} h_{i}\n",
     "$$\n",
+    "\n",
     "<br>\n",
     "\n",
     "At the $j$-th output time step, we **concatenate** the $j$-th context vector with the **embedding** of the $j$-th input token to the decoder. Recall that this decoder input will be determined by the previous time step. The concatenated vector is then fed through the decoder **without any other changes**. \n",
@@ -217,10 +222,12 @@
     "The function is calcualted by multiplying the input vectors with matrices and **adding** these together. This is passed through a tanh activation function and then reduced to a scalar. That is, \n",
     "\n",
     "<br>\n",
+    "\n",
     "$$\n",
     "\\Large\n",
     "f(h_i, s_j) = v^{T}\\tanh(W h_i+U s_j)\n",
     "$$\n",
+    "\n",
     "<br>\n",
     "\n",
     "This is simply a **feed-forward neural network**, which gets trained via backpropagation. So by adding more **learnable parameters** we allow the model to learn which hidden encoder states are more important given the current decoder hidden state. \n",
@@ -230,19 +237,23 @@
     "An alternative to the additive attention scoring function is **Multiplicative (aka Luong) Attention**. This is simply **multiplies** the input vectors together via the dot product. \n",
     "\n",
     "<br>\n",
+    "\n",
     "$$\n",
     "\\Large\n",
     "f(h_i, s_j) = s_j^{T} h_i\n",
     "$$\n",
+    "\n",
     "<br>\n",
     "\n",
     "Some implementations use an **additional matrix** of learnable weights in the multiplication. \n",
     "\n",
     "<br>\n",
+    "\n",
     "$$\n",
     "\\Large\n",
     "f(h_i, s_j) = s_j^{T} W h_i\n",
     "$$\n",
+    "\n",
     "<br>\n",
     "\n",
     "This is usually prefered to additive attention because it is much **faster** to implement and reaches a **similar performance**. \n",