Welcome to the "Natural Language Processing (NLP) Deep Dive on Data and Tokenizer" project. The code name is "d3tz." It is a hackable, step-by-step Jupyter Notebook for investigating the data, separating them into a "training set" and a "validation set," and tokenizing them.
Welcome to the "Natural Language Processing (NLP) Deep Dive on Data and Tokenizer" project. The code name is "d3tz." It is a hackable, step-by-step Jupyter Notebook for investigating the data, separating them into a "training set" and a "validation set," and tokenizing them. Furthermore, "d3tz" will reveal the data biases, both intentional and adventitious, and how it affects the NLP's accuracy.
The big bonus is a deep understanding of the input-data that will enable you to use the hyper-parameters effectively, such as, should I increase or decrease the "learning (fit) rate," the "dropout" rate, the "momentum" rate, or the discriminate learn-rate "percentage"?
I classify the "d3tz" as a "sandbox" or "toy" project. In other words, it is a fun, experimental project focusing on solving one problem.
When researching or learning about Artificial Neural Network (ANN), we are too often focusing on the "train" cycle, base architecture selection, hyper-parameters, and loss-error metric. Those factors are fundamental to NLP. However, the input-data affects the NLP's accuracy substantially.
If you are new to ANN and not understanding the full cycle, from development to deployment, please read the "Norwegian Blue Parrot, The k2fa" AI" article on LinkedIn.
As an AI scientist, I choose to be an expert in input-data, such as data analysis, data visualization, labeling, cleaning, augmentation, separating "training and validation" set, and foremost, revealing and clarifying intentional biases and adventitious biases.
I often spend weeks, or months, on the input-data before writing the first Python code-line for the "training" cycle. The reason is that after a thorough review and visualize the input-data, I can predict the ANN accuracy against the business objectives before hands.
After a handful of successful enterprise AI projects and dozens of AI projects for courses and Kaggle competitions, I gravitate to the input-data biases. For enterprise AI project, the "success or failure" is not on how well we trained the ANN model, but how the project achieved the business objective.
For example, I did a national bank project, and the ANN model achieved a 96.38% accuracy, a 3.62% loss. By the contract agreement, the ANN model surpassed the set goal. However, after the bank deployed the project, it was deemed a failure because, in large part, the real-world customer's input-data is substantially different from the giving input-data.
Looking back, I should be more forceful in verbalizing the intentional biases that I have observed in the input data. At that time, my employer said that I have successfully delivered an ANN model that exceeds the contractual client requirement, so "don't rock the boat."
The "d3tz" notebook is a deep dive in NLP, but the lesson learned is applied equally in image classification, image segmentation, or any other ANN project.
-
For this journey, the companion's name is "Henna." It is because she looks like a "hen" with a "na" following it.
-
Typically a dog name is chosen, e.g., "Lefty," "Zeta," or "Gamma," but be warned, don't name it after your cat because a "cat" will not follow any commands.
-
Naturally, as with other sandbox projects, you should hack the notebook and change the name to your preference.
-
As a good little programmer, Henna starts by creating an object or class.
-
Henna uses the "river" coding style for all "sandbox" projects.
-
The style uses a full library name, sub-class, and following by the function name. Jupyter notebook has auto-complete, so Henna can use the long-name and not misspelling them. The bonus is that when Henna type a dot (".") after an object, the list of possible methods and variables are displayed.
-
Henna is NOT using the global-space as in "import numpy *" or using the shorten name like "import matplotlibl.pyplot as ptl” instead of using the full {river} name as in “numpy.random.randint().”
-
In addition, she shies away from using Python language-specific syntax shorthand, such as the “assigned if.”
-
The primary reason for using the “river” coding style coupled with a descriptive naming convention is that it is easier to read, hack, and translate to Swift or Javascript.
-
Henna is in the exploration journey, and therefore, she will considering code compaction and code optimization when she refactors them in the Python project using Atom IDE.