Text Classification using Pre-trained BERT Vectors

Program performs text classification of following 10 classes using BERT vector:

Arabic
Cantonese
Japanese
Korean
Mandarin
Polish
Russian
Spanish
Thai
Vietnamese

Program was implemented using Python and BERT. Refer the report for further implementation details, and instructions to run the code: View Report

Results:

Logistic Regression Model Predictions: Among all languages, highest precision, recall, and f1-score is for Thai, whereas lowest is for Mandarin. Misclassification is highest for Mandarin and Cantonese, whereas lowest for Thai.

Neural Network Model Predictions: Using MLP Classifier, highest precision, recall, and f1-score is for Thai, whereas lowest is for Mandarin. Misclassification is highest for Mandarin, whereas lowest for Thai.

Improvements:

The logistic regression model can be improved by hyperparameter tuning by grid search. The neutral network model can be improved by using hyperparameter optimization tools on parameters like hidden_layer_sizes, activation, solver, alpha, learning_rate, max_iter, etc. Use BERT vectors and more data to train the models in order to see improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text Classification using Pre-trained BERT Vectors

Results:

Improvements:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text Classification using Pre-trained BERT Vectors

Results:

Improvements: