Program performs text classification of following 10 classes using BERT vector:
- Arabic
- Cantonese
- Japanese
- Korean
- Mandarin
- Polish
- Russian
- Spanish
- Thai
- Vietnamese
Program was implemented using Python and BERT. Refer the report for further implementation details, and instructions to run the code:
View Report
- Logistic Regression Model Predictions: Among all languages, highest precision, recall, and f1-score is for Thai, whereas lowest is for Mandarin. Misclassification is highest for Mandarin and Cantonese, whereas lowest for Thai.
- Neural Network Model Predictions: Using MLP Classifier, highest precision, recall, and f1-score is for Thai, whereas lowest is for Mandarin. Misclassification is highest for Mandarin, whereas lowest for Thai.
The logistic regression model can be improved by hyperparameter tuning by grid search. The neutral network model can be improved by using hyperparameter optimization tools on parameters like hidden_layer_sizes, activation, solver, alpha, learning_rate, max_iter, etc. Use BERT vectors and more data to train the models in order to see improvements.