Arabic Sequence Labeling: Part Of Speech Tagging NLP task

This project is implemented on arabic part of speech tagging as part of the "Natural Language Processing" course of my master's degree. the project uses the arabic PUD dataset from universal dependencies and implements

Deep learning model (BiLSTM) for sequential labeling classification
Pre-deep learning model (KNN) for multi-class classification

Arabic PUD Dataset

During preprocessing steps the following processes are applied :

Remove tanween and tashkeel
Remove sentences that contains non-arabic words (i.e. english characters)

Such that the distribution of tags within the dataset is visualized as barchart where the majority of words (5553 word) in the dataset is associated with noun tag while the least common tag with the dataset is X. Each of the tags symbolizes part of the speech, refer to the image below for description of each tag.

Arabic Word Embedding

Word embedding provides a dense representation of words and their relative meanings.
The word embedding technique used in this project is N-Gram Word2Vec -SkipGram model from aravec project trained on twitter data with vector size 300.

Structure BiLSTM sequential labeling classification model

Results

The dataset is split to 70% for training and 30% for testing

BiLSTM sequential labeling classification model

KNN multi-class classification model

Requirements

Preprocessing and visualization

conllu
matplotlib.pyplot
pandas
re
seaborn
numpy
tensorflow (Tokenizer,pad_sequences)
sklearn (preprocessing.LabelEncoder,model_selection.train_test_split)

Word Embedding

gensim

Classification model

tensorflow
keras.models.sequential
keras.layers (Dense,Embedding,Bidirectional,LSTM,TimeDistributed,InputLayer)
sklearn.neighbors.KNeighborsClassifier

Model Evaluation

sklearn.metrics

References and Resources

reading and parsing dataset: link
Processing input data:link
Aravec for word embedding model :link
Keras Embedding layer : link1,link2,link3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Arabic Sequence Labeling: Part Of Speech Tagging NLP task

Table of contents

Arabic PUD Dataset

Arabic Word Embedding

Structure BiLSTM sequential labeling classification model

Results

BiLSTM sequential labeling classification model

KNN multi-class classification model

Requirements

References and Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Arabic Sequence Labeling: Part Of Speech Tagging NLP task

Table of contents

Arabic PUD Dataset

Arabic Word Embedding

Structure BiLSTM sequential labeling classification model

Results

BiLSTM sequential labeling classification model

KNN multi-class classification model

Requirements

References and Resources