Skip to content

Latest commit

 

History

History
80 lines (72 loc) · 4.14 KB

File metadata and controls

80 lines (72 loc) · 4.14 KB

Arabic Sequence Labeling: Part Of Speech Tagging NLP task

This project is implemented on arabic part of speech tagging as part of the "Natural Language Processing" course of my master's degree. the project uses the arabic PUD dataset from universal dependencies and implements

  1. Deep learning model (BiLSTM) for sequential labeling classification
  2. Pre-deep learning model (KNN) for multi-class classification

Table of contents

Arabic PUD Dataset

During preprocessing steps the following processes are applied :

  1. Remove tanween and tashkeel
  2. Remove sentences that contains non-arabic words (i.e. english characters)
Such that the distribution of tags within the dataset is visualized as barchart where the majority of words (5553 word) in the dataset is associated with noun tag while the least common tag with the dataset is X. Each of the tags symbolizes part of the speech, refer to the image below for description of each tag.

Arabic Word Embedding

Word embedding provides a dense representation of words and their relative meanings.
The word embedding technique used in this project is N-Gram Word2Vec -SkipGram model from aravec project trained on twitter data with vector size 300.

Structure BiLSTM sequential labeling classification model

Results

The dataset is split to 70% for training and 30% for testing

BiLSTM sequential labeling classification model

KNN multi-class classification model

image

Requirements

Preprocessing and visualization

  • conllu
  • matplotlib.pyplot
  • pandas
  • re
  • seaborn
  • numpy
  • tensorflow (Tokenizer,pad_sequences)
  • sklearn (preprocessing.LabelEncoder,model_selection.train_test_split)
Word Embedding
  • gensim
Classification model
  • tensorflow
  • keras.models.sequential
  • keras.layers (Dense,Embedding,Bidirectional,LSTM,TimeDistributed,InputLayer)
  • sklearn.neighbors.KNeighborsClassifier
Model Evaluation
  • sklearn.metrics

References and Resources

  • reading and parsing dataset: link
  • Processing input data:link
  • Aravec for word embedding model :link
  • Keras Embedding layer : link1,link2,link3