Skip to content

Latest commit

 

History

History
47 lines (35 loc) · 1.9 KB

File metadata and controls

47 lines (35 loc) · 1.9 KB

cover

NLP-GitHub-Bug-Prediction-GoogleBERT

Applying NLP techniques on Embold's github dataset and deploying Google's distilBERT model for classification.

Overview

Developed a model using NLP (Python) with Google’s pretrained distilBERT . The dataset used has been taken from MachineHack’s Bug Prediction Dataset which was preprocessed using regex and NLTK python libraries.

Features

  • feature extraction from raw text using TF-IDF, CountVectorizer
  • using word embeddings to represent words as vectors using Word2Vec, Gensim.
  • visualizing Data using TSNE algorithm, Plotly-express' sunburst treemaps.
  • optimizing accuracy score as a metric to generalize well on unseen data.

Dataset Description:

Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Getting a code quality score using the Embold Code Analysis platform for Evaluation.

Attribute Decription:

Title - the title of the GitHub bug, feature, question
Body - the body of the GitHub bug, feature, question
Label - Represents various classes of Labels

  • Bug - 0
  • Feature - 1
  • Question - 2

Performance metrics used : Confusion matrix, AUC-ROC curve

Images

AUC-ROC curve

aucroc

Confusion Matrix

cm

Visualization

sunburst

Figure: Visualizing data using Plotly-express'sunburst treemaps

Project By: