NLP-GitHub-Bug-Prediction-GoogleBERT

Applying NLP techniques on Embold's github dataset and deploying Google's distilBERT model for classification.

Overview

Developed a model using NLP (Python) with Google’s pretrained distilBERT . The dataset used has been taken from MachineHack’s Bug Prediction Dataset which was preprocessed using regex and NLTK python libraries.

Features

feature extraction from raw text using TF-IDF, CountVectorizer
using word embeddings to represent words as vectors using Word2Vec, Gensim.
visualizing Data using TSNE algorithm, Plotly-express' sunburst treemaps.
optimizing accuracy score as a metric to generalize well on unseen data.

Dataset Description:

Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Getting a code quality score using the Embold Code Analysis platform for Evaluation.

Attribute Decription:

Title - the title of the GitHub bug, feature, question
Body - the body of the GitHub bug, feature, question
Label - Represents various classes of Labels

Bug - 0
Feature - 1
Question - 2

Performance metrics used : Confusion matrix, AUC-ROC curve

Images

AUC-ROC curve

Confusion Matrix

Visualization

Figure: Visualizing data using Plotly-express'sunburst treemaps

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
models/distilbert-base-uncased		models/distilbert-base-uncased
notebooks		notebooks
preprocess		preprocess
vscode		vscode
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-GitHub-Bug-Prediction-GoogleBERT

Overview

Features

Dataset Description:

Attribute Decription:

Images

AUC-ROC curve

Confusion Matrix

Visualization

Project By:

About

Releases

Packages

Languages

valkyron/NLP-GitHub-Bug-Prediction-GoogleBERT

Folders and files

Latest commit

History

Repository files navigation

NLP-GitHub-Bug-Prediction-GoogleBERT

Overview

Features

Dataset Description:

Attribute Decription:

Images

AUC-ROC curve

Confusion Matrix

Visualization

Project By:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages