Applying NLP techniques on Embold's github dataset and deploying Google's distilBERT model for classification.
Developed a model using NLP (Python) with Google’s pretrained distilBERT . The dataset used has been taken from MachineHack’s Bug Prediction Dataset which was preprocessed using regex and NLTK python libraries.
- feature extraction from raw text using TF-IDF, CountVectorizer
- using word embeddings to represent words as vectors using Word2Vec, Gensim.
- visualizing Data using TSNE algorithm, Plotly-express' sunburst treemaps.
- optimizing accuracy score as a metric to generalize well on unseen data.
Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Getting a code quality score using the Embold Code Analysis platform for Evaluation.
Title - the title of the GitHub bug, feature, question
Body - the body of the GitHub bug, feature, question
Label - Represents various classes of Labels
- Bug - 0
- Feature - 1
- Question - 2
Performance metrics used : Confusion matrix, AUC-ROC curve
Figure: Visualizing data using Plotly-express'sunburst treemaps