Skip to content

Latest commit

 

History

History
174 lines (103 loc) · 13.4 KB

README.md

File metadata and controls

174 lines (103 loc) · 13.4 KB

Text Mining in R

Makes people smile

Hits

Sentiment Analysis Text Mining using R

Generic badge Generic badge Generic badge ForTheBadge uses-git

The project is about searching the text mining for classification using bag of words #bagofwords and applying machine learning models on this.

GitHub repo size GitHub code size in bytesGitHub top language

Few popular hashtags -

#R #MachineLearning #NLP

#patternlearning #BagofWords #textanalytics

Motivation

Nowadays, a daily increase of online available data leads to a growing need for that data to be organized and regularized. Textual data is all around us starting from web pages, e-books, media articles to emails or user comments. There are a lot of cases where automatic text classification would accelerate processing time (for example, detection of spam pages, personal email sorting, tagging products or document filtering). We can say that all organizations (e.g. academia, marketing or government) that deal with a lot of unstructured text, could handle that data much easier if it was standardized by categories/tags. This Dataset is a collection newsgroup documents. The 4 newsgroups collection can be used for experiments in text applications of machine learning techniques, such as text classification and text clustering.

About the Project

What is Text Mining?

Text classification or text categorization is an activity of labelling natural language texts with relevant predefined categories. The idea is to automatically organize text in different classes. It can drastically simplify and speed-up your search through the documents or texts!

Steps involved in this project

3 major steps in Text-Mining-in-R code :

  1. While training and building a model keep in mind that the first model is never the best one, so the best practice is the “trial and error” method. To make that process simpler, you should create a function for training and in each attempt save results and accuracies.

  2. I decided to sort the EDA process into two categories: general pre-processing steps that were common across all vectorizers and models and certain pre-processing steps that I put as options to measure model performance with or without them

  3. Accuracy was chosen as a measure of comparison between models since greater the accuracy, better the model performance on test data.

Made with Python Made with love ForTheBadge built-with-swag

Explanation

  • First of all, I've created a Bag of Words file. This file clean_data.R contains all the methods to preprocess and generate bag of words. We use Corpus library to handle preprocessing and to generate Bag of Words .

  • The following general pre-processing steps were carried out since any document being input to a model would be required to be in a certain format:

  1. Converting to lowercase
  2. Removal of stop words
  3. Removing alphanumeric characters
  4. Removal of punctuations
  5. Vectorization: TfVectorizer was used. The model accuracy was compared with those that used TfIDFVectorizer. In all cases, when TfVectorizer was used, it gave better results and hence was chosen as the default Vectorizer.
  • The following steps were added to the pre-processing steps as optional to see how model performance changed with and without these steps: 1. Stemming 2. Lemmatization 3. Using Unigrams/Bigrams

Confusion Matrix for Support Vector Machine using Bag of Words Generated using clean_data.r

> confusionMatrix(table(predsvm,data.test$folder_class))
Confusion Matrix and Statistics

       
predsvm  1  2  3  4
      1 31  0  0  0
      2  0 29  6  0
      3  0  3 28  0
      4  0  0  0 23

Overall Statistics
                                          
               Accuracy : 0.925           
                 95% CI : (0.8624, 0.9651)
    No Information Rate : 0.2833          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8994          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4

-The most interesting deduction is that the more specific the newsgroup topic is, the more accurate that the Naïve Bayes classifier can determine what newsgroup a document belongs to and the converse is also true where the less specific the newsgroup is, the accuracy rate plummets.

-We can see this in Accuracy where every newsgroup that isn’t a misc will always have an accuracy rate of at least 50%. The bottom newsgroups for terms of accuracy rate are all misc which includes a 0.25% accuracy rate for talk.politics.misc.

-A reason for this is that the posts that are written in misc newsgroups are rarely related to the actual root of the newsgroup. The misc section caters to other topics of discussion other than the “root newsgroup” meaning that it is much easier for the classifier to confuse a document from a misc newsgroup with another newsgroup and much harder for the classifier to even consider the root newsgroup since topics regarding the root newsgroup at posted there instead.

-For example, a post about guns is posted in talk.religion.misc can be easily classified as being talk.politics.guns because it would have to use similar words found in the posts found in talk.politics.guns. Likewise, posts about politics in talk.politics.misc are less likely because you are more likely to post in or talk.politics.guns (where wildcard is the relevant section for the type of politics to be discussed).

Libraries Used

R Studio R Studio R Studio R Studio R Studio R Studio R Studio

Installation

  • Install randomForest using pip command: install.packages("randomForest")
  • Install caret using pip command: install.packages("caret")
  • Install mlr using pip command: install.packages("mlr")
  • Install MASS using pip command: install.packages("MASS")

How to run?

R Studio

Project Reports

report

Useful Links

  1. Why Term Frequency is better than TF-IDF for text classification
  2. Naïve Bayes Classification for 20 News Group Dataset
  3. Analyzing word and document frequency: tf-idf
  4. Natural Language Processing
  5. K Nearest Neighbor in R
  6. MLR Package

Related Work

Sentiment Analysis GitHub top language

Text Mining Analyzer - A Detailed Report on the Analysis

Contributing

PRs Welcome GitHub issues GitHub pull requests GitHub commit activity

  • Clone this repository:
git clone https://github.com/iamsivab/Text-Mining-in-R.git

Need help?

Facebook Instagram LinkedIn

📧 Feel free to contact me @ [email protected]

GMAIL Twitter Follow

License

MIT © Sivasubramanian

GitHub license GitHub forks GitHub stars GitHub followers Ask Me Anything !