Short Message Service (SMS) is a text communication platform that allows mobile phone users to exchange short text messages (usually less than 160 7-bit characters) at a low cost. The amount of unsolicited commercial advertisements (spams) over SMS will be bigger since the cost of SMS will decrease.
SMS spam is not familiar compared to mail spam.
The dataset for this project is a large text file established by a database of 5574 real text messages from UCI Machine Learning repository gathered in 2012 which contains:
- 425 SMS spam messages from the Grumbletext website (a UK forum in which cell phone users make public claims about SMS spam.
- 3375 SMS non-spam messages from the NUS SMS Corpus (NSC).
- 450 SMS non-spam messages from Caroline Tag's PhD Thesis.
- 1002 SMS non-spam and 322 spam messages from SMS Spam Corpus v.0.1 Big.
Downloaded from SMS Spam Collection Data Set from UCI Machine Learning Repository.
Each line of dataset is formatted as follow:
- label of the message
- text message string
The established email filtering algorithms are not efficient for SMS since:
- short length
- informal language
- full of abbreviations
The extraction and analysis of the data is done with Python.
The machine learning algorithms tested are:
- Naive Bayes Algorithm
- Support Vector Machine (SVM) Algorithm
- k-Nearest Neighbor Algorithm
- ...
For each text message:
- count the overall number of characters.
- count the number of dollars.
- count the number of numeric characters.
- remove space ( ), comma (,), dot (.), or any special characters.
- split into tokens of alphabetic characters.
- count the number of words in uppercase.
- convert all tokens into lowercase and find the word root.
- add a token at the begining with the number of characters.
- add a token at the 2nd place with the number of dollars.
- add a token at the 3rd place with the number of numeric characters.
- add a token at the 4th place with the number of words in uppercase.
There are some problems with very special characters (utf-8), need to fix that!
Need to divide sms into two datasets (ham set and spam set). Then need to find the most popular words (10) and the worst popular words (10) in these two sets. Then use an association algorithm with these 40 words to find out if they're important. Then add them under @attribute field in the arff file for Weka.