- Jupyter Notebook with Python 3
- All the python packages that are used in the library, e.g.:
- numpy
- pandas
- matplotlib
- scikit-learn
- xgboost
- active learning
- ...
- The goal is to predict the willingness of Facebook users to provide their private data to social experiments with certain amount of money
- The model is trained with the public data of the users and the label of their willingness(yes/no)
- The model is trained to predict the willingness of users with the public data of them
- The project is divided into three parts:
- Crawling
- Preprocessing
- Training
- Before we enter the code part, this is also part of a social experiment, so we use some questionnaires to collect the accounts and willingness of FB users
- ~1000 data were collected
- 70% positive
- The crawled data are all the public data of each user
- We also take the sum of 'Likes' and sum of posts into consider
- In total of 60 attributes
- Drop the empty or too scarce columns
- Group simliar attributes into one
- Rank by 'Education'
- Group by 'Blood', 'Gender' and 'Religion'
- Group by 'Hometown' and 'Current City'
- 37 attributes are extracted from the original 60 ones.
- Use upsampling to balance the data
- Use 5-fold cross-validation as our method to train the data
- We have tried many different model, e.g.:SVM, random forest ...
- The highest accuracy is about 81%
- Use linear SVM as our learner and try to use the AL as our method
- We build our own initial strategy and query startegy to choose the data point
- And combine them in pairs
- Use upsampling to balance the data
- Use 5-fold cross-validation as our method to train the data
- The AL can reach the highest accuracy at about 77%
- Can be used in other culture since social experiement is culture-dependent.