This repository contains Sentiment Analyzer for the project Big Mood. Using the currently trending topics on twitter, we display the general mood of a location. This is done by retrieving tweets for trending hashtags and performing sentiment analysis using VADER sentiment analysis tool.
The steps given below will expose the trendsentiment endpoint which gets trending hashtags from BigMood-API. The sentiment for the trends is calculated and updated JSON schema is published to kafka and inserted to MongoDB. Logs can be viewed on Kibana. In addition to the steps below, we have setup CI-CD using TravisCI.
Step 1:
Deploy the following services on GCP. Note the credentials for each and use them in Step 2.
- MongoDB
- RedisCache
- Kafka
- Logstash
Step 2:
In the config folder, add a file conf.py containing the following configurations :
UN = "XXX" # MongoDB username for trends
PW = "XXX" # MongoDB password for trends
HOST = "XXX.XX.XXX.XXX" # MongoDB hostname for trends
MONGO = {
"URI" : "mongodb://"+UN+":"+PW+"@"+HOST
}
DUMP_INTERVAL = {
'dump_interval': XXX # Cache dump interval in seconds
}
logstash_host = "XX.XX.XXX.XXX" # Logstash hostname
logstash_port = XXXX # Logstash port
app_port = XXXX # App's port number
kafka_host = "XX.XXX.XXX.XXX:XXXX" # Kafka hostname and port
kafka_topic = "XXXX" # Kafka topic
consumer_sleep_time = XX # Kafka consumer sleep time in seconds
twitter_keys = [{ # Twitter keys
'streamConsumerKey': "XXXX",
'streamConsumerSecret': "XXXX",
'streamAccessTokenKey': "XXXX",
'streamAccessTokenSecret': "XXXX"
},
{
'streamConsumerKey': "XXXX",
'streamConsumerSecret': "XXXX",
'streamAccessTokenKey': "XXXX",
'streamAccessTokenSecret': "XXXX"
}]
Step 3:
Build the docker container (replace path with the project folder name) :
docker build PATH .
Run the docker container (replace image with the image name built above) :
docker run IMAGE
A lexicon and rule based sentiment analysis tool called VADER is used to calculate sentiment scores. It has been known to be successful specifically when dealing with social media text and is fast to be used with streaming data. It also performs well with emojis, exclamation marks, slangs etc.
The SentimentAnalyzer
class has functions that call the location service to get latitude and longitude for a city and uses the TweetRetriver
get tweets for a city. In case of a country, tweets are retrieved using the twitter streaming api with the help of bounding boxes. The tweets are passed to the polarity_scores()
method and polarity indices of the tweets are obtained. The compund score metric has been used to represent the overall sentiment of the tweet. This score varies from -1 (extremely negative) to +1 (extremely positive).
The reason for choosing this library over others was that it not just tells whether the text is positive or negative but also tells how positive or negative the text is.
Retrieving tweets on a per-city basis is covered by the TweetRetriever
class, which uses the birdy
library to abstract the Twitter API. Initialize a TweetRetriever
with the development environment name to be used for the 30-day search endpoint associated with your Twitter API keys stored in the configuration file.
Once a TweetRetriever
object is successfully initialized, the instance method get_tweets(trend, latitude, longitude, radius='25mi', endpoint='7day')
will return Tweets using the birdy
library. Because we do not have access to Twitter's premium APIs, we often find that the free 7-day search endpoint returns more tweets than the sandbox 30-day search endpoint. No matter which endpoint is chosen, tweets are queried within the given radius of a (latitude, longitude)
coordinate pair. The 7-day endpoint explicitly excludes retweets from the search query.
Because querying tweets by city requires only a simple GET request to the Twitter API, requests can be made simply and without the additional complexity of a tweet stream.
Retrieving tweets on a per-country basis is covered by the TweetStream
class, which uses the tweepy
library to abstract the Twitter API. Initialize a TweetStream
with the appropriate credentials to have access to Twitter's status/filter
endpoint.
Once a TweetStream
object is successfully intialized, the instance method get_tweets(bounding_boxes, trends, num_tweets)
will return tweets using tweepy
's streaming functionality. Each bounding box should be specified as a pair of longitude and latitude pairs, with the southwest corner of the bounding box coming first. We have provided a list of bounding boxes in the bounding_box.json file. The trends
are sent by the node.js server as a call to our POST endpoint.
Get the details of the latest tweets.
URL : /trendsentiment
Method : GET
Auth required : YES
Permissions required : None
Code : 200 OK
Content examples
For a User with ID 1234 on the local database where that User has saved an email address and name information.
{"trends":
[
{
"city":"Portsmouth",
"country":"United Kingdom",
"countryCode":"GB",
"locationType":"City",
"timestamp":"2019-04-27 20:00:57",
"trends":
[
{
"name":"Endgame",
"rank":1,
"sentiment":0.25058,
"tweetVolume":1288426,
"url":"http://twitter.com/search?q=Endgame"
},
{
"name":"London",
"rank":2,
"sentiment":0.272458,
"tweetVolume":247982,
"url":"http://twitter.com/search?q=London"
},
{
"name":"#ThankYouAvengers",
"rank":3,
"sentiment":0.3578857142857142,
"tweetVolume":211037,
"url":"http://twitter.com/search?q=%23ThankYouAvengers"
},
{
"name":"Wembley",
"rank":4,
"sentiment":0.23150999999999997,
"tweetVolume":40480,
"url":"http://twitter.com/search?q=Wembley"
},
{
"name":"Days Gone",
"rank":5,
"sentiment":0.19040769230769233,
"tweetVolume":34142,
"url":"http://twitter.com/search?q=%22Days+Gone%22"
}
]}
]}
Implementing the Elastic Stack (Elasticsearch, Logstash, and Kibana) for distributed logging and analysis has proven incredibly useful. All three services run on a single AWS or GCP instance. To set up these services, we first follow the tutorial at https://logz.io/learn/complete-guide-elk-stack/#installing-elk. Note that RAM becomes a limitation when running all three services in parallel. Approximately 4GB RAM is required. The smallest instance size we were able to successfully run on without encountering memory issues on GCP was 1 vCPU (3.75 GB memory). We used Ubuntu 16.04 LTS, and the instance has 20GB persistent disk.
BigMood-SentimentAnalyzer is written in Python and writes logs directly to Logstash using a TCP connection to port 9123. BigMood-API is written in Node.js and writes logs directly to Logstash using HTTP requests to port 9214. This is accomplished by various 3rd-party Logstash client libraries available in each language's package ecosystem. We tag each incoming message based on its architecture component, and then route it to Logstash with the following Logstash configuration file. Note that by adding port-based tags, we are able to differentiate between our services when viewing logs in Kibana.
sudo vim /etc/logstash/conf.d/BigMood.conf
tcp {
port => 9123
codec => json
tags => ["BigMood-SentimentAnalyzer"]
}
http {
port => 9124
codec => json
tags => ["BigMood-API"]
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
Remember that once the configuration file has been changed, the service must be restarted (sudo service logstash restart)
for changes to take effect.
Finally, to make the Kibana UI accessible to our 5-person group without exposing it to the entire internet, we no not expose the Kibana server's web interface port to the world. Instead, we install haproxy
with sudo apt install haproxy
, and expose a tiny HTTP server on port 80 that verifies login credentials and then forwards all port 80 traffic to the Kibana UI's port.
The haproxy
configuration file that facilitates this is as follows:
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
# Default SSL material locations
ca-base /etc/ssl/certs
crt-base /etc/ssl/private
# Default ciphers to use on SSL-enabled listening sockets.
# For more information, see ciphers(1SSL). This list is from:
# https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
ssl-default-bind-options no-sslv3
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 1m
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
userlist UsersFor_Kibana
user kibana insecure-password MY-SUPER-SECRET-PASSWORD
frontend localnodes
bind *:80
mode http
default_backend nodes
backend nodes
acl AuthOkay_Kibana http_auth(UsersFor_Kibana)
http-request auth realm Kibana if !AuthOkay_Kibana
mode http
balance roundrobin
option forwardfor
http-request set-header X-Forwarded-Port %[dst_port]
http-request add-header X-Forwarded-Proto https if { ssl_fc }
option httpchk HEAD / HTTP/1.1\r\nHost:localhost
server server1 127.0.0.1:5601 check
```