Skip to content

Demo scrapes financial news and uses NLP to guage sentiment

Notifications You must be signed in to change notification settings

ashatidealiq/marketpulse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project description

This project demos a sentiment analysis model for financial news headlines. It's a simple prototype to demonstrate the data scraping concept, feature engineering (in OS Hopsworks), and natural language processing and inference via a (very) basic front-end.

The model is fine-tuned on the base data Financial Phrasebank dataset and Zeroshot Twitter. Each week, new features are collected by scraping Yahoo News and getting the sentiment of new news articles. The base model is then fine-tuned again but with the incremented dataset and if the newly fine-tuned model performs better than the prior version, this one is deployed. The model is then used by an app that we built that allows users to enter a search key and a maximum number of articles to analyze. The app then makes a request to the API to fetch the articles related to the search term from the past 7 days. The script in the API that does the scraping is a JavaScript version of the Python script that is used to collect new features every week. Our API is deployed using Google Cloud. The frontend app is deployed using Firebase and can be found on News Sentiment Analyzer. The model is deployed to Huggingface.co

Architecture diagram

Architecture diagram

Base dataset description:

Our base dataset is as follows:

  1. The first is Financial Phrasebank
  2. The second is Zeroshot Twitter.

Both datasets include a text as well as a sentiment label. For FinancialPhraseBank it is negative, neutral, and positive, and for Zeroshot bearish, bullish, and neutral. The financial phrase bank dataset contains 4 subsets created based on how the annotators agreed on the labeling: 50%, 66%, 75%, and 100%. We used the one with 75% agreeance to maximize the size of our final dataset. There is some simple preprocessing on the different labeling so they could be combined. Ultimately the labels used are simple negative, positive, and neutral (mapped to 0, 1, and 2 in the feature store on Hopsworks).

Base model description:

https://huggingface.co/bert-base-cased

The BERT model is a base model pre-trained on the raw texts only, with no human labeling. The model was pre-trained with two objectives: masked language modeling (MLM) and next sentence prediction (NSP). The MLM objective involves masking 15% of the words in the input sentence and predicting the masked words. The NSP objective involves concatenating two masked sentences as inputs during pre-training and predicting if the two sentences were following each other or not.

The BERT model can be fine-tuned on a downstream task such as sequence classification, token classification, or question answering. The model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions. This model was first introduced in this paper https://arxiv.org/pdf/1810.04805.pdf

Project files and folders:

sentiment_analysis_backend

This folder contains all the necessary files for the API used by the app. This includes the javascript version of yahoo_finance_news_scraper. The backend is written using the conventions of NodeJs. This API has one endpoint that is used by the front end to get the sentiment of a specific stock. The endpoint is called /sentiment-analysis and takes queries as input. The queries are searchKey and maxArticlesPerSearch. The searchKey is the search term that is used to find the headlines that are related to the stock. The maxArticlesPerSearch is the maximum number of headlines that are used to analyze the sentiment of the search term. The endpoint returns a JSON object with the following structure:

{
    "result": [
        {
            "headline": "string",
            "posted": "Date | null",
            "text": "string",
            "href": "string",
        },
    ]
}

sentiment_analysis_frontend

All the files necessary for the front end and using the endpoint from the API as well as the inference API for the model on Huggingface. The front end is JS/React

deploy_weekly_training.sh

This is a shell script used for retraining the model weekly after new features have been collected. Using modal allows us to automate the process of triggering the training pipeline, ensuring that the model is regularly updated with fresh data. (Beware: If you try to deploy this straight to your Azure env you will probably have CORS headaches and you won't understand why :)

feature_pipeline_weekly.py

This is the pipeline script that collects new features weekly by using the yahoo_finance_news_scraper.py module. The scraper collects 50 headlines from the past 7 days each of the search keys AAPL, AMZN, GOOGL, MSFT, TSLA. We then get their sentiment using the model distilRoberta-financial-sentiment which is shared on HuggingFace. To upload the feaures to Hopsworks we embed the text for each feature using OpenAI's text embedder model text-embedding-ada-002. Once the new features are collected they are split into training and test sets and uploaded to the respective training and test feature groups on Hopsworks. This ensures that the model has a steady stream of new data to learn from.

feature_pipeline.ipynb

Py notebook used to upload the initial features from the base dataset. The code is very similar to the weekly feature pipeline script but does not collect features using the scraping script. It is also here that we created the train and test feature groups on Hopsworks. Run this once to create and initialize the feature groups in Hopsworks.

hyperparameter_search.ipynb

Py notebook to find the optimal combination of hyperparameters for training. Using Optuna as the background was a tip. The hyperparameters with the best loss are:

training_args = TrainingArguments(
    output_dir="bert_sentiment_trainer", 
    evaluation_strategy="steps",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=4,
    num_train_epochs=8,
    learning_rate= 2.754984679344267e-05,
    save_total_limit=3,
    seed=42,
    lr_scheduler_type='constant_with_warmup',
    warmup_steps=50,
    max_steps=3000,
    save_strategy="steps",
    save_steps=250,
    fp16=False,
    eval_steps=250,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
)

preprocessing_pipeline.ipynb

Py notebook used to preprocess the base data and populate csv files (for later use in feature_pipeline.ipynb) as well as figure out and test the text embedding tokenizer to see that it works as intended.

requirements.txt

The requirements.txt file contains the modules needed to run the Python script.

training_pipeline.py

This script is run when running the shell script deploy_weekly_training.sh. It collects the training and test features from Hopsworks and runs the finetuning of the bert_base_cased model from Huggingface using the optimal hyperparameters found in the previously mentioned hyperparameter Python notebook. If the fine-tuned model (using the newly incremented data) performs better than the previous version the new model is uploaded to huggingface. This script is the main driver of the model training process.

training_pipeline_notebook.ipynb

This notebook is similar to the training_pipeline.py script but is in a Python notebook format. In this notebook, we fine-tuned and uploaded the first version of our model to Huggingface.

yahoo_finance_news_scraper.py

This module contains the functions necessary to do the scraping on Yahoo News to find the headlines that are related to a specific search term. The script ran into errors when using Python versions higher than 3.10, so we recommend using version 3.9 or 3.8.

How to run the pipelines

Run the backend locally

You will need to install node (we used version 18). To run the backend locally on your computer you need to uncomment one line and comment out another. The file is on the file path sentiment_analysis_backend\app.js. Uncomment line 15 and comment out line 14. This is needed so that the requests from the front end are allowed. Change the directory to the sentiment_analysis_backend folder, and run the command

npm install
npm run dev

Run the frontend

Requires Node v18. You also have to create a file in the sentiment_analysis_frontend folder called credentials.json that contains the following:

{
    "huggingface": "<your huggingface API key>"
}

To run the frontend locally on your computer you need to uncomment one line and comment out another. The file is on the file path sentiment_analysis_frontend\src\pages\Index.jsx. Uncomment line 130 and comment out line 129. This is needed to be able to use the API you are running locally on your computer. Change the directory to the sentiment_analysis_frontend folder, and run the command

npm install
npm run dev

Pipelines

All of the pipeline scripts/notebooks can be run by just running them. Make sure that you have the necessary python modules installed by running the command

pip install -r requirements.txt

About

Demo scrapes financial news and uses NLP to guage sentiment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published