Prototype for news monitoring service
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
The main task of news monitoring is to process the incoming stream of news, identifying events that are interesting to users. In the banking sector, this can be useful for predicting defaults of major borrowers, such as various large companies. In this case, it is necessary to build a model to detect in the news an event corresponding to a delay in putting a certain object into operation. Later, with the selected texts, simpler models can be used to search for mentions of the bank's borrowers.
For this project, a pre-selected set of training and testing data was used. More details about the data analysis can be found below.
The training dataset consists of 1.6k samples, with 19 percent being target texts. The remaining texts were chosen in a way that it is initially difficult to determine whether they are target texts or not.
The testing dataset is a set of 10k samples collected from various news sources over the course of one week.
To get sentence embeddings I used model cointegrated/rubert-tiny2, which were trained to produce high-quality sentence embeddings. Then I reduced the dimensions of the embeddings using UMAP and clustered them using HDBSCAN. To tune hyperparameters and score clusters I used Bayesian optimization with Hyperopt
For texts with more than 512 tokens, we will summarize them to fit into the classifier
I decided to use the abstractive model, despite the fact that extractive models work faster in this case. My choice is justified by the fact that the texts in the test set are quite large, there may be several different topics, and an extractive model may not extract what we need from such text.
I chose the model mbart_ru_sum_gazeta because it is trained for summarizing news in Russian and adapted to the domain of our data. Additionally, in the model author's article about the training dataset, you can see that the distribution of the number of tokens per sentence in the test set and the model's output is suitable for our task. arxiv:2006.11063
data | Accuracy | f1-score |
---|---|---|
val_dataset | 0.95 | 0.87 |
The project has the following structure:
news_monitoring/eda
: clustering scriptsnews_monitoring/models
:.py
scripts with summarization and classification modelsnews_monitoring/preprocessing
:.py
scripts with text preprocessingnews_monitoring/preprocessing/news_monitoring.ipynb
: inference notebook
-
Topic modeling
-
News summarizing
-
News classificator
-
News deduplication
-
App for news scrapping
Telegram: @my_name_is_nikita_hey
Mail: [email protected]
Distributed under the MIT License. See LICENSE.txt
for more information.