dp-challenge

Project structure:

main.py: Main file that creates document_processing objects, to process the patents and extract the relevant features. After this, loads the pre-trained Random Forest model to classify those patents related to energy, and finally persist the energy patents.
document_processing.py: Class that processes the documents. Reads the files from the blob, uncompress the files, and extracts the text content. After this creates a spark Dataframe from the text and extracts the different relevant columns from the patent. Finally, save the processed patents on a temp directory
train_data.ipynb: Notebook to process the train data and creates the training dataset.
train_model.py: Python file that reads the training dataset, and trains the Random Forest classifier.

To perform the code deployment you need to follow these steps:

Replace the value of the key 'blob_key' with the connection string value. This variable is in line 13 of the main.py file, inside of the "configs" dict
Replace the variable "conn_string" with the value of the connection string in the file train_data.ipynb
Upload all the project files to Databricks

After the code deployment, is necessary to execute the following commands in a different cell in the files:

The commands are commented on in the file.

The model dependencies must be available in Databricks. In case of these dependencies are not, is needed to perform the following steps:

After this, the main.py file can be executed.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
document_processing.py		document_processing.py
main.py		main.py
train_data.ipynb		train_data.ipynb
train_model.py		train_model.py