For developing and automating machine learning models with MLOps practices, Linux environment is important. We have a variety of options.
- AWS EC2
- Compute Engine ( Google )
- Github CodeSpace
Docker is preinstalled on Github CodeSpace. So, we only need to install Anaconda. In the root directory,
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
bash Anaconda3-2023.09-0-Linux-x86_64.sh
conda install jupyter
Check conda path in the environment, we can use this command in the terminal.
which conda
To define path, first open ~/.bashrc
file with nano
. Inside ~/.bashrc
,
export PATH="<YourAnacondaPath>/bin:$PATH"
Save it and run the followig in the project repo to make sure of changes.
source ~/.bashrc
We can test if we are using the correct interpreter with the following command
which python
We trained a machine learning model on the New York Taxi Dataset.
In this module, we only practiced maturity 0 model without MLOps asepcts.
We used pick up location ID and drop off location ID and trip distance as features. Our target variable is the trip duration which we got by
substracting pick up datetime from drop off datetime. We get total seconds with total_seconds()
method to the timedelta object and we divide it by 60 to get total minutes.
We can experiment mean, median, mode and standard deviations of trip durations. We can also experiment with the following code. The purpose of the experiments is to detect outliers in the target variable.
df.duration.describe(percentiles=[list])
((df.duration>=1) & (df.duration<=60)).mean()
Most of the trips are between 1 minutes and 60 minutes so we removed outliers by only filtering only those trips.
Regression models such as linear regression, lasso, ridge are trained and validated with rmse
scores.