This project lets you learn who to implement a linear regression to predict the price of a car based on its mileage.
- Run
python train.py
to train the model. With-f <path>
you can change the data file,-l <number>
you can modify the learning rate,-e <number>
you can change the number of epochs and-p
display the learning step as a graph. - Run
python estimate.py
to obtain the price of your car according to its mileage. - Run
python evaluate.py
to show the evaluation of performance of the trained model.
stateDiagram
state Learn {
extract
normalise
train
[*] --> extract:csv file
extract --> normalise:dataframe
normalise--> train:dataframe
}
train --> Evaluate:β<sub>0</sub> and β<sub>1</sub>
train --> predict:β<sub>0</sub> and β<sub>1</sub>
state Predict {
[*] --> predict:value
predict
predict --> [*]:predicted value
}
state Evaluate {
R2
AR2: Adjusted R2
MAE
MSE
RMSE
MAPE
}
this schema presents the steps to predict a value with data.
stateDiagram
direction LR
state extract {
direction TB
E_step1:Convert csv to pd.DataFrame
E_step2:Cast data in float
E_step3:Find required columns
E_step1 --> E_step2
E_step2 --> E_step3
}
[*] --> extract
extract --> [*]
stateDiagram
direction LR
state normalise{
direction TB
N_step1:(Apply Z-score Normalisation)
}
[*] --> normalise
normalise --> [*]
Z-score normalisation, also known as standardisation, is a method used to scale
the values in a dataset so that they have
for X, a list of value, we use :
where:
-
$\mu^{}_{X}$ is the X mean. -
$\sigma^{}_{X}$ is the X standard deviation.
stateDiagram
direction LR
state train {
direction TB
T_step1:Apply Gradient descent
T_step2:Denormalise thetas
T_step1 --> T_step2
}
[*] --> train
train --> [*]
Gradient descent is an optimisation method commonly used to adjust the coefficients of a linear regression model in order to minimise a cost function. We search to minimise the cost function:
with :
-
$a$ the equation of prediction.
-
$y^{(i)}$ the observed value
To find the best
and
where :
and
In this case
def gradientDescent(data, learning_rate, epoch):
theta0, theta1 = 0, 0 # init thetas
for i in range(epoch): # loop {epoch} time
_d0 = compute_partial_derivative_0(data, theta0, theta1)
_d1 = compute_partial_derivative_1(data, theta0, theta1)
theta0 -= learning_rate * _d0
theta1 -= learning_rate * _d1
return theta0, theta1 # return thetas
partial_derivative_0 :
partial_derivative_1 :
Where:
-
$lr$ is the learningRate. -
$m$ is the total number of x. -
$estimatePrice()$ the function$y^{}_{estimated} = θ^{}_{0} + x * θ^{}_{1}$
As thetas are calculated using standardised data, we have to denormalise them to make them match the original data.
-
Normalised data :
$$x_{\text{norm}} = \frac{x - \mu_x}{\sigma_x}$$ $$y_{\text{norm}} = \frac{y - \mu_y}{\sigma_y}$$ -
Linear regression on normalised data :
$$y_{\text{norm}} = \theta_0 + \theta_1 \cdot x_{\text{norm}}$$
To find
Substitute
For
Substitute
Let's develop the equation :
Let's rearrange the equation to isolate the constant terms and those as a function of
Let's group the constant terms together:
By comparing this equation with the standard form
This gives us the denormalisation equations for the coefficients of the linear regression:
stateDiagram
direction LR
state predict {
direction TB
T_step1:Apply the linear equation
}
[*] --> predict
predict --> [*]
linear equation:
stateDiagram
direction LR
state Evaluate {
direction LR
R2
AR2: Adjusted R2
MAE
MSE
RMSE
MAPE
}
[*] --> R2
R2 --> [*]
[*] --> AR2
AR2 --> [*]
[*] --> MAE
MAE --> [*]
[*] --> MSE
MSE --> [*]
[*] --> RMSE
RMSE --> [*]
[*] --> MAPE
MAPE --> [*]
Performance measures are essential for assessing the quality of a linear regression model.
- Definition : R² measures the proportion of the total variance in the data that is explained by the regression model.
- Interpretation : An R² value close to 1 indicates that the model explains the variability of the data well. For example, an R² of 0.8 means that 80% of the variance in the data is explained by the model.
- Formula :
where
- Definition : MAE measures the average of the absolute errors between the observed and predicted values.
- Interpretation : A lower MAE indicates a more accurate model. It gives an idea of the average error that can be expected from the model's predictions.
-
Formula :
$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$
- Definition : MSE measures the mean square error between observed and predicted values.
- Interpretation : As the errors are squared, larger errors are penalised more severely. A lower MSE indicates a better model.
-
Formula :
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
- Definition : RMSE is the square root of the mean square error.
- Interpretation : RMSE gives an idea of the magnitude of the typical error. Like MSE, it penalises large errors more severely. A lower RMSE indicates a better model.
-
Formula :
$$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$
- Definition : MAPE measures the average absolute error as a percentage of the observed values.
- Interpretation : MAPE is useful for understanding relative error as a percentage, which can be more intuitive than absolute errors.
-
Formula :
$$MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100$$
- Definition : The R² adjustment takes into account the number of predictors in the model and penalises models that are too complex.
- Interpretation : It is particularly useful when comparing models with a different number of independent variables. A higher value indicates a better fitted model.
-
Formula :
$$R^2_{adjusted} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)$$ where$n$ is the number of observations and$p$ is the number of predictors.