Logistic regression on COVID-19 data using BFGS algorithm
This repository contains a Python project I developed for the Optimization exam in the AI Master Course I'm attending. The aim of the project is to provide an implementation of the BFGS algorithm and to apply it to a particular case of study: I decided to perform a logistic regression on the official COVID-19 data (provided by Protezione Civile and available on GitHub at this link). I worked on data referring to the whole country, included in folder dati-andamento-nazionale: I reuploaded the folder on this repository for ease of use.
I modelled the early evolution of COVID-19 as a logistic function with 6 parameters:
where σ(x,θ) represents the number of positive cases at x-th day after the 24th of February.
I exploited BFGS algorithm to find the optimal parameters θ*, i.e. those that minimize the objective (loss) function:
where y are the data points.
To implement the algorithm, I was inspired by an article on aria42's blog. Though, the code included in this repository is entirely written by me.
NOTE: To compute the optimal parameters I mapped the original data to the range [0,1] for the purpose of computability: the order of magnitude of the positive cases quickly reaches 104 and goes up to 105, which leads to significant effort in computations mainly due to the exponential.
Thus the data points actually employed in the computations were obtained through:
As a consequence, the values of σ(x,θ)
and loss(x,θ)
reported in the following will refer to rescaled data.
On the other hand, after performing the fit on the data obtaining the optimal parameters, I scaled back the resulting predictions according to the inverse transformation to produce the final result, so that I could compare real data to the predictions made by the model.
The algorithm is implemented as BFGS_algorithm(obj_fun, theta0, max_iter=2e04, epsilon=0)
, where obj_fun
is the objective function to be minimized (in our case loss(x,θ)
), theta0
is the initial guess for the parameters (a numpy
array), max_iter
is the maximum number of iterations performed and epsilon
is the minimum value the loss function may have for the purpose of convergence.
The function takes advantage of the library scipy.optimize
, deploying functions BFGS
and line_search
, as I discuss in the following.
Firstly, a BFGS
object is created invoking the constructor: this object allows to efficiently store the (inverse) approximate Hessian matrix of the function and perform computations (such as the dot product with a vector), thus keeping track of the updates.
Then, the recursive loop does the job:
- at each iteration
n
, the initial guessθn
is stored along with the gradientgn
, computed at its position; - the search direction
d
is computed as dot product betweenHn-1
andgn
, where the former is the current approximation of the inverse Hessian matrix; - the step-size
α
is computed throughline_search
, which isscipy.optimize
's implementation of a backtracking line search algorithm: essentially, it computes the non-negative parameter α that minimizes the expression f(xn - αd); - accordingly to these updates, the array of the parameters
θn+1
is computed along with the gradientgn+1
, computed at its position; the new approximation of the Hessian is obtained via the methodBFGS.update
, which takes as inputs the differencesθn+1 - θn
andgn+1 - gn
; - the loop is stopped whenever the value of
loss(x,θ)
is belowepsilon
or whenmax_iter
is reached.
The function returns a tuple theta_history, cost_history, niter, success
in which the following is stored: the parameters history, the loss function values history, the number of iterations performed, information about the convergence (i.e. whether the loss function of θ* is below epsilon
or not).
These are the results I obtained choosing as input parameters firstly max_iter = 20000
and then max_iter = 30000
, and epsilon = 0
in both cases (i.e. I ended up performing no check for convergence: empirically, I observed it never dropping below ~1.747*10-5).
The Logistic fit graphs show the data points as orange dots and the logistic regression as a blue dashed line: at a glance, the curve fits quite well the data.
As you can immediately notice from the Cost history graphs, there is no meaningful improvement in terms of value of the loss function after ~15000 iterations.
Results:
Number of iterations: 20000
Theta: [ 0.06900987 0.05576653 5.52850234 0.17732944 1.34652566 -0.01234024]
Loss: 104709.8770002242
Loss (rescaled data): 1.7472241874665985e-05
Converged: False, max_iter reached.
Results:
Number of iterations: 30000
Theta: [ 0.06879371 0.05563397 5.53465311 0.17751231 1.34667285 -0.01218171]
Loss: 104697.39794599885
Loss (rescaled data): 1.747015957775154e-05
Converged: False, max_iter reached.