Linear regression is a frequently used method in the field of machine learning, specifically for tasks that require a continuous output. Hence, for a given data matrix X (rows represent samples/columns represent features) we try to find a line that fits the given data reasonably well. Consider the following point distribution:
We aim to find a hyperplane (in a 1d context: line) that minimizes the distance from the points to the plane. We can find such an optimal plane using the in statistics well known ordinary least squares (OLS) method. We even can find an analytically closed solution. For more information on this, you can have a look at the Wikipedia article about Linear Regression. The visual solution to above point distribution is:
In this repository, we implement the analytical correct solution of OLS in C++ for a given data matrix. Note that for large scaled data matrices a numerical solution (e.g. gradient descent or Newton method) might be more reasonable. Also in the case that your matrix is sparse there are more advanced and more suitable solution to do the computation.
- images: Contains png-files drawing the outcome of the algorithm
- generateData.py: Code for generating the data matrix X and target y (here you can customize the distribution of data points)
- X.txt: Datamatrix X (here with one feature, i.e. 1d case)
- y.txt: Target vector y
- main.cpp: C++ code for computing the parameters for the best fitting line
- output.txt Automatically generated a text file that contains the parameters for the best fitting line
- createImages.py: Takes the output.txt file as well as X.txt and y.txt and saves the point distribution as well as the best fitting line into one image (right now this works only for the 1d case and if the intercept is fitted)
max:LinearRegression Max$ g++ main.cpp -o main
max:LinearRegression Max$ python generateData.py
max:LinearRegression Max$ ./main
Number of rows: 100
Number of columns: 1
Fit intercept? 1
max:LinearRegression Max$ python createImages.py
The Code was solely implemented and tested on a MacBook Pro i5 / 8GB RAM.
- C++: iostream, iomanip, fstream, sstream
- Python: Numpy, Matplotlib
The g++ compiler with the c++11 standard is used.
The main code works also for the case of multiple features (multiple linear regression). We just have to adapt the loadData function such that it is capable of loading a comma separated file into the data matrix.
Max Kapsecker, 2018: Remark, that there is always room for optimization. The Code can indeed be optimized regarding time and space efficiency. Feel free to report any mistake to [email protected]