GitHub - toniesteves/sw-effort-predictive-analysis: In this project, Basic Machine Learning concepts were built on Desharnais dataset to built a software effort estimation model using a linear regression model. This statistical model was developed using a non-parametric linear regression algorithm based on the K-Nearest Neighbours (KNN).

This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering.

If you publish material based on PROMISE data sets then, please follow the acknowledgment guidelines posted on the PROMISE repository web page http://promise.site.uottawa.ca/SERepository .

Abstract

Making decisions with a highly uncertain level is a critical problem in the area of software engineering. Predicting software quality requires high accurate tools and high-level experience. Otherwise, AI-based predictive models could be a useful tool with an accurate degree that helps on the prediction of software effort based on historical data from software development metrics. In this study, we built a software effort estimation model to predict this effort using a linear regression model. This statistical model was developed using a non-parametric linear regression algorithm based on the K-Nearest Neighbours (KNN). So, our results show the possibility of using AI methods to predict the software engineering effort prediction problem with an coefficient of determination of 76%

import math
from scipy.io import arff
from scipy.stats.stats import pearsonr
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

# Formatação mais bonita para os notebooks
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (15,5)

df_desharnais = pd.read_csv('../Datasets/02.desharnais.csv',  header=0)
df_desharnais.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	Project	TeamExp	ManagerExp	YearEnd	Length	Effort	Transactions	Entities	PointsNonAdjust	Adjustment	PointsAjust	Language
0	1	1	1	4	85	12	5152	253	52	305	34	302	1
1	2	2	0	0	86	4	5635	197	124	321	33	315	1
2	3	3	4	4	85	1	805	40	60	100	18	83	1
3	4	4	0	0	86	5	3829	200	119	319	30	303	1
4	5	5	0	0	86	4	2149	140	94	234	24	208	1

df_desharnais.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 13 columns):
id                 81 non-null int64
Project            81 non-null int64
TeamExp            81 non-null int64
ManagerExp         81 non-null int64
YearEnd            81 non-null int64
Length             81 non-null int64
Effort             81 non-null int64
Transactions       81 non-null int64
Entities           81 non-null int64
PointsNonAdjust    81 non-null int64
Adjustment         81 non-null int64
PointsAjust        81 non-null int64
Language           81 non-null int64
dtypes: int64(13)
memory usage: 8.3 KB

df_desharnais.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	Project	TeamExp	ManagerExp	YearEnd	Length	Effort	Transactions	Entities	PointsNonAdjust	Adjustment	PointsAjust	Language
count	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000	81.000000
mean	41.000000	41.000000	2.185185	2.530864	85.740741	11.666667	5046.308642	182.123457	122.333333	304.456790	27.629630	289.234568	1.555556
std	23.526581	23.526581	1.415195	1.643825	1.222475	7.424621	4418.767228	144.035098	84.882124	180.210159	10.591795	185.761088	0.707107
min	1.000000	1.000000	-1.000000	-1.000000	82.000000	1.000000	546.000000	9.000000	7.000000	73.000000	5.000000	62.000000	1.000000
25%	21.000000	21.000000	1.000000	1.000000	85.000000	6.000000	2352.000000	88.000000	57.000000	176.000000	20.000000	152.000000	1.000000
50%	41.000000	41.000000	2.000000	3.000000	86.000000	10.000000	3647.000000	140.000000	99.000000	266.000000	28.000000	255.000000	1.000000
75%	61.000000	61.000000	4.000000	4.000000	87.000000	14.000000	5922.000000	224.000000	169.000000	384.000000	35.000000	351.000000	2.000000
max	81.000000	81.000000	4.000000	7.000000	88.000000	39.000000	23940.000000	886.000000	387.000000	1127.000000	52.000000	1116.000000	3.000000

Applying Pearson’s correlation

In this section, the correlations between attributes of Desharnais dataset and software effort are analyzed and applicability of the regression analysis is examined. The correlation between two variables is a measure of how well the variables are related. The most common measure of correlation in statistics is the Pearson Correlation (or the Pearson Product Moment Correlation - PPMC) which shows the linear relationship between two variables.

Pearson correlation coefficient analysis produces a result between -1 and 1. A result of -1 means that there is a perfect negative correlation between the two values at all, while a result of 1 means that there is a perfect positive correlation between the two variables.

Results between 0.5 and 1.0 indicate high correlation.Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient. Pearson’s correlation (also called Pearson’s R) is a correlation coefficient commonly used in linear regression.

df_desharnais.corr()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	id	Project	TeamExp	ManagerExp	YearEnd	Length	Effort	Transactions	Entities	PointsNonAdjust	Adjustment	PointsAjust	Language
id	1.000000	1.000000	-0.006007	0.214294	0.096486	0.255187	0.126153	0.265891	0.028787	0.226076	-0.207774	0.202608	0.391475
Project	1.000000	1.000000	-0.006007	0.214294	0.096486	0.255187	0.126153	0.265891	0.028787	0.226076	-0.207774	0.202608	0.391475
TeamExp	-0.006007	-0.006007	1.000000	0.424687	-0.210335	0.143948	0.119529	0.103768	0.256608	0.203805	0.235629	0.222884	-0.079112
ManagerExp	0.214294	0.214294	0.424687	1.000000	-0.011519	0.211324	0.158303	0.138146	0.206644	0.207748	-0.066821	0.187399	0.205521
YearEnd	0.096486	0.096486	-0.210335	-0.011519	1.000000	-0.095027	-0.048367	0.034331	0.001686	0.028234	-0.056743	0.012106	0.342233
Length	0.255187	0.255187	0.143948	0.211324	-0.095027	1.000000	0.693280	0.620711	0.483504	0.723849	0.266086	0.714092	-0.023810
Effort	0.126153	0.126153	0.119529	0.158303	-0.048367	0.693280	1.000000	0.581881	0.510328	0.705449	0.463865	0.738271	-0.261942
Transactions	0.265891	0.265891	0.103768	0.138146	0.034331	0.620711	0.581881	1.000000	0.185041	0.886419	0.341906	0.880923	0.136778
Entities	0.028787	0.028787	0.256608	0.206644	0.001686	0.483504	0.510328	0.185041	1.000000	0.618913	0.234747	0.598401	-0.056439
PointsNonAdjust	0.226076	0.226076	0.203805	0.207748	0.028234	0.723849	0.705449	0.886419	0.618913	1.000000	0.383842	0.985945	0.082737
Adjustment	-0.207774	-0.207774	0.235629	-0.066821	-0.056743	0.266086	0.463865	0.341906	0.234747	0.383842	1.000000	0.513197	-0.199167
PointsAjust	0.202608	0.202608	0.222884	0.187399	0.012106	0.714092	0.738271	0.880923	0.598401	0.985945	0.513197	1.000000	0.046672
Language	0.391475	0.391475	-0.079112	0.205521	0.342233	-0.023810	-0.261942	0.136778	-0.056439	0.082737	-0.199167	0.046672	1.000000

colormap = plt.cm.viridis
plt.figure(figsize=(10,10))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.set(font_scale=1.05)
sns.heatmap(df_desharnais.drop(['id'], axis=1).astype(float).corr(),linewidths=0.1,vmax=1.0, square=True,cmap=colormap, linecolor='white', annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x116178cc0>

Split train/test data

features = [ 'TeamExp', 'ManagerExp', 'YearEnd', 'Length', 'Transactions', 'Entities',
        'PointsNonAdjust', 'Adjustment', 'PointsAjust']

max_corr_features = ['Length', 'Transactions', 'Entities','PointsNonAdjust','PointsAjust']

X = df_desharnais[max_corr_features]
y = df_desharnais['Effort']

Models Costruction

In this study the following algorithms were used: Linear Regression and K-Nearest Neighbors Regression. The training of the regressors models were performed on 67% of the instances

1) Knn Regression

The K-Nearest Neighbor Regression is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure and it’s been used in a statistical estimation and pattern recognition as non-parametric technique classifying correctly unknown cases calculating euclidean distance between data points. In fact our choice by K-Nearest Neighbor Regression was motivated by the absence of a detailed explanation about how effort attribute value is calculated on Desharnais dataset. In the K-Nearest Neighbor Regression we choose to specify only 3 neighbors for k-neighbors queries and uniform weights, that means all points in each neighborhood are weighted equally.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=30)

neigh = KNeighborsRegressor(n_neighbors=3, weights='uniform')
neigh.fit(X_train, y_train) 
print(neigh.score(X_test, y_test))

0.7379861869550943

2) Linear Regression

The regression analysis aims to verify the existence of a functional relationship between a variable with one or more variables, obtaining an equation that explains the variation of the dependent variable Y, by the variation of the levels of the independent variables. The training of the Linear Regression model consists of generating a regression for the target variable Y.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=22)

model = LinearRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.7680074954440712

3) Support Vector Machine

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=22)

parameters = {'kernel':('linear', 'rbf'), 'C':[1,2,3,4,5,6,7,8,9,10]}

svr = SVR()
LinearSVC = GridSearchCV(svr, parameters)
LinearSVC.fit(X_train, y_train)
print("Best params hash: {}".format(LinearSVC.best_params_))
print(LinearSVC.score(X_test, y_test))

Best params hash: {'C': 1, 'kernel': 'linear'}
0.735919788126071

Results

The figure shows the linear model (blue line) prediction is fairly close to Knn model effort prediction (red line), predicting the numerical target based on a similarity measure. According to the plot we observe that Linear Regression model (blue line) presents a better performance. Although Knn Regression model (red line) is fairly close to data points, the Linear Regression model shows a smaller mean squared error. It is possible to observe that the lines of both models present a slight tendency to rise, which justifies their correlation with the increase in effort. Some metrics are also highlighted by the presence of outliers.

plt.figure(figsize=(18,6))
plt.rcParams['legend.fontsize'] = 18
plt.rcParams['legend.loc']= 'upper left'
plt.rcParams['axes.labelsize']= 32

for i, feature in enumerate(max_corr_features):
   
    # Knn Regression Model 
    xs, ys = zip(*sorted(zip(X_test[feature], neigh.fit(X_train, y_train).predict(X_test))))
    
    # Linear Regression Model 
    model_xs, model_ys = zip(*sorted(zip(X_test[feature], model.fit(X_train, y_train).predict(X_test))))
    
    # Support Vector Machine
    svc_model_xs, svc_model_ys = zip(*sorted(zip(X_test[feature], LinearSVC.fit(X_train, y_train).predict(X_test))))

    plt.scatter(X_test[feature], y_test, label='Real data', lw=2,alpha= 0.7, c='k' )
    plt.plot(model_xs, model_ys , lw=2, label='Linear Regression Model', c='cornflowerblue')
    plt.plot(xs, ys , lw=2,label='K Nearest Neighbors (k=3)', c='yellowgreen')
    plt.plot(svc_model_xs, svc_model_ys , lw=2,label='Support Vector Machine (Kernel=Linear)', c='gold')
    
    plt.xlabel(feature)
    plt.ylabel('Effort')
    plt.legend()
    plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Datasets		Datasets
Notebooks		Notebooks
Presentation		Presentation
.gitignore		.gitignore
A predictive analysis approach using linear regression to estimate software effort.html		A predictive analysis approach using linear regression to estimate software effort.html
Avaliação de Modelos de Preditivos de Regressão para Estimar Esforço de Software.pdf		Avaliação de Modelos de Preditivos de Regressão para Estimar Esforço de Software.pdf
README.md		README.md
output_10_1.png		output_10_1.png
output_25_0.png		output_25_0.png
output_25_1.png		output_25_1.png
output_25_2.png		output_25_2.png
output_25_3.png		output_25_3.png
output_25_4.png		output_25_4.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering.

Abstract

Applying Pearson’s correlation

Split train/test data

Models Costruction

1) Knn Regression

2) Linear Regression

3) Support Vector Machine

Results

About

Releases

Packages

Languages

toniesteves/sw-effort-predictive-analysis

Folders and files

Latest commit

History

Repository files navigation

This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering.

Abstract

Applying Pearson’s correlation

Split train/test data

Models Costruction

1) Knn Regression

2) Linear Regression

3) Support Vector Machine

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages