-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO 28Jun2020
59 lines (50 loc) · 1.95 KB
/
TODO 28Jun2020
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
0. Dataset loading and getting all the data in an understandable format.
Plotting it. - done
1. Mutate time series - better outliers. - done
2. Extracting Features from time series.
Stationarity check
Stationarity transformations - differencing and power transformations.
3. PCA on features - reduce dimensions to a subspace of n dimensions - get ones with largest variance
PCA -
Taking the whole dataset ignoring the class labels
Compute the d-dimensional mean vector
Computing the scatter matrix (alternatively, the covariance matrix)
Computing eigenvectors and corresponding eigenvalues
Ranking and choosing k eigenvectors
Transforming the samples onto the new subspace
4. Try LDA if classes are known
5. STL decomposition of timeseries -
Analyse residuals.
If > threshold then outlier
Find a dynamic way to calculate threshold.
6. Model time series using AR
7. Model time series using MA
8. Model time series using ARMA and ARIMA and SARIMA, ARFIMA, Box-Jenkins
AR
MA
ARIMA
SARIMA
Forecast, calculate actual - pred then check against threshold
STL
check residuals against threshold,
Threshold detection:
data point +- 3d from mean - z-score method
Median Absolute Deviation
Interquartile range
Scikit learn
OneClassSVM, Isolation forest and LOF
Other methods
Clustering - DBSCAN
Trees - Isolation binary decision trees and random forests
Primary -
Z-value test for outlier analysis
z = xi - mu/ sigma
signifies the number of std devs a point is away from the mean
This provides a good proxy for the outlier score of that point.
An implicit assumption is that the data is modeled from a normal distribution, and therefore the
Z-value is a random variable drawn from a standard normal distribution with zero mean
and unit variance.
In cases where the mean and standard deviation of the distribution
can be accurately estimated, a good “rule-of-thumb” is to use Zi ≥ 3 as a proxy for the
anomaly
3 sd