Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting and fixed imports warnings #5

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ ForecastPFN is more accurate and faster compared to state-of-the-art forecasting

The codebase has these parts:
- `./src/` contains all code to replicate the ForecastPFN synthetic data generation and training procedure
- `./benchmark/` contains all the code to replicate the benchmark of ForecastPFN against the the other baselines.
- `./benchmark/` contains all the code to replicate the benchmark of ForecastPFN against the other baselines.

# Table of contents
1. [Installation](#installation-)
Expand Down Expand Up @@ -80,7 +80,7 @@ The arguments that are passed are:
See how our model performs:
![alt text](img/fpfn_performance.png?raw=true)

The above figure shows analysis of performance vs. train budget, aggregated across datasets and prediction lengths. We plot the number of total MSE wins (left) where a higher value is better and mean MSE rank (right) where a lower values is better. Error bars show one standard deviation across training runs. ForecastPFN and Meta-N-BEATS are disadvantaged in these comparisons given that they see no training data for these series, only the length 36 input.
The above figure shows an analysis of performance vs. train budget, aggregated across datasets and prediction lengths. We plot the number of total MSE wins (left) where a higher value is better and mean MSE rank (right) where a lower value is better. Error bars show one standard deviation across training runs. ForecastPFN and Meta-N-BEATS are disadvantaged in these comparisons given that they see no training data for these series, only the length 36 input.

# Synthetic Data Generation <a name="SyntheticDataGeneration"></a>
ForecastPFN is completely trained on synthetic data.
Expand Down
Binary file removed benchmark/.DS_Store
Binary file not shown.
10 changes: 5 additions & 5 deletions benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This directory is for evaluation of ForecastPFN. We have evaluated ForecastPFN on seven real-world datasets which have been used in the literature. The datasets are in the `../academic_data` folder. The datasets include Illness, Exchange, ECL, ETTh1 and ETTh2, Weather and Traffic.
This directory is for the evaluation of ForecastPFN. We have evaluated ForecastPFN on seven real-world datasets that have been used in the literature. The datasets are in the `../academic_data` folder. The datasets include Illness, Exchange, ECL, ETTh1 and ETTh2, Weather and Traffic.

The evaluation has been done against multiple baselines which include Arima, Prophet, Informer, Fedformer-w, Autoformer, Transformer and Metalearn, as well as more simple baselines Mean, Last, and NaiveSeasonal.

Expand All @@ -24,12 +24,12 @@ The arguments that are passed are:
- `root_path` : This denotes the parent directory which contains the required dataset.
- `data_path` : This denotes the name of the file which contains the data. Look into the academic_data folder for information regarding other dataset files.
- `model` : This is one of (ForecastPFN, Metalearn, Arima, Autoformer, Informer, Transformer, FEDformer-w, Prophet)
- `seq_len` : The length of input sequence to be used. In our default setting, we have this set to 96 for exchange and 36 for all other datasets.
- `seq_len` : The length of the input sequence to be used. In our default setting, we have this set to 96 for exchange and 36 for all other datasets.
- `label_len` : In our default setting, we have this set to 48 for exchange and 18 for all other datasets.
- `pred_len` : This is the length of prediction to be made. We have evaluated our model with various prediction lengths.
- `train_budget` : This denotes the number of training examples that are available to the models which they can use for training. ForecastPFN and Metalearn use 0 examples since they are zero-shot.
- `pred_len` : This is the length of the prediction to be made. We have evaluated our model with various prediction lengths.
- `train_budget` : This denotes the number of training examples that are available to the models that they can use for training. ForecastPFN and Metalearn use 0 examples since they are zero-shot.
- `itr` : Number of times evaluation should be repeated. This affects the transformer-based models since they are non-deterministic.

All experiments that have been run for this paper can be found in `run.sh`.

Repliaction of the paper tables and plots can be found in the jupyter notebook `./analyze_results.ipynb`.
Replication of the paper tables and plots can be found in the jupyter notebook `./analyze_results.ipynb`.
101 changes: 66 additions & 35 deletions benchmark/data_provider/UnivariateTimeseriesSampler_WithStamps.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
import numpy as np
import pandas as pd
import datetime


class UnivariateTimeseriesSampler_WithStamps:
def __init__(self,
timeseries: np.ndarray,
time_stamps: np.ndarray,
insample_size: int,
outsample_size: int,
window_sampling_limit: int,
batch_size: int,
time_features,
):
def __init__(
self,
timeseries: np.ndarray,
time_stamps: np.ndarray,
insample_size: int,
outsample_size: int,
window_sampling_limit: int,
batch_size: int,
time_features,
):
self.timeseries = [ts for ts in timeseries]
self.time_stamps = [ts for ts in time_stamps]
self.window_sampling_limit = window_sampling_limit
Expand All @@ -20,55 +20,86 @@ def __init__(self,
self.outsample_size = outsample_size
self.time_features = time_features
self.time_embedding_dim = self.time_features(self.time_stamps[0]).T.shape[0]


def __iter__(self):
while True:
insample = np.zeros((self.batch_size, self.insample_size))
insample_mask = np.zeros((self.batch_size, self.insample_size))
outsample = np.zeros((self.batch_size, self.outsample_size))
outsample_mask = np.zeros((self.batch_size, self.outsample_size))
sampled_ts_indices = np.random.randint(len(self.timeseries), size=self.batch_size)
sampled_ts_indices = np.random.randint(
len(self.timeseries), size=self.batch_size
)

insample_time_stamps = np.zeros(
(self.batch_size, self.insample_size, self.time_embedding_dim), dtype=object)
(self.batch_size, self.insample_size, self.time_embedding_dim),
dtype=object,
)
outsample_time_stamps = np.zeros(
(self.batch_size, self.outsample_size, self.time_embedding_dim), dtype=object)
(self.batch_size, self.outsample_size, self.time_embedding_dim),
dtype=object,
)
for i, sampled_index in enumerate(sampled_ts_indices):
sampled_timeseries = self.timeseries[sampled_index]
cut_point = np.random.randint(low=max(1, len(sampled_timeseries) - self.window_sampling_limit),
high=len(sampled_timeseries),
size=1)[0]
cut_point = np.random.randint(
low=max(1, len(sampled_timeseries) - self.window_sampling_limit),
high=len(sampled_timeseries),
size=1,
)[0]

insample_window = sampled_timeseries[max(0, cut_point - self.insample_size):cut_point]
insample[i, -len(insample_window):] = insample_window
insample_mask[i, -len(insample_window):] = 1.0
insample_window = sampled_timeseries[
max(0, cut_point - self.insample_size) : cut_point
]
insample[i, -len(insample_window) :] = insample_window
insample_mask[i, -len(insample_window) :] = 1.0
outsample_window = sampled_timeseries[
cut_point:min(len(sampled_timeseries), cut_point + self.outsample_size)]
outsample[i, :len(outsample_window)] = outsample_window
outsample_mask[i, :len(outsample_window)] = 1.0
cut_point : min(
len(sampled_timeseries), cut_point + self.outsample_size
)
]
outsample[i, : len(outsample_window)] = outsample_window
outsample_mask[i, : len(outsample_window)] = 1.0

sampled_timestamps = self.time_stamps[sampled_index]
insample_window_time_stamps = sampled_timestamps[max(0, cut_point - self.insample_size):cut_point]
insample_time_stamps[i, -len(insample_window_time_stamps):] = self.time_features(insample_window_time_stamps)
insample_window_time_stamps = sampled_timestamps[
max(0, cut_point - self.insample_size) : cut_point
]
insample_time_stamps[
i, -len(insample_window_time_stamps) :
] = self.time_features(insample_window_time_stamps)
outsample_window_timestamps = sampled_timestamps[
cut_point:min(len(sampled_timestamps), cut_point + self.outsample_size)]
outsample_time_stamps[i, :len(outsample_window_timestamps)] = self.time_features(outsample_window_timestamps)
yield insample, insample_mask, outsample, outsample_mask, insample_time_stamps, outsample_time_stamps
cut_point : min(
len(sampled_timestamps), cut_point + self.outsample_size
)
]
outsample_time_stamps[
i, : len(outsample_window_timestamps)
] = self.time_features(outsample_window_timestamps)
yield (
insample,
insample_mask,
outsample,
outsample_mask,
insample_time_stamps,
outsample_time_stamps,
)

def sequential_latest_insamples(self):
batch_size = len(self.timeseries)
insample = np.zeros((batch_size, self.insample_size))
insample_mask = np.zeros((batch_size, self.insample_size))
insample_time_stamps = np.zeros(
(batch_size, self.insample_size, self.time_embedding_dim), dtype=object)
(batch_size, self.insample_size, self.time_embedding_dim), dtype=object
)
for i, (ts, time_stamp) in enumerate(zip(self.timeseries, self.time_stamps)):
ts_last_window = ts[-self.insample_size:]
insample[i, -len(ts):] = ts_last_window
insample_mask[i, -len(ts):] = 1.0
ts_last_window = ts[-self.insample_size :]
insample[i, -len(ts) :] = ts_last_window
insample_mask[i, -len(ts) :] = 1.0

sampled_timestamps = time_stamp
insample_window_time_stamps = sampled_timestamps[-self.insample_size:]
insample_time_stamps[i, -len(insample_window_time_stamps):] = self.time_features(insample_window_time_stamps)
insample_window_time_stamps = sampled_timestamps[-self.insample_size :]
insample_time_stamps[
i, -len(insample_window_time_stamps) :
] = self.time_features(insample_window_time_stamps)

return insample, insample_mask, insample_time_stamps
7 changes: 5 additions & 2 deletions benchmark/data_provider/data_factory.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from data_provider.data_loader import Dataset_Custom
from torch.utils.data import DataLoader

from data_provider.data_loader import Dataset_Custom

# from metalearned.resources.electricity.dataset import ElectricityDataset, ElectricityMeta
# from metalearned.resources.m3.dataset import M3Dataset, M3Meta
# from metalearned.resources.m4.dataset import M4Dataset, M4Meta
Expand Down Expand Up @@ -74,5 +76,6 @@ def data_provider(args, flag):
batch_size=batch_size,
shuffle=shuffle_flag,
num_workers=args.num_workers,
drop_last=drop_last)
drop_last=drop_last,
)
return data_set, data_loader
100 changes: 64 additions & 36 deletions benchmark/data_provider/data_loader.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,33 @@
import os
import numpy as np
import warnings

import pandas as pd
import os
import torch
from torch.utils.data import Dataset, DataLoader
from utils.timefeatures import time_features
from sklearn.preprocessing import StandardScaler
import warnings
from torch.utils.data import Dataset

from utils.timefeatures import time_features

warnings.filterwarnings('ignore')


class Dataset_Custom(Dataset):
def __init__(self, root_path, flag='train', size=None,
features='S', data_path='ETTh1.csv',
target='OT', scale=True, timeenc=0, freq='h',
scaler=StandardScaler(), train_budget=None):
def __init__(
self,
root_path,
flag='train',
size=None,
features='S',
data_path='ETTh1.csv',
target='OT',
scale=True,
timeenc=0,
freq='h',
scaler=StandardScaler(),
train_budget=None,
):
# size [seq_len, label_len, pred_len]
# info
if size == None:
if size is None:
self.seq_len = 24 * 4 * 4
self.label_len = 24 * 4
self.pred_len = 24 * 4
Expand All @@ -43,12 +53,11 @@ def __init__(self, root_path, flag='train', size=None,
self.__read_data__()

def __read_data__(self):
df_raw = pd.read_csv(os.path.join(self.root_path,
self.data_path))
df_raw = pd.read_csv(os.path.join(self.root_path, self.data_path))

'''
"""
df_raw.columns: ['date', ...(other features), target feature]
'''
"""
cols = list(df_raw.columns)
cols.remove(self.target)
cols.remove('date')
Expand All @@ -60,10 +69,13 @@ def __read_data__(self):

train_start = 0
if self.train_budget:
train_start = max(train_start, num_train -
self.seq_len - self.train_budget)
train_start = max(train_start, num_train - self.seq_len - self.train_budget)

border1s = [train_start, num_train - self.seq_len, len(df_raw) - num_test - self.seq_len]
border1s = [
train_start,
num_train - self.seq_len,
len(df_raw) - num_test - self.seq_len,
]
border2s = [num_train, num_train + num_vali, len(df_raw)]
border1 = border1s[self.set_type]
border2 = border2s[self.set_type]
Expand All @@ -75,7 +87,7 @@ def __read_data__(self):
df_data = df_raw[[self.target]]

if self.scale:
train_data = df_data[0:border2s[0]]
train_data = df_data[0 : border2s[0]]
self.scaler.fit(train_data.values)
data = self.scaler.transform(df_data.values)
else:
Expand All @@ -91,7 +103,9 @@ def __read_data__(self):
df_stamp['hour'] = df_stamp.date.apply(lambda row: row.hour, 1)
data_stamp = df_stamp.drop(['date'], 1).values
elif self.timeenc == 1:
data_stamp = time_features(pd.to_datetime(df_stamp['date'].values), freq=self.freq)
data_stamp = time_features(
pd.to_datetime(df_stamp['date'].values), freq=self.freq
)
data_stamp = data_stamp.transpose(1, 0)

self.data_x = data[border1:border2]
Expand All @@ -108,10 +122,10 @@ def __getitem__(self, index):
seq_y = self.data_y[r_begin:r_end]
seq_x_mark = self.data_stamp[s_begin:s_end]
seq_y_mark = self.data_stamp[r_begin:r_end]
seq_x_original = self.data_stamp_original['date'].values[s_begin:s_end]
seq_y_original = self.data_stamp_original['date'].values[r_begin:r_end]
# seq_x_original = self.data_stamp_original["date"].values[s_begin:s_end]
# seq_y_original = self.data_stamp_original["date"].values[r_begin:r_end]

return seq_x, seq_y, seq_x_mark, seq_y_mark#, seq_x_original, seq_y_original
return seq_x, seq_y, seq_x_mark, seq_y_mark # , seq_x_original, seq_y_original

def __len__(self):
return len(self.data_x) - self.seq_len - self.pred_len + 1
Expand All @@ -121,13 +135,24 @@ def inverse_transform(self, data):


class Dataset_Pred(Dataset):
def __init__(self, root_path, flag='pred', size=None,
features='S', data_path='ETTh1.csv',
target='OT', scale=True, inverse=False, timeenc=0, freq='15min', cols=None,
scaler=StandardScaler()):
def __init__(
self,
root_path,
flag='pred',
size=None,
features='S',
data_path='ETTh1.csv',
target='OT',
scale=True,
inverse=False,
timeenc=0,
freq='15min',
cols=None,
scaler=StandardScaler(),
):
# size [seq_len, label_len, pred_len]
# info
if size == None:
if size is None:
self.seq_len = 24 * 4 * 4
self.label_len = 24 * 4
self.pred_len = 24 * 4
Expand All @@ -151,11 +176,10 @@ def __init__(self, root_path, flag='pred', size=None,
self.__read_data__()

def __read_data__(self):
df_raw = pd.read_csv(os.path.join(self.root_path,
self.data_path))
'''
df_raw = pd.read_csv(os.path.join(self.root_path, self.data_path))
"""
df_raw.columns: ['date', ...(other features), target feature]
'''
"""
if self.cols:
cols = self.cols.copy()
cols.remove(self.target)
Expand All @@ -181,7 +205,9 @@ def __read_data__(self):

tmp_stamp = df_raw[['date']][border1:border2]
tmp_stamp['date'] = pd.to_datetime(tmp_stamp.date)
pred_dates = pd.date_range(tmp_stamp.date.values[-1], periods=self.pred_len + 1, freq=self.freq)
pred_dates = pd.date_range(
tmp_stamp.date.values[-1], periods=self.pred_len + 1, freq=self.freq
)

df_stamp = pd.DataFrame(columns=['date'])
df_stamp.date = list(tmp_stamp.date.values) + list(pred_dates[1:])
Expand All @@ -194,7 +220,9 @@ def __read_data__(self):
df_stamp['minute'] = df_stamp.minute.map(lambda x: x // 15)
data_stamp = df_stamp.drop(['date'], 1).values
elif self.timeenc == 1:
data_stamp = time_features(pd.to_datetime(df_stamp['date'].values), freq=self.freq)
data_stamp = time_features(
pd.to_datetime(df_stamp['date'].values), freq=self.freq
)
data_stamp = data_stamp.transpose(1, 0)

self.data_x = data[border1:border2]
Expand All @@ -212,9 +240,9 @@ def __getitem__(self, index):

seq_x = self.data_x[s_begin:s_end]
if self.inverse:
seq_y = self.data_x[r_begin:r_begin + self.label_len]
seq_y = self.data_x[r_begin : r_begin + self.label_len]
else:
seq_y = self.data_y[r_begin:r_begin + self.label_len]
seq_y = self.data_y[r_begin : r_begin + self.label_len]
seq_x_mark = self.data_stamp[s_begin:s_end]
seq_y_mark = self.data_stamp[r_begin:r_end]

Expand Down
Loading