-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data and restructure file handling #2
base: master
Are you sure you want to change the base?
Conversation
|
||
def __len__(self): | ||
return len(self.file_names) | ||
return len(self.file_names) * self.n_targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we increase the len to account for multiple targets
|
||
def __getitem__(self, idx): | ||
file_path = os.path.join(self.folder_path, self.sorted_files[idx]) | ||
actual_idx = idx // self.n_targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to map back to original idx, so if there are k rows and 2 targets, idx goes to k*2
|
||
def __getitem__(self, idx): | ||
file_path = os.path.join(self.folder_path, self.sorted_files[idx]) | ||
actual_idx = idx // self.n_targets | ||
augment_idx = idx % self.n_targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
augment_idx determines which target is used
c, target_embedding = self.task_dict[keys[augment_idx]] | ||
target_embedding = torch.tensor(target_embedding,dtype=data.dtype) | ||
data = add_positional_info(target_embedding=target_embedding, feature_embeddings=data) | ||
return torch.cat((data[:c, :], data[(c+1):, :]), dim=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove target embedding
dataset_name, target, _ = filename.split(";") | ||
return dataset_name, torch.tensor(float(target), dtype=torch.float32) | ||
dataset_name, target_list, _ = filename.split(";") | ||
target_list = eval(target_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the list is encapsulated in a string, hence eval
ds_dir = os.path.join(join_paths(d.current_dir, 'files', mode)) | ||
datasets.append(TabularDataset(folder_path=ds_dir, | ||
task_description_path=os.path.join(d.current_dir, 'task_description', mode), | ||
n_targets=len(d.target_column) if not only_main else 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1 works since the main target is always the first in the list
This PR
What has changed:
We now treat every! column as feature, compute embeddings for each row and store as npy-file. The tensor that is stored has dimension (
n_cols, emb_dim
). Within the file name, we store for each target its value and position in the sequence. So ifn_cols
is 10 and there are two targets at position 6 and 9, the file name contains[6: 0.3, 9: 0.7]
with 0.3 and 0.7 being the respective target values for that particular row. In addition, we store a dict containing the embedding for the task/target description of each target.In the torch dataset, we then:
Why this approach?
When training on multiple targets for a particular dataset, we stored one file per row per target, so with nrows = 100 and n_targts = 3 we stored 300 files. However, with the new approach, we only need to store 100 rows (plus the small dict)