Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data and restructure file handling #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

umertens
Copy link
Collaborator

This PR

  1. adds some new datasets
  2. changes the storage of files

What has changed:
We now treat every! column as feature, compute embeddings for each row and store as npy-file. The tensor that is stored has dimension (n_cols, emb_dim). Within the file name, we store for each target its value and position in the sequence. So if n_cols is 10 and there are two targets at position 6 and 9, the file name contains [6: 0.3, 9: 0.7] with 0.3 and 0.7 being the respective target values for that particular row. In addition, we store a dict containing the embedding for the task/target description of each target.

In the torch dataset, we then:

  • load the row-info (n_cols, emb_dim)
  • load the dict
  • prepend the target embedding (n_cols + 1, emb_dim)
  • remove the target embedding

Why this approach?

When training on multiple targets for a particular dataset, we stored one file per row per target, so with nrows = 100 and n_targts = 3 we stored 300 files. However, with the new approach, we only need to store 100 rows (plus the small dict)


def __len__(self):
return len(self.file_names)
return len(self.file_names) * self.n_targets
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we increase the len to account for multiple targets


def __getitem__(self, idx):
file_path = os.path.join(self.folder_path, self.sorted_files[idx])
actual_idx = idx // self.n_targets
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to map back to original idx, so if there are k rows and 2 targets, idx goes to k*2


def __getitem__(self, idx):
file_path = os.path.join(self.folder_path, self.sorted_files[idx])
actual_idx = idx // self.n_targets
augment_idx = idx % self.n_targets
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

augment_idx determines which target is used

c, target_embedding = self.task_dict[keys[augment_idx]]
target_embedding = torch.tensor(target_embedding,dtype=data.dtype)
data = add_positional_info(target_embedding=target_embedding, feature_embeddings=data)
return torch.cat((data[:c, :], data[(c+1):, :]), dim=0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove target embedding

dataset_name, target, _ = filename.split(";")
return dataset_name, torch.tensor(float(target), dtype=torch.float32)
dataset_name, target_list, _ = filename.split(";")
target_list = eval(target_list)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the list is encapsulated in a string, hence eval

ds_dir = os.path.join(join_paths(d.current_dir, 'files', mode))
datasets.append(TabularDataset(folder_path=ds_dir,
task_description_path=os.path.join(d.current_dir, 'task_description', mode),
n_targets=len(d.target_column) if not only_main else 1))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1 works since the main target is always the first in the list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants