Add data and restructure file handling #2

umertens · 2024-10-16T08:12:12Z

This PR

adds some new datasets
changes the storage of files

What has changed:
We now treat every! column as feature, compute embeddings for each row and store as npy-file. The tensor that is stored has dimension (n_cols, emb_dim). Within the file name, we store for each target its value and position in the sequence. So if n_cols is 10 and there are two targets at position 6 and 9, the file name contains [6: 0.3, 9: 0.7] with 0.3 and 0.7 being the respective target values for that particular row. In addition, we store a dict containing the embedding for the task/target description of each target.

In the torch dataset, we then:

load the row-info (n_cols, emb_dim)
load the dict
prepend the target embedding (n_cols + 1, emb_dim)
remove the target embedding

Why this approach?

When training on multiple targets for a particular dataset, we stored one file per row per target, so with nrows = 100 and n_targts = 3 we stored 300 files. However, with the new approach, we only need to store 100 rows (plus the small dict)

umertens · 2024-10-16T08:15:37Z

tabgpt/tabular_dataset.py


    def __len__(self):
-        return len(self.file_names)
+        return len(self.file_names) * self.n_targets


we increase the len to account for multiple targets

umertens · 2024-10-16T08:21:12Z

tabgpt/tabular_dataset.py


    def __getitem__(self, idx):
-        file_path = os.path.join(self.folder_path, self.sorted_files[idx])
+        actual_idx = idx // self.n_targets


we need to map back to original idx, so if there are k rows and 2 targets, idx goes to k*2

umertens · 2024-10-16T08:22:11Z

tabgpt/tabular_dataset.py


    def __getitem__(self, idx):
-        file_path = os.path.join(self.folder_path, self.sorted_files[idx])
+        actual_idx = idx // self.n_targets
+        augment_idx = idx % self.n_targets


augment_idx determines which target is used

umertens · 2024-10-16T08:23:15Z

tabgpt/tabular_dataset.py

+        c, target_embedding = self.task_dict[keys[augment_idx]]
+        target_embedding = torch.tensor(target_embedding,dtype=data.dtype)
+        data = add_positional_info(target_embedding=target_embedding, feature_embeddings=data)
+        return torch.cat((data[:c, :], data[(c+1):, :]), dim=0)


remove target embedding

umertens · 2024-10-16T08:23:36Z

tabgpt/tabular_dataset.py

-        dataset_name, target, _ = filename.split(";")
-        return dataset_name, torch.tensor(float(target), dtype=torch.float32)
+        dataset_name, target_list, _ = filename.split(";")
+        target_list = eval(target_list)


the list is encapsulated in a string, hence eval

umertens · 2024-10-16T08:24:13Z

tabgpt/tabular_dataset.py

+        ds_dir = os.path.join(join_paths(d.current_dir, 'files', mode))
+        datasets.append(TabularDataset(folder_path=ds_dir, 
+                                       task_description_path=os.path.join(d.current_dir, 'task_description', mode),
+                                       n_targets=len(d.target_column) if not only_main else 1))


The 1 works since the main target is always the first in the list

add data

c31ff50

umertens commented Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data and restructure file handling #2

Add data and restructure file handling #2

umertens commented Oct 16, 2024

umertens Oct 16, 2024

umertens Oct 16, 2024

umertens Oct 16, 2024

umertens Oct 16, 2024

umertens Oct 16, 2024

umertens Oct 16, 2024

Add data and restructure file handling #2

Are you sure you want to change the base?

Add data and restructure file handling #2

Conversation

umertens commented Oct 16, 2024

umertens Oct 16, 2024

Choose a reason for hiding this comment

umertens Oct 16, 2024

Choose a reason for hiding this comment

umertens Oct 16, 2024

Choose a reason for hiding this comment

umertens Oct 16, 2024

Choose a reason for hiding this comment

umertens Oct 16, 2024

Choose a reason for hiding this comment

umertens Oct 16, 2024

Choose a reason for hiding this comment