From e1bf8f9de984cd8ed992bb947efff07180d6ae24 Mon Sep 17 00:00:00 2001 From: Joya Chen Date: Fri, 28 Jun 2024 17:56:27 +0800 Subject: [PATCH] Create README.md --- data/livechat/README.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 data/livechat/README.md diff --git a/data/livechat/README.md b/data/livechat/README.md new file mode 100644 index 0000000..a7de91a --- /dev/null +++ b/data/livechat/README.md @@ -0,0 +1,35 @@ +## Distributed Streaming Dialogue Data Generation + +This part describes how to generate streaming dialogue data on Ego4D GoalStep dataset. + +### Download Ego4D GoalStep Annotation JSON + +Try to download Ego4D annotations. Refer to [Ego4D](https://ego4d-data.org/docs/start-here/) for details. + +After that, you can use symbolic link to ensure you have ego4d annotations as the following: + +``` +datasets/ego4d/v2/annotations/ +├── ... +├── goalstep_train.json +├── goalstep_val.json +├── ... +``` + +### Run Streaming Data Generation Script + +``` +python -m data.ego4d.livechat.ego4d_goalstep_livechat_generation --num_gpus 8 --anno_root +``` + +- Please run the script in ```videollm-online/``` root folder. + +- If you are on a cluster, you can set ```--num_nodes ... --slurm_partition ...``` to use them. The more nodes and GPUs, the faster preprocessing. + +- The results will be saved into ```datasets/ego4d/v2/annotations/livechat/```. + +### Filtering out Data expose ground-truth + +We find sometimes the generated dialogue will expose the timestamp information in ground-truth annotation. The pattern is "second", "..s" appeared in generated assistant responses. Furthermore, we remove dialogue less than 1 minute and larger than 60 minutes. + +See python [filter.py](filter.py) for details.