Merge branch 'dev_release' of https://github.com/Luodian/Otter into d…

…ev_release
Luodian · Nov 8, 2023 · 66b9d5a · 66b9d5a
2 parents 70a95bb + 11105f1
commit 66b9d5a
Show file tree

Hide file tree

Showing 2 changed files with 12 additions and 15 deletions.
diff --git a/docs/mimicit_format.md b/docs/mimicit_format.md
@@ -1,5 +1,7 @@
 # Breaking Down the MIMIC-IT Format
 
+❗❗❗We changed previous `images.json` to `images.parquet`. They are all containing multiple `key:base64` pairs but the later one would consume far less CPU memory and faster during loading with `pandas.Dataframe`. It enables us to train with larger datasets more conviently.
+
 We mainly use one integrate dataset format and we refer it to MIMIC-IT format since.
 
 The mimic-it format contains the following data yaml file. Within this data yaml file, you could assign the path of the instruction json file and the image parquet file, and also the number of samples you want to use. The number of samples within each group will be uniformly sampled, and the `number_samples / total_numbers`` will decide sampling ratio of each dataset.
@@ -86,4 +88,4 @@ parquet_file_path = os.path.join(
     parquet_root_path, os.path.basename(json_file_path).split(".")[0].replace("_image", "") + ".parquet"
 )
 df.to_parquet(parquet_file_path, engine="pyarrow")
-```
+```
diff --git a/shared_scripts/Demo_Data.yaml b/shared_scripts/Demo_Data.yaml
@@ -1,21 +1,16 @@
-IMAGE_TEXT:
-  LADD:
-    mimicit_path: azure_storage/json/LA/LADD_instructions.json
-    images_path: azure_storage/Parquets/LA.parquet
-    num_samples: -1
-  # LACONV:
-  #   mimicit_path: azure_storage/json/LA/LACONV_instructions.json
-  #   images_path: azure_storage/json/LA/LA.json
-  #   train_config_path: azure_storage/json/LA/LACONV_train.json
-  #   num_samples: 50
+IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
+  LADD: # LLaVA Detailed Description, dataset name can be assigned at any name you want
+      mimicit_path: azure_storage/json/LA/LADD_instructions.json # Path of the instruction json file
+      images_path: azure_storage/Parquets/LA.parquet # Path of the image parquet file
+      num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.
+  M3IT_CAPTIONING:
+      mimicit_path: azure_storage/json/M3IT/captioning/coco/coco_instructions.json
+      images_path: azure_storage/Parquets/coco.parquet
+      num_samples: 20000
   LACR_T2T:
     mimicit_path: azure_storage/json/LA/LACR_T2T_instructions.json
     images_path: azure_storage/Parquets/LA.parquet
     num_samples: -1
-  M3IT_CAPTIONING:
-    mimicit_path: azure_storage/json/M3IT/captioning/coco/coco_instructions.json
-    images_path: azure_storage/Parquets/coco.parquet
-    num_samples: 20000
   # M3IT_VQA:
   #   mimicit_path: azure_storage/json/M3IT/vqa/vqav2/vqav2_instructions.json
   #   images_path: azure_storage/json/M3IT/vqa/vqav2/vqav2.json