MiniGPT-4/dataset/README_MINIGPTv2_FINETUNE.md

## Download the COCO captions, RefCOCO, RefCOCO+. RefCOCOg, visual genome, textcaps, LLaVA, gqa, AOK-VQA, OK-VQA, OCR-VQA, filtered Flickr-30k, multi-task conversation, and Unnatural instruction datasets

After downloading all of them, organize the data as follows in `./playground/data`,

```
├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2


### COCO captions
- [train2017](http://images.cocodataset.org/zips/train2017.zip)

### RefCOCO, RefCOCO+, RefCOCOg

### Visual genome
- [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
### TextCaps

### LLaVA

### TextVQA
- [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
### GQA
- [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
- [Annotations](https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/testdev_balanced_questions.json)

### OKVQA

### AOK-VQA

### OCR-VQA
- [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), **we save all files as `.jpg`**

### filtered Flickr-30k

### Multi-task conversation

### Unnatural instruction


### Pre-training datasets download:
We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP).

It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets

Image source | Filtered synthetic caption by ViT-L
--- | :---:
CC3M+CC12M+SBU | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json">Download</a>
LAION115M |  <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json">Download</a>

This will download two json files 
```
ccs_synthetic_filtered_large.json
laion_synthetic_filtered_large.json
```

## prepare the data step-by-step


### setup the dataset folder and move the annotation file to the data storage folder
```
export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/
mkdir ${MINIGPT4_DATASET}/cc_sbu
mkdir ${MINIGPT4_DATASET}/laion
mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu
mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion
```

### Convert the scripts to data storate folder
```
cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu
cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu
cp convert_laion.py ${MINIGPT4_DATASET}/laion
cp download_laion.sh ${MINIGPT4_DATASET}/laion
```


### Convert the laion and cc_sbu annotation file format to be img2dataset format
```
cd ${MINIGPT4_DATASET}/cc_sbu
python convert_cc_sbu.py

cd ${MINIGPT4_DATASET}/laion
python convert_laion.py
```

### Download the datasets with img2dataset
```
cd ${MINIGPT4_DATASET}/cc_sbu
sh download_cc_sbu.sh
cd ${MINIGPT4_DATASET}/laion
sh download_laion.sh
```


The final dataset structure

```
.
├── ${MINIGPT4_DATASET}
│   ├── cc_sbu
│       ├── convert_cc_sbu.py
│       ├── download_cc_sbu.sh
│       ├── ccs_synthetic_filtered_large.json
│       ├── ccs_synthetic_filtered_large.tsv
│       └── cc_sbu_dataset
│           ├── 00000.tar
│           ├── 00000.parquet
│           ...
│   ├── laion
│       ├── convert_laion.py
│       ├── download_laion.sh
│       ├── laion_synthetic_filtered_large.json
│       ├── laion_synthetic_filtered_large.tsv
│       └── laion_dataset
│           ├── 00000.tar
│           ├── 00000.parquet
│           ...
...   
```


## Set up the dataset configuration files

Then, set up the LAION dataset loading path in 
[here](../minigpt4/configs/datasets/laion/defaults.yaml#L5) at Line 5 as 
${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar

and the Conceptual Captoin and SBU datasets loading path in 
[here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L5) at Line 5 as 
${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar
prepare dataset readme 2023-10-23 06:43:07 +00:00			`## Download the COCO captions, RefCOCO, RefCOCO+. RefCOCOg, visual genome, textcaps, LLaVA, gqa, AOK-VQA, OK-VQA, OCR-VQA, filtered Flickr-30k, multi-task conversation, and Unnatural instruction datasets`

Update README_MINIGPTv2_FINETUNE.md 2023-10-23 18:11:35 +00:00			After downloading all of them, organize the data as follows in `./playground/data`,

			```
			`├── coco`
			`│ └── train2017`
			`├── gqa`
			`│ └── images`
			`├── ocr_vqa`
			`│ └── images`
			`├── textvqa`
			`│ └── train_images`
			`└── vg`
			`├── VG_100K`
			`└── VG_100K_2`
prepare dataset readme 2023-10-23 06:43:07 +00:00

Update README_MINIGPTv2_FINETUNE.md 2023-10-23 18:11:35 +00:00			`### COCO captions`
			`- [train2017](http://images.cocodataset.org/zips/train2017.zip)`

prepare dataset readme 2023-10-23 06:43:07 +00:00			`### RefCOCO, RefCOCO+, RefCOCOg`

			`### Visual genome`
Update README_MINIGPTv2_FINETUNE.md 2023-10-23 18:11:35 +00:00			`- [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)`
			`### TextCaps`
prepare dataset readme 2023-10-23 06:43:07 +00:00
			`### LLaVA`

Update README_MINIGPTv2_FINETUNE.md 2023-10-23 18:11:35 +00:00			`### TextVQA`
			`- [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)`
			`### GQA`
			`- [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)`
			`- [Annotations](https://storage.googleapis.com/sfr-vision-language-research/LAVIS/datasets/gqa/testdev_balanced_questions.json)`
prepare dataset readme 2023-10-23 06:43:07 +00:00
			`### OKVQA`

			`### AOK-VQA`

			`### OCR-VQA`
Update README_MINIGPTv2_FINETUNE.md 2023-10-23 18:11:35 +00:00			- [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), we save all files as `.jpg`
prepare dataset readme 2023-10-23 06:43:07 +00:00
			`### filtered Flickr-30k`

			`### Multi-task conversation`

			`### Unnatural instruction`













			`### Pre-training datasets download:`
			`We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP).`

			`It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets`

			`Image source \| Filtered synthetic caption by ViT-L`
			`--- \| :---:`
			`CC3M+CC12M+SBU \| <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json">Download</a>`
			`LAION115M \| <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json">Download</a>`

			`This will download two json files`
			```
			`ccs_synthetic_filtered_large.json`
			`laion_synthetic_filtered_large.json`
			```

			`## prepare the data step-by-step`


			`### setup the dataset folder and move the annotation file to the data storage folder`
			```
			`export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/`
			`mkdir ${MINIGPT4_DATASET}/cc_sbu`
			`mkdir ${MINIGPT4_DATASET}/laion`
			`mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu`
			`mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion`
			```

			`### Convert the scripts to data storate folder`
			```
			`cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu`
			`cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu`
			`cp convert_laion.py ${MINIGPT4_DATASET}/laion`
			`cp download_laion.sh ${MINIGPT4_DATASET}/laion`
			```


			`### Convert the laion and cc_sbu annotation file format to be img2dataset format`
			```
			`cd ${MINIGPT4_DATASET}/cc_sbu`
			`python convert_cc_sbu.py`

			`cd ${MINIGPT4_DATASET}/laion`
			`python convert_laion.py`
			```

			`### Download the datasets with img2dataset`
			```
			`cd ${MINIGPT4_DATASET}/cc_sbu`
			`sh download_cc_sbu.sh`
			`cd ${MINIGPT4_DATASET}/laion`
			`sh download_laion.sh`
			```


			`The final dataset structure`

			```
			`.`
			`├── ${MINIGPT4_DATASET}`
			`│ ├── cc_sbu`
			`│ ├── convert_cc_sbu.py`
			`│ ├── download_cc_sbu.sh`
			`│ ├── ccs_synthetic_filtered_large.json`
			`│ ├── ccs_synthetic_filtered_large.tsv`
			`│ └── cc_sbu_dataset`
			`│ ├── 00000.tar`
			`│ ├── 00000.parquet`
			`│ ...`
			`│ ├── laion`
			`│ ├── convert_laion.py`
			`│ ├── download_laion.sh`
			`│ ├── laion_synthetic_filtered_large.json`
			`│ ├── laion_synthetic_filtered_large.tsv`
			`│ └── laion_dataset`
			`│ ├── 00000.tar`
			`│ ├── 00000.parquet`
			`│ ...`
			`...`
			```


			`## Set up the dataset configuration files`

			`Then, set up the LAION dataset loading path in`
			`[here](../minigpt4/configs/datasets/laion/defaults.yaml#L5) at Line 5 as`
			`${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar`

			`and the Conceptual Captoin and SBU datasets loading path in`
			`[here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L5) at Line 5 as`
			`${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar`