MiniGPT-4/dataset/convert_laion.py
SamimAB a8eb69ecd1 Using ijson to avoid loading full json in memory.
Using ijson to load item by item, so it is possible to load dataset
using dataset/convert_cc_sbu.py and dataset/convert_laion.py on machines
with low RAM.
2023-09-20 23:32:58 +05:30

23 lines
723 B
Python

import ijson
# specify input and output file paths
input_file = 'laion_synthetic_filtered_large.json'
output_file = 'laion_synthetic_filtered_large.tsv'
# set header to None
headers = None
# load JSON data from input file and open the output file at same time
with open(input_file, 'r') as in_file, open(output_file, 'w') as out_file:
objects = ijson.items(in_file, 'item')
for obj in objects:
# extract header and data from JSON
if headers is None:
headers = list(obj.keys())
out_file.write('\t'.join(headers) + '\n')
# write data to TSV file line by line
row = '\t'.join(str(obj[key]) for key in headers)
out_file.write(row + '\n')