Add one or multiple options to allow user specify a strategy to split the dataset among multiple files.
It could be great for example to have :
info:
output_name: test
output_format: parquet
rows: 2_000_000
files: 5
So each file will contains approx. 2M/5 = 400k rows.
We could have parameters like:
files : described above
target_size : split when the file is above a certain threshold (to test HDFS optimal block size for example)
- ...