Version 0.14.14
I have the following code to merge 2 sorted bucket datasets:
AvroSortedBucketIO.Read right = AvroSortedBucketIO.read(new TupleTag<>("right"), x.class).from("directoryX");
AvroSortedBucketIO.Read left = AvroSortedBucketIO.read(new TupleTag<>("left"), y.class).from("directoryY");
PCollection<KV<String, CoGbkResult>> merged = pipeline.apply("Merge and Transform", SortedBucketIO.read(String.class).of(left, right));
merged.apply(ParDo.of(new MergeTransform());
One of the dataset is approx 2TB
Dataflow failed out of memory
Please advise. Thanks