-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Table.append #428
Comments
@bigluck Thanks for raising this. This is on my list to look into! Parallelization of this is always hard since it is hard to exactly know how big the Parquet file will be. Efficient encoding and compression can vary a lot. However, just using one core does not make any sense. |
Copy/paste my comments from Slack. PyIceberg currently uses
https://arrow.apache.org/cookbook/r/reading-and-writing-data---multiple-files.html#write-partitioned-data---parquet |
There's not a lot mentions of using Arrow to write a single file using multiple threads. The only thing I found was I don't think this is exposed to writing a single file ( |
Looks like we have to use a higher level API to force parallelism, i.e. |
Looks like Maybe we can use the |
Support encoding a single parquet file using multiple threads Looks like the bottleneck might be encoding instead of IO |
I wonder if |
This is for reading, so each core reads a separate row group.
How would we control the size of a file (or the number of records is a second best). I wonder if we generate a fair amount of data, like @bigluck is doing: import multiprocessing
import os
import time
import uuid
from functools import partial
from typing import Any
import pyarrow as pa
from faker import Faker
from pyiceberg.catalog import load_catalog
def generate_fake_rows(batch_size, i) -> list[dict[str, Any]]:
fake = Faker()
return [
{
**fake.profile(),
'password': fake.sha256(),
'device_id': fake.uuid4(),
'device_ip': fake.ipv4(),
'device_user_agent': fake.user_agent(),
'phone': fake.phone_number(),
'salary': fake.random_int(15_000, 250_000),
'job_description': fake.sentence(nb_words=15),
'cc_expire': fake.credit_card_expire(),
'cc_number': fake.credit_card_number(),
'cc_security_code': fake.credit_card_security_code(),
}
for _ in range(batch_size)
]
def generate_fake_table(num_records: int, batch_size: int) -> pa.Table:
num_batches = num_records // batch_size
generate_fake_rows_partial = partial(generate_fake_rows, batch_size)
with multiprocessing.Pool(processes=multiprocessing.cpu_count() - 1) as pool:
results = pool.map(generate_fake_rows_partial, range(num_batches))
return pa.Table.from_pylist([row for batch in results for row in batch]) How many Arrow buffers do we get, and if we can smartly join these buffers (we already have a bin-packing algorithm in the code). We could use the |
I took the above code and did some investigation. I'll summarize. I used the above functions to generate 1 million records and save them in a feather file. We can use the Given this, it seems plausible to chunk the table into batches and use the Note, |
It seems like there's an upper bound to the size of the RecordBatch produced by I guess is this what you mean by bin-packing. We can bin-pack these batches into 512 MB parquet files. |
Also, @bigluck, while running the code to generate the data using faker, I opened |
Thanks @kevinjqliu BTW it seems to use all the cores on my M2 Max: |
Do we know if this is for data generation, or also when writing? In the end, it would be good to be able to split the data into multiple files. The MacBooks have huge IO, so it might be that the CPU is the bottleneck and not the IO |
It was for data generation only. I can't seem to reproduce the parallelism issue for |
@Fokko can you point me to that? I couldn't find it |
@kevinjqliu It is under utils/bin_packing.py. |
thanks! I found it, had to fuzzy search in vscode :) Here's an example of bin-packing an Arrow table. I didn't fully understand |
Integrating this with the write path, I have 2 approaches
I like the second option, but we need to coordinate with how we implement partitioned writes. |
#444 something like this. wrote out 3 files
and reading it back returns the same number of records |
Oh interesting, the input is 1M records, 685.46 MB in memory. We bin-pack the Arrow representation into 256MB chunks ( We'd want to take into account the size of the parquet file when written to disk, not the size of the Arrow representation in memory. |
Here's how it's done in Java.
Default setting is 512 MB |
Ciao @kevinjqliu, thanks! I've tested it on the same
This is the final table parquet file on s3: |
hm. Looks like something weird is going on if the resulting parquet file is 1.6 GB. Each parquet file size should be at most 512 MB, if not less. See the bin packing logic. Here's something we can run for diagnostics,
You might have to change the For Write operations (append/overwrite), parallelism only kicks in during the actual writing. In order to take advantage of the parallelism, you'd have to set the |
Hey @kevinjqliu , we're currently debugging the issue on Slack, but I thought it would be helpful to report our findings here as well. In my tests, the pyarrow table is generated using the following code:
I've also cached the table on disk to save time, and it's read using the following code:
Although I know that using a record batch would be the right way to read the file, I'm explicitly using After importing, the
Let me know if you have any questions, and thanks for your time! |
@kevinjqliu, your latest changes are mind-blowing (#444 (comment) for reference) I have tested your last changes on
I have been experimenting with different settings to improve the writing performances, but I failed. Overall, I am very impressed with how it works now! Well done! |
As a way to benchmark multithreaded writes to multiple parquet files, I've noticed that Duckdb's COPY command has the Using the
Result,
And setting FILE_SIZE_BYTES to 256MB,
I'm not sure if there's a way to specify the number of threads Duckdb can use. But with |
@kevinjqliu nice, duckdb should use https://duckdb.org/docs/sql/configuration.html |
Thanks! It
I also ran
just in case |
Here's the script I used to run the Wrote 14 files in around 16.5 seconds And on |
Fixed in #444 |
Apache Iceberg version
main (development)
Please describe the bug 🐞
While doing some tests with the latest RC (
v0.6.0rc5
), I generated a ~6.7GB arrow table and appended it to a new table.In terms of performances, I got similar results (writing to S3) on these 2 type of EC2 machines:
c5ad.8xlarge
32 core, 64 ram, 10gbps nic -> wrote 1 parquet file of 2GB in 31sc5ad.16xlarge
64 core, 128 ram, 20gbps nic -> wrote 1 parquet file of 1.6GB in 28sBy using
htop
I notice that the code was only using a thread during the append operation, which means that it's not parallelizing the write operation.The text was updated successfully, but these errors were encountered: