Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests performance degradation #604

Open
kevinjqliu opened this issue Apr 13, 2024 · 8 comments
Open

Integration tests performance degradation #604

kevinjqliu opened this issue Apr 13, 2024 · 8 comments

Comments

@kevinjqliu
Copy link
Contributor

Apache Iceberg version

main (development)

Please describe the bug 🐞

Integration tests feel significantly slower than before.

make test-integration

Running on the latest main branch took 266 seconds, which is more than 4 minutes

/usr/bin/time make test-integration

=============================================================== 765 passed, 4 skipped, 2415 deselected, 24 warnings in 217.18s (0:03:37) ================================================================
      265.95 real        17.03 user         7.11 sys

Compare this to pyiceberg-0.6.0rc6 tag, integration tests only took 100 seconds.

git checkout pyiceberg-0.6.0rc6
/usr/bin/time make test-integration

============================================================== 508 passed, 4 skipped, 2305 deselected in 48.61s ==============================================================
       99.89 real         9.43 user         4.16 sys

Here are the top 10 slowest tests:

PYTEST_ARGS="--durations=10" make test-integration

============================================================================ slowest 10 durations ============================================================================
5.93s call     tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
3.52s call     tests/integration/test_reads.py::test_ray_nan[catalog_hive]
3.12s setup    tests/integration/test_add_files.py::test_add_files_to_unpartitioned_table[1]
2.66s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-int]
2.62s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[int]
2.46s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_appended_null_partitioned[2-binary]
2.35s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-float]
2.26s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_v1_v2_append_null[bool]
2.21s call     tests/integration/test_writes/test_partitioned_writes.py::test_query_filter_null_partitioned[1-long]
2.21s call     tests/integration/test_reads.py::test_table_properties_dict[catalog_hive]
@kevinjqliu
Copy link
Contributor Author

Some potential optimizations:

  • parallelize pytest execution, this requires that each test can be independently run
  • un-paratermized some tests, I noticed that tests like test_query_filter_only_nulls are parameterized by many args making the number of tests equal to the args multiplied together.

@Gowthami03B
Copy link
Contributor

Interesting, @kevinjqliu are you already working on this?

@kevinjqliu
Copy link
Contributor Author

Yeah, I started looking at ways to optimize all the tests. I'm blocked on making tests run in isolation. There's a wip PR #598

@kevinjqliu
Copy link
Contributor Author

Here's what I'm blocked on specifically.

Parallelize this test test_query_filter_appended_null,

PYTEST_ARGS="-n auto -k test_query_filter_appended_null" /usr/bin/time make test-integration

The problem is the table default.arrow_table_v1_with_null used has scope="session", autouse=True, which because of the session scope, can't run in parallel

@Fokko
Copy link
Contributor

Fokko commented Apr 19, 2024

I would love to check where the time is being spent.

Concerning parallelization, I'm a bit hesitant since I think this will also apply more pressure on the rest catalog. I'm doubtful if it will really speed up the tests, but the numbers will tell.

@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Apr 21, 2024

Making all the tests run in parallel requires lots and lots of changes.

For the short term, the best way forward is to look at individual tests. pytest's --duration shows parameterized tests as separate tests. I wrote a script to roll up the parameterized tests

Here are the top 10 tests by duration, group by test name for parameterized tests

tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_appended_null_partitioned, 22 tests, took 47.68
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_null_partitioned, 22 tests, took 39.73
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_only_nulls_partitioned, 22 tests, took 29.37
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_v1_v2_append_null, 11 tests, took 22.38
tests/integration/test_writes/test_partitioned_writes.py:test_query_filter_without_data_partitioned, 22 tests, took 17.06
tests/integration/test_add_files.py:test_add_files_to_unpartitioned_table, 3 tests, took 13.00
tests/integration/test_partitioning_key.py:test_partition_key, 28 tests, took 6.25
tests/integration/test_rest_schema.py:test_disallowed_updates, 337 tests, took 5.88
tests/integration/test_reads.py:test_ray_nan, 2 tests, took 3.88
tests/integration/test_writes/test_writes.py:test_query_filter_appended_null, 24 tests, took 3.66

Top 5 tests are all from test_partitioned_writes.py, totaling 156.22 seconds (2.6 minutes).
These tests are heavily parameterized

@pytest.mark.parametrize(
    "part_col", ['int', 'bool', 'string', "string_long", "long", "float", "double", "date", "timestamptz", "timestamp", "binary"]
)

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Oct 19, 2024
Copy link

github-actions bot commented Nov 2, 2024

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2024
@kevinjqliu kevinjqliu removed the stale label Nov 2, 2024
@kevinjqliu kevinjqliu reopened this Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants