Skip to content

feat: High performance pandas integration. #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 149 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from 124 commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
12660f3
Testing tweak.
amunra Oct 26, 2022
bdbd283
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra Oct 26, 2022
9e93993
Updated to c-questdb-client 2.1.1
amunra Oct 26, 2022
e56027c
Some progress..
amunra Oct 28, 2022
ed1658e
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra Nov 1, 2022
d692098
Fixed broken check.
amunra Nov 1, 2022
a92a858
symbols validation.
amunra Nov 1, 2022
bebbc8a
Added repl command to proj script.
amunra Nov 1, 2022
ab05dec
Progress with pandas method input validation.
amunra Nov 1, 2022
dc65b2c
Moved Python IntVec to C int_vec.
amunra Nov 1, 2022
a46490a
More code to avoid python lists.
amunra Nov 2, 2022
737f970
CI fixup.
amunra Nov 2, 2022
d574486
CI fixup 2.
amunra Nov 2, 2022
b1c17b0
CI fixup 3.
amunra Nov 2, 2022
fc99054
CI fixup 4.
amunra Nov 2, 2022
3654938
CI fixup 5.
amunra Nov 2, 2022
7102ddd
Types and buffers.
amunra Nov 4, 2022
5d6d4be
Improved column index handling, actually getting buffers from numpy a…
amunra Nov 4, 2022
95a66bc
Introducing a small rust lib to convert python strings to UTF-8 witho…
amunra Nov 4, 2022
e43e564
Implemented conversions for UCS1/2/4.
amunra Nov 4, 2022
25bc49b
Renamed rust lib and wired up the build and linkage bits in setup.py
amunra Nov 4, 2022
b199db6
Auxilliary rust lib rename.
amunra Nov 5, 2022
7b5200e
Reworked API for better buffer reuse and for individual UCS1, 2, 4 fu…
amunra Nov 5, 2022
e945702
cbindgen to generate C .h and Cython .pxd headers.
amunra Nov 5, 2022
161f6ae
cbindgen fixup & including lib headers from setup.py
amunra Nov 7, 2022
5b5d58b
Wrote Cython function to invoke 'pystr-to-utf8' lib. Additional refac…
amunra Nov 7, 2022
12598c4
Transitioned all string buffer conversions to new Rust code lib.
amunra Nov 8, 2022
3544312
Made pystr_to_utf8 addresses stable.
amunra Nov 8, 2022
9fce68b
Rust str to utf8 lib fixes (but still broken - ongoing)
amunra Nov 8, 2022
9155357
Minor unicode test improvement. Transcoding works now.
amunra Nov 8, 2022
f953ad0
Rust PyStr lib tests and a few bugfixes.
amunra Nov 8, 2022
ab1566c
Updated pystr lib readme.
amunra Nov 8, 2022
d79955f
More pystr-to_utf8 tests and improvements.
amunra Nov 9, 2022
e8980d5
Added UCS-4 tests.
amunra Nov 9, 2022
644bdba
More unicode testing.
amunra Nov 10, 2022
b071edf
Fixed include for cython generation compatability.
amunra Nov 10, 2022
091e4c8
Table name columns, symbols and timestamps now work!
amunra Nov 10, 2022
29895f6
Handling null column values in strings.
amunra Nov 11, 2022
ef6735b
Added arrow C data interface type definitions.
amunra Nov 11, 2022
ead851d
Code reorg.
amunra Nov 11, 2022
d14eea9
Consolidated approach writeup and code reorg into .pxi files.
amunra Nov 14, 2022
94378a0
Undid removal of ingress.c from gitignore.
amunra Nov 14, 2022
939c50d
More writeup with final types.
amunra Nov 15, 2022
2355b3d
Categories added to write-up.
amunra Nov 15, 2022
1c14cb7
Consolidated Pandas logic into single .pxi file.
amunra Nov 15, 2022
4f7ef81
File renaming.
amunra Nov 15, 2022
de4b471
Reorganised existing logic into a sorted array of col_t types. Some p…
amunra Nov 16, 2022
ef2424d
Documented float16, added col_source_t.
amunra Nov 16, 2022
993dd5f
Beginning to resolve columns.
amunra Nov 17, 2022
8387209
More array extraction logic.
amunra Nov 18, 2022
5e5158f
Updating of types, updating tech doc for timezone timestamps.
amunra Nov 21, 2022
2a26d8f
Fixed up most cython build issues. Mostly enum usage issues.
amunra Nov 21, 2022
e9890a6
Code builds again finally.
amunra Nov 21, 2022
7e75260
Dead code removal.
amunra Nov 21, 2022
0e67480
Types to dispatch codes to functions.
amunra Nov 21, 2022
41d3e6b
Some test fixup
amunra Nov 22, 2022
7a1c0b0
Yay, segfault!
amunra Nov 22, 2022
3af4899
Fixed a few segfaults, got some more.
amunra Nov 22, 2022
9dc17a3
Fixed segfaults.
amunra Nov 22, 2022
9665d9f
Added missing dispatch codes and lots of TODOs.
amunra Nov 22, 2022
194ad25
Got rid of a lot of INCREF/DECREF silliness.
amunra Nov 23, 2022
9ea4a0e
Ohh look. Tests pass again.
amunra Nov 23, 2022
7c03446
Fixed another segfault.
amunra Nov 23, 2022
80748ad
Another bug bites the dust.
amunra Nov 23, 2022
01d8732
Implemented symbols='auto' and i32 column support.
amunra Nov 23, 2022
579e649
Swapped out error prone 'bint / except False' declarations with 'void…
amunra Nov 23, 2022
df823b7
More string trouble.
amunra Nov 24, 2022
cb97a20
Normality restored.
amunra Nov 24, 2022
5c1aaff
py obj to symbols.
amunra Nov 24, 2022
dc5795f
Done timestamp at and columns. Found out that timezone timestamps are…
amunra Nov 24, 2022
b7cfd50
Added some testing notes.
amunra Nov 24, 2022
12c6eec
TODO fixup.
amunra Nov 24, 2022
8593b75
Added support for datetimes with timezones (only nanosecond based) vi…
amunra Nov 25, 2022
7f6503d
Bool column support from Python objects.
amunra Nov 25, 2022
25c91be
Added arrow-based boolean pandas datatype column support.
amunra Nov 25, 2022
8b31010
Support for arrow integer columns.
amunra Nov 25, 2022
6cad41b
Progress handling strings.
amunra Nov 28, 2022
1499cff
Support for objects with integers.
amunra Nov 28, 2022
0e9e287
Float object support.
amunra Nov 28, 2022
a1e7d07
arrow f32 and f64
amunra Nov 28, 2022
898b157
str column pyarrow.
amunra Nov 28, 2022
88e618a
LTO, basic perf tests, removed debug logging, fixed a bug in string c…
amunra Nov 30, 2022
0158fad
Tests for categories.
amunra Nov 30, 2022
fd01ef3
Releasing and reacquiring GIL to avoid starving other threads.
amunra Nov 30, 2022
dfdd302
Fully releasing GIL whenever possible. This was fiddly to get working.
amunra Nov 30, 2022
5041d26
Refactoring out benchmarks, refactoring Py str to UTF8 rust impl.
amunra Dec 1, 2022
e4135d0
8% perf improvements in Python string to UTF-8 conversions.
amunra Dec 1, 2022
754534c
Multithreading benchmark.
amunra Dec 1, 2022
5915a00
Implemented column (arrow and pybuffer) cleanup.
amunra Dec 1, 2022
9b75a12
Formatting.
amunra Dec 1, 2022
7f8dab4
Tested all-nulls column is altogether skipped.
amunra Dec 1, 2022
21f39f8
Refactoring and sorting columns in C.
amunra Dec 1, 2022
f28903e
Updated c-questdb-client submodule: Latest perf improvements.
amunra Dec 2, 2022
8b6652a
Fixed broken build.
amunra Dec 2, 2022
cba2aaf
Single logic to infer object column types.
amunra Dec 2, 2022
e07bb42
Tests fixup.
amunra Dec 2, 2022
bd4bca6
More tests.
amunra Dec 2, 2022
7c734b5
Fixed a bug passing None in datetime columns.
amunra Dec 2, 2022
480343d
Tests for degenerate pandas dataframes.
amunra Dec 2, 2022
b1f2ebf
Informative message for row of nulls.
amunra Dec 2, 2022
9811007
Mandating pyarrow dependency for pandas functionality.
amunra Dec 3, 2022
c03c4ed
There's a chance this will fix CI.
amunra Dec 5, 2022
b1a4dc7
Second attempt to fix up the CI.
amunra Dec 5, 2022
8b3e45d
Third attempt to fix up the CI.
amunra Dec 5, 2022
15330b0
Reduced stack size in case of errors to aid legibility.
amunra Dec 5, 2022
cee6d4b
Fourth attempt to fix up the CI.
amunra Dec 5, 2022
5668b57
Fifth attempt to fix up the CI.
amunra Dec 5, 2022
77a612c
Sixth attempt to fix up the CI.
amunra Dec 5, 2022
57ae0b8
Progress on API docs.
amunra Dec 6, 2022
a2763f9
Found and fixed a memory leak.
amunra Dec 7, 2022
960cd74
More fuzzing.
amunra Dec 7, 2022
5998a1c
Added support from taking the table name from the df.index.name, rena…
amunra Dec 8, 2022
be0407c
General fixes and testcases for handling timestamps.
amunra Dec 12, 2022
24ee3cd
Should fix tests in CI.
amunra Dec 12, 2022
e8d8daa
Extra testing of 'TimestampXXX.now()' and hopefully fixing CI.
amunra Dec 12, 2022
5f8e8ee
CI fixup attempt.
amunra Dec 12, 2022
6b77da9
Fixing broken 32-bit binaries.
amunra Dec 12, 2022
aaf7e95
Slimmed down 'col_t' type.
amunra Dec 13, 2022
b9b2081
Implemented (but not yet tested) pandas auto-flush logic. Also releas…
amunra Dec 13, 2022
dcccabd
Tweak to pandas auto-flush logic.
amunra Dec 13, 2022
98a5496
Basic pandas end-to-end test.
amunra Dec 13, 2022
02d49dd
Tests (and bugfixes) for panda's auto-flush.
amunra Dec 13, 2022
1801f10
Pandas API docs.
amunra Dec 14, 2022
45aa14b
Renamed '.pandas()' to '.dataframe()'.
amunra Dec 14, 2022
ada3ac8
Int object int64 bounds check tests.
amunra Dec 15, 2022
89c50b4
Test strided numpy array with zero-copy into pandas.
amunra Dec 15, 2022
64f14fa
Serializing subset of dataframe rows.
amunra Dec 15, 2022
def3887
Improved error messaging.
amunra Dec 15, 2022
81d6cb8
Testing chunked arrow arrays.
amunra Dec 15, 2022
83a937a
Removed completed TODOs
amunra Dec 15, 2022
712ec1d
Hopefully fixing CI.
amunra Dec 15, 2022
88b043e
Dataframe API doc fixup.
amunra Dec 15, 2022
58de10c
Fixing the CI
amunra Dec 15, 2022
b461557
Parquet rountrip test.
amunra Dec 16, 2022
9625447
Added missing libs in dev_requirements.txt
amunra Dec 28, 2022
02da96b
CI fixup (hopefully)
amunra Dec 28, 2022
e26f5fe
CI fixup (hopefully, again)
amunra Dec 28, 2022
6dd6cf6
CI fixup (once more, with feeling)
amunra Dec 28, 2022
67cedd9
More examples.
amunra Dec 29, 2022
0c7b6ef
Parquet data example.
amunra Dec 30, 2022
cd97af2
Updated parquet example, added to docs.
amunra Jan 2, 2023
ab69e9c
Updated examples manifest to hint at more examples for Pandas datafra…
amunra Jan 2, 2023
3af8c85
Disabled bytecode file gen for install_rust.py
amunra Jan 2, 2023
25d4e2b
Updated CHANGELOG.rst
amunra Jan 3, 2023
39dc427
Minor error reporting bugfix.
amunra Jan 4, 2023
7818149
Improved docs.
amunra Jan 4, 2023
46999e7
Updated c-questdb-client dependency.
amunra Jan 4, 2023
32f3394
Exception type tidy-up.
amunra Jan 4, 2023
38eb382
Fixed typos spotted during the code review.
amunra Jan 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,14 @@
src/questdb/ingress.html
src/questdb/ingress.c
src/questdb/*.html
rustup-init.exe

# Linux Perf profiles
perf.data*
perf/*.svg

# Atheris Crash/OOM and other files
fuzz-artifact/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
6 changes: 5 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
{
"esbonio.sphinx.confDir": ""
"esbonio.sphinx.confDir": "",
"cmake.configureOnOpen": false,
"files.associations": {
"ingress_helper.h": "c"
}
}
4 changes: 0 additions & 4 deletions TODO.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ TODO
Build Tooling
=============

* **[HIGH]** Transition to Azure, move Linux arm to ARM pipeline without QEMU.

* **[MEDIUM]** Automate Apple Silicon as part of CI.

* **[LOW]** Release to PyPI from CI.
Expand All @@ -20,8 +18,6 @@ Docs
are in the C client). This is to ensure they don't "bit rot" as the code
changes.

* **[MEDIUM]** Document on a per-version basis.

Development
===========

Expand Down
16 changes: 8 additions & 8 deletions ci/cibuildwheel.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -83,7 +83,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -100,7 +100,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -117,7 +117,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -134,7 +134,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
pip3 install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -151,7 +151,7 @@ stages:
- bash: |
set -o errexit
python3 -m pip install --upgrade pip
python3 -m pip install cibuildwheel==2.11.1
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand All @@ -165,8 +165,8 @@ stages:
- task: UsePythonVersion@0
- bash: |
set -o errexit
python -m pip install --upgrade pip
pip install cibuildwheel==2.11.1
python3 -m pip install --upgrade pip
python3 -m pip install cibuildwheel==2.11.2
displayName: Install dependencies
- bash: cibuildwheel --output-dir wheelhouse .
displayName: Build wheels
Expand Down
71 changes: 71 additions & 0 deletions ci/pip_install_deps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import sys
import subprocess
import shlex
import textwrap
import platform


class UnsupportedDependency(Exception):
pass


def pip_install(package):
args = [
sys.executable,
'-m', 'pip', 'install',
'--upgrade',
'--only-binary', ':all:',
package]
args_s = ' '.join(shlex.quote(arg) for arg in args)
sys.stderr.write(args_s + '\n')
res = subprocess.run(
args,
stderr=subprocess.STDOUT,
stdout=subprocess.PIPE)
if res.returncode == 0:
return
output = res.stdout.decode('utf-8')
if 'Could not find a version that satisfies the requirement' in output:
raise UnsupportedDependency(output)
else:
sys.stderr.write(output + '\n')
sys.exit(res.returncode)


def try_pip_install(package):
try:
pip_install(package)
except UnsupportedDependency as e:
msg = textwrap.indent(str(e), ' ' * 8)
sys.stderr.write(f' Ignored unsatisfiable dependency:\n{msg}\n')


def ensure_timezone():
try:
import zoneinfo
if platform.system() == 'Windows':
pip_install('tzdata') # for zoneinfo
except ImportError:
pip_install('pytz')


def main():
ensure_timezone()
try_pip_install('pandas')
try_pip_install('numpy')
try_pip_install('pyarrow')

on_linux_is_glibc = (
(not platform.system() == 'Linux') or
(platform.libc_ver()[0] == 'glibc'))
is_64bits = sys.maxsize > 2**32
is_cpython = platform.python_implementation() == 'CPython'
if on_linux_is_glibc and is_64bits and is_cpython:
# Ensure that we've managed to install the expected dependencies.
import pandas
import numpy
import pyarrow


if __name__ == "__main__":
main()
8 changes: 7 additions & 1 deletion ci/run_tests_pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,17 @@ stages:
submodules: true
- task: UsePythonVersion@0
- script: python3 --version
- script: python3 -m pip install cython
- script: |
python3 -m pip install cython
python3 ci/pip_install_deps.py
displayName: Installing Python dependencies
- script: python3 proj.py build
displayName: "Build"
- script: python3 proj.py test 1
displayName: "Test"
env:
JAVA_HOME: $(JAVA_HOME_11_X64)

# TODO: Add tests with and tests without installing `pyarrow` as it's
# an optional dependency (that can't always be installed).
# The tests without to test the fallback logic.
4 changes: 3 additions & 1 deletion dev_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
setuptools>=45.2.0
Cython>=0.29.32
wheel>=0.34.2
cibuildwheel>=2.11.1
cibuildwheel>=2.11.2
Sphinx>=5.0.2
sphinx-rtd-theme>=1.0.0
twine>=4.0.1
bump2version>=1.0.1
pandas>=1.3.5
numpy>=1.21.6
28 changes: 28 additions & 0 deletions perf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Profiling with Linux Perf

https://juanjose.garciaripoll.com/blog/profiling-code-with-linux-perf/index.html

```bash
$ TEST_QUESTDB_PATCH_PATH=1 perf record -g --call-graph dwarf python3 test/benchmark.py -v TestBencharkPandas.test_string_encoding_1m
test_string_encoding_1m (__main__.TestBencharkPandas.test_string_encoding_1m) ... Time: 4.682273147998785, size: 4593750000
ok

----------------------------------------------------------------------
Ran 1 test in 10.166s

OK
[ perf record: Woken up 1341 times to write data ]
Warning:
Processed 54445 events and lost 91 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 405.575 MB perf.data (50622 samples) ]
```

# Rendering results

```bash
$ perf script | python3 perf/gprof2dot.py --format=perf | dot -Tsvg > perf/profile_graph.svg
$ (cd perf && python3 -m http.server)
```
Loading