Skip to content

Statically linked with libsqlite3.a with LTO enabled #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

NobodyXu
Copy link

@NobodyXu NobodyXu commented Jul 27, 2021

This PR enables binaries who uses rusqlite to statically linked with libsqlite3.a compiled with LTO using linker-plugin-lto.

To compile these binaries (excluding basic_async.rs), just run make -j $(nproc).
It will compile sqlite3.c using CFLAGS='-O2 -flto'.
The generated binaries will be smaller, though I haven't tested the performance yet, I will add the benchmark below as a comment but only provides minor performance improvements (see comments below) It seems that I didn't enable LTO in rust (see comments below).

To compile basic_async.rs, run cargo build --release --bin basic_async --features async-sql.

This PR might be related to #14

Signed-off-by: Jiahao XU [email protected]

NobodyXu added 17 commits July 26, 2021 17:38
Signed-off-by: Jiahao XU <[email protected]>
in `cargo build`

Signed-off-by: Jiahao XU <[email protected]>
Signed-off-by: Jiahao XU <[email protected]>
that automatically compile sqlite3

Signed-off-by: Jiahao XU <[email protected]>
Signed-off-by: Jiahao XU <[email protected]>
Since sqlx adds feature `bundled` to `libsqlite3-sys` which is used by
`rusqlite`, preventing from using my LTO compiled libsqlite3.a, I added
a feature to disable it.

Signed-off-by: Jiahao XU <[email protected]>
SQLITE3_INCLUDE_DIR is necessary since libsqlite3-sys does not use sqlite3.pc
found using `SQLITE3_LIB_DIR` for locating the header.

Signed-off-by: Jiahao XU <[email protected]>
@NobodyXu
Copy link
Author

NobodyXu commented Jul 27, 2021

This benchmark result is outdated

Tue Jul 27 14:30:28 AEST 2021 [RUST] basic (100_000_000) inserts

real    2m46.459s
user    2m45.606s
sys     0m0.840s
Tue Jul 27 14:33:15 AEST 2021 [RUST] basic_batched (100_000_000) inserts

real    0m24.468s
user    0m23.676s
sys     0m0.790s
Tue Jul 27 14:33:39 AEST 2021 [RUST] basic_batched_wp (100_000_000) inserts

real    2m16.578s
user    2m15.077s
sys     0m1.500s
Tue Jul 27 14:35:56 AEST 2021 [RUST] basic_prep (100_000_000) inserts

real    0m57.312s
user    0m56.442s
sys     0m0.860s
Tue Jul 27 14:36:53 AEST 2021 [RUST] threaded_batched (100_000_000) inserts

real    0m24.522s
user    0m32.507s
sys     0m5.327s
Tue Jul 27 14:37:18 AEST 2021 [RUST] threaded_busy (100_000_000) inserts

real    0m1.614s
user    0m15.816s
sys     0m0.929s
Tue Jul 27 14:37:20 AEST 2021 [RUST] threaded_str_batched (100_000_000) inserts

real    2m13.930s
user    2m28.820s
sys     0m2.238s

It seems that linking with sqlite3 with LTO does provides some minor benefit.

@NobodyXu
Copy link
Author

I looked at the assembly and found that not all sqlite3_* functions are inlined.

I think this might have something to do with the default profile.release in cargo:

lto = false
panic = 'unwind'

Since lto is disabled by default, this might explain why there isn't much improvement at all.

And, panic = "unwind" also have affect on performance.

@NobodyXu
Copy link
Author

The latest commit enables lto, set panic to unwind and codegen-units to 1.

However, I currently don't have any time to benchmark this new commit.

@NobodyXu
Copy link
Author

NobodyXu commented Jul 27, 2021

Here's the up-to-date benchmark:

Tue Jul 27 20:49:47 AEST 2021 [RUST] basic (100_000_000) inserts

real    2m49.296s
user    2m48.535s
sys     0m0.750s
Tue Jul 27 20:52:37 AEST 2021 [RUST] basic_batched (100_000_000) inserts

real    0m22.040s
user    0m21.170s
sys     0m0.870s
Tue Jul 27 20:52:59 AEST 2021 [RUST] basic_batched_wp (100_000_000) inserts

real    2m14.802s
user    2m13.190s
sys     0m1.610s
Tue Jul 27 20:55:14 AEST 2021 [RUST] basic_prep (100_000_000) inserts

real    0m52.067s
user    0m51.294s
sys     0m0.770s
Tue Jul 27 20:56:06 AEST 2021 [RUST] busy (100_000_000) inserts

real    0m6.143s
user    0m5.922s
sys     0m0.220s
Tue Jul 27 20:56:12 AEST 2021 [RUST] threaded_batched (100_000_000) inserts

real    0m21.247s
user    0m27.526s
sys     0m5.566s
Tue Jul 27 20:56:33 AEST 2021 [RUST] threaded_busy (100_000_000) inserts

real    0m1.219s
user    0m11.483s
sys     0m0.656s
Tue Jul 27 20:56:35 AEST 2021 [RUST] threaded_str_batched (100_000_000) inserts

real    2m18.352s
user    2m30.856s
sys     0m2.062s

However, after investigation, I still found many sqlite3_* symbols in the generated binary, which means the functions mostly aren't inlined.

@NobodyXu
Copy link
Author

NobodyXu commented Jul 27, 2021

Strange thing is, when I look into the disassembly using objdump -d, I found that in the main function, there is no call to sqlite3_* function.

I think I have been mistaken, as the main is provided by libstd, as a wrapper to the actual main writen in rust.

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

By doing the linker-plugin-lto in the workspace, it seems that the cross-language LTO finally worked.

Here's the benchmark:

Wed Aug  4 11:33:50 IST 2021 [PYTHON] running basic (10_000_000) inserts

real    2m40.541s
user    2m39.687s
sys     0m0.850s
Wed Aug  4 11:36:31 IST 2021 [PYTHON] running basic_batched (10_000_000) inserts

real    0m22.865s
user    0m22.014s
sys     0m0.850s
Wed Aug  4 11:36:54 IST 2021 [PYTHON] running basic_batched_wp (10_000_000) inserts

real    2m19.619s
user    2m18.259s
sys     0m1.360s
Wed Aug  4 11:39:13 IST 2021 [PYTHON] running basic_prep (10_000_000) inserts

real    0m53.305s
user    0m52.571s
sys     0m0.730s
Wed Aug  4 11:40:07 IST 2021 [PYTHON] running busy (10_000_000) inserts

real    0m5.994s
user    0m5.622s
sys     0m0.350s
Wed Aug  4 11:40:13 IST 2021 [PYTHON] running threaded_batched (10_000_000) inserts

real    0m21.945s
user    0m29.315s
sys     0m5.388s
Wed Aug  4 11:40:35 IST 2021 [PYTHON] running threaded_str_batched (10_000_000) inserts

real    2m15.994s
user    2m30.199s
sys     0m2.240s

There isn't much improvments, so I will use framegraph to profile the binaries.

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

Here are the flamegraphs for the rust binaries.

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

basic flamegraph

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

basic_batched flamegraph

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

basic_batched_wp flamegraph

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

basic_prep flamegraph

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

threaded_batched flamegraph

@NobodyXu
Copy link
Author

NobodyXu commented Aug 4, 2021

threaded_str_batched flamegraph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant