-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: prefetch keys during transaction scheduling #4355
base: main
Are you sure you want to change the base?
Conversation
219c1e3
to
a1946b2
Compare
|
||
// Prefetch buckets that might hold the key with high probability. | ||
__builtin_prefetch(&target, 0, 1); | ||
__builtin_prefetch(&probe, 0, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
during high load this can create extra pressure on memory bandwidth, have you tested without __builtin_prefetch(&probe, 0, 1); ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we are going to look up these keys anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I have not. I will check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also check these changes on the 64 or if possible 128 core/thread CPU where memory bandwidth pressure can be higher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the bandwidth concerns: DDR5 allows 64GB/s per DIMM: https://en.wikipedia.org/wiki/DDR5_SDRAM#:~:text=DDR5%20also%20has%20higher%20frequencies,s)%20of%20bandwidth%20per%20DIMM. 64 CPUs servers are likely to have at least 32-64 DIMMs.
Here we are talking about 1-50M/s key ops, and each prefetch loads 64 bytes. This translates to 300MB/s at top. I do not think the numbers are even close to be a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me clarify my concern. The performance of DDR5 is great (as I know, the maximum number of channels is 12), but the cache size isn't so huge and the cache is not as efficient for databases as for other application types. Loading data that we need with some probability evicts other data from the cache and we need to load it again.
What I want to say is if we get the same result without loading the next bucket we don't need it to load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do not succeed to reproduce improvement at all. the difference in performance is within the statistical noise of benchmarking. I am putting this on hold.
a1946b2
to
44aaa4a
Compare
To demonstrate the improvement, I run read only traffic on already prefilled datastore "debug populate 10000000 key 1000". The traffic consists of 100% miss rate in order to zoom in into the flow handled by this pr, which is - looking up for a key in the dashtable. For the same reason, I used pipelining - to reduce the impact of networking CPU on the server side, and to make the workload more intensive on memory. This improvement: 1. Reduces the running time by 12% (or increased the avg QPS by 13%) 2. p99 reduced by 12% from 1700 to 1500 usec. Credit for the idea: https://valkey.io/blog/unlock-one-million-rps/ Detailed runs at more detail: Before this change: ``` ~/projects/dragonfly/build-opt$ ./dfly_bench -p 6380 --qps=9500 --ratio=0:1 -h 10.142.15.215 -n 4000000 --key_prefix=k --proactor_threads=8 -c 10 Running 8 threads, sending 4000000 requests per each connection, or 320000000 requests overall At a rate of 9500 rps per connection, i.e. request every 105us Overall scheduled RPS: 760000 5s: 1.1% done, RPS(now/agg): 710271/710271, errs: 0, hitrate: 0.0%, clients: 80 done_min: 0.96%, done_max: 1.19%, p99_lat(us): 1786, max_pending: 11 10s: 2.2% done, RPS(now/agg): 703190/706730, errs: 0, hitrate: 0.0%, clients: 80 done_min: 1.90%, done_max: 2.38%, p99_lat(us): 1788, max_pending: 11 90s: 20.0% done, RPS(now/agg): 703373/711583, errs: 0, hitrate: 0.0%, clients: 80 done_min: 17.68%, done_max: 21.38%, p99_lat(us): 1778, max_pending: 11 345s: 76.3% done, RPS(now/agg): 734230/707276, errs: 0, hitrate: 0.0%, clients: 80 done_min: 68.83%, done_max: 81.94%, p99_lat(us): 1779, max_pending: 11 350s: 77.3% done, RPS(now/agg): 696489/707122, errs: 0, hitrate: 0.0%, clients: 80 done_min: 69.84%, done_max: 83.13%, p99_lat(us): 1778, max_pending: 11 450s: 97.3% done, RPS(now/agg): 400617/691734, errs: 0, hitrate: 0.0%, clients: 45 done_min: 89.85%, done_max: 100.00%, p99_lat(us): 1779, max_pending: 11 455s: 97.7% done, RPS(now/agg): 250114/686881, errs: 0, hitrate: 0.0%, clients: 24 done_min: 90.80%, done_max: 100.00%, p99_lat(us): 1780, max_pending: 11 460s: 98.0% done, RPS(now/agg): 179637/681368, errs: 0, hitrate: 0.0%, clients: 24 done_min: 91.76%, done_max: 100.00%, p99_lat(us): 1781, max_pending: 11 465s: 98.3% done, RPS(now/agg): 210018/676299, errs: 0, hitrate: 0.0%, clients: 24 done_min: 92.76%, done_max: 100.00%, p99_lat(us): 1781, max_pending: 11 470s: 98.6% done, RPS(now/agg): 184117/671063, errs: 0, hitrate: 0.0%, clients: 24 done_min: 93.72%, done_max: 100.00%, p99_lat(us): 1782, max_pending: 11 475s: 98.8% done, RPS(now/agg): 156475/665647, errs: 0, hitrate: 0.0%, clients: 19 done_min: 94.68%, done_max: 100.00%, p99_lat(us): 1783, max_pending: 11 480s: 99.0% done, RPS(now/agg): 148995/660265, errs: 0, hitrate: 0.0%, clients: 19 done_min: 95.65%, done_max: 100.00%, p99_lat(us): 1783, max_pending: 11 485s: 99.3% done, RPS(now/agg): 148889/654992, errs: 0, hitrate: 0.0%, clients: 19 done_min: 96.60%, done_max: 100.00%, p99_lat(us): 1784, max_pending: 11 490s: 99.5% done, RPS(now/agg): 148289/649822, errs: 0, hitrate: 0.0%, clients: 19 done_min: 97.55%, done_max: 100.00%, p99_lat(us): 1784, max_pending: 11 495s: 99.7% done, RPS(now/agg): 147537/644749, errs: 0, hitrate: 0.0%, clients: 19 done_min: 98.52%, done_max: 100.00%, p99_lat(us): 1785, max_pending: 11 500s: 100.0% done, RPS(now/agg): 145938/639761, errs: 0, hitrate: 0.0%, clients: 11 done_min: 99.51%, done_max: 100.00%, p99_lat(us): 1785, max_pending: 11 Total time: 8m21.153171955s. Overall number of requests: 320000000, QPS: 638722 Latency summary, all times are in usec: Count: 320000000 Average: 903.4699 StdDev: 2207414.49 Min: 53.0000 Median: 900.5397 Max: 13940.0000 ------------------------------------------------------ [ 50, 60 ) 98 0.000% 0.000% [ 60, 70 ) 1368 0.000% 0.000% [ 70, 80 ) 6217 0.002% 0.002% [ 80, 90 ) 17120 0.005% 0.008% [ 90, 100 ) 36010 0.011% 0.019% [ 100, 120 ) 168280 0.053% 0.072% [ 120, 140 ) 429397 0.134% 0.206% [ 140, 160 ) 868176 0.271% 0.477% [ 160, 180 ) 1513899 0.473% 0.950% [ 180, 200 ) 2299055 0.718% 1.669% [ 200, 250 ) 8282542 2.588% 4.257% # [ 250, 300 ) 10372276 3.241% 7.498% # [ 300, 350 ) 11892829 3.717% 11.215% # [ 350, 400 ) 12378963 3.868% 15.083% # [ 400, 450 ) 11577678 3.618% 18.701% # [ 450, 500 ) 10591660 3.310% 22.011% # [ 500, 600 ) 20705038 6.470% 28.481% # [ 600, 700 ) 22463042 7.020% 35.501% # [ 700, 800 ) 23769529 7.428% 42.929% # [ 800, 900 ) 22512946 7.035% 49.964% # [ 900, 1000 ) 21098245 6.593% 56.558% # [ 1000, 1200 ) 48858666 15.268% 71.826% ### [ 1200, 1400 ) 49938490 15.606% 87.432% ### [ 1400, 1600 ) 28313693 8.848% 96.280% ## [ 1600, 1800 ) 9371830 2.929% 99.208% # [ 1800, 2000 ) 1656441 0.518% 99.726% [ 2000, 2500 ) 392161 0.123% 99.849% [ 2500, 3000 ) 128840 0.040% 99.889% [ 3000, 3500 ) 121288 0.038% 99.927% [ 3500, 4000 ) 91733 0.029% 99.955% [ 4000, 4500 ) 60773 0.019% 99.974% [ 4500, 5000 ) 36645 0.011% 99.986% [ 5000, 6000 ) 30751 0.010% 99.996% [ 6000, 7000 ) 7415 0.002% 99.998% [ 7000, 8000 ) 1478 0.000% 99.998% [ 8000, 9000 ) 1072 0.000% 99.999% [ 9000, 10000 ) 1199 0.000% 99.999% [ 10000, 12000 ) 1897 0.001% 100.000% [ 12000, 14000 ) 1260 0.000% 100.000% ``` With this change: ``` Running 8 threads, sending 4000000 requests per each connection, or 320000000 requests overall At a rate of 9500 rps per connection, i.e. request every 105us Overall scheduled RPS: 760000 5s: 1.2% done, RPS(now/agg): 757514/757514, errs: 0, hitrate: 0.0%, clients: 80 done_min: 1.16%, done_max: 1.19%, p99_lat(us): 1527, max_pending: 11 10s: 2.4% done, RPS(now/agg): 753364/755439, errs: 0, hitrate: 0.0%, clients: 80 done_min: 2.27%, done_max: 2.38%, p99_lat(us): 1560, max_pending: 11 15s: 3.5% done, RPS(now/agg): 753031/754636, errs: 0, hitrate: 0.0%, clients: 80 330s: 77.6% done, RPS(now/agg): 753779/752887, errs: 0, hitrate: 0.0%, clients: 80 done_min: 74.12%, done_max: 78.38%, p99_lat(us): 1578, max_pending: 11 done_min: 96.63%, done_max: 100.00%, p99_lat(us): 1579, max_pending: 11 435s: 99.7% done, RPS(now/agg): 137773/733153, errs: 0, hitrate: 0.0%, clients: 15 done_min: 97.77%, done_max: 100.00%, p99_lat(us): 1579, max_pending: 11 440s: 99.9% done, RPS(now/agg): 134162/726347, errs: 0, hitrate: 0.0%, clients: 15 done_min: 98.88%, done_max: 100.00%, p99_lat(us): 1579, max_pending: 11 Total time: 7m23.464824086s. Overall number of requests: 320000000, QPS: 722347 Latency summary, all times are in usec: Count: 320000000 Average: 826.7950 StdDev: 2009589.73 Min: 51.0000 Median: 857.0704 Max: 23549.0000 ------------------------------------------------------ [ 50, 60 ) 95 0.000% 0.000% [ 60, 70 ) 524 0.000% 0.000% [ 70, 80 ) 1715 0.001% 0.001% [ 80, 90 ) 5620 0.002% 0.002% [ 90, 100 ) 14380 0.004% 0.007% [ 100, 120 ) 88375 0.028% 0.035% [ 120, 140 ) 270640 0.085% 0.119% [ 140, 160 ) 610742 0.191% 0.310% [ 160, 180 ) 1182863 0.370% 0.680% [ 180, 200 ) 2054392 0.642% 1.322% [ 200, 250 ) 8804939 2.752% 4.073% # [ 250, 300 ) 12475349 3.899% 7.972% # [ 300, 350 ) 15107581 4.721% 12.693% # [ 350, 400 ) 16456965 5.143% 17.836% # [ 400, 450 ) 15996109 4.999% 22.834% # [ 450, 500 ) 14600129 4.563% 27.397% # [ 500, 600 ) 25648291 8.015% 35.412% ## [ 600, 700 ) 20320301 6.350% 41.762% # [ 700, 800 ) 16566820 5.177% 46.939% # [ 800, 900 ) 17161547 5.363% 52.302% # [ 900, 1000 ) 24021013 7.507% 59.809% ## [ 1000, 1200 ) 69190350 21.622% 81.431% #### [ 1200, 1400 ) 45721447 14.288% 95.719% ### [ 1400, 1600 ) 11667667 3.646% 99.365% # [ 1600, 1800 ) 1247125 0.390% 99.755% [ 1800, 2000 ) 160430 0.050% 99.805% [ 2000, 2500 ) 133001 0.042% 99.846% [ 2500, 3000 ) 129180 0.040% 99.887% [ 3000, 3500 ) 131104 0.041% 99.928% [ 3500, 4000 ) 99134 0.031% 99.959% [ 4000, 4500 ) 60951 0.019% 99.978% [ 4500, 5000 ) 36908 0.012% 99.989% [ 5000, 6000 ) 25643 0.008% 99.997% [ 6000, 7000 ) 3980 0.001% 99.999% [ 7000, 8000 ) 2088 0.001% 99.999% [ 8000, 9000 ) 829 0.000% 99.999% [ 9000, 10000 ) 251 0.000% 100.000% [ 10000, 12000 ) 147 0.000% 100.000% [ 12000, 14000 ) 1129 0.000% 100.000% [ 14000, 16000 ) 80 0.000% 100.000% [ 18000, 20000 ) 9 0.000% 100.000% [ 20000, 25000 ) 157 0.000% 100.000% ``` Signed-off-by: Roman Gershman <[email protected]>
To demonstrate the improvement, I run read only traffic on already prefilled datastore "debug populate 10000000 key 1000". The traffic consists of 100% miss rate in order to zoom in into the flow handled by this pr, which is - looking up for a key in the dashtable.
For the same reason, I used pipelining - to reduce the impact of networking CPU on the server side, and to make the workload more intensive on memory.
This improvement:
Credit for the idea:
https://valkey.io/blog/unlock-one-million-rps/
Detailed runs at more detail:
Before this change:
With this change: