feat: stats and keyspace notifications about lazy expiration #1547

proost · 2025-01-12T08:59:55Z

resolve: #1507

codecov · 2025-01-12T15:20:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.96%. Comparing base (8af35a1) to head (828550f).
Report is 15 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1547      +/-   ##
============================================
+ Coverage     70.85%   70.96%   +0.11%     
============================================
  Files           120      120              
  Lines         64991    65066      +75     
============================================
+ Hits          46050    46175     +125     
+ Misses        18941    18891      -50

Files with missing lines	Coverage Δ
src/db.c	`89.51% <100.00%> (-0.05%)`	⬇️
src/server.c	`87.61% <100.00%> (+0.04%)`	⬆️
src/server.h	`100.00% <ø> (ø)`

... and 21 files with indirect coverage changes

Signed-off-by: proost <[email protected]>

This PR replaces dict with hashtable in the ZSET datatype. Instead of mapping key to score as dict did, the hashtable maps key to a node in the skiplist, which contains the score. This takes advantage of hashtable performance improvements and saves 15 bytes per set item - 24 bytes overhead before, 9 bytes after. Closes valkey-io#1096 --------- Signed-off-by: Rain Valentine <[email protected]> Signed-off-by: Viktor Söderqvist <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: proost <[email protected]>

Skip logreqres on tests for the HELLO command Signed-off-by: Rueian <[email protected]> Signed-off-by: proost <[email protected]>

Resolves issue with valkey-cli not auto exiting from subscribed mode on reaching zero pub/sub subscription (previously filed on Redis) redis/redis#12592 --------- Signed-off-by: Nikhil Manglore <[email protected]> Signed-off-by: proost <[email protected]>

- remove old ignores - fix a "new" typo 🎁 Signed-off-by: Viktor Szépe <[email protected]> Signed-off-by: proost <[email protected]>

This PR introduces improvements to the hashtable iterator, implementing prefetching technique described in the blog post [Unlock One Million RPS - Part 2](https://valkey.io/blog/unlock-one-million-rps-part2/) . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations. In the pursuit of maximizing iterator performance, I conducted a comprehensive series of experiments. My tests encompassed a wide range of approaches, including processing multiple bucket indices in parallel, prefetching the next bucket upon completion of the current one, and several other timing and quantity variations. Surprisingly, after rigorous testing and performance analysis, the simplest implementation presented in this PR consistently outperformed all other more complex strategies. ## Implementation Each time we start iterating over a bucket, we prefetch data for future iterations: - We prefetch the entries of the next bucket (if it exists). - We prefetch the structure (but not the entries) of the bucket after the next. This prefetching is done when we pick up a new bucket, increasing the chance that the data will be in cache by the time we need it. ## Performance The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command. ``` +--------------------+------------------+-----------------------------+ | prefetching | Time (seconds) | Keys Processed per Second | +--------------------+------------------+-----------------------------+ | No | 11.112279 | 4,499,529 | | Yes | 3.141916 | 15,913,862 | +--------------------+------------------+-----------------------------+ Improvement: Comparing the iterator without prefetching to the one with prefetching, we can see a speed improvement of 11.112279 / 3.141916 ≈ 3.54 times faster. ``` ### Save command improvment #### Setup: - 64cores Graviton 3 Amazon EC2 instance. - 50 mil keys in size of 100 bytes each. - Running valkey server over RAM file system. - crc checksum and comperssion off. #### Results ``` +--------------------+------------------+-----------------------------+ | prefetching | Time (seconds) | Keys Processed per Second | +--------------------+------------------+-----------------------------+ | No | 28 | 1,785,700 | | Yes | 19.6 | 2,550,000 | +--------------------+------------------+-----------------------------+ Improvement: - Reduced SAVE time by 30% (8.4 seconds faster) - Increased key processing rate by 42.8% (764,300 more keys/second) ``` Signed-off-by: NadavGigi <[email protected]> Signed-off-by: proost <[email protected]>

This PR is to cleanup the `SERVER_TEST` compiler flag from cmake compile definitions, as it is no longer required in the new unit test framework, see valkey-io#428. Signed-off-by: Karthick Ariyaratnam <[email protected]> Signed-off-by: proost <[email protected]>

We need to add a hash tag in cluster mode. Fixes valkey-io#1531. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

The fix that Redis gave us for the CVE-2024-46981 was freeing lctx.lua, and I didn't merge it correctly. We made some changes so that we are able to async free the lua context, so we need to free the passed in context. This was applied correctly on the two released versions (8.0 and 7.2) just not on unstable. Signed-off-by: Madelyn Olson <[email protected]> Signed-off-by: proost <[email protected]>

Fixes valkey-io#1538 Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: proost <[email protected]>

…lkey-io#1535) In valkey-io#1441, we found a assert, and decided remove this assert and instead just free the newly created node and close the link, since if we cannot get the IP from the link it probably means the connection was closed. ``` === VALKEY BUG REPORT START: Cut & paste starting from here === 17847:M 19 Dec 2024 00:15:58.021 # === ASSERTION FAILED === 17847:M 19 Dec 2024 00:15:58.021 # ==> cluster_legacy.c:3252 'nodeIp2String(node->ip, link, hdr->myip) == C_OK' is not true ------ STACK TRACE ------ 17847 valkey-server * src/valkey-server 127.0.0.1:27131 [cluster](clusterProcessPacket+0x1304) [0x4e5634] src/valkey-server 127.0.0.1:27131 [cluster](clusterReadHandler+0x11e) [0x4e59de] /__w/valkey/valkey/src/valkey-tls.so(+0x2f1e) [0x7f083983ff1e] src/valkey-server 127.0.0.1:27131 [cluster](aeMain+0x8a) [0x41afea] src/valkey-server 127.0.0.1:27131 [cluster](main+0x4d7) [0x40f547] /lib64/libc.so.6(+0x40c8) [0x7f083985a0c8] /lib64/libc.so.6(__libc_start_main+0x8b) [0x7f083985a18b] src/valkey-server 127.0.0.1:27131 [cluster](_start+0x25) [0x410ef5] ``` But it also introduces another assert. The reason is that this new node is not added to the cluster nodes dict. ``` 17128:M 08 Jan 2025 10:51:44.061 # === ASSERTION FAILED === 17128:M 08 Jan 2025 10:51:44.061 # ==> cluster_legacy.c:1693 'dictDelete(server.cluster->nodes, nodename) == DICT_OK' is not true ------ STACK TRACE ------ 17128 valkey-server * src/valkey-server 127.0.0.1:28627 [cluster][0x4ebdc4] src/valkey-server 127.0.0.1:28627 [cluster][0x4e81d2] src/valkey-server 127.0.0.1:28627 [cluster](clusterReadHandler+0x268)[0x4e8618] /__w/valkey/valkey/src/valkey-tls.so(+0xb278)[0x7f109480b278] src/valkey-server 127.0.0.1:28627 [cluster](aeMain+0x89)[0x592b09] src/valkey-server 127.0.0.1:28627 [cluster](main+0x4b3)[0x453e23] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f10958bf7e5] src/valkey-server 127.0.0.1:28627 [cluster](_start+0x2e)[0x454a5e] ``` This closes valkey-io#1527. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

… is 0 (valkey-io#1541) When latency-monitor-threshold is set to 0, it means the latency monitor is disabled, and in VM_LatencyAddSample, we wrote the condition incorrectly, causing us to record latency when latency was turned off. This bug was introduced in the very first day, see e3b1d6d, it was merged in 2019. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

…flicts (valkey-io#1018) When multiple primary nodes fail simultaneously, the cluster can not recover within the default effective time (data_age limit). The main reason is that the vote is without ranking among multiple replica nodes, which case too many epoch conflicts. Therefore, we introduced into ranking based on the failed primary shard-id. Introduced a new failed_primary_rank var, this var means the rank of this myself instance in the context of all failed primary list. This var will be used in failover and we will do the failover election packets in order based on the rank, this can effectively avoid the voting conflicts. If a single primary is down, the behavior is the same as before. If multiple primaries are down, their replica election initiation time will be delayed by 500ms according to the ranking. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

… the FAIL (valkey-io#1191) Imagine we have a cluster, for example a three-shard cluster, if shard 1 doing a CLUSTER RESET HARD, it will change the node name, and then other nodes will mark it as NOADR since the node name received by PONG has changed. In the eyes of other nodes, there is one working primary node left but with no address, and in this case, the address report in MOVED will be invalid and will confuse the clients. And in the same time, the replica will not failover since its primary is not in the FAIL state. And the cluster looks OK to everyone. This leaves a cluster that appears OK, but with no coverage for shard 1, obviously we should do something like CLUSTER FORGET to remove the node and fix the cluster before using it. But the point in here, we can mark the NOADDR node as FAIL to advance the cluster state. If a node is NOADDR means it does not have a valid address, so we won't reconnect it, we won't send PING, we won't gossip it, it seems reasonable to mark it as FAIL. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

When the cluster changes, we need to persist the cluster configuration, and these file IO operations may cause latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

The commands used in valkey-cli tests are not important the reply schema validation. Skip them to avoid the problem if tests hanging. This has failed lately in the daily job: ``` [TIMEOUT]: clients state report follows. sock55fedcc19be0 => (IN PROGRESS) valkey-cli pubsub mode with single standard channel subscription Killing still running Valkey server 33357 ``` These test cases use a special valkey-cli command `:get pubsub` command, which is an internal command to valkey-cli rather than a Valkey server command. This command hangs when compiled with with logreqres enabled. Easy solution is to skip the tests in this setup. The test cases were introduced in valkey-io#1432. Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: proost <[email protected]>

After valkey-io#1545 disabled some tests for reply schema validation, we now have another issue that ECHO is not covered. ``` WARNING! The following commands were not hit at all: echo ERROR! at least one command was not hit by the tests ``` This patch adds a test case for ECHO in the unit/other test suite. I haven't checked if there are more commands that aren't covered. Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: proost <[email protected]>

This PR replaces dict with the new hashtable data structure in the HASH datatype. There is a new struct for hashtable items which contains a pointer to value sds string and the embedded key sds string. These values were previously stored in dictEntry. This structure is kept opaque so we can easily add small value embedding or other optimizations in the future. closes valkey-io#1095 --------- Signed-off-by: Rain Valentine <[email protected]> Signed-off-by: proost <[email protected]>

In some cases unix groups could have whitespace and/or `\` in them. One example is my workstation. It's a MacOS in an Active Directory domain. So my user has group `LD\Domain Users`. Running `make test` on `unstable` and `8.0` branches fails with: I'm not sure if we need to fix this in 8.0. But it seems that it should be fixed in unstable. Signed-off-by: secwall <[email protected]> Signed-off-by: proost <[email protected]>

Introduced at https://github.com/valkey-io/valkey/pull/1165/files Signed-off-by: naglera <[email protected]> Signed-off-by: proost <[email protected]>

`sds` is a typedef of `char *`. `const sds` means `char * const`, i.e. a const-pointer to non-const content. More often, you would want `const char *`, i.e. a pointer to const-content. Until now, it's not possible to express that. This PR adds `const_sds` which is a pointer to const-content sds. To get a const-pointer to const-content sds, you can use `const const_sds`. In this PR, some uses of `const sds` are replaced by `const_sds`. We can use it more later. Fixes valkey-io#1542 --------- Signed-off-by: Viktor Söderqvist <[email protected]> Signed-off-by: proost <[email protected]>

Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients to inform users about if clients are paused. --------- Signed-off-by: zhaozhao.zz <[email protected]> Signed-off-by: proost <[email protected]>

Adds filter options to CLIENT LIST: * USER <username> Return clients authenticated by <username>. * ADDR <ip:port> Return clients connected from the specified address. * LADDR <ip:port> Return clients connected to the specified local address. * SKIPME (YES|NO) Exclude the current client from the list (default: no). * MAXAGE <maxage> Only list connections older than the specified age. Modifies the ID filter to CLIENT KILL to allow multiple IDs * ID <client-id> [<client-id>...] Kill connections by client ids. This makes CLIENT LIST and CLIENT KILL accept the same options. For backward compatibility, the default value for SKIPME is NO for CLIENT LIST and YES for CLIENT KILL. The MAXAGE comes from CLIENT KILL, where it *keeps* clients with the given max age and kills the older ones. This logic becomes weird for CLIENT LIST, but is kept for similary with CLIENT KILL, for the use case of first testing manually using CLIENT LIST, and then running CLIENT KILL with the same filters. The `ID client-id [client-id ...]` no longer needs to be the last filter. The parsing logic determines if an argument is an ID or not based on whether it can be parsed as an integer or not. Partly addresses: valkey-io#668 --------- Signed-off-by: Sarthak Aggarwal <[email protected]> Signed-off-by: proost <[email protected]>

Just like spell-check workflow, we should allow to trigger it in push events, so that the forks repo can notice the format thing way before submitting the PR. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

…d other commands(valkey-io#1517) Some commands that use unix-time, such as `EXPIREAT` and `SET EXAT`, should include the deleted keys in the `expired_keys` statistics if the specified time has already expired, and notifications should be sent in the manner of expired. --------- Signed-off-by: Ray Cao <[email protected]>

…1312) This commit creates a new compilation unit for the scripting engine code by extracting the existing code from the functions unit. We're doing this refactor to prepare the code for running the `EVAL` command using different scripting engines. This PR has a module API change: we changed the type of error messages returned by the callback `ValkeyModuleScriptingEngineCreateFunctionsLibraryFunc` to be a `ValkeyModuleString` (aka `robj`); This PR also fixes valkey-io#1470. --------- Signed-off-by: Ricardo Dias <[email protected]> Signed-off-by: proost <[email protected]>

…1563) This issue affected only two message types (CLUSTERMSG_TYPE_PUBLISH and CLUSTERMSG_TYPE_PUBLISHSHARD) because they used a light message header, which caused the CLUSTER INFO stats to miss sent/received message information for those types. --------- Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

When processing a cluster bus PING extension, there is a memory leak when adding a new key to the `nodes_black_list` dict. We now make sure to free the key `sds` if the dict did not take ownership of it. Signed-off-by: Pierre Turin <[email protected]> Signed-off-by: proost <[email protected]>

Update comments and log message in `cluster_legacy.c`. Follow-up from valkey-io#1441. Signed-off-by: Pierre Turin <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

…eat-keyspace-expiration-misses Signed-off-by: proost <[email protected]>

proost changed the title ~~feat: keyspace expiration misses~~ feat: stats and keyspace notifications about lazy expiration Jan 17, 2025

proost and others added 28 commits January 17, 2025 23:29

feat: add key expiration misses stat

9dced28

Signed-off-by: proost <[email protected]>

chore: rollback header file

6f58024

Signed-off-by: proost <[email protected]>

feat: add notify lazy expired

35a5622

Signed-off-by: proost <[email protected]>

Skip logreqres on tests for the HELLO command (valkey-io#1528)

980ab2a

Skip logreqres on tests for the HELLO command Signed-off-by: Rueian <[email protected]> Signed-off-by: proost <[email protected]>

Improve Typos configuration (valkey-io#1456)

1dd2cba

- remove old ignores - fix a "new" typo 🎁 Signed-off-by: Viktor Szépe <[email protected]> Signed-off-by: proost <[email protected]>

Fix new cli subscribed mode test in cluster mode (valkey-io#1533)

ad9601d

We need to add a hash tag in cluster mode. Fixes valkey-io#1531. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

Update upload artifacts to v4 (valkey-io#1539)

6a58098

Fixes valkey-io#1538 Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: proost <[email protected]>

Add latency stats around cluster config file operations (valkey-io#1534)

e8e6f68

When the cluster changes, we need to persist the cluster configuration, and these file IO operations may cause latency. Signed-off-by: Binbin <[email protected]> Signed-off-by: proost <[email protected]>

Fix valgrind test (valkey-io#1555)

c33a7b1

Introduced at https://github.com/valkey-io/valkey/pull/1165/files Signed-off-by: naglera <[email protected]> Signed-off-by: proost <[email protected]>

add paused_actions for INFO Clients (valkey-io#1519)

32c57b5

Add `paused_actions` and `paused_timeout_milliseconds` for INFO Clients to inform users about if clients are paused. --------- Signed-off-by: zhaozhao.zz <[email protected]> Signed-off-by: proost <[email protected]>

hpatro and others added 3 commits January 17, 2025 23:30

proost force-pushed the feat-keyspace-expiration-misses branch from e3452dd to e9a9894 Compare January 17, 2025 14:32

Merge branch 'unstable' of https://github.com/valkey-io/valkey into f…

e05be86

…eat-keyspace-expiration-misses Signed-off-by: proost <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stats and keyspace notifications about lazy expiration #1547

feat: stats and keyspace notifications about lazy expiration #1547

proost commented Jan 12, 2025

codecov bot commented Jan 12, 2025

feat: stats and keyspace notifications about lazy expiration #1547

Are you sure you want to change the base?

feat: stats and keyspace notifications about lazy expiration #1547

Conversation

proost commented Jan 12, 2025

codecov bot commented Jan 12, 2025

Codecov Report