feat(ISV-6425): extend base consumer with retry mechanism#5
feat(ISV-6425): extend base consumer with retry mechanism#5BorekZnovustvoritel merged 9 commits intomainfrom
Conversation
21a4aee to
5859031
Compare
* covered with unit tests Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.5-opus-high (Cursor)
* create various integration tests Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.5-opus-high (Cursor)
Signed-off-by: Marek Szymutko <mszymutk@redhat.com>
5859031 to
b82afb2
Compare
|
@mavaras PTAL (I cannot assign you in UI, I guess you have to accept the invitation to the organization) 🙏 |
Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.5-Sonnet (Cursor)
Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.5-Sonnet (Cursor)
Signed-off-by: Marek Szymutko <mszymutk@redhat.com>
|
@ezopezo @bclindner please also take a look. I am very much willing to schedule some meeting trying to explain this. I definitely do not expect your review too fast, it's a lot of lines, which I am sorry for |
mavaras
left a comment
There was a problem hiding this comment.
For me this is a LGTM, but I suggest that others take an additional look to this 👍
There was a problem hiding this comment.
sorry for the wait - hoping i didn't miss much, but with all the dry test code you have to write for these kinds of changes i'm sure i missed something or other. i might recommend a third reviewer on this, it's a tough one.
i think this all makes sense to me, but i could use some clarification on a couple things. i also see some questionable bits, particularly in the tests. see comments.
|
|
||
| ### Fault-tolerance | ||
|
|
||
| Each consumer accepts configuration with retry topics. A retry topic is |
There was a problem hiding this comment.
kinda new to Kafka-- is this retry topic pattern a widely-used thing for Kafka as a platform, or is it our own thing?
first time seeing something like this in the wild so i'm just trying to get more info. the closest established generic pattern i know of is a dead letter channel, but that isn't really meant to indicate when/where to retry, just a place to put dropped messages. otherwise the best i can find for this is this medium article & some Java/Spring-related stuff on baeldung
There was a problem hiding this comment.
It is basically a lightweight reimplementation of the functionality present in Spring Kafka library, see here. This solution was suggested by Tim Waugh
There was a problem hiding this comment.
thanks for the docs here, this is a lot of code to review in one go & i definitely needed the primer
|
|
||
| [testenv:integration] | ||
| commands = pytest -vv -s ./tests/integration/ {posargs} | ||
| commands = pytest -vv -s --log-cli-level=DEBUG ./tests/integration/ {posargs} |
There was a problem hiding this comment.
is this debug addition still necessary/worth keeping in?
There was a problem hiding this comment.
It makes the integration tests transparent. We don't need this for functionality, but I think it makes sense we want to be able to debug integration tests. I am open to suggestions on this one though
tests/integration/README.md
Outdated
|
|
||
| ```python | ||
| from retriable_kafka_client import ConsumeTopicConfig | ||
| from tests.integration.integration_utils import IntegrationTestScaffold, ScaffoldConfig |
There was a problem hiding this comment.
nitpick: we should change this to .integration_utils to make it consistent with how we're actually using it
|
|
||
|
|
||
| @dataclass | ||
| class RandomDelay: |
There was a problem hiding this comment.
...can't we just simplify this to
def get_random_delay(min_delay: float, max_delay: float):
return random.uniform(min_delay, max_delay)instead of making a whole thing out of it with a class? we could also just call random.uniform directly since we only use this once.
There was a problem hiding this comment.
I mean, we can. But this way we are able to reduce the number of arguments passed to the scaffolding's start_consumer method. Let me know if you really want me to change this.
| ValueError: if simulated failure occurs | ||
| """ | ||
| if self.delay: | ||
| if isinstance(self.delay, RandomDelay): |
There was a problem hiding this comment.
could easily simplify this by removing RandomDelay as mentioned elsewhere in the review
There was a problem hiding this comment.
With the current implementation, we are able to specify a static delay (float) or a random delay. If I were to ditch the RandomDelay class, then I think the best solution to pass the parameters to the MockTarget would be just two floats max_delay and min_delay. If a static delay was wanted in some tests, these would have to be specified as the same value.
I don't know, is that cleaner? I am not opposed, I am just not sure if I see the benefit really.
There was a problem hiding this comment.
Ahhh I missed what was going on here before - this is randomizing at runtime rather than instantiation time so we're randomizing individual message delays. That's my bad, this makes sense now.
tests/integration/conftest.py
Outdated
| metadata = admin_client.list_topics(timeout=5) | ||
| if metadata.brokers: | ||
| return | ||
| except KafkaException: |
There was a problem hiding this comment.
nitpick: doesn't KafkaException extend Exception? if so, we can probably just remove this block for it, right?
and if not, we could also just do this to clean it up:
except (Exception, KafkaException):There was a problem hiding this comment.
You are right, it's unnecessary. I guess I left that behind by accident. I will update this.
tests/unit/test_retry_utils.py
Outdated
| """Test that RetryManager accepts valid configurations.""" | ||
| config = _make_config(topics) | ||
| manager = RetryManager(config) | ||
| assert manager is not None |
There was a problem hiding this comment.
i don't see a code path where constructing a RetryManager would ever return None. is this assert a formality/pylint situation, or is there maybe a better way we can validate this?
There was a problem hiding this comment.
Yes, that's nonsense. Just calling the RetryManager constructor and not getting any assertion error is enough here.
tests/unit/test_producer.py
Outdated
| # Should have retried twice (2 failures + 1 success = 3 calls) | ||
| assert mock_kafka_producer.produce.call_count == 3 | ||
| # Should have waited approximately: 0.01 + 0.011 = 0.021 seconds | ||
| assert elapsed_time >= 0.02 # Allow some timing flexibility |
There was a problem hiding this comment.
this is a strange one - if the testing hardware is slow, this will pass when it shouldn't, rather than failing if the hardware is slow as these testing race conditions usually go. should we bother checking for this given it only semi-reliably tests the existence of a wait?
There was a problem hiding this comment.
You are right. I missed this when culling nonsense by Claude in the tests.
| @pytest.mark.asyncio | ||
| async def test_producer_send_success_first_attempt(base_producer: BaseProducer) -> None: | ||
| """Test successful message send on first attempt.""" | ||
| def _run_send_method( |
There was a problem hiding this comment.
smart way to tackle this, i think! i guess you could also just run both at once here to avoid having to do parametrize, but i'm guessing that causes problems with things like patching
There was a problem hiding this comment.
I am going to be honest, this is not my idea, this was created by Claude. But it makes sense to me so I kept it.
I think parametrization makes sense so that we can see which method failed in the test output. If we called both at once, some test would fail and we wouldn't know on first sight if it failed during the sync or async execution.
Signed-off-by: Marek Szymutko <mszymutk@redhat.com>
bclindner
left a comment
There was a problem hiding this comment.
Changes look good, this all seems clear to me now. Thanks!
Signed-off-by: Marek Szymutko <mszymutk@redhat.com>
src/retriable_kafka_client/config.py
Outdated
| topics: list of topic names to connect to | ||
| topics: list of configuration for topics and their | ||
| retry policies | ||
| cancel_future_wait_time: Maximal time to wait fot a task |
There was a problem hiding this comment.
| cancel_future_wait_time: Maximal time to wait fot a task | |
| cancel_future_wait_time: Maximal time to wait for a task |
README.md
Outdated
| of them finish before the cluster forces rebalancing. The library tries to | ||
| finish everything before the rebalance while stopping additional non-started | ||
| tasks. The default timeout for each task to finish is 30 seconds, but can be |
There was a problem hiding this comment.
Just for papertrail: We agreed in DMs on moving from eager rebalancing towards cooperative rebalancing. Are those lines still applicable even after changing strategy?
There was a problem hiding this comment.
Well, it should be re-worded, good point. It does stop tasks and waits for others to finish, it just does it only for affected partitions. Example: If a single consumer is in a group, it consumes all partitions. Then another consumer joins the group, the cluster has to revoke some of the partitions from the first consumer to supply them to the second consumer. I will try to update this text a little.
|
Also adding digest of our personal discussion: I expressed my concerns about reordering of messages during outage when moved to retry topic. To my knowledge messages are loaded to this topic, and then pop-ed from it, but when processing is still blocked, message is added at the beginning of the queue. When processing started again this might cause message reordering, which might cause inconsistencies in workflows. Nevertheless, we agreed that this implementation is sufficient, thorough UAT will be performed, logs will be analyzed afterwards and further improvements might be addressed in further tasks. |
Signed-off-by: Marek Szymutko <mszymutk@redhat.com>
Assisted-by: Claude-4.5-opus-high (Cursor), mainly in the test section
How to review
The majority of this PR are tests. The best place to look for verifying of the functionality are integration tests. There are many unit tests and they are not that interesting to be honest. But the integration test functionality does sum up parts that are important in my opinion.
What changed