Interface for consuming changes #13

procrastinationfighter · 2021-11-12T13:29:51Z

Fixes: #5

This pull request adds an interface for consuming changes from CDC table.
The module consumer provides a Consumer trait and a ConsumerFactory trait that define interface used by the reader component. Data from the table is passed to a consumer in a CDCRow object. CDCSchema is used to create new CDCRow objects.

macher259 · 2021-11-12T13:43:58Z

scylla-cdc/src/customer.rs

+    mapping: HashMap<String, usize>,
+}
+
+const STREAM_ID_NAME: &'static str = "cdc$stream_id";


I think constants have static lifetime by default.

That's right, I can't decide which option looks better, so I will just leave it for now.

Update: Clippy recommended me to remove it.

scylla-cdc/src/customer.rs

Ponewor · 2021-11-12T20:39:38Z

scylla-cdc/src/customer.rs

+    PreImage = 0,
+    RowUpdate = 1,
+    RowInsert = 2,
+    RowDelete = 3,
+    PartitionDelete = 4,
+    RowRangeDelInclLeft = 5,
+    RowRangeDelExclLeft = 6,
+    RowRangeDelInclRight = 7,
+    RowRangeDelExclRight = 8,
+    PostImage = 9,


I know this comment is redundant at this point after @piodul's suggestion to use num_enum but for the record:
you don't have to be explicit with integer values here, they are assigned this way by default.

Ponewor · 2021-11-13T04:14:13Z

scylla-cdc/src/customer.rs

@@ -0,0 +1,113 @@
+use std::collections::HashMap;


You probably wanted to name the file consumer.rs, didn't you?

procrastinationfighter · 2021-11-13T21:12:48Z

I've applied some changes to the code - it compiles now, however it was not tested yet, so it's still marked as a draft - I will mark this request as ready for review as soon as I add the tests.

procrastinationfighter · 2021-11-13T21:14:59Z

scylla-cdc/src/consumer.rs

+            // I have no idea why, but I couldn't use match here - variables were not accessible inside the match block.
+            if i == schema.stream_id {
+                stream_id = column.unwrap().into_blob().unwrap();
+            } else if i == schema.time {
+                time = column.unwrap().as_uuid().unwrap();
+            } else if i == schema.batch_seq_no {
+                batch_seq_no = column.unwrap().as_int().unwrap();
+            } else if i == schema.end_of_batch {
+                end_of_batch = column.unwrap().as_boolean().unwrap()
+            } else if i == schema.operation {
+                operation = OperationType::try_from(column.unwrap().as_tinyint().unwrap()).unwrap();
+            } else if i == schema.ttl {
+                ttl = column.map(|ttl| ttl.as_bigint().unwrap());
+            } else {
+                data.push(column);
+            }


I don't know why, but when I tried to do something like that:
match i { schema.stream_id = > (...) }
it just didn't work - as if the variable schema was out of scope.

I don't think you can put variables into match arms - only constants known at compile time are suitable.

macher259 · 2021-11-13T21:35:25Z

scylla-cdc/src/consumer.rs

+        let mut i = 0;
+
+        // .iter().enumerate() can't be used here, because it doesn't take ownership of taken value.
+        for column in row.columns.into_iter() {


I think you don't have to use into_iter here.

macher259 · 2021-11-13T21:35:29Z

scylla-cdc/src/consumer.rs

+        let mut i = 0;
+
+        // .iter().enumerate() can't be used here, because it doesn't take ownership of taken value.
+        for column in row.columns.into_iter() {


I think you don't have to use into_iter here.

scylla-cdc/src/consumer.rs

piodul · 2021-11-13T22:53:16Z

scylla-cdc/src/consumer.rs

+            // I have no idea why, but I couldn't use match here - variables were not accessible inside the match block.
+            if i == schema.stream_id {
+                stream_id = column.unwrap().into_blob().unwrap();
+            } else if i == schema.time {
+                time = column.unwrap().as_uuid().unwrap();
+            } else if i == schema.batch_seq_no {
+                batch_seq_no = column.unwrap().as_int().unwrap();
+            } else if i == schema.end_of_batch {
+                end_of_batch = column.unwrap().as_boolean().unwrap()
+            } else if i == schema.operation {
+                operation = OperationType::try_from(column.unwrap().as_tinyint().unwrap()).unwrap();
+            } else if i == schema.ttl {
+                ttl = column.map(|ttl| ttl.as_bigint().unwrap());
+            } else {
+                data.push(column);
+            }


I don't think you can put variables into match arms - only constants known at compile time are suitable.

piodul · 2021-11-13T23:10:47Z

scylla-cdc/src/consumer.rs

+        let mapping = schema.mapping.clone();
+        let deleted_mapping = schema.deleted_mapping.clone();
+        let deleted_el_mapping = schema.deleted_el_mapping.clone();


It's really sad that construction of a CDCRow requires cloning three hashmaps.

Let's change the CDCRow so that it keeps a reference to a CDCRowSchema:

pub struct CDCRow<'schema> // ... schema: &'schema CDCRowSchema,

Then, in the getter methods you can refer to the mappings through the schema reference.

scylla-cdc/src/consumer.rs

procrastinationfighter · 2021-11-18T19:17:38Z

I applied suggested fixes and created some tests for the module. I also have an unfinished test that checks if CDCRow is created properly (without connecting to the database, with prepared Row and CDCRowSchema), but I didn't finish it, because I changed my opinion about its usefulness and writing it was tiring. If you think that such a test is a good idea, I will finish it.

piodul

Sorry for changing my mind on the interface so late... but after I saw how many unwraps there are in tests I think the return types get_value, is_value_deleted and get_deleted_elements should be simplified even if it means that some information about columns existing or not will be unavailable. I think that the users will usually know what their schema is, so they which column names to expect and which cdc$deleted_ and cdc$deleted_elements_ special columns to expect. They will have to unwrap anyway (I don't think they will want to handle it in some other way), so we can do it for them. I imagine the following interface:

// We need this one `Option` here because it represents if the value is null or not
// and that is a case some users would definitely want to check for
pub fn get_value(&self, name: &str) -> &Option<CqlValue> {
    // Panic if the column does not exist
    // Otherwise return a reference to its value
}

pub fn is_value_deleted(&self, name: &str) -> bool {
    // Panic if the column does not exist
    // Otherwise return if the value was deleted
}

pub fn get_deleted_elements(&self, name: &str) -> &[CqlValue] {
    // Panic if the column does not exist
    // If the value is null, return an empty slice: &[]
    // Otherwise return the slice with deleted elements
}

If I'm wrong and somebody does need this information, we can add another method for them which checks if the column exists:

pub fn column_exists(&self, name: &str) -> bool {
	// ...
}

The panicking behavior should be documented in the docstring comments for those methods.

Apart from that and some other small comments I left, I think it LGTM.

piodul · 2021-11-19T06:20:40Z

scylla-cdc/src/consumer.rs

+}
+
+pub trait ConsumerFactory {
+    fn new_consumer() -> Box<dyn Consumer>;


This method should be non-static, i.e. take &self as the first parameter.

piodul · 2021-11-19T07:17:54Z

scylla-cdc/src/consumer.rs

+        session.query(query, &[]).await.unwrap();
+        session.await_schema_agreement().await.unwrap();
+
+        // Create test tables containing information about generations and streams.


This comment is incorrect - the code below just creates tables with CDC enabled, it does not create mock tables containing information about streams and generations (as the tables created in stream_generations.rs).

piodul · 2021-11-19T07:33:01Z

scylla-cdc/src/consumer.rs

+        // We must allow filtering in order to search by cdc$operation.
+        let result = session
+            .query(format!("SELECT * FROM {} WHERE \"cdc$operation\" = {} AND pk = {} AND ck = {} ALLOW FILTERING;",
+                           TEST_SINGLE_VALUE_CDC_TABLE, 1, 1, 2), ()) // 1 is row update


Instead of using 1 as cdc$operation, please use OperationType::RowUpdate as i8 to be more clear.

piodul · 2021-11-19T07:39:22Z

scylla-cdc/src/consumer.rs

+        // Test against the default values in CDCRow::from_row
+        assert!(cdc_row.stream_id.len() > 0);
+        assert_ne!(cdc_row.time, uuid::Uuid::default());
+        assert_ne!(cdc_row.batch_seq_no, i32::MAX);


You can assert that the batch_seq_no is equal to 0 here.

piodul · 2021-11-19T07:41:41Z

scylla-cdc/src/consumer.rs

+    }
+
+    #[tokio::test]
+    async fn test_get_item() {


Is this test necessary? The same things (and more) are checked in the test_query.

piodul · 2021-11-19T08:26:52Z

scylla-cdc/src/consumer.rs

+        // The operation type is insert in this case.
+        assert_ne!(cdc_row.operation, OperationType::PreImage);


// The operation type is insert in this case.

You can express such comments as assertions, you know :) Now you are only asserting that the operation type is not PreImage.

piodul · 2021-11-19T09:38:33Z

scylla-cdc/src/consumer.rs

+    /// Returns None if there is no collection column with such name.
+    /// Returns Some(None) if there is such collection, but nothing was deleted from it.
+    /// Otherwise returns Some(Some(x)) where x is a reference to vector containing deleted values.
+    pub fn get_deleted_elements(&self, name: &str) -> Option<Option<&Vec<CqlValue>>> {


We can simplify a bit here, similarly to is_value_deleted. I don't think that the set of deleted elements can be empty, so we can consider null to be an empty Vec in order to remove one layer of Option. Additionally, you should change &Vec<CqlValue> to &[CqlValue] - always prefer slices to vec references.

Also, take a look at my summary comment of this review - I included more suggestions which will affect the final shape of this function.

piodul · 2021-11-19T09:51:05Z

@kbr- please review, too.

piodul · 2021-11-23T14:04:07Z

@kbr- ping

piodul · 2021-11-24T20:04:09Z

@kbr- ping^2

kbr- · 2021-11-25T11:50:35Z

Please fix the git log according to these instructions:
#8 (comment)

also the PR cover letter says:

This is definitely not finished, as you can see, there are many TODOs and it does not compile.

I'm guessing this is no longer true, so please update it.

kbr- · 2021-11-25T11:57:18Z

scylla-cdc/src/consumer.rs

+    pub(crate) end_of_batch: usize,
+    pub(crate) operation: usize,
+    pub(crate) ttl: usize,
+    pub(crate) mapping: HashMap<String, usize>,


Please describe in a comment the nature of this mapping (i.e. it maps what to what)

kbr- · 2021-11-25T12:01:45Z

scylla-cdc/src/consumer.rs

+}
+
+pub struct CDCRowSchema {
+    pub(crate) stream_id: usize,


These values are indices of these columns in the CDC log table schema, or something else? A comment should mention this. (I didn't understand what these usizes mean until I read the implementation of new)

kbr- · 2021-11-25T12:07:45Z

scylla-cdc/src/consumer.rs

+        let mut data: Vec<Option<CqlValue>> = Vec::with_capacity(data_count);
+
+        for (i, column) in row.columns.into_iter().enumerate() {
+            if i == schema.stream_id {


Perhaps using guards would be more elegant:
https://doc.rust-lang.org/rust-by-example/flow_control/match/guard.html
(just a suggestion, depends on your preference)

kbr- · 2021-11-25T12:16:53Z

scylla-cdc/src/consumer.rs

+mod tests {
+    // Because we are planning to extract a common setup to all tests,
+    // the setup for this module is based on generation fetcher's tests,
+    // which has already been merged to the main branch.


The last part of this commit is really unnecessary (about merging stuff into main)

scylla-cdc/src/consumer.rs

piodul · 2021-11-29T19:33:00Z

scylla-cdc/src/consumer.rs

+        match val {
+            Some(vec) => vec,
+            None => &[],
+        }


Nit: this is probably equivalent to val.unwrap_or(&[])

I'm not sure about that. It does not compile with message: expected struct Vec, found array of 0 elements. val is of type Option<&Vec<CqlValue>> and if I understand the generics correctly, unwrap_or should take as an argument of the same type as its return type: https://doc.rust-lang.org/std/option/enum.Option.html#method.unwrap_or

OK, then nevermind. It looks like Rust implicitly converts &Vec<CqlValue> to &[CqlValue] in your code, but it can't do it in unwrap_or. I think that if you placed an as_ref somewhere higher in the code then unwrap_or would work too (so that we explicitly perform this conversion), but I won't insist on changing that.

scylla-cdc/src/consumer.rs

piodul · 2021-11-29T19:37:41Z

scylla-cdc/src/consumer.rs

+        }
+    }
+
+    pub fn column_exists(&self, name: &str) -> bool {


I think the newly added methods belong to CDCRowSchema rather than CDCRow, so please move them there.

I created these functions with thought that they will serve the consumer, for example, to debug. As far as I know, the consumer isn't going to see any CDCRowSchema, because this is an internal struct that is used to create new CDCRow objects. If these functions were moved to CDCRowSchema, they would be useless in my opinion.

Personally, I see nothing wrong with exposing the CDCRowSchema to the user, (for example through CDCRow::get_schema() or something) and I think that those methods make more sense to be there, but maybe it will make the interface harder to use... I'm not sure, so I won't insist.

Create Consumer and ConsumerFactory traits. Implement CDCRow struct that represents data passed to the consumer. Implement CDCRowSchema that contains info about column order in the CDC table. Create tests for Consumer module.

macher259 reviewed Nov 12, 2021

View reviewed changes

procrastinationfighter changed the title ~~Create a draft version of the customer interface.~~ Interface for consuming changes Nov 12, 2021

piodul requested changes Nov 12, 2021

View reviewed changes

Ponewor reviewed Nov 12, 2021

View reviewed changes

Ponewor reviewed Nov 13, 2021

View reviewed changes

procrastinationfighter commented Nov 13, 2021

View reviewed changes

procrastinationfighter requested a review from piodul November 13, 2021 21:15

macher259 reviewed Nov 13, 2021

View reviewed changes

piodul requested changes Nov 13, 2021

View reviewed changes

procrastinationfighter marked this pull request as ready for review November 18, 2021 19:12

procrastinationfighter requested a review from piodul November 18, 2021 19:31

piodul requested changes Nov 19, 2021

View reviewed changes

kbr- reviewed Nov 25, 2021

View reviewed changes

procrastinationfighter force-pushed the consumer branch 2 times, most recently from d31ac22 to c07bec7 Compare November 26, 2021 11:42

procrastinationfighter requested a review from piodul November 26, 2021 12:14

kbr- approved these changes Nov 26, 2021

View reviewed changes

piodul reviewed Nov 29, 2021

View reviewed changes

procrastinationfighter force-pushed the consumer branch from c07bec7 to 5eea4fc Compare November 29, 2021 21:05

procrastinationfighter force-pushed the consumer branch from 5eea4fc to 0b152bd Compare November 29, 2021 21:23

piodul approved these changes Nov 30, 2021

View reviewed changes

piodul merged commit 0935aa2 into scylladb:main Nov 30, 2021

procrastinationfighter deleted the consumer branch December 14, 2021 22:29

		// The operation type is insert in this case.
		assert_ne!(cdc_row.operation, OperationType::PreImage);

Interface for consuming changes #13

Interface for consuming changes #13

Conversation

procrastinationfighter commented Nov 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

procrastinationfighter commented Nov 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

procrastinationfighter commented Nov 18, 2021 • edited Loading

piodul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piodul commented Nov 19, 2021

piodul commented Nov 23, 2021

piodul commented Nov 24, 2021

kbr- commented Nov 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

procrastinationfighter Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

procrastinationfighter Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

procrastinationfighter commented Nov 12, 2021 •

edited

Loading

procrastinationfighter commented Nov 18, 2021 •

edited

Loading

procrastinationfighter Nov 29, 2021 •

edited

Loading

procrastinationfighter Nov 29, 2021 •

edited

Loading