-
Notifications
You must be signed in to change notification settings - Fork 16
feat: add snapshot chunking mode overrides and sparse PK safeguards #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add snapshot chunking mode overrides and sparse PK safeguards #66
Conversation
|
|
||
| // Calculate sparsity ratio to detect sparse primary key distributions | ||
| // Example: Snowflake IDs where minValue=1, maxValue=7234567890123456789, but only 1000 rows | ||
| sparsityRatio := float64(totalRange) / float64(rowCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In large distributed ID systems (e.g., snowflake/timestamp IDs), it’s common to switch from offset/range chunking to keyset pagination based on a “gap density” (sparsity) check.
| type SnapshotChunkingMode string | ||
|
|
||
| const ( | ||
| SnapshotChunkingModeAuto SnapshotChunkingMode = "auto" | ||
| SnapshotChunkingModeRange SnapshotChunkingMode = "range" | ||
| SnapshotChunkingModeKeyset SnapshotChunkingMode = "keyset" | ||
| SnapshotChunkingModeOffset SnapshotChunkingMode = "offset" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user might not want the decision made according to the tree. I added this so that, if possible, we can choose a moderator and direct them to the type of snapshot they want.
| // ChunkResult contains the result of processing a chunk | ||
| type ChunkResult struct { | ||
| RowCount int64 | ||
| LastPK *int64 // Last processed primary key value (for keyset pagination) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In keyset chunks, the starting cursor for the next chunk is reliably determined. That's why we needed this model.
| "SELECT * FROM %s.%s WHERE %s >= %d AND %s <= %d ORDER BY %s LIMIT %d", | ||
| `SELECT * FROM "%s"."%s" WHERE "%s" >= %d AND "%s" <= %d ORDER BY %s LIMIT %d`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the SQL generation logic to wrap column names in double quotes ("column_name"). This ensures our queries are compliant with the PostgreSQL Lexical Structure for "Quoted Identifiers."
| heartbeat_at = '%s', | ||
| -- Update range_start from previous chunk's last_pk for sequential keyset chunks | ||
| range_start = CASE | ||
| WHEN c.range_end IS NULL AND c.range_start < 0 AND c.chunk_index > 0 | ||
| THEN COALESCE((SELECT last_pk FROM prev_chunk_info), c.range_start) | ||
| ELSE c.range_start | ||
| END |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range_end IS NULL condition is scoped to sequential keyset only; range, offset, and parallel keyset chunks are unaffected.
| // Use NTILE to divide rows into equal groups and get boundary values | ||
| // Use quoted identifiers to handle special characters in table/column names | ||
| query := fmt.Sprintf(` | ||
| WITH chunk_boundaries AS ( | ||
| SELECT | ||
| "%s" as pk_value, | ||
| NTILE(%d) OVER (ORDER BY "%s") as chunk_num | ||
| FROM "%s"."%s" | ||
| ) | ||
| SELECT | ||
| chunk_num - 1 as chunk_index, | ||
| MIN(pk_value) as range_start, | ||
| MAX(pk_value) as range_end | ||
| FROM chunk_boundaries | ||
| GROUP BY chunk_num | ||
| ORDER BY chunk_num | ||
| `, pkColumn, numChunks, pkColumn, table.Schema, table.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke with @Abdulsametileri . There is a performance problem here at the moment.
|
let's use for ctid partitioning #67 {
Name: "yourTable",
Schema: "yourSchema",
SnapshotPartitionStrategy: publication.SnapshotPartitionStrategyCTIDBlock,
}, |
|
we added ctid and fixed with it for now, ı am closing this, in the future maybe it needed |
No description provided.