Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Tailing extraction mode #22

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions proposals/draft/SIPXX - Tap tail extraction mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
## Status

| header | header |
| ------ | ------ |
| State | Draft |
| Issue Link | https://github.com/MeltanoLabs/Singer-Working-Group/issues/7 |
| Discussion Thread(s) | (optional link) |
| Created | 2021-10-25

-----------------------

## Proposal

### TL;DR Overview

To support streaming use cases, tap's may want to support continuous "tailing" of sources.

There are no changes needed to the spec and it's possible to create and use a tap in an "always-on" fashion today. However, there's no explicit guidance or best practices defined for tap developers or end-users on leveraging this pattern.


### What specific change do you propose to make?

The underlying mechanic (polling micro-batches, reading a streaming endpoint, etc) rests with the tap developer and is still based on the feature set and data source implementation details. However, we could support a standardized boolean setting like `tail` that would cause taps to alter their behavior and begin to extract data from the source continuously. A tap would be expected to continue extracting data until its stopped or encounters errors.

That is, the `tail` setting might alter a taps behavior such that:

- It may enter an infinite loop using its existing extraction implementation and enter a `extract -> sleep -> extract -> ...` pattern.
- It may adjust and begin reading/emitting data continuously by connecting to a dedicated streaming endpoint.

Additionally, we should detail a standardized set of supporting options and their expected behaviors for use with `tail` mode.

1. To honor resumption after a hard interrupt, a `state_message_min_interval_ms` setting should be established, with the expectation that this forces a state message to flush at least every x interval, for instance, assuming 1 or more records have been sent since the last STATE message. Tap developers may also want to tie this closely to commit intervals with their source (i.e. commit Kafka offsets to match) and should still document how this relates to deliverability semantics and guarantees.
2. Optionally, for incremental streams where polling is being used a `new_record_polling_interval_ms` setting should be established, with the expectation that this would be the interval used when polling for data. (i.e. sleeping for this interval in a polling loop)
3. Optionally, to keep non-incremental streams whole while running a tap in `tail` mode, a `max_full_table_age_seconds` setting should be established, which would be expected to drive new extracts of streams pulled with `FULL_TABLE` after the set amount of time has elapsed. This may not be desirable in all scenarios.
4. `SIGTERM` signal use should be supported and expected to immediately trigger halting extraction and emitting a state message. After which the tap should exit cleanly. Again some users may also want to tie this to commit intervals with their data source - and tap developers should still provide guidance around data duplication and deliverability semantics.


To provide a consistent user experience for end-users, time unit based settings are suffixed with their expected unit of time (`_ms`).

### Which layer(s) of the Singer ecosystem does this proposal directly touch?

Select all that apply:

- [ ] Singer Specification - required capabilities and behaviors
- [x] Singer Specification - optional capabilities and behaviors
- [x] Singer best practices and other guidance
- [ ] Singer Working Group - practices and procedures
- [ ] Singer documentation (Other)

## Motivation
pandemicsyn marked this conversation as resolved.
Show resolved Hide resolved
> >

### What problem does it solve?

The ability to extract data continuously or more frequently is highly desirable. Data freshness is often the primary problem but in some situations, the data being extracted may also be short-lived at the source.


### Why is it needed?

In streaming or near-realtime scenarios obtaining data in batches with larger intervals impacts data freshness.

Obtaining data more frequently with micro-batches is possible, but when a complete client setup/tear down also occurs can be undesirable, as frequent client disconnects may cause additional overhead upstream with the data source.

Some data sources also have dedicated high performance or low overhead streaming endpoints that are preferred - for which batch patterns are sub-optimal or even unsupported.

## Other Considerations
> >
### Are there any downsides to this change?

There are no provisions for retry semantics and end users may need to make deployment changes to accommodate long running process should they wish to use this feature.

### Is the change backward compatible?

This is backward compatible in the sense there's no actual change to the spec, but existing taps would require updates to expose this functionality.

### Which users are affected by the change?

Tap developers are the users primarily and directly affected since they would need to accommodate this mode of operation, but there is a second order impact for Target developers as well.

Target developers may be affected in cases where the target implementation is reliant on "short lived" batch behavior. They may only be expecting to commit a single batch and so are not proactively persisting records or finalizing writes until a tap has completed. That could result in unbounded transactions times, transactions timeouts, forced rollbacks, etc. To prevent this Target developers should implement a forced checkpoint/finalization of batch processing either for each STATE message received or after a given elapsed time window has passed. Target implementations which already commit and finalize streams after each STATE message would not need to be altered.

### How are users affected by the change? (e.g. DB upgrade required?)

While completely optional, end users must accommodate managing a long-running process to leverage this.

### Prototype Implementations (optional)

N/A

### Future Plans

Retries. With the increased run times, it's much more likely a tap may begin to encounter transient errors (i.e. endpoint disconnects, temporary throttling). In the future, we may want to consider providing guidance around retry handling and define some basic standardized named settings (i.e. `tail_retry`, `tail_max_retries`, `tail_full_table_on_failure`).

Guidance around capturing `HUP` signal to support restarting/reconnecting active connections.

Guidance on how to handle SIGSTOP/SIGCONT use by users.

### Excluded Alternatives

...(if applicable)...

### Acknowledgements

- [@aaronsteers](https://github.com/aaronsteers)

## What defines this SIP as "done"?

...