Skip to content
This repository was archived by the owner on Sep 26, 2023. It is now read-only.

Add Snowpipe support #60

Open
chuwy opened this issue Aug 1, 2018 · 3 comments
Open

Add Snowpipe support #60

chuwy opened this issue Aug 1, 2018 · 3 comments

Comments

@chuwy
Copy link
Contributor

chuwy commented Aug 1, 2018

https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-intro.html

@dhuang
Copy link

dhuang commented Feb 22, 2020

@chuwy has this been explored further?

With the transform on load idea in https://github.com/snowplow-incubator/snowplow-snowflake-loader/issues/101, which I've tested within a pipe, I believe Snowpipe could load the enriched files in S3 directly into Snowflake. From a latency perspective, I imagine Snowpipe could stay not too far behind Stream Enrich. Cost-wise, I'd assume it would be more efficient, especially when a warehouse is exclusively used by Snowplow as a single concurrent query will likely never fully utilize a warehouse.

Something external to Snowflake would still need to handle the initialization of tables/pipe and the addition of new columns when needed. Addressing new columns may not be trivial as we can't ensure that alter step happens before the pipe tries to load data. The ability to pause/unpause the pipe may be helpful there or just some way to ensure the pipe fails when columns are matched then getting it to try again later. On the operational side, Snowpipe's transparency is probably also a major concern. However, it does have some status exposed with SYSTEM$PIPE_STATUS and loaded files history in COPY_HISTORY. It will also ensure the same files aren't loaded more than once.

I'm sure there are plenty of considerations I haven't thought of, but I just wanted to inquire since I think there are some big upsides to this, both in simplifying the load workflow and potential cost/latency wins.

cc @jbeemster since you've given me some insight around concerns with Snowpipe in the past.

@chuwy
Copy link
Contributor Author

chuwy commented Feb 22, 2020

The approach that we are currently considering is somewhat similar to Snowpipe, but portable across Snowplow loaders (Redshift and potentially other OLAPs). Long story short is that we're planning to rewrite transformation step into a streaming application that sinks data real-time and make loader a long-running (rather than batch) application handling mutation (as you proposed) and copy statements (if we don't take #101 into account).

@istreeter
Copy link
Contributor

It's true you can use the transform in #101 and create a pipe that extracts unstructured events into their own column.

However.... no matter what I tried, I could not get the same thing to work with contexts and derived contexts, which are arrays of entities each with different schema type. I tried doing a combination of lateral flatten and array_agg to extract the contexts into their own columns, but I always got compilation errors like COPY statement only supports simple SELECT from stage statements for import.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants