Add Snowpipe support #60

chuwy · 2018-08-01T05:58:50Z

https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-intro.html

dhuang · 2020-02-22T01:50:34Z

@chuwy has this been explored further?

With the transform on load idea in https://github.com/snowplow-incubator/snowplow-snowflake-loader/issues/101, which I've tested within a pipe, I believe Snowpipe could load the enriched files in S3 directly into Snowflake. From a latency perspective, I imagine Snowpipe could stay not too far behind Stream Enrich. Cost-wise, I'd assume it would be more efficient, especially when a warehouse is exclusively used by Snowplow as a single concurrent query will likely never fully utilize a warehouse.

Something external to Snowflake would still need to handle the initialization of tables/pipe and the addition of new columns when needed. Addressing new columns may not be trivial as we can't ensure that alter step happens before the pipe tries to load data. The ability to pause/unpause the pipe may be helpful there or just some way to ensure the pipe fails when columns are matched then getting it to try again later. On the operational side, Snowpipe's transparency is probably also a major concern. However, it does have some status exposed with SYSTEM$PIPE_STATUS and loaded files history in COPY_HISTORY. It will also ensure the same files aren't loaded more than once.

I'm sure there are plenty of considerations I haven't thought of, but I just wanted to inquire since I think there are some big upsides to this, both in simplifying the load workflow and potential cost/latency wins.

cc @jbeemster since you've given me some insight around concerns with Snowpipe in the past.

chuwy · 2020-02-22T02:47:54Z

The approach that we are currently considering is somewhat similar to Snowpipe, but portable across Snowplow loaders (Redshift and potentially other OLAPs). Long story short is that we're planning to rewrite transformation step into a streaming application that sinks data real-time and make loader a long-running (rather than batch) application handling mutation (as you proposed) and copy statements (if we don't take #101 into account).

istreeter · 2022-02-14T21:01:44Z

It's true you can use the transform in #101 and create a pipe that extracts unstructured events into their own column.

However.... no matter what I tried, I could not get the same thing to work with contexts and derived contexts, which are arrays of entities each with different schema type. I tried doing a combination of lateral flatten and array_agg to extract the contexts into their own columns, but I always got compilation errors like COPY statement only supports simple SELECT from stage statements for import.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Snowpipe support #60

Add Snowpipe support #60

chuwy commented Aug 1, 2018

dhuang commented Feb 22, 2020

chuwy commented Feb 22, 2020

istreeter commented Feb 14, 2022

Add Snowpipe support #60

Add Snowpipe support #60

Comments

chuwy commented Aug 1, 2018

dhuang commented Feb 22, 2020

chuwy commented Feb 22, 2020

istreeter commented Feb 14, 2022