From 2a15bad6dd86c6ebf39f646abb1392f12ccb1753 Mon Sep 17 00:00:00 2001 From: John Reid Date: Wed, 30 Jul 2025 17:05:44 +0100 Subject: [PATCH 1/9] Add Segment to Snowplow migration guide --- .../segment_migration_guide_markdown.md | 216 ++++++++++++++++++ 1 file changed, 216 insertions(+) create mode 100644 docs/resources/migration-guides/segment_migration_guide_markdown.md diff --git a/docs/resources/migration-guides/segment_migration_guide_markdown.md b/docs/resources/migration-guides/segment_migration_guide_markdown.md new file mode 100644 index 000000000..303317a99 --- /dev/null +++ b/docs/resources/migration-guides/segment_migration_guide_markdown.md @@ -0,0 +1,216 @@ +# A competitive migration guide: From Segment to Snowplow + +This guide is for technical implementers considering a migration from Segment to Snowplow. This move represents a shift from a managed Customer Data Platform (CDP) to a more flexible, composable behavioral data platform which runs in your cloud environment. + +## The strategic imperative: Why data teams migrate from Segment to Snowplow + +The move from Segment to Snowplow is usually driven by a desire for greater control, higher data fidelity, and a more predictable financial model. + +### Achieve data ownership and control in your cloud + +The key architectural difference is deployment. Segment is a SaaS platform where your data is processed on their servers. Snowplow runs as a set of services in your private cloud (AWS/GCP/Azure), giving you full ownership of your data at all stages. + +This provides several advantages: + +- **Enhanced security and compliance**: Keeping data within your own cloud simplifies security reviews and compliance audits (e.g., GDPR, CCPA, HIPAA), as no third-party vendor processes raw user data +- **Complete data control**: You can configure, scale, and monitor every component of the pipeline according to your specific needs +- **Elimination of vendor lock-in**: Because you own the infrastructure and the data format is open, you are not locked into a proprietary ecosystem + +### A new approach to governance: Foundational data quality + +Segment and Snowplow approach data governance differently. Segment's Protocols feature validates data reactively, acting as a gatekeeper for incoming events. This is often a premium feature and, if not rigorously managed, can lead to inconsistent data requiring significant downstream cleaning. Furthermore, there is no separation between development and production environments, meaning no easy way to test changes before deploying them. + +Snowplow enforces data quality proactively with mandatory, machine-readable **schemas** for every event and entity. Events that fail validation are quarantined for inspection, ensuring only clean, consistent data lands in your warehouse. This "shift-left" approach moves the cost of data quality from a continuous operational expense to a one-time design investment. + +### Unlock advanced analytics with greater granularity + +Segment is primarily a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's event-entity model allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. + +This rich, structured data is ideal for: + +- **Complex data modeling**: Snowplow provides open-source dbt packages to transform raw data into analysis-ready tables +- **AI and machine learning**: High-fidelity data is ideal for training ML models like recommendation engines or churn predictors +- **Deep user behavior analysis**: Rich entities enable multi-faceted exploration of user journeys without complex data wrangling + +### A predictable, infrastructure-based cost model + +Segment's pricing is based on Monthly Tracked Users (MTUs), which can become expensive and unpredictable as you scale. This model can penalize growth. + +Snowplow's costs are based on your cloud infrastructure usage (compute and storage from AWS or GCP), which is more predictable and cost-effective at scale. This model aligns cost directly with data processing volume, not user count, encouraging comprehensive data collection without financial penalty. + +| Feature | Segment | Snowplow | +|---------|---------|----------| +| **Deployment Model** | SaaS-only; data processed on Segment servers hosted by AWS | Private cloud; runs entirely in your AWS/GCP/Azure account | +| **Data Ownership** | Data access in warehouse; vendor controls pipeline | True ownership of data and entire pipeline infrastructure | +| **Governance Model** | Reactive; post-hoc validation with Protocols (a premium add-on) | Proactive; foundational schema validation for every event | +| **Data Structure** | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | +| **Primary Use Case** | Data routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | +| **Pricing Model** | Based on Monthly Tracked Users (MTUs) and API calls | Based on your underlying cloud infrastructure costs | +| **Real-Time Capability** | Limited low-latency support and observability | Real-time streaming pipeline (e.g., via Kafka) supports use cases in seconds | + +## Deconstructing the data model: From flat events to rich context + +To appreciate the strategic value of migrating to Snowplow, it is essential to understand the fundamental differences in how each platform approaches the modeling of behavioral data. This is not just a technical distinction; it is a difference in approach that has consequences for data quality, flexibility, and analytical power. Segment operates on a simple, action-centric model, while Snowplow introduces a more sophisticated, context-centric paradigm that more accurately reflects the complexity of the real world. + +### The Segment model: A review of track, identify, and the property-centric approach + +Segment's data specification is designed for simplicity and ease of use. It is built around a handful of core API methods that capture the essential elements of user interaction. The most foundational of these is the `track` call, which is designed to answer the question, "What is the user doing?". Each `track` call records a single user action, known as an event, which has a human-readable name (e.g., `User Registered`) and an associated `properties` object. This object is a simple JSON containing key-value pairs that describe the action (e.g., `plan: 'pro'`, accountType: 'trial'`). + +The other key methods in the Segment spec support this action-centric view: + +- **`identify`**: Answers the question, "Who is the user?" It associates a `userId` with a set of `traits` (e.g., `email`, `name`), which describe the user themselves +- **`page` and `screen`**: Record when a user views a webpage or a mobile app screen, respectively +- **`group`**: Associates an individual user with a group, such as a company or organization +- **`alias`**: Used to merge the identities of a user across different systems or states (e.g., anonymous to logged-in) + +This model forces the world into a verb-centric framework. The event—the action—is the primary object of interest. All other information, whether it describes the product involved, the user performing the action, or the page on which it occurred, is relegated to being a "property" or a "trait" attached to that action. While this approach is intuitive, it lacks a formal, structured way to define and reuse the *nouns* of the business—the users, products, content, and campaigns—as first-class, independent components of the data model itself. This architectural choice leads to data being defined and repeated within the context of each individual action, rather than as a set of interconnected, reusable concepts. + +### The Snowplow approach: Understanding the event-entity distinction + +Snowplow introduces a more nuanced and powerful paradigm that separates the *event* (the action that occurred at a point in time) from the *entities* (the nouns that were involved in that action). In Snowplow, every tracked event can be decorated with an array of contextual entities. This is the core of the event-entity model. + +An **event** is an immutable record of something that happened. A **self-describing event** in Snowplow is the equivalent of a Segment `track` call, capturing a specific action like `add_to_cart`. + +An **entity**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator: Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: + +- `view_product` +- `add_to_basket` +- `remove_from_basket` +- `purchase_product` +- `review_product` + +This approach reflects the real world more accurately. An "event" is a momentary action, while "entities" like users, products, and marketing campaigns are persistent objects that participate in many events over time. This separation provides immense power. It allows you to analyze the `product` entity across its entire lifecycle, from initial discovery to final purchase, by querying a single, consistent data structure. You are no longer forced to hunt for and coalesce disparate property fields (`viewed_product_sku`, `purchased_product_sku`, etc.) across different event tables. + +Furthermore, Snowplow comes with a rich set of out-of-the-box entities that can be enabled to automatically enrich every event with crucial context, such as the `webPage` entity, the `performanceTiming` entity for site speed, and the `user` entity. This moves the data model from being action-centric to being context-centric, providing a much richer and more interconnected view of behavior from the moment of collection. + +### The language of your business: Building composable data structures with self-describing schemas (data contracts) + +The technical foundation that makes the event-entity model possible is Snowplow's use of **self-describing schemas**. In the Segment world, developers often start by implementing events, and then data team then retrospectively classifies and governs them via Segment Protocols. While they do provide tracking plan capabilities, these are hard to find and optional. + +In the Snowplow ecosystem, the schema registry *is* the single source of truth. Every self-describing event and every custom entity is defined by a formal JSON Schema, which is stored and versioned in a schema registry called **Iglu**. Each schema is a machine-readable contract that specifies: + +- **Identity**: A unique URI comprising a vendor (`com.acme`), a name (`product`), a format (`jsonschema`), and a version (`1-0-0`) +- **Structure**: The exact properties the event or entity should contain (e.g., `sku`, `name`, `price`) +- **Validation Rules**: The data type for each property (`string`, `number`, `boolean`), as well as constraints like minimum/maximum length, regular expression patterns, or enumerated values + +The data payload itself contains a reference to the specific schema and version that defines it, which is why it's called a "self-describing JSON". This creates a powerful, unambiguous, and shared language for data across the entire organization. When a product manager designs a new feature, they collaborate with engineers and analysts to define the schemas for the new events and entities involved. This contract is then stored in Iglu. The engineers implement tracking based on this contract, and the analysts know exactly what data to expect in the warehouse because they can reference the same contract. This is a cultural shift that treats data as a deliberately designed product, not as a byproduct of application code. + +### Analytical implications: How the event-entity model unlocks deeper, contextual insights + +The architectural advantage of the event-entity model becomes apparent in the data warehouse. In a Segment implementation, each custom event type is loaded into its own table (e.g., `order_completed`, `product_viewed`). While this provides structure, it can lead to a large number of tables in the warehouse, a challenge sometimes referred to as "schema sprawl." A significant amount of analytical work involves discovering the correct tables and then `UNION`-ing them together to reconstruct a user's complete journey. + +Snowplow's data model for modern warehouses like Snowflake and BigQuery simplifies this downstream work by using a "one big table" approach. All data is loaded into a single, wide `atomic.events` table. Self-describing events and their associated entities are not loaded into separate tables. Instead, they are stored as dedicated, structured columns within that one table—for example, as an `OBJECT` in Snowflake or a `REPEATED RECORD` in BigQuery. This model avoids the schema sprawl of the Segment approach. + +For an analyst, this means that to get a complete picture of an `add_to_cart` event and the product involved, they query a single, predictable table. The event and all its contextual entities are present in the same row. This structure can simplify data modeling in tools like dbt and accelerate time-to-insight, as the analytical work shifts from joining many disparate event tables to unnesting or accessing data within the structured columns of a single table. It is important to note that this loading behavior is different for Amazon Redshift, where each entity type does get loaded into its own separate table. + +| Segment Concept | Segment Example | Snowplow Equivalent | Snowplow Implementation Detail | +|-----------------|-----------------|---------------------|--------------------------------| +| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-describing event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties | +| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User entity and `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at` | +| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` and `web_page` entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity | +| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.) | + +## Architecting your migration: A phased framework + +A successful migration requires a well-defined strategy that manages risk and ensures data continuity. This section outlines a high-level project plan, including different strategic scenarios and a plan for handling historical data. + +### The three-phase migration roadmap + +A migration from Segment to Snowplow can be broken down into three phases: + +- **Phase 1: Assess and plan** + - Audit all existing Segment `track`, `identify`, `page`, and `group` calls + - Export the complete Segment Tracking Plan via API (if you still have an active account) or infer it from data in a data warehouse + - Translate the Segment plan into a Snowplow tracking plan, defining event schemas and identifying reusable entities - using the Snowplow CLI MCP Server + - Deploy the Snowplow pipeline components (Collector, Enrich, Loaders) and the Iglu Schema Registry in your cloud +- **Phase 2: Implement and validate** + - Add Snowplow trackers to your applications to run in parallel with existing Segment trackers (dual-tracking) + - Use tools like Snowplow Micro for local testing and validation before deployment + - Perform end-to-end data reconciliation in your data warehouse by comparing Segment and Snowplow data to ensure accuracy +- **Phase 3: Cutover and optimize** + - Update all downstream data consumers (BI dashboards, dbt models) to query the new Snowplow data tables + - Remove the Segment trackers and SDKs from application codebases + - Decommission the Segment sources and, eventually, the subscription + +### Migration scenario 1: The parallel-run approach + +The parallel-run approach is the recommended, lowest-risk strategy. It involves running both systems simultaneously (dual-tracking) to validate data integrity before cutting over. Existing Segment-powered workflows remain operational while you test and reconcile the new Snowplow data in the warehouse. This approach builds confidence and allows you to resolve discrepancies without impacting production systems. + +### Migration scenario 2: The full re-architecture + +A "rip-and-replace" approach is faster but riskier, involving a direct switch from Segment to Snowplow SDKs. This is best suited for: + +- New projects or applications with no legacy system +- Major application refactors where the switch can be part of a larger effort +- Teams with high risk tolerance and robust automated testing frameworks + +This strategy requires thorough pre-launch testing in a staging environment to prevent data loss. + +### A strategy for historical data + +You have two main options for handling historical data from Segment: + +- **Option A: Coexistence (Pragmatic)** Leave historical Segment data in its existing tables. For longitudinal analysis, write queries that `UNION` data from both Segment and Snowplow tables, using a transformation layer (e.g., in dbt) to create a compatible structure. This avoids a large backfill project +- **Option B: Unification (Backfill)** For a single, unified dataset, undertake a custom engineering project to transform and backfill historical data. This involves exporting Segment data, writing a script to reshape it into the Snowplow enriched event format, and loading it into the warehouse. This is a significant effort but provides a consistent historical dataset + +## The technical playbook: Executing your migration + +This section provides a detailed, hands-on playbook for the technical execution of the migration. A central theme of this playbook is the use of the Snowplow CLI and its integrated AI capabilities to accelerate the most challenging part of the migration: designing a new, high-quality tracking plan. + +### Step 1: Deconstruct your legacy: Export and analyze the Segment tracking plan + +Before building the new data foundation, you must create a complete blueprint of the existing structure. The first practical step is to export your Segment Tracking Plan into a machine-readable format that can serve as the raw material for your redesign. + +There are two primary methods for this export: + +1. **Manual CSV download**: The Segment UI provides an option to download your Tracking Plan as a CSV file. This is a quick way to get a human-readable inventory of your events and properties. However, it can be less ideal for programmatic analysis and may not capture the full structural detail of your plan +2. **Programmatic API export (recommended)**: The superior method is to use the Segment Public API. The API allows you to programmatically list all Tracking Plans in your workspace and retrieve the full definition of each plan, including its rules, in a structured JSON format. This JSON output is invaluable because it often includes the underlying JSON Schema that Segment uses to validate the `properties` of each event + +The result of this step is a definitive, version-controlled artifact (e.g., a `segment_plan.json` file) that represents the ground truth of your current tracking implementation. This file will be the primary input for the next step of the process. + +### Step 2: AI-assisted design: Build your Snowplow tracking plan with the CLI and MCP server + +Next, you'll need to translate that tracking plan into a Snowplow-appropriate format (Data Products and Data Structures). + +The Snowplow CLI is a command-line utility that includes a Model Context Protocol (MCP) server, so you can use an AI agent to generate idiomatic Snowplow tracking. For more information on how to do this, read the [tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/). + +### Step 3: Re-instrument your codebase: A conceptual guide + +With a robust and well-designed tracking plan published to your Iglu registry, the next step is to update your application code to send events to Snowplow. While the specific code will vary by language and platform, the core concepts are consistent. We recommend using [Snowtype](https://docs.snowplow.io/docs/data-product-studio/snowtype/using-the-cli/), our Code Generation tool, to automatically generate type-safe tracking code. + +#### Migrate client-side tracking: From analytics.js to the Snowplow Browser Tracker + +The Snowplow JavaScript/Browser tracker introduces a more modern and readable API. The most significant change from Segment's `analytics.js` is the move from function calls with long, ordered parameter lists to calls that accept a single object with named arguments. + +- A Segment call like `analytics.track('Event', {prop: 'value'})` becomes a Snowplow call like `snowplow('trackSelfDescribingEvent', {schema: 'iglu:com.acme/event/jsonschema/1-0-0', data: {prop: 'value'}})` +- A Segment `identify` call is replaced by a combination of a `setUserId` call to set the primary user identifier and the attachment of a custom `user` entity to provide the user traits + +This object-based approach improves code readability, as the purpose of each value is explicit, and makes the tracking calls more extensible for the future. + +#### Migrate server-side and mobile tracking: An overview of Snowplow's polyglot trackers + +Snowplow provides a comprehensive suite of trackers for virtually every common back-end language and mobile platform, including Java, Python, .NET, Go, Ruby, iOS (Swift/Objective-C), and Android (Kotlin/Java). + +While the syntax is idiomatic to each language, the underlying paradigm remains the same across all trackers. The developer will: + +1. Initialize the tracker with the endpoint of their Snowplow collector +2. Use builder patterns or helper classes to construct self-describing events and entity objects, referencing the schema URIs from the Iglu registry. For example, the Java tracker uses a `SelfDescribing.builder()` to construct the event payload +3. Use a `track` method to send the fully constructed event to the collector + +The consistency of the event-entity model across all trackers ensures that data from every platform will arrive in the warehouse in a unified, coherent structure. + +### Step 4: Ensure a smooth transition: Validation, testing, and cutover + +The final step is to rigorously validate the new implementation and manage the cutover. A smooth transition is non-negotiable. + +#### Local validation with Snowplow Micro + +To empower developers and "shift-left" on data quality, customers should incorporate **Snowplow Micro**. Micro is a complete Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. + +#### End-to-end data reconciliation strategies + +During the parallel-run phase, it is essential to perform end-to-end data reconciliation in the data warehouse. This involves writing a suite of SQL queries to compare the data collected by the two systems. Analysts should compare high-level metrics like daily event counts and unique user counts, as well as the values of specific, critical properties. The goal is not to achieve 100% identical data—the data models are different, which is the point of the migration. The goal is to be able to confidently explain any variances and to prove that the new Snowplow pipeline is capturing all critical business logic correctly. + +#### Final cutover: Decommission Segment senders + +Once the data has been thoroughly reconciled and all downstream dependencies (e.g., BI dashboards, ML models, marketing automation workflows) have been successfully migrated to use the new, richer Snowplow data tables, the team can proceed with the final cutover. This involves a coordinated deployment to remove the Segment SDKs and all `analytics.track()` calls from the codebases. Following general data migration best practices, the old Segment sources should be left active for a short period as a final fallback before being fully decommissioned. \ No newline at end of file From 9d61ae6dbcafeecc4eda695fcad4efeb71381fde Mon Sep 17 00:00:00 2001 From: John Reid Date: Wed, 30 Jul 2025 17:10:30 +0100 Subject: [PATCH 2/9] Added docs links --- .../segment_migration_guide_markdown.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/resources/migration-guides/segment_migration_guide_markdown.md b/docs/resources/migration-guides/segment_migration_guide_markdown.md index 303317a99..d416b57a6 100644 --- a/docs/resources/migration-guides/segment_migration_guide_markdown.md +++ b/docs/resources/migration-guides/segment_migration_guide_markdown.md @@ -20,15 +20,15 @@ This provides several advantages: Segment and Snowplow approach data governance differently. Segment's Protocols feature validates data reactively, acting as a gatekeeper for incoming events. This is often a premium feature and, if not rigorously managed, can lead to inconsistent data requiring significant downstream cleaning. Furthermore, there is no separation between development and production environments, meaning no easy way to test changes before deploying them. -Snowplow enforces data quality proactively with mandatory, machine-readable **schemas** for every event and entity. Events that fail validation are quarantined for inspection, ensuring only clean, consistent data lands in your warehouse. This "shift-left" approach moves the cost of data quality from a continuous operational expense to a one-time design investment. +Snowplow enforces data quality proactively with mandatory, machine-readable **[schemas](https://docs.snowplow.io/docs/fundamentals/schemas/)** for every [event](https://docs.snowplow.io/docs/fundamentals/events/) and [entity](https://docs.snowplow.io/docs/fundamentals/entities/). [Events that fail validation](https://docs.snowplow.io/docs/fundamentals/failed-events/) are quarantined for inspection, ensuring only clean, consistent data lands in your warehouse. This "shift-left" approach moves the cost of data quality from a continuous operational expense to a one-time design investment. ### Unlock advanced analytics with greater granularity -Segment is primarily a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's event-entity model allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. +Segment is primarily a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's [event-entity model](https://docs.snowplow.io/docs/fundamentals/events/) allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. This rich, structured data is ideal for: -- **Complex data modeling**: Snowplow provides open-source dbt packages to transform raw data into analysis-ready tables +- **Complex data modeling**: Snowplow provides [open-source dbt packages](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/) to transform raw data into analysis-ready tables - **AI and machine learning**: High-fidelity data is ideal for training ML models like recommendation engines or churn predictors - **Deep user behavior analysis**: Rich entities enable multi-faceted exploration of user journeys without complex data wrangling @@ -69,9 +69,9 @@ This model forces the world into a verb-centric framework. The event—the actio Snowplow introduces a more nuanced and powerful paradigm that separates the *event* (the action that occurred at a point in time) from the *entities* (the nouns that were involved in that action). In Snowplow, every tracked event can be decorated with an array of contextual entities. This is the core of the event-entity model. -An **event** is an immutable record of something that happened. A **self-describing event** in Snowplow is the equivalent of a Segment `track` call, capturing a specific action like `add_to_cart`. +An **[event](https://docs.snowplow.io/docs/fundamentals/events/)** is an immutable record of something that happened. A **[self-describing event](https://docs.snowplow.io/docs/fundamentals/events/#self-describing-events)** in Snowplow is the equivalent of a Segment `track` call, capturing a specific action like `add_to_cart`. -An **entity**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator: Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: +An **[entity](https://docs.snowplow.io/docs/fundamentals/entities/)**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator: Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: - `view_product` - `add_to_basket` @@ -81,13 +81,13 @@ An **entity**, however, is a reusable, self-describing JSON object that provides This approach reflects the real world more accurately. An "event" is a momentary action, while "entities" like users, products, and marketing campaigns are persistent objects that participate in many events over time. This separation provides immense power. It allows you to analyze the `product` entity across its entire lifecycle, from initial discovery to final purchase, by querying a single, consistent data structure. You are no longer forced to hunt for and coalesce disparate property fields (`viewed_product_sku`, `purchased_product_sku`, etc.) across different event tables. -Furthermore, Snowplow comes with a rich set of out-of-the-box entities that can be enabled to automatically enrich every event with crucial context, such as the `webPage` entity, the `performanceTiming` entity for site speed, and the `user` entity. This moves the data model from being action-centric to being context-centric, providing a much richer and more interconnected view of behavior from the moment of collection. +Furthermore, Snowplow comes with a rich set of [out-of-the-box entities](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/#out-of-the-box-entity-tracking) that can be enabled to automatically enrich every event with crucial context, such as the `webPage` entity, the `performanceTiming` entity for site speed, and the `user` entity. This moves the data model from being action-centric to being context-centric, providing a much richer and more interconnected view of behavior from the moment of collection. ### The language of your business: Building composable data structures with self-describing schemas (data contracts) The technical foundation that makes the event-entity model possible is Snowplow's use of **self-describing schemas**. In the Segment world, developers often start by implementing events, and then data team then retrospectively classifies and governs them via Segment Protocols. While they do provide tracking plan capabilities, these are hard to find and optional. -In the Snowplow ecosystem, the schema registry *is* the single source of truth. Every self-describing event and every custom entity is defined by a formal JSON Schema, which is stored and versioned in a schema registry called **Iglu**. Each schema is a machine-readable contract that specifies: +In the Snowplow ecosystem, the schema registry *is* the single source of truth. Every self-describing event and every custom entity is defined by a formal JSON Schema, which is stored and versioned in a schema registry called **[Iglu](https://docs.snowplow.io/docs/fundamentals/schemas/#iglu)**. Each schema is a machine-readable contract that specifies: - **Identity**: A unique URI comprising a vendor (`com.acme`), a name (`product`), a format (`jsonschema`), and a version (`1-0-0`) - **Structure**: The exact properties the event or entity should contain (e.g., `sku`, `name`, `price`) @@ -99,7 +99,7 @@ The data payload itself contains a reference to the specific schema and version The architectural advantage of the event-entity model becomes apparent in the data warehouse. In a Segment implementation, each custom event type is loaded into its own table (e.g., `order_completed`, `product_viewed`). While this provides structure, it can lead to a large number of tables in the warehouse, a challenge sometimes referred to as "schema sprawl." A significant amount of analytical work involves discovering the correct tables and then `UNION`-ing them together to reconstruct a user's complete journey. -Snowplow's data model for modern warehouses like Snowflake and BigQuery simplifies this downstream work by using a "one big table" approach. All data is loaded into a single, wide `atomic.events` table. Self-describing events and their associated entities are not loaded into separate tables. Instead, they are stored as dedicated, structured columns within that one table—for example, as an `OBJECT` in Snowflake or a `REPEATED RECORD` in BigQuery. This model avoids the schema sprawl of the Segment approach. +Snowplow's data model for modern warehouses like Snowflake and BigQuery simplifies this downstream work by using a "one big table" approach. All data is loaded into a single, wide [`atomic.events` table](https://docs.snowplow.io/docs/fundamentals/canonical-event/). Self-describing events and their associated entities are not loaded into separate tables. Instead, they are stored as dedicated, structured columns within that one table—for example, as an `OBJECT` in Snowflake or a `REPEATED RECORD` in BigQuery. This model avoids the schema sprawl of the Segment approach. For an analyst, this means that to get a complete picture of an `add_to_cart` event and the product involved, they query a single, predictable table. The event and all its contextual entities are present in the same row. This structure can simplify data modeling in tools like dbt and accelerate time-to-insight, as the analytical work shifts from joining many disparate event tables to unnesting or accessing data within the structured columns of a single table. It is important to note that this loading behavior is different for Amazon Redshift, where each entity type does get loaded into its own separate table. @@ -122,13 +122,13 @@ A migration from Segment to Snowplow can be broken down into three phases: - Audit all existing Segment `track`, `identify`, `page`, and `group` calls - Export the complete Segment Tracking Plan via API (if you still have an active account) or infer it from data in a data warehouse - Translate the Segment plan into a Snowplow tracking plan, defining event schemas and identifying reusable entities - using the Snowplow CLI MCP Server - - Deploy the Snowplow pipeline components (Collector, Enrich, Loaders) and the Iglu Schema Registry in your cloud + - Deploy the Snowplow pipeline components ([Collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/), [Enrich](https://docs.snowplow.io/docs/pipeline-components-and-applications/enrichment-components/), [Loaders](https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/)) and the [Iglu Schema Registry](https://docs.snowplow.io/docs/pipeline-components-and-applications/iglu/) in your cloud - **Phase 2: Implement and validate** - - Add Snowplow trackers to your applications to run in parallel with existing Segment trackers (dual-tracking) - - Use tools like Snowplow Micro for local testing and validation before deployment + - Add [Snowplow trackers](https://docs.snowplow.io/docs/collecting-data/) to your applications to run in parallel with existing Segment trackers (dual-tracking) + - Use tools like [Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/) for local testing and validation before deployment - Perform end-to-end data reconciliation in your data warehouse by comparing Segment and Snowplow data to ensure accuracy - **Phase 3: Cutover and optimize** - - Update all downstream data consumers (BI dashboards, dbt models) to query the new Snowplow data tables + - Update all downstream data consumers (BI dashboards, [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/)) to query the new Snowplow data tables - Remove the Segment trackers and SDKs from application codebases - Decommission the Segment sources and, eventually, the subscription @@ -172,15 +172,15 @@ The result of this step is a definitive, version-controlled artifact (e.g., a `s Next, you'll need to translate that tracking plan into a Snowplow-appropriate format (Data Products and Data Structures). -The Snowplow CLI is a command-line utility that includes a Model Context Protocol (MCP) server, so you can use an AI agent to generate idiomatic Snowplow tracking. For more information on how to do this, read the [tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/). +The [Snowplow CLI](https://docs.snowplow.io/docs/data-product-studio/snowplow-cli/) is a command-line utility that includes a Model Context Protocol (MCP) server, so you can use an AI agent to generate idiomatic Snowplow tracking. For more information on how to do this, read the [tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/). ### Step 3: Re-instrument your codebase: A conceptual guide -With a robust and well-designed tracking plan published to your Iglu registry, the next step is to update your application code to send events to Snowplow. While the specific code will vary by language and platform, the core concepts are consistent. We recommend using [Snowtype](https://docs.snowplow.io/docs/data-product-studio/snowtype/using-the-cli/), our Code Generation tool, to automatically generate type-safe tracking code. +With a robust and well-designed tracking plan published to your Iglu registry, the next step is to update your application code to send events to Snowplow. While the specific code will vary by language and platform, the core concepts are consistent. We recommend using [Snowtype](https://docs.snowplow.io/docs/data-product-studio/snowtype/), our Code Generation tool, to automatically generate type-safe tracking code. #### Migrate client-side tracking: From analytics.js to the Snowplow Browser Tracker -The Snowplow JavaScript/Browser tracker introduces a more modern and readable API. The most significant change from Segment's `analytics.js` is the move from function calls with long, ordered parameter lists to calls that accept a single object with named arguments. +The [Snowplow JavaScript/Browser tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/) introduces a more modern and readable API. The most significant change from Segment's `analytics.js` is the move from function calls with long, ordered parameter lists to calls that accept a single object with named arguments. - A Segment call like `analytics.track('Event', {prop: 'value'})` becomes a Snowplow call like `snowplow('trackSelfDescribingEvent', {schema: 'iglu:com.acme/event/jsonschema/1-0-0', data: {prop: 'value'}})` - A Segment `identify` call is replaced by a combination of a `setUserId` call to set the primary user identifier and the attachment of a custom `user` entity to provide the user traits @@ -189,11 +189,11 @@ This object-based approach improves code readability, as the purpose of each val #### Migrate server-side and mobile tracking: An overview of Snowplow's polyglot trackers -Snowplow provides a comprehensive suite of trackers for virtually every common back-end language and mobile platform, including Java, Python, .NET, Go, Ruby, iOS (Swift/Objective-C), and Android (Kotlin/Java). +Snowplow provides a comprehensive suite of [trackers](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/) for virtually every common back-end language and mobile platform, including [Java](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/java-tracker/), [Python](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/python-tracker/), [.NET](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/net-tracker/), [Go](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/go-tracker/), [Ruby](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/ruby-tracker/), [iOS](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/ios-tracker/) (Swift/Objective-C), and [Android](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/android-tracker/) (Kotlin/Java). While the syntax is idiomatic to each language, the underlying paradigm remains the same across all trackers. The developer will: -1. Initialize the tracker with the endpoint of their Snowplow collector +1. Initialize the tracker with the endpoint of their [Snowplow collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/) 2. Use builder patterns or helper classes to construct self-describing events and entity objects, referencing the schema URIs from the Iglu registry. For example, the Java tracker uses a `SelfDescribing.builder()` to construct the event payload 3. Use a `track` method to send the fully constructed event to the collector @@ -205,7 +205,7 @@ The final step is to rigorously validate the new implementation and manage the c #### Local validation with Snowplow Micro -To empower developers and "shift-left" on data quality, customers should incorporate **Snowplow Micro**. Micro is a complete Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. +To empower developers and "shift-left" on data quality, customers should incorporate **[Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/)**. Micro is a complete Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. #### End-to-end data reconciliation strategies From 64cdb5de1b6836d5358066fa414112187bcd0783 Mon Sep 17 00:00:00 2001 From: John Reid Date: Fri, 1 Aug 2025 14:27:57 +0100 Subject: [PATCH 3/9] Addressed additional feedback from Mike and Leo --- .../segment_migration_guide_markdown.md | 38 +++++++++---------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/docs/resources/migration-guides/segment_migration_guide_markdown.md b/docs/resources/migration-guides/segment_migration_guide_markdown.md index d416b57a6..5dba26ca9 100644 --- a/docs/resources/migration-guides/segment_migration_guide_markdown.md +++ b/docs/resources/migration-guides/segment_migration_guide_markdown.md @@ -1,4 +1,4 @@ -# A competitive migration guide: From Segment to Snowplow +# A competitive migration guide: From Segment to Snowplow Analytics This guide is for technical implementers considering a migration from Segment to Snowplow. This move represents a shift from a managed Customer Data Platform (CDP) to a more flexible, composable behavioral data platform which runs in your cloud environment. @@ -18,41 +18,41 @@ This provides several advantages: ### A new approach to governance: Foundational data quality -Segment and Snowplow approach data governance differently. Segment's Protocols feature validates data reactively, acting as a gatekeeper for incoming events. This is often a premium feature and, if not rigorously managed, can lead to inconsistent data requiring significant downstream cleaning. Furthermore, there is no separation between development and production environments, meaning no easy way to test changes before deploying them. +Segment and Snowplow approach data governance differently. Segment's Protocols feature validates data reactively, acting as a gatekeeper for incoming events. This is a premium feature and, if not rigorously managed, can lead to inconsistent data requiring significant downstream cleaning. Snowplow enforces data quality proactively with mandatory, machine-readable **[schemas](https://docs.snowplow.io/docs/fundamentals/schemas/)** for every [event](https://docs.snowplow.io/docs/fundamentals/events/) and [entity](https://docs.snowplow.io/docs/fundamentals/entities/). [Events that fail validation](https://docs.snowplow.io/docs/fundamentals/failed-events/) are quarantined for inspection, ensuring only clean, consistent data lands in your warehouse. This "shift-left" approach moves the cost of data quality from a continuous operational expense to a one-time design investment. ### Unlock advanced analytics with greater granularity -Segment is primarily a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's [event-entity model](https://docs.snowplow.io/docs/fundamentals/events/) allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. +Segment started out as a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular first-party behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's [event-entity model](https://docs.snowplow.io/docs/fundamentals/events/) allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. This rich, structured data is ideal for: -- **Complex data modeling**: Snowplow provides [open-source dbt packages](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/) to transform raw data into analysis-ready tables +- **Complex data modeling**: Snowplow provides source-available dbt packages to transform raw data into analysis-ready tables - **AI and machine learning**: High-fidelity data is ideal for training ML models like recommendation engines or churn predictors - **Deep user behavior analysis**: Rich entities enable multi-faceted exploration of user journeys without complex data wrangling ### A predictable, infrastructure-based cost model -Segment's pricing is based on Monthly Tracked Users (MTUs), which can become expensive and unpredictable as you scale. This model can penalize growth. +Segment's entry-level pricing is based on Monthly Tracked Users (MTUs), which can become expensive and unpredictable as you scale. This model can penalize growth. -Snowplow's costs are based on your cloud infrastructure usage (compute and storage from AWS or GCP), which is more predictable and cost-effective at scale. This model aligns cost directly with data processing volume, not user count, encouraging comprehensive data collection without financial penalty. +Snowplow's costs are based on your cloud infrastructure usage (compute and storage from AWS or GCP) plus a license fee depending on event volume which is more predictable and cost-effective at scale. This model aligns cost directly with data processing volume, not user count, encouraging comprehensive data collection without financial penalty. | Feature | Segment | Snowplow | |---------|---------|----------| | **Deployment Model** | SaaS-only; data processed on Segment servers hosted by AWS | Private cloud; runs entirely in your AWS/GCP/Azure account | -| **Data Ownership** | Data access in warehouse; vendor controls pipeline | True ownership of data and entire pipeline infrastructure | +| **Data Ownership** | Data access in warehouse; vendor controls pipeline | Customer owns data and controls pipeline infrastructure | | **Governance Model** | Reactive; post-hoc validation with Protocols (a premium add-on) | Proactive; foundational schema validation for every event | | **Data Structure** | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | -| **Primary Use Case** | Data routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | -| **Pricing Model** | Based on Monthly Tracked Users (MTUs) and API calls | Based on your underlying cloud infrastructure costs | +| **Primary Use Case** | Building a Customer Data Platform for routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | +| **Pricing Model** | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | | **Real-Time Capability** | Limited low-latency support and observability | Real-time streaming pipeline (e.g., via Kafka) supports use cases in seconds | ## Deconstructing the data model: From flat events to rich context To appreciate the strategic value of migrating to Snowplow, it is essential to understand the fundamental differences in how each platform approaches the modeling of behavioral data. This is not just a technical distinction; it is a difference in approach that has consequences for data quality, flexibility, and analytical power. Segment operates on a simple, action-centric model, while Snowplow introduces a more sophisticated, context-centric paradigm that more accurately reflects the complexity of the real world. -### The Segment model: A review of track, identify, and the property-centric approach +### The Segment model: A review of `track`, `identify`, and the property-centric approach Segment's data specification is designed for simplicity and ease of use. It is built around a handful of core API methods that capture the essential elements of user interaction. The most foundational of these is the `track` call, which is designed to answer the question, "What is the user doing?". Each `track` call records a single user action, known as an event, which has a human-readable name (e.g., `User Registered`) and an associated `properties` object. This object is a simple JSON containing key-value pairs that describe the action (e.g., `plan: 'pro'`, accountType: 'trial'`). @@ -63,7 +63,7 @@ The other key methods in the Segment spec support this action-centric view: - **`group`**: Associates an individual user with a group, such as a company or organization - **`alias`**: Used to merge the identities of a user across different systems or states (e.g., anonymous to logged-in) -This model forces the world into a verb-centric framework. The event—the action—is the primary object of interest. All other information, whether it describes the product involved, the user performing the action, or the page on which it occurred, is relegated to being a "property" or a "trait" attached to that action. While this approach is intuitive, it lacks a formal, structured way to define and reuse the *nouns* of the business—the users, products, content, and campaigns—as first-class, independent components of the data model itself. This architectural choice leads to data being defined and repeated within the context of each individual action, rather than as a set of interconnected, reusable concepts. +This model forces the world into a verb-centric framework. The event—the action—is the primary object of interest. All other information, whether it describes the product involved, the user performing the action, or the page on which it occurred, is relegated to being a "property" or a "trait" attached to that action. While this approach is intuitive, it lacks a formal, structured way to define and reuse the *nouns* of the business—the users, products, content, and campaigns—as first-class, independent components of the data model itself. This architectural choice leads to data being defined and repeated within the context of each individual action, rather than as a set of interconnected, reusable concepts. It often requires a consolidation period down the line as downstream users struggle with data quality issues. ### The Snowplow approach: Understanding the event-entity distinction @@ -71,7 +71,7 @@ Snowplow introduces a more nuanced and powerful paradigm that separates the *eve An **[event](https://docs.snowplow.io/docs/fundamentals/events/)** is an immutable record of something that happened. A **[self-describing event](https://docs.snowplow.io/docs/fundamentals/events/#self-describing-events)** in Snowplow is the equivalent of a Segment `track` call, capturing a specific action like `add_to_cart`. -An **[entity](https://docs.snowplow.io/docs/fundamentals/entities/)**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator: Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: +An **[entity](https://docs.snowplow.io/docs/fundamentals/entities/)**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator. Consider a retail example. Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: - `view_product` - `add_to_basket` @@ -85,7 +85,7 @@ Furthermore, Snowplow comes with a rich set of [out-of-the-box entities](https:/ ### The language of your business: Building composable data structures with self-describing schemas (data contracts) -The technical foundation that makes the event-entity model possible is Snowplow's use of **self-describing schemas**. In the Segment world, developers often start by implementing events, and then data team then retrospectively classifies and governs them via Segment Protocols. While they do provide tracking plan capabilities, these are hard to find and optional. +The technical foundation that makes the event-entity model possible is Snowplow's use of **self-describing schemas**. In the Segment world, developers often start by implementing events, and then the data team then retrospectively classifies and governs them via Segment Protocols. While they do provide tracking plan capabilities, these are hard to find or tied to enterprise-level onboarding. In the Snowplow ecosystem, the schema registry *is* the single source of truth. Every self-describing event and every custom entity is defined by a formal JSON Schema, which is stored and versioned in a schema registry called **[Iglu](https://docs.snowplow.io/docs/fundamentals/schemas/#iglu)**. Each schema is a machine-readable contract that specifies: @@ -105,10 +105,10 @@ For an analyst, this means that to get a complete picture of an `add_to_cart` ev | Segment Concept | Segment Example | Snowplow Equivalent | Snowplow Implementation Detail | |-----------------|-----------------|---------------------|--------------------------------| -| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-describing event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties | -| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User entity and `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at` | -| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` and `web_page` entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity | -| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.) | +| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-Describing Event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties. | +| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User Entity & `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at`. | +| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` & `web_page` Entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity. | +| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` Entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.). | ## Architecting your migration: A phased framework @@ -178,7 +178,7 @@ The [Snowplow CLI](https://docs.snowplow.io/docs/data-product-studio/snowplow-cl With a robust and well-designed tracking plan published to your Iglu registry, the next step is to update your application code to send events to Snowplow. While the specific code will vary by language and platform, the core concepts are consistent. We recommend using [Snowtype](https://docs.snowplow.io/docs/data-product-studio/snowtype/), our Code Generation tool, to automatically generate type-safe tracking code. -#### Migrate client-side tracking: From analytics.js to the Snowplow Browser Tracker +#### Migrate client-side tracking: From `analytics.js` to the Snowplow Browser Tracker The [Snowplow JavaScript/Browser tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/) introduces a more modern and readable API. The most significant change from Segment's `analytics.js` is the move from function calls with long, ordered parameter lists to calls that accept a single object with named arguments. @@ -205,7 +205,7 @@ The final step is to rigorously validate the new implementation and manage the c #### Local validation with Snowplow Micro -To empower developers and "shift-left" on data quality, customers should incorporate **[Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/)**. Micro is a complete Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. +To empower developers and "shift-left" on data quality, customers should incorporate **[Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/)**. Micro is a partial Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. #### End-to-end data reconciliation strategies From eaa96a7603fafa5f44ed04cb2a2b9d210f32372a Mon Sep 17 00:00:00 2001 From: John Reid Date: Mon, 4 Aug 2025 09:20:07 +0100 Subject: [PATCH 4/9] Added YAML markup at top of page --- .../migration-guides/segment_migration_guide_markdown.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/resources/migration-guides/segment_migration_guide_markdown.md b/docs/resources/migration-guides/segment_migration_guide_markdown.md index 5dba26ca9..bbfbfb117 100644 --- a/docs/resources/migration-guides/segment_migration_guide_markdown.md +++ b/docs/resources/migration-guides/segment_migration_guide_markdown.md @@ -1,3 +1,9 @@ +--- +title: "Segment to Snowplow Migration Guide" +date: "2025-08-04" +sidebar_position: 0 +--- + # A competitive migration guide: From Segment to Snowplow Analytics This guide is for technical implementers considering a migration from Segment to Snowplow. This move represents a shift from a managed Customer Data Platform (CDP) to a more flexible, composable behavioral data platform which runs in your cloud environment. From 96cd1d5448f98d1a128a217bae444fb0c2c028a9 Mon Sep 17 00:00:00 2001 From: Miranda Wilson Date: Tue, 5 Aug 2025 11:41:45 +0100 Subject: [PATCH 5/9] Add folder structure --- docs/resources/migration-guides/index.md | 6 ++++ .../index.md} | 36 +++++++++---------- 2 files changed, 23 insertions(+), 19 deletions(-) create mode 100644 docs/resources/migration-guides/index.md rename docs/resources/migration-guides/{segment_migration_guide_markdown.md => segment/index.md} (87%) diff --git a/docs/resources/migration-guides/index.md b/docs/resources/migration-guides/index.md new file mode 100644 index 000000000..f8c0e2202 --- /dev/null +++ b/docs/resources/migration-guides/index.md @@ -0,0 +1,6 @@ +--- +title: "Migration guides" +sidebar_position: 1 +--- + +This section contains advice for migrating to Snowplow from other solutions. diff --git a/docs/resources/migration-guides/segment_migration_guide_markdown.md b/docs/resources/migration-guides/segment/index.md similarity index 87% rename from docs/resources/migration-guides/segment_migration_guide_markdown.md rename to docs/resources/migration-guides/segment/index.md index bbfbfb117..4fd81ecf5 100644 --- a/docs/resources/migration-guides/segment_migration_guide_markdown.md +++ b/docs/resources/migration-guides/segment/index.md @@ -1,11 +1,9 @@ --- -title: "Segment to Snowplow Migration Guide" +title: "Segment to Snowplow" date: "2025-08-04" sidebar_position: 0 --- -# A competitive migration guide: From Segment to Snowplow Analytics - This guide is for technical implementers considering a migration from Segment to Snowplow. This move represents a shift from a managed Customer Data Platform (CDP) to a more flexible, composable behavioral data platform which runs in your cloud environment. ## The strategic imperative: Why data teams migrate from Segment to Snowplow @@ -44,15 +42,15 @@ Segment's entry-level pricing is based on Monthly Tracked Users (MTUs), which ca Snowplow's costs are based on your cloud infrastructure usage (compute and storage from AWS or GCP) plus a license fee depending on event volume which is more predictable and cost-effective at scale. This model aligns cost directly with data processing volume, not user count, encouraging comprehensive data collection without financial penalty. -| Feature | Segment | Snowplow | -|---------|---------|----------| -| **Deployment Model** | SaaS-only; data processed on Segment servers hosted by AWS | Private cloud; runs entirely in your AWS/GCP/Azure account | -| **Data Ownership** | Data access in warehouse; vendor controls pipeline | Customer owns data and controls pipeline infrastructure | -| **Governance Model** | Reactive; post-hoc validation with Protocols (a premium add-on) | Proactive; foundational schema validation for every event | -| **Data Structure** | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | -| **Primary Use Case** | Building a Customer Data Platform for routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | -| **Pricing Model** | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | -| **Real-Time Capability** | Limited low-latency support and observability | Real-time streaming pipeline (e.g., via Kafka) supports use cases in seconds | +| Feature | Segment | Snowplow | +| ------------------------ | ------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- | +| **Deployment Model** | SaaS-only; data processed on Segment servers hosted by AWS | Private cloud; runs entirely in your AWS/GCP/Azure account | +| **Data Ownership** | Data access in warehouse; vendor controls pipeline | Customer owns data and controls pipeline infrastructure | +| **Governance Model** | Reactive; post-hoc validation with Protocols (a premium add-on) | Proactive; foundational schema validation for every event | +| **Data Structure** | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | +| **Primary Use Case** | Building a Customer Data Platform for routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | +| **Pricing Model** | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | +| **Real-Time Capability** | Limited low-latency support and observability | Real-time streaming pipeline (e.g., via Kafka) supports use cases in seconds | ## Deconstructing the data model: From flat events to rich context @@ -109,12 +107,12 @@ Snowplow's data model for modern warehouses like Snowflake and BigQuery simplifi For an analyst, this means that to get a complete picture of an `add_to_cart` event and the product involved, they query a single, predictable table. The event and all its contextual entities are present in the same row. This structure can simplify data modeling in tools like dbt and accelerate time-to-insight, as the analytical work shifts from joining many disparate event tables to unnesting or accessing data within the structured columns of a single table. It is important to note that this loading behavior is different for Amazon Redshift, where each entity type does get loaded into its own separate table. -| Segment Concept | Segment Example | Snowplow Equivalent | Snowplow Implementation Detail | -|-----------------|-----------------|---------------------|--------------------------------| -| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-Describing Event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties. | -| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User Entity & `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at`. | -| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` & `web_page` Entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity. | -| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` Entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.). | +| Segment Concept | Segment Example | Snowplow Equivalent | Snowplow Implementation Detail | +| ----------------------- | ------------------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-Describing Event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties. | +| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User Entity & `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at`. | +| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` & `web_page` Entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity. | +| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` Entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.). | ## Architecting your migration: A phased framework @@ -219,4 +217,4 @@ During the parallel-run phase, it is essential to perform end-to-end data reconc #### Final cutover: Decommission Segment senders -Once the data has been thoroughly reconciled and all downstream dependencies (e.g., BI dashboards, ML models, marketing automation workflows) have been successfully migrated to use the new, richer Snowplow data tables, the team can proceed with the final cutover. This involves a coordinated deployment to remove the Segment SDKs and all `analytics.track()` calls from the codebases. Following general data migration best practices, the old Segment sources should be left active for a short period as a final fallback before being fully decommissioned. \ No newline at end of file +Once the data has been thoroughly reconciled and all downstream dependencies (e.g., BI dashboards, ML models, marketing automation workflows) have been successfully migrated to use the new, richer Snowplow data tables, the team can proceed with the final cutover. This involves a coordinated deployment to remove the Segment SDKs and all `analytics.track()` calls from the codebases. Following general data migration best practices, the old Segment sources should be left active for a short period as a final fallback before being fully decommissioned. From a1a96f818e7f9eb4e73320d728dfe23105a62b87 Mon Sep 17 00:00:00 2001 From: Miranda Wilson Date: Tue, 5 Aug 2025 16:01:48 +0100 Subject: [PATCH 6/9] Simplify text --- docs/resources/migration-guides/index.md | 16 + .../migration-guides/segment/index.md | 282 +++++++----------- 2 files changed, 131 insertions(+), 167 deletions(-) diff --git a/docs/resources/migration-guides/index.md b/docs/resources/migration-guides/index.md index f8c0e2202..ad28a85fe 100644 --- a/docs/resources/migration-guides/index.md +++ b/docs/resources/migration-guides/index.md @@ -4,3 +4,19 @@ sidebar_position: 1 --- This section contains advice for migrating to Snowplow from other solutions. + +In general, there are two possible migration strategies: parallel-run or full re-architecture. + +## Parallel-run + +A parallel-run approach is the recommended, lowest-risk strategy. It involves running both systems simultaneously (dual-tracking) before switching over to Snowplow entirely. This allows you to test and validate your new Snowplow data in your warehouse, without affecting any existing workflows or production systems. + +## Full re-architecture + +A "rip-and-replace" approach is faster but riskier, involving a direct switch from your existing system to Snowplow. This is best suited for: + +* Major application refactors where the switch can be part of a larger effort +* Teams with high risk tolerance and robust automated testing frameworks +* New projects or applications with minimal legacy systems + +A full re-architecture strategy requires thorough testing in a staging environment to prevent data loss. diff --git a/docs/resources/migration-guides/segment/index.md b/docs/resources/migration-guides/segment/index.md index 4fd81ecf5..75ba47865 100644 --- a/docs/resources/migration-guides/segment/index.md +++ b/docs/resources/migration-guides/segment/index.md @@ -4,217 +4,165 @@ date: "2025-08-04" sidebar_position: 0 --- -This guide is for technical implementers considering a migration from Segment to Snowplow. This move represents a shift from a managed Customer Data Platform (CDP) to a more flexible, composable behavioral data platform which runs in your cloud environment. +This guide covers migrating from Segment to Snowplow for technical implementers. The migration moves from a managed Customer Data Platform (CDP) to a composable behavioral data platform running in your cloud environment. -## The strategic imperative: Why data teams migrate from Segment to Snowplow +## High-level differences -The move from Segment to Snowplow is usually driven by a desire for greater control, higher data fidelity, and a more predictable financial model. +There are a number of differences between Segment and Snowplow: as a data platform; in how they're priced; and in how the data is defined, conceptualized, and stored. -### Achieve data ownership and control in your cloud +### Platform comparison -The key architectural difference is deployment. Segment is a SaaS platform where your data is processed on their servers. Snowplow runs as a set of services in your private cloud (AWS/GCP/Azure), giving you full ownership of your data at all stages. +| Feature | Segment | Snowplow | +| -------------------- | ---------------------------------------------------- | ---------------------------------------------------------------------- | +| Deployment model | SaaS-only: data is processed on Segment servers | Private cloud (BDP Enterprise) and SaaS (BDP Cloud) are both available | +| Data ownership | Data access in warehouse, vendor controls pipeline | You own your data and control your pipeline infrastructure | +| Governance model | Reactive: post-hoc validation with Protocols | Proactive: schema validation for every event | +| Data structure | Flat events with properties, user traits and context | Events enriched by multiple, reusable entities | +| Pricing model | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | +| Real-time capability | Limited low-latency support | Real-time streaming pipeline supports sub-second use cases | -This provides several advantages: +### Data warehouse structure -- **Enhanced security and compliance**: Keeping data within your own cloud simplifies security reviews and compliance audits (e.g., GDPR, CCPA, HIPAA), as no third-party vendor processes raw user data -- **Complete data control**: You can configure, scale, and monitor every component of the pipeline according to your specific needs -- **Elimination of vendor lock-in**: Because you own the infrastructure and the data format is open, you are not locked into a proprietary ecosystem +Segment loads each custom event type into separate tables, for example, `order_completed`, or `product_viewed` tables. -### A new approach to governance: Foundational data quality +Snowplow uses a single [`atomic.events`](https://docs.snowplow.io/docs/fundamentals/canonical-event/) table in warehouses like Snowflake and BigQuery. Events and entities are stored as structured columns within that table. -Segment and Snowplow approach data governance differently. Segment's Protocols feature validates data reactively, acting as a gatekeeper for incoming events. This is a premium feature and, if not rigorously managed, can lead to inconsistent data requiring significant downstream cleaning. +## What do events look like? -Snowplow enforces data quality proactively with mandatory, machine-readable **[schemas](https://docs.snowplow.io/docs/fundamentals/schemas/)** for every [event](https://docs.snowplow.io/docs/fundamentals/events/) and [entity](https://docs.snowplow.io/docs/fundamentals/entities/). [Events that fail validation](https://docs.snowplow.io/docs/fundamentals/failed-events/) are quarantined for inspection, ensuring only clean, consistent data lands in your warehouse. This "shift-left" approach moves the cost of data quality from a continuous operational expense to a one-time design investment. +Segment uses flat JSON `properties` objects in `track` events. Snowplow uses a nested [event-entity model](/docs/fundamentals/events/index.md) where events can be enriched with multiple contextual entities. -### Unlock advanced analytics with greater granularity +### Segment -Segment started out as a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular first-party behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's [event-entity model](https://docs.snowplow.io/docs/fundamentals/events/) allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. +Segment's core method for tracking user behavior is `track`. A `track` call contains a name that describes the action taken, and a `properties` object that contains contextual information about the action. -This rich, structured data is ideal for: +The other Segment tracking methods are: +* `page` and `screen`: records page views and screen views +* `identify`: describes the user, and associates a `userId` with user `traits` +* `group`: associates the user with a group +* `alias`: merges user identities, for identity resolution across applications -- **Complex data modeling**: Snowplow provides source-available dbt packages to transform raw data into analysis-ready tables -- **AI and machine learning**: High-fidelity data is ideal for training ML models like recommendation engines or churn predictors -- **Deep user behavior analysis**: Rich entities enable multi-faceted exploration of user journeys without complex data wrangling +Data about the user's action is tracked separately from data about the user. You'll stitch them together during data modeling in the warehouse. -### A predictable, infrastructure-based cost model +Here's an example showing how you could track a user registration event in Segment: -Segment's entry-level pricing is based on Monthly Tracked Users (MTUs), which can become expensive and unpredictable as you scale. This model can penalize growth. +```javascript +analytics.track("User Registered", { + plan: "Pro Annual", + accountType: "Facebook" +}); +``` -Snowplow's costs are based on your cloud infrastructure usage (compute and storage from AWS or GCP) plus a license fee depending on event volume which is more predictable and cost-effective at scale. This model aligns cost directly with data processing volume, not user count, encouraging comprehensive data collection without financial penalty. +The tracked events can be optionally validated against Protocols. These are defined by you, and will detect violations against your tracking plan. You can choose to filter out events that don't pass validation. -| Feature | Segment | Snowplow | -| ------------------------ | ------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- | -| **Deployment Model** | SaaS-only; data processed on Segment servers hosted by AWS | Private cloud; runs entirely in your AWS/GCP/Azure account | -| **Data Ownership** | Data access in warehouse; vendor controls pipeline | Customer owns data and controls pipeline infrastructure | -| **Governance Model** | Reactive; post-hoc validation with Protocols (a premium add-on) | Proactive; foundational schema validation for every event | -| **Data Structure** | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | -| **Primary Use Case** | Building a Customer Data Platform for routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | -| **Pricing Model** | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | -| **Real-Time Capability** | Limited low-latency support and observability | Real-time streaming pipeline (e.g., via Kafka) supports use cases in seconds | +### Snowplow -## Deconstructing the data model: From flat events to rich context +Snowplow events generally contain more data than Segment events. Each [event](/docs/fundamentals/events/index.md) contains data about the action, as well as about the user and any other relevant contextual information. -To appreciate the strategic value of migrating to Snowplow, it is essential to understand the fundamental differences in how each platform approaches the modeling of behavioral data. This is not just a technical distinction; it is a difference in approach that has consequences for data quality, flexibility, and analytical power. Segment operates on a simple, action-centric model, while Snowplow introduces a more sophisticated, context-centric paradigm that more accurately reflects the complexity of the real world. +Snowplow SDKs also provide methods for tracking page views and screen views, along with many other kinds of events, such as button clicks, form submissions, page pings (activity), add to cart, and so on. The equivalent to Segment's `track` is tracking custom events with `track_self_describing_event`. -### The Segment model: A review of `track`, `identify`, and the property-centric approach +All Snowplow events, whether designed by you or built-in, are defined by [JSON schemas](/docs/fundamentals/schemas/index.md). The events are always validated as they're processed through the Snowplow pipeline, and events that fail validation are separated out for assessment. -Segment's data specification is designed for simplicity and ease of use. It is built around a handful of core API methods that capture the essential elements of user interaction. The most foundational of these is the `track` call, which is designed to answer the question, "What is the user doing?". Each `track` call records a single user action, known as an event, which has a human-readable name (e.g., `User Registered`) and an associated `properties` object. This object is a simple JSON containing key-value pairs that describe the action (e.g., `plan: 'pro'`, accountType: 'trial'`). +To track a user registration event in Snowplow, you could define a custom event to track like this: -The other key methods in the Segment spec support this action-centric view: +```javascript +snowplow('trackSelfDescribingEvent', { + event: { + schema: 'iglu:com.acme_company/user_registration/jsonschema/1-0-0', + data: { + plan: "Pro Annual", + accountType: "Facebook", + } + } +}); +``` -- **`identify`**: Answers the question, "Who is the user?" It associates a `userId` with a set of `traits` (e.g., `email`, `name`), which describe the user themselves -- **`page` and `screen`**: Record when a user views a webpage or a mobile app screen, respectively -- **`group`**: Associates an individual user with a group, such as a company or organization -- **`alias`**: Used to merge the identities of a user across different systems or states (e.g., anonymous to logged-in) +This call doesn't appear to contain more information than the Segment `track` call, because only the event definition is shown. Here's the same example with additional data tracked about the user, as a `user` entity: -This model forces the world into a verb-centric framework. The event—the action—is the primary object of interest. All other information, whether it describes the product involved, the user performing the action, or the page on which it occurred, is relegated to being a "property" or a "trait" attached to that action. While this approach is intuitive, it lacks a formal, structured way to define and reuse the *nouns* of the business—the users, products, content, and campaigns—as first-class, independent components of the data model itself. This architectural choice leads to data being defined and repeated within the context of each individual action, rather than as a set of interconnected, reusable concepts. It often requires a consolidation period down the line as downstream users struggle with data quality issues. +```javascript +snowplow('trackSelfDescribingEvent', { + event: { + schema: 'iglu:com.acme_company/user_registration/jsonschema/1-0-0', + data: { + plan: "Pro Annual", + accountType: "Facebook", + } + }, + context: [{ + schema: "iglu:com.acme_company/user/jsonschema/1-0-0", + data: { + userId: "12345", + } + }] +}); +``` -### The Snowplow approach: Understanding the event-entity distinction +Instead of tracking `identify` separately, this user entity can be reused and added to all Snowplow events where it could be useful, for example `log_out`, `change_profile_image`, `view_product`, etc. -Snowplow introduces a more nuanced and powerful paradigm that separates the *event* (the action that occurred at a point in time) from the *entities* (the nouns that were involved in that action). In Snowplow, every tracked event can be decorated with an array of contextual entities. This is the core of the event-entity model. +The Snowplow tracking SDKs in fact add multiple entities to all tracked events by default, including information about the specific page or screen, the user's session, and the device or browser. Many other built-in entities can be configured. As shown here, you can define custom entities to add to any Snowplow event. -An **[event](https://docs.snowplow.io/docs/fundamentals/events/)** is an immutable record of something that happened. A **[self-describing event](https://docs.snowplow.io/docs/fundamentals/events/#self-describing-events)** in Snowplow is the equivalent of a Segment `track` call, capturing a specific action like `add_to_cart`. +### Concepts comparison -An **[entity](https://docs.snowplow.io/docs/fundamentals/entities/)**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator. Consider a retail example. Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: +This table compares some core Segment concepts with their Snowplow equivalents. -- `view_product` -- `add_to_basket` -- `remove_from_basket` -- `purchase_product` -- `review_product` +| Segment concept | Segment example | Snowplow equivalent | +| ------------------- | ------------------------------------------------------------- | ---------------------------------------------------------- | +| Core action | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | Self-describing event with custom `order_completed` schema | +| User identification | `identify('user123', {plan: 'pro', created_at: '...'})` | User entity and `setUserId` call | +| Page context | `page('Pricing', {category: 'Products'})` | `trackPageView` with `web_page` entity | +| Reusable properties | `properties.product_sku` in multiple `track` calls | Dedicated `product` entity attached to relevant events | -This approach reflects the real world more accurately. An "event" is a momentary action, while "entities" like users, products, and marketing campaigns are persistent objects that participate in many events over time. This separation provides immense power. It allows you to analyze the `product` entity across its entire lifecycle, from initial discovery to final purchase, by querying a single, consistent data structure. You are no longer forced to hunt for and coalesce disparate property fields (`viewed_product_sku`, `purchased_product_sku`, etc.) across different event tables. +## Migration phases -Furthermore, Snowplow comes with a rich set of [out-of-the-box entities](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/#out-of-the-box-entity-tracking) that can be enabled to automatically enrich every event with crucial context, such as the `webPage` entity, the `performanceTiming` entity for site speed, and the `user` entity. This moves the data model from being action-centric to being context-centric, providing a much richer and more interconnected view of behavior from the moment of collection. +We recommend using a parallel-run migration approach. This process can be divided into three phases: +1. Assess and plan +2. Implement and validate +3. Cutover and optimize -### The language of your business: Building composable data structures with self-describing schemas (data contracts) +### Assess and plan -The technical foundation that makes the event-entity model possible is Snowplow's use of **self-describing schemas**. In the Segment world, developers often start by implementing events, and then the data team then retrospectively classifies and governs them via Segment Protocols. While they do provide tracking plan capabilities, these are hard to find or tied to enterprise-level onboarding. +- Audit existing Segment tracking calls +- Export Segment tracking plan via API, or infer from warehouse data +- Translate the Segment tracking plan into a Snowplow tracking plan based on event schemas and entities +- Deploy Snowplow infrastructure -In the Snowplow ecosystem, the schema registry *is* the single source of truth. Every self-describing event and every custom entity is defined by a formal JSON Schema, which is stored and versioned in a schema registry called **[Iglu](https://docs.snowplow.io/docs/fundamentals/schemas/#iglu)**. Each schema is a machine-readable contract that specifies: +Export your Segment tracking plan using one of these methods: +* Programmatic API export (recommended) using Segment Public API for full JSON structure +* Manual CSV download from Segment UI -- **Identity**: A unique URI comprising a vendor (`com.acme`), a name (`product`), a format (`jsonschema`), and a version (`1-0-0`) -- **Structure**: The exact properties the event or entity should contain (e.g., `sku`, `name`, `price`) -- **Validation Rules**: The data type for each property (`string`, `number`, `boolean`), as well as constraints like minimum/maximum length, regular expression patterns, or enumerated values +Use the [Snowplow CLI](/docs/data-product-studio/snowplow-cli/index.md) with its Model Context Protocol (MCP) server to translate your Segment plan. See the [Snowplow CLI MCP tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/) for more information about using the Snowplow CLI with MCP server. -The data payload itself contains a reference to the specific schema and version that defines it, which is why it's called a "self-describing JSON". This creates a powerful, unambiguous, and shared language for data across the entire organization. When a product manager designs a new feature, they collaborate with engineers and analysts to define the schemas for the new events and entities involved. This contract is then stored in Iglu. The engineers implement tracking based on this contract, and the analysts know exactly what data to expect in the warehouse because they can reference the same contract. This is a cultural shift that treats data as a deliberately designed product, not as a byproduct of application code. +### Implement and validate -### Analytical implications: How the event-entity model unlocks deeper, contextual insights +- Add [Snowplow tracking](/docs/sources/index.md) to run in parallel with Segment tracking +- Test with [Snowplow Micro](/docs/data-product-studio/data-quality/snowplow-micro/index.md) for local validation +- Compare between Segment and Snowplow data in your warehouse +- Decide what to do about historical data -The architectural advantage of the event-entity model becomes apparent in the data warehouse. In a Segment implementation, each custom event type is loaded into its own table (e.g., `order_completed`, `product_viewed`). While this provides structure, it can lead to a large number of tables in the warehouse, a challenge sometimes referred to as "schema sprawl." A significant amount of analytical work involves discovering the correct tables and then `UNION`-ing them together to reconstruct a user's complete journey. +Assuming you're tracking on web, the [Snowplow JavaScript tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/) uses object-based calls instead of ordered parameters: +- Segment: `analytics.track('Event', {prop: 'value'})` +- Snowplow: `snowplow('trackSelfDescribingEvent', {schema: 'iglu:com.acme/event/jsonschema/1-0-0', data: {prop: 'value'}})` -Snowplow's data model for modern warehouses like Snowflake and BigQuery simplifies this downstream work by using a "one big table" approach. All data is loaded into a single, wide [`atomic.events` table](https://docs.snowplow.io/docs/fundamentals/canonical-event/). Self-describing events and their associated entities are not loaded into separate tables. Instead, they are stored as dedicated, structured columns within that one table—for example, as an `OBJECT` in Snowflake or a `REPEATED RECORD` in BigQuery. This model avoids the schema sprawl of the Segment approach. +All Snowplow trackers follow the same pattern: -For an analyst, this means that to get a complete picture of an `add_to_cart` event and the product involved, they query a single, predictable table. The event and all its contextual entities are present in the same row. This structure can simplify data modeling in tools like dbt and accelerate time-to-insight, as the analytical work shifts from joining many disparate event tables to unnesting or accessing data within the structured columns of a single table. It is important to note that this loading behavior is different for Amazon Redshift, where each entity type does get loaded into its own separate table. +1. Initialize tracker with a Collector endpoint +3. Use `track` methods to send events -| Segment Concept | Segment Example | Snowplow Equivalent | Snowplow Implementation Detail | -| ----------------------- | ------------------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-Describing Event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties. | -| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User Entity & `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at`. | -| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` & `web_page` Entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity. | -| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` Entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.). | +Use [Snowtype](/docs/data-product-studio/snowtype/index.md) to generate type-safe tracking code based on your tracking plan schemas. -## Architecting your migration: A phased framework +Use [Snowplow Micro](/docs/data-product-studio/data-quality/snowplow-micro/index.md) to run a local Snowplow pipeline in Docker. Point your application to the local instance to validate events in real-time before deployment. -A successful migration requires a well-defined strategy that manages risk and ensures data continuity. This section outlines a high-level project plan, including different strategic scenarios and a plan for handling historical data. +For historical data, you have a choice of approaches: +- Coexistence: leave historical Segment data in existing tables. Use transformation layer to `UNION` Segment and Snowplow data for longitudinal analysis. +- Unification: transform and backfill historical Segment data into Snowplow format. Requires custom engineering project but provides unified historical dataset. -### The three-phase migration roadmap +During parallel tracking, compare data in your warehouse using SQL queries. Focus on: -A migration from Segment to Snowplow can be broken down into three phases: +- Daily event counts +- Unique user counts +- Critical property values -- **Phase 1: Assess and plan** - - Audit all existing Segment `track`, `identify`, `page`, and `group` calls - - Export the complete Segment Tracking Plan via API (if you still have an active account) or infer it from data in a data warehouse - - Translate the Segment plan into a Snowplow tracking plan, defining event schemas and identifying reusable entities - using the Snowplow CLI MCP Server - - Deploy the Snowplow pipeline components ([Collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/), [Enrich](https://docs.snowplow.io/docs/pipeline-components-and-applications/enrichment-components/), [Loaders](https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/)) and the [Iglu Schema Registry](https://docs.snowplow.io/docs/pipeline-components-and-applications/iglu/) in your cloud -- **Phase 2: Implement and validate** - - Add [Snowplow trackers](https://docs.snowplow.io/docs/collecting-data/) to your applications to run in parallel with existing Segment trackers (dual-tracking) - - Use tools like [Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/) for local testing and validation before deployment - - Perform end-to-end data reconciliation in your data warehouse by comparing Segment and Snowplow data to ensure accuracy -- **Phase 3: Cutover and optimize** - - Update all downstream data consumers (BI dashboards, [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/)) to query the new Snowplow data tables - - Remove the Segment trackers and SDKs from application codebases - - Decommission the Segment sources and, eventually, the subscription +### Cutover and finalize -### Migration scenario 1: The parallel-run approach - -The parallel-run approach is the recommended, lowest-risk strategy. It involves running both systems simultaneously (dual-tracking) to validate data integrity before cutting over. Existing Segment-powered workflows remain operational while you test and reconcile the new Snowplow data in the warehouse. This approach builds confidence and allows you to resolve discrepancies without impacting production systems. - -### Migration scenario 2: The full re-architecture - -A "rip-and-replace" approach is faster but riskier, involving a direct switch from Segment to Snowplow SDKs. This is best suited for: - -- New projects or applications with no legacy system -- Major application refactors where the switch can be part of a larger effort -- Teams with high risk tolerance and robust automated testing frameworks - -This strategy requires thorough pre-launch testing in a staging environment to prevent data loss. - -### A strategy for historical data - -You have two main options for handling historical data from Segment: - -- **Option A: Coexistence (Pragmatic)** Leave historical Segment data in its existing tables. For longitudinal analysis, write queries that `UNION` data from both Segment and Snowplow tables, using a transformation layer (e.g., in dbt) to create a compatible structure. This avoids a large backfill project -- **Option B: Unification (Backfill)** For a single, unified dataset, undertake a custom engineering project to transform and backfill historical data. This involves exporting Segment data, writing a script to reshape it into the Snowplow enriched event format, and loading it into the warehouse. This is a significant effort but provides a consistent historical dataset - -## The technical playbook: Executing your migration - -This section provides a detailed, hands-on playbook for the technical execution of the migration. A central theme of this playbook is the use of the Snowplow CLI and its integrated AI capabilities to accelerate the most challenging part of the migration: designing a new, high-quality tracking plan. - -### Step 1: Deconstruct your legacy: Export and analyze the Segment tracking plan - -Before building the new data foundation, you must create a complete blueprint of the existing structure. The first practical step is to export your Segment Tracking Plan into a machine-readable format that can serve as the raw material for your redesign. - -There are two primary methods for this export: - -1. **Manual CSV download**: The Segment UI provides an option to download your Tracking Plan as a CSV file. This is a quick way to get a human-readable inventory of your events and properties. However, it can be less ideal for programmatic analysis and may not capture the full structural detail of your plan -2. **Programmatic API export (recommended)**: The superior method is to use the Segment Public API. The API allows you to programmatically list all Tracking Plans in your workspace and retrieve the full definition of each plan, including its rules, in a structured JSON format. This JSON output is invaluable because it often includes the underlying JSON Schema that Segment uses to validate the `properties` of each event - -The result of this step is a definitive, version-controlled artifact (e.g., a `segment_plan.json` file) that represents the ground truth of your current tracking implementation. This file will be the primary input for the next step of the process. - -### Step 2: AI-assisted design: Build your Snowplow tracking plan with the CLI and MCP server - -Next, you'll need to translate that tracking plan into a Snowplow-appropriate format (Data Products and Data Structures). - -The [Snowplow CLI](https://docs.snowplow.io/docs/data-product-studio/snowplow-cli/) is a command-line utility that includes a Model Context Protocol (MCP) server, so you can use an AI agent to generate idiomatic Snowplow tracking. For more information on how to do this, read the [tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/). - -### Step 3: Re-instrument your codebase: A conceptual guide - -With a robust and well-designed tracking plan published to your Iglu registry, the next step is to update your application code to send events to Snowplow. While the specific code will vary by language and platform, the core concepts are consistent. We recommend using [Snowtype](https://docs.snowplow.io/docs/data-product-studio/snowtype/), our Code Generation tool, to automatically generate type-safe tracking code. - -#### Migrate client-side tracking: From `analytics.js` to the Snowplow Browser Tracker - -The [Snowplow JavaScript/Browser tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/) introduces a more modern and readable API. The most significant change from Segment's `analytics.js` is the move from function calls with long, ordered parameter lists to calls that accept a single object with named arguments. - -- A Segment call like `analytics.track('Event', {prop: 'value'})` becomes a Snowplow call like `snowplow('trackSelfDescribingEvent', {schema: 'iglu:com.acme/event/jsonschema/1-0-0', data: {prop: 'value'}})` -- A Segment `identify` call is replaced by a combination of a `setUserId` call to set the primary user identifier and the attachment of a custom `user` entity to provide the user traits - -This object-based approach improves code readability, as the purpose of each value is explicit, and makes the tracking calls more extensible for the future. - -#### Migrate server-side and mobile tracking: An overview of Snowplow's polyglot trackers - -Snowplow provides a comprehensive suite of [trackers](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/) for virtually every common back-end language and mobile platform, including [Java](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/java-tracker/), [Python](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/python-tracker/), [.NET](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/net-tracker/), [Go](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/go-tracker/), [Ruby](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/ruby-tracker/), [iOS](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/ios-tracker/) (Swift/Objective-C), and [Android](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/android-tracker/) (Kotlin/Java). - -While the syntax is idiomatic to each language, the underlying paradigm remains the same across all trackers. The developer will: - -1. Initialize the tracker with the endpoint of their [Snowplow collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/) -2. Use builder patterns or helper classes to construct self-describing events and entity objects, referencing the schema URIs from the Iglu registry. For example, the Java tracker uses a `SelfDescribing.builder()` to construct the event payload -3. Use a `track` method to send the fully constructed event to the collector - -The consistency of the event-entity model across all trackers ensures that data from every platform will arrive in the warehouse in a unified, coherent structure. - -### Step 4: Ensure a smooth transition: Validation, testing, and cutover - -The final step is to rigorously validate the new implementation and manage the cutover. A smooth transition is non-negotiable. - -#### Local validation with Snowplow Micro - -To empower developers and "shift-left" on data quality, customers should incorporate **[Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/)**. Micro is a partial Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. - -#### End-to-end data reconciliation strategies - -During the parallel-run phase, it is essential to perform end-to-end data reconciliation in the data warehouse. This involves writing a suite of SQL queries to compare the data collected by the two systems. Analysts should compare high-level metrics like daily event counts and unique user counts, as well as the values of specific, critical properties. The goal is not to achieve 100% identical data—the data models are different, which is the point of the migration. The goal is to be able to confidently explain any variances and to prove that the new Snowplow pipeline is capturing all critical business logic correctly. - -#### Final cutover: Decommission Segment senders - -Once the data has been thoroughly reconciled and all downstream dependencies (e.g., BI dashboards, ML models, marketing automation workflows) have been successfully migrated to use the new, richer Snowplow data tables, the team can proceed with the final cutover. This involves a coordinated deployment to remove the Segment SDKs and all `analytics.track()` calls from the codebases. Following general data migration best practices, the old Segment sources should be left active for a short period as a final fallback before being fully decommissioned. +- Update downstream consumers (BI dashboards, [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/)) to use Snowplow data +- Remove Segment trackers from application code +- Decommission Segment sources From 9da0f27d87beba764847db6c01c98a29fb44b945 Mon Sep 17 00:00:00 2001 From: Miranda Wilson Date: Wed, 6 Aug 2025 10:04:04 +0100 Subject: [PATCH 7/9] Add original text back with additions --- .../migration-guides/segment/index.md | 296 +++++++++++------- 1 file changed, 181 insertions(+), 115 deletions(-) diff --git a/docs/resources/migration-guides/segment/index.md b/docs/resources/migration-guides/segment/index.md index 75ba47865..f761a9094 100644 --- a/docs/resources/migration-guides/segment/index.md +++ b/docs/resources/migration-guides/segment/index.md @@ -4,165 +4,231 @@ date: "2025-08-04" sidebar_position: 0 --- -This guide covers migrating from Segment to Snowplow for technical implementers. The migration moves from a managed Customer Data Platform (CDP) to a composable behavioral data platform running in your cloud environment. -## High-level differences +This guide is for technical implementers considering a migration from Segment to Snowplow. This move represents a shift from a managed Customer Data Platform (CDP) to a more flexible, composable behavioral data platform which runs in your cloud environment. -There are a number of differences between Segment and Snowplow: as a data platform; in how they're priced; and in how the data is defined, conceptualized, and stored. +## The strategic imperative: Why data teams migrate from Segment to Snowplow -### Platform comparison +The move from Segment to Snowplow is usually driven by a desire for greater control, higher data fidelity, and a more predictable financial model. -| Feature | Segment | Snowplow | -| -------------------- | ---------------------------------------------------- | ---------------------------------------------------------------------- | -| Deployment model | SaaS-only: data is processed on Segment servers | Private cloud (BDP Enterprise) and SaaS (BDP Cloud) are both available | -| Data ownership | Data access in warehouse, vendor controls pipeline | You own your data and control your pipeline infrastructure | -| Governance model | Reactive: post-hoc validation with Protocols | Proactive: schema validation for every event | -| Data structure | Flat events with properties, user traits and context | Events enriched by multiple, reusable entities | -| Pricing model | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | -| Real-time capability | Limited low-latency support | Real-time streaming pipeline supports sub-second use cases | +### Achieve data ownership and control in your cloud -### Data warehouse structure +The key architectural difference is deployment. Segment is a SaaS platform where your data is processed on their servers. Snowplow runs as a set of services in your private cloud (AWS/GCP/Azure), giving you full ownership of your data at all stages. -Segment loads each custom event type into separate tables, for example, `order_completed`, or `product_viewed` tables. +This provides several advantages: -Snowplow uses a single [`atomic.events`](https://docs.snowplow.io/docs/fundamentals/canonical-event/) table in warehouses like Snowflake and BigQuery. Events and entities are stored as structured columns within that table. +- **Enhanced security and compliance**: Keeping data within your own cloud simplifies security reviews and compliance audits (e.g., GDPR, CCPA, HIPAA), as no third-party vendor processes raw user data +- **Complete data control**: You can configure, scale, and monitor every component of the pipeline according to your specific needs +- **Elimination of vendor lock-in**: Because you own the infrastructure and the data format is open, you are not locked into a proprietary ecosystem -## What do events look like? +### A new approach to governance: Foundational data quality -Segment uses flat JSON `properties` objects in `track` events. Snowplow uses a nested [event-entity model](/docs/fundamentals/events/index.md) where events can be enriched with multiple contextual entities. +Segment and Snowplow approach data governance differently. Segment's Protocols feature validates data reactively, acting as a gatekeeper for incoming events. This is a premium feature and, if not rigorously managed, can lead to inconsistent data requiring significant downstream cleaning. -### Segment +Snowplow enforces data quality proactively with mandatory, machine-readable **[schemas](https://docs.snowplow.io/docs/fundamentals/schemas/)** for every [event](https://docs.snowplow.io/docs/fundamentals/events/) and [entity](https://docs.snowplow.io/docs/fundamentals/entities/). [Events that fail validation](https://docs.snowplow.io/docs/fundamentals/failed-events/) are quarantined for inspection, ensuring only clean, consistent data lands in your warehouse. This "shift-left" approach moves the cost of data quality from a continuous operational expense to a one-time design investment. -Segment's core method for tracking user behavior is `track`. A `track` call contains a name that describes the action taken, and a `properties` object that contains contextual information about the action. +### Unlock advanced analytics with greater granularity -The other Segment tracking methods are: -* `page` and `screen`: records page views and screen views -* `identify`: describes the user, and associates a `userId` with user `traits` -* `group`: associates the user with a group -* `alias`: merges user identities, for identity resolution across applications +Segment started out as a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular first-party behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's [event-entity model](https://docs.snowplow.io/docs/fundamentals/events/) allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. -Data about the user's action is tracked separately from data about the user. You'll stitch them together during data modeling in the warehouse. +This rich, structured data is ideal for: -Here's an example showing how you could track a user registration event in Segment: +- **Complex data modeling**: Snowplow provides source-available dbt packages to transform raw data into analysis-ready tables +- **AI and machine learning**: High-fidelity data is ideal for training ML models like recommendation engines or churn predictors +- **Deep user behavior analysis**: Rich entities enable multi-faceted exploration of user journeys without complex data wrangling -```javascript -analytics.track("User Registered", { - plan: "Pro Annual", - accountType: "Facebook" -}); -``` +### A predictable, infrastructure-based cost model -The tracked events can be optionally validated against Protocols. These are defined by you, and will detect violations against your tracking plan. You can choose to filter out events that don't pass validation. +Segment's entry-level pricing is based on Monthly Tracked Users (MTUs), which can become expensive and unpredictable as you scale. This model can penalize growth. -### Snowplow +Snowplow's costs are based on your cloud infrastructure usage (compute and storage from AWS or GCP) plus a license fee depending on event volume which is more predictable and cost-effective at scale. This model aligns cost directly with data processing volume, not user count, encouraging comprehensive data collection without financial penalty. -Snowplow events generally contain more data than Segment events. Each [event](/docs/fundamentals/events/index.md) contains data about the action, as well as about the user and any other relevant contextual information. +### Flexible integration with downstream tools -Snowplow SDKs also provide methods for tracking page views and screen views, along with many other kinds of events, such as button clicks, form submissions, page pings (activity), add to cart, and so on. The equivalent to Segment's `track` is tracking custom events with `track_self_describing_event`. +While Segment excels at routing data to third-party marketing and analytics tools, Snowplow provides flexible options for connecting your behavioral data to downstream systems. [Event forwarding](https://docs.snowplow.io/docs/destinations/forwarding-events/) enables real-time streaming of enriched events to various destinations, supporting both analytical and operational use cases. For reverse ETL workflows that send processed data back to operational systems, Snowplow has partnered with Census to provide best-in-class functionality for activating your warehouse data in marketing and sales tools. -All Snowplow events, whether designed by you or built-in, are defined by [JSON schemas](/docs/fundamentals/schemas/index.md). The events are always validated as they're processed through the Snowplow pipeline, and events that fail validation are separated out for assessment. +| Feature | Segment | Snowplow | +| --------------------------- | ------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------- | +| **Deployment Model** | SaaS-only; data processed on Segment servers hosted by AWS | Private cloud; runs entirely in your AWS/GCP/Azure account | +| **Data Ownership** | Data access in warehouse; vendor controls pipeline | Customer owns data and controls pipeline infrastructure | +| **Governance Model** | Reactive; post-hoc validation with Protocols (a premium add-on) | Proactive; foundational schema validation for every event | +| **Data Structure** | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | +| **Primary Use Case** | Building a Customer Data Platform for routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | +| **Pricing Model** | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | +| **Real-Time Capability** | Limited low-latency support and observability | Real-time streaming pipeline (e.g., via Kafka) supports use cases in seconds | +| **Downstream Integrations** | Native connections to 300+ marketing and analytics tools | Event forwarding to custom destinations plus reverse ETL via Census partnership | -To track a user registration event in Snowplow, you could define a custom event to track like this: +## Deconstructing the data model: From flat events to rich context -```javascript -snowplow('trackSelfDescribingEvent', { - event: { - schema: 'iglu:com.acme_company/user_registration/jsonschema/1-0-0', - data: { - plan: "Pro Annual", - accountType: "Facebook", - } - } -}); -``` +To appreciate the strategic value of migrating to Snowplow, it is essential to understand the fundamental differences in how each platform approaches the modeling of behavioral data. This is not just a technical distinction; it is a difference in approach that has consequences for data quality, flexibility, and analytical power. Segment operates on a simple, action-centric model, while Snowplow introduces a more sophisticated, context-centric paradigm that more accurately reflects the complexity of the real world. -This call doesn't appear to contain more information than the Segment `track` call, because only the event definition is shown. Here's the same example with additional data tracked about the user, as a `user` entity: +### The Segment model: A review of `track`, `identify`, and the property-centric approach -```javascript -snowplow('trackSelfDescribingEvent', { - event: { - schema: 'iglu:com.acme_company/user_registration/jsonschema/1-0-0', - data: { - plan: "Pro Annual", - accountType: "Facebook", - } - }, - context: [{ - schema: "iglu:com.acme_company/user/jsonschema/1-0-0", - data: { - userId: "12345", - } - }] -}); -``` +Segment's data specification is designed for simplicity and ease of use. It is built around a handful of core API methods that capture the essential elements of user interaction. The most foundational of these is the `track` call, which is designed to answer the question, "What is the user doing?". Each `track` call records a single user action, known as an event, which has a human-readable name (e.g., `User Registered`) and an associated `properties` object. This object is a simple JSON containing key-value pairs that describe the action (e.g., `plan: 'pro'`, accountType: 'trial'`). -Instead of tracking `identify` separately, this user entity can be reused and added to all Snowplow events where it could be useful, for example `log_out`, `change_profile_image`, `view_product`, etc. +The other key methods in the Segment spec support this action-centric view: -The Snowplow tracking SDKs in fact add multiple entities to all tracked events by default, including information about the specific page or screen, the user's session, and the device or browser. Many other built-in entities can be configured. As shown here, you can define custom entities to add to any Snowplow event. +- **`identify`**: Answers the question, "Who is the user?" It associates a `userId` with a set of `traits` (e.g., `email`, `name`), which describe the user themselves +- **`page` and `screen`**: Record when a user views a webpage or a mobile app screen, respectively +- **`group`**: Associates an individual user with a group, such as a company or organization +- **`alias`**: Used to merge the identities of a user across different systems or states (e.g., anonymous to logged-in) -### Concepts comparison +This model forces the world into a verb-centric framework. The event—the action—is the primary object of interest. All other information, whether it describes the product involved, the user performing the action, or the page on which it occurred, is relegated to being a "property" or a "trait" attached to that action. While this approach is intuitive, it lacks a formal, structured way to define and reuse the *nouns* of the business—the users, products, content, and campaigns—as first-class, independent components of the data model itself. This architectural choice leads to data being defined and repeated within the context of each individual action, rather than as a set of interconnected, reusable concepts. It often requires a consolidation period down the line as downstream users struggle with data quality issues. -This table compares some core Segment concepts with their Snowplow equivalents. +### The Snowplow approach: Understanding the event-entity distinction -| Segment concept | Segment example | Snowplow equivalent | -| ------------------- | ------------------------------------------------------------- | ---------------------------------------------------------- | -| Core action | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | Self-describing event with custom `order_completed` schema | -| User identification | `identify('user123', {plan: 'pro', created_at: '...'})` | User entity and `setUserId` call | -| Page context | `page('Pricing', {category: 'Products'})` | `trackPageView` with `web_page` entity | -| Reusable properties | `properties.product_sku` in multiple `track` calls | Dedicated `product` entity attached to relevant events | +Snowplow introduces a more nuanced and powerful paradigm that separates the *event* (the action that occurred at a point in time) from the *entities* (the nouns that were involved in that action). In Snowplow, every tracked event can be decorated with an array of contextual entities. This is the core of the event-entity model. -## Migration phases +An **[event](https://docs.snowplow.io/docs/fundamentals/events/)** is an immutable record of something that happened. A **[self-describing event](https://docs.snowplow.io/docs/fundamentals/events/#self-describing-events)** in Snowplow is the equivalent of a Segment `track` call, capturing a specific action like `add_to_cart`. -We recommend using a parallel-run migration approach. This process can be divided into three phases: -1. Assess and plan -2. Implement and validate -3. Cutover and optimize +An **[entity](https://docs.snowplow.io/docs/fundamentals/entities/)**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator. Consider a retail example. Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: -### Assess and plan +- `view_product` +- `add_to_basket` +- `remove_from_basket` +- `purchase_product` +- `review_product` -- Audit existing Segment tracking calls -- Export Segment tracking plan via API, or infer from warehouse data -- Translate the Segment tracking plan into a Snowplow tracking plan based on event schemas and entities -- Deploy Snowplow infrastructure +This approach reflects the real world more accurately. An "event" is a momentary action, while "entities" like users, products, and marketing campaigns are persistent objects that participate in many events over time. This separation provides immense power. It allows you to analyze the `product` entity across its entire lifecycle, from initial discovery to final purchase, by querying a single, consistent data structure. You are no longer forced to hunt for and coalesce disparate property fields (`viewed_product_sku`, `purchased_product_sku`, etc.) across different event tables. -Export your Segment tracking plan using one of these methods: -* Programmatic API export (recommended) using Segment Public API for full JSON structure -* Manual CSV download from Segment UI +Furthermore, Snowplow comes with a rich set of [out-of-the-box entities](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/#out-of-the-box-entity-tracking) that can be enabled to automatically enrich every event with crucial context, such as the `webPage` entity, the `performanceTiming` entity for site speed, and the `user` entity. This moves the data model from being action-centric to being context-centric, providing a much richer and more interconnected view of behavior from the moment of collection. -Use the [Snowplow CLI](/docs/data-product-studio/snowplow-cli/index.md) with its Model Context Protocol (MCP) server to translate your Segment plan. See the [Snowplow CLI MCP tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/) for more information about using the Snowplow CLI with MCP server. +### The language of your business: Building composable data structures with self-describing schemas (data contracts) -### Implement and validate +The technical foundation that makes the event-entity model possible is Snowplow's use of **self-describing schemas**. In the Segment world, developers often start by implementing events, and then the data team then retrospectively classifies and governs them via Segment Protocols. While they do provide tracking plan capabilities, these are hard to find or tied to enterprise-level onboarding. -- Add [Snowplow tracking](/docs/sources/index.md) to run in parallel with Segment tracking -- Test with [Snowplow Micro](/docs/data-product-studio/data-quality/snowplow-micro/index.md) for local validation -- Compare between Segment and Snowplow data in your warehouse -- Decide what to do about historical data +In the Snowplow ecosystem, the schema registry *is* the single source of truth. Every self-describing event and every custom entity is defined by a formal JSON Schema, which is stored and versioned in a schema registry called **[Iglu](https://docs.snowplow.io/docs/fundamentals/schemas/#iglu)**. Each schema is a machine-readable contract that specifies: -Assuming you're tracking on web, the [Snowplow JavaScript tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/) uses object-based calls instead of ordered parameters: -- Segment: `analytics.track('Event', {prop: 'value'})` -- Snowplow: `snowplow('trackSelfDescribingEvent', {schema: 'iglu:com.acme/event/jsonschema/1-0-0', data: {prop: 'value'}})` +- **Identity**: A unique URI comprising a vendor (`com.acme`), a name (`product`), a format (`jsonschema`), and a version (`1-0-0`) +- **Structure**: The exact properties the event or entity should contain (e.g., `sku`, `name`, `price`) +- **Validation Rules**: The data type for each property (`string`, `number`, `boolean`), as well as constraints like minimum/maximum length, regular expression patterns, or enumerated values -All Snowplow trackers follow the same pattern: +The data payload itself contains a reference to the specific schema and version that defines it, which is why it's called a "self-describing JSON". This creates a powerful, unambiguous, and shared language for data across the entire organization. When a product manager designs a new feature, they collaborate with engineers and analysts to define the schemas for the new events and entities involved. This contract is then stored in Iglu. The engineers implement tracking based on this contract, and the analysts know exactly what data to expect in the warehouse because they can reference the same contract. This is a cultural shift that treats data as a deliberately designed product, not as a byproduct of application code. -1. Initialize tracker with a Collector endpoint -3. Use `track` methods to send events +### Analytical implications: How the event-entity model unlocks deeper, contextual insights -Use [Snowtype](/docs/data-product-studio/snowtype/index.md) to generate type-safe tracking code based on your tracking plan schemas. +The architectural advantage of the event-entity model becomes apparent in the data warehouse. In a Segment implementation, each custom event type is loaded into its own table (e.g., `order_completed`, `product_viewed`). While this provides structure, it can lead to a large number of tables in the warehouse, a challenge sometimes referred to as "schema sprawl." A significant amount of analytical work involves discovering the correct tables and then `UNION`-ing them together to reconstruct a user's complete journey. -Use [Snowplow Micro](/docs/data-product-studio/data-quality/snowplow-micro/index.md) to run a local Snowplow pipeline in Docker. Point your application to the local instance to validate events in real-time before deployment. +Snowplow's data model for modern warehouses like Snowflake and BigQuery simplifies this downstream work by using a "one big table" approach. All data is loaded into a single, wide [`atomic.events` table](https://docs.snowplow.io/docs/fundamentals/canonical-event/). Self-describing events and their associated entities are not loaded into separate tables. Instead, they are stored as dedicated, structured columns within that one table—for example, as an `OBJECT` in Snowflake or a `REPEATED RECORD` in BigQuery. This model avoids the schema sprawl of the Segment approach. -For historical data, you have a choice of approaches: -- Coexistence: leave historical Segment data in existing tables. Use transformation layer to `UNION` Segment and Snowplow data for longitudinal analysis. -- Unification: transform and backfill historical Segment data into Snowplow format. Requires custom engineering project but provides unified historical dataset. +For an analyst, this means that to get a complete picture of an `add_to_cart` event and the product involved, they query a single, predictable table. The event and all its contextual entities are present in the same row. This structure can simplify data modeling in tools like dbt and accelerate time-to-insight, as the analytical work shifts from joining many disparate event tables to unnesting or accessing data within the structured columns of a single table. It is important to note that this loading behavior is different for Amazon Redshift, where each entity type does get loaded into its own separate table. -During parallel tracking, compare data in your warehouse using SQL queries. Focus on: +| Segment Concept | Segment Example | Snowplow Equivalent | Snowplow Implementation Detail | +| ----------------------- | ------------------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-Describing Event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties. | +| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User Entity & `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at`. | +| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` & `web_page` Entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity. | +| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` Entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.). | -- Daily event counts -- Unique user counts -- Critical property values +## Architecting your migration: A phased framework -### Cutover and finalize +A successful migration requires a well-defined strategy that manages risk and ensures data continuity. This section outlines a high-level project plan, including different strategic scenarios and a plan for handling historical data. -- Update downstream consumers (BI dashboards, [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/)) to use Snowplow data -- Remove Segment trackers from application code -- Decommission Segment sources +### The three-phase migration roadmap + +A migration from Segment to Snowplow can be broken down into three phases: + +- **Phase 1: Assess and plan** + - Audit all existing Segment `track`, `identify`, `page`, and `group` calls + - Export the complete Segment Tracking Plan via API (if you still have an active account) or infer it from data in a data warehouse + - Translate the Segment plan into a Snowplow tracking plan, defining event schemas and identifying reusable entities - using the Snowplow CLI MCP Server + - Deploy the Snowplow pipeline components ([Collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/), [Enrich](https://docs.snowplow.io/docs/pipeline-components-and-applications/enrichment-components/), [Loaders](https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/)) and the [Iglu Schema Registry](https://docs.snowplow.io/docs/pipeline-components-and-applications/iglu/) in your cloud +- **Phase 2: Implement and validate** + - Add [Snowplow trackers](https://docs.snowplow.io/docs/collecting-data/) to your applications to run in parallel with existing Segment trackers (dual-tracking) + - Use tools like [Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/) for local testing and validation before deployment + - Perform end-to-end data reconciliation in your data warehouse by comparing Segment and Snowplow data to ensure accuracy +- **Phase 3: Cutover and optimize** + - Update all downstream data consumers (BI dashboards, [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/)) to query the new Snowplow data tables + - Remove the Segment trackers and SDKs from application codebases + - Decommission the Segment sources and, eventually, the subscription + +### Migration scenario 1: The parallel-run approach + +The parallel-run approach is the recommended, lowest-risk strategy. It involves running both systems simultaneously (dual-tracking) to validate data integrity before cutting over. Existing Segment-powered workflows remain operational while you test and reconcile the new Snowplow data in the warehouse. This approach builds confidence and allows you to resolve discrepancies without impacting production systems. + +### Migration scenario 2: The full re-architecture + +A "rip-and-replace" approach is faster but riskier, involving a direct switch from Segment to Snowplow SDKs. This is best suited for: + +- New projects or applications with no legacy system +- Major application refactors where the switch can be part of a larger effort +- Teams with high risk tolerance and robust automated testing frameworks + +This strategy requires thorough pre-launch testing in a staging environment to prevent data loss. + +### A strategy for historical data + +You have two main options for handling historical data from Segment: + +- **Option A: Coexistence (Pragmatic)** Leave historical Segment data in its existing tables. For longitudinal analysis, write queries that `UNION` data from both Segment and Snowplow tables, using a transformation layer (e.g., in dbt) to create a compatible structure. This avoids a large backfill project +- **Option B: Unification (Backfill)** For a single, unified dataset, undertake a custom engineering project to transform and backfill historical data. This involves exporting Segment data, writing a script to reshape it into the Snowplow enriched event format, and loading it into the warehouse. This is a significant effort but provides a consistent historical dataset + +## The technical playbook: Executing your migration + +This section provides a detailed, hands-on playbook for the technical execution of the migration. A central theme of this playbook is the use of the Snowplow CLI and its integrated AI capabilities to accelerate the most challenging part of the migration: designing a new, high-quality tracking plan. + +### Step 1: Deconstruct your legacy: Export and analyze the Segment tracking plan + +Before building the new data foundation, you must create a complete blueprint of the existing structure. The first practical step is to export your Segment Tracking Plan into a machine-readable format that can serve as the raw material for your redesign. + +There are two primary methods for this export: + +1. **Manual CSV download**: The Segment UI provides an option to download your Tracking Plan as a CSV file. This is a quick way to get a human-readable inventory of your events and properties. However, it can be less ideal for programmatic analysis and may not capture the full structural detail of your plan +2. **Programmatic API export (recommended)**: The superior method is to use the Segment Public API. The API allows you to programmatically list all Tracking Plans in your workspace and retrieve the full definition of each plan, including its rules, in a structured JSON format. This JSON output is invaluable because it often includes the underlying JSON Schema that Segment uses to validate the `properties` of each event + +The result of this step is a definitive, version-controlled artifact (e.g., a `segment_plan.json` file) that represents the ground truth of your current tracking implementation. This file will be the primary input for the next step of the process. + +### Step 2: AI-assisted design: Build your Snowplow tracking plan with the CLI and MCP server + +Next, you'll need to translate that tracking plan into a Snowplow-appropriate format (Data Products and Data Structures). + +The [Snowplow CLI](https://docs.snowplow.io/docs/data-product-studio/snowplow-cli/) is a command-line utility that includes a Model Context Protocol (MCP) server, so you can use an AI agent to generate idiomatic Snowplow tracking. For more information on how to do this, read the [tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/). + +### Step 3: Re-instrument your codebase: A conceptual guide + +With a robust and well-designed tracking plan published to your Iglu registry, the next step is to update your application code to send events to Snowplow. While the specific code will vary by language and platform, the core concepts are consistent. We recommend using [Snowtype](https://docs.snowplow.io/docs/data-product-studio/snowtype/), our Code Generation tool, to automatically generate type-safe tracking code. + +#### Migrate client-side tracking: From `analytics.js` to the Snowplow Browser Tracker + +The [Snowplow JavaScript/Browser tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/) introduces a more modern and readable API. The most significant change from Segment's `analytics.js` is the move from function calls with long, ordered parameter lists to calls that accept a single object with named arguments. + +- A Segment call like `analytics.track('Event', {prop: 'value'})` becomes a Snowplow call like `snowplow('trackSelfDescribingEvent', {schema: 'iglu:com.acme/event/jsonschema/1-0-0', data: {prop: 'value'}})` +- A Segment `identify` call is replaced by a combination of a `setUserId` call to set the primary user identifier and the attachment of a custom `user` entity to provide the user traits + +This object-based approach improves code readability, as the purpose of each value is explicit, and makes the tracking calls more extensible for the future. + +#### Migrate server-side and mobile tracking: An overview of Snowplow's polyglot trackers + +Snowplow provides a comprehensive suite of [trackers](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/) for virtually every common back-end language and mobile platform, including [Java](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/java-tracker/), [Python](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/python-tracker/), [.NET](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/net-tracker/), [Go](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/go-tracker/), [Ruby](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/ruby-tracker/), [iOS](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/ios-tracker/) (Swift/Objective-C), and [Android](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/android-tracker/) (Kotlin/Java). + +While the syntax is idiomatic to each language, the underlying paradigm remains the same across all trackers. The developer will: + +1. Initialize the tracker with the endpoint of their [Snowplow collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/) +2. Use builder patterns or helper classes to construct self-describing events and entity objects, referencing the schema URIs from the Iglu registry. For example, the Java tracker uses a `SelfDescribing.builder()` to construct the event payload +3. Use a `track` method to send the fully constructed event to the collector + +The consistency of the event-entity model across all trackers ensures that data from every platform will arrive in the warehouse in a unified, coherent structure. + +### Step 4: Configure downstream integrations + +After implementing tracking, you'll want to connect your Snowplow data to downstream systems. Snowplow provides two main approaches for this: + +**Event forwarding** enables real-time streaming of enriched events to various destinations. This capability allows you to send data to custom endpoints, message queues, or third-party services as events flow through the pipeline. You can configure forwarding rules to send specific event types or filtered data streams to different destinations. For detailed setup instructions, see the [event forwarding documentation](https://docs.snowplow.io/docs/destinations/forwarding-events/). + +**Reverse ETL workflows** leverage your data warehouse as the source of truth for activating processed data in operational systems. Through Snowplow's partnership with Census, you can build sophisticated audience segments, computed user properties, and behavioral scores in your warehouse, then sync these insights to marketing automation platforms, CRM systems, and personalization tools. This approach enables data-driven activation workflows that would be difficult to achieve with traditional CDP routing. + +### Step 5: Ensure a smooth transition: Validation, testing, and cutover + +The final technical step is to rigorously validate the new implementation and manage the cutover. A smooth transition is non-negotiable. + +#### Local validation with Snowplow Micro + +To empower developers and "shift-left" on data quality, customers should incorporate **[Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/)**. Micro is a partial Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. + +#### End-to-end data reconciliation strategies + +During the parallel-run phase, it is essential to perform end-to-end data reconciliation in the data warehouse. This involves writing a suite of SQL queries to compare the data collected by the two systems. Analysts should compare high-level metrics like daily event counts and unique user counts, as well as the values of specific, critical properties. The goal is not to achieve 100% identical data—the data models are different, which is the point of the migration. The goal is to be able to confidently explain any variances and to prove that the new Snowplow pipeline is capturing all critical business logic correctly. + +#### Final cutover: Decommission Segment senders + +Once the data has been thoroughly reconciled and all downstream dependencies (e.g., BI dashboards, ML models, marketing automation workflows) have been successfully migrated to use the new, richer Snowplow data tables, the team can proceed with the final cutover. This involves a coordinated deployment to remove the Segment SDKs and all `analytics.track()` calls from the codebases. Following general data migration best practices, the old Segment sources should be left active for a short period as a final fallback before being fully decommissioned. From 3ca6b28b4d7a1f4e55373e8b1b9a447dd1e51bf0 Mon Sep 17 00:00:00 2001 From: Miranda Wilson Date: Wed, 6 Aug 2025 14:03:15 +0100 Subject: [PATCH 8/9] Shorten text again --- docs/resources/migration-guides/index.md | 2 +- .../migration-guides/segment/index.md | 307 +++++++----------- 2 files changed, 122 insertions(+), 187 deletions(-) diff --git a/docs/resources/migration-guides/index.md b/docs/resources/migration-guides/index.md index ad28a85fe..9ead1d043 100644 --- a/docs/resources/migration-guides/index.md +++ b/docs/resources/migration-guides/index.md @@ -5,7 +5,7 @@ sidebar_position: 1 This section contains advice for migrating to Snowplow from other solutions. -In general, there are two possible migration strategies: parallel-run or full re-architecture. +There are two possible migration strategies: parallel-run or full re-architecture. ## Parallel-run diff --git a/docs/resources/migration-guides/segment/index.md b/docs/resources/migration-guides/segment/index.md index f761a9094..07ab4ff74 100644 --- a/docs/resources/migration-guides/segment/index.md +++ b/docs/resources/migration-guides/segment/index.md @@ -4,231 +4,166 @@ date: "2025-08-04" sidebar_position: 0 --- +This guide helps technical implementers migrate from Segment to Snowplow. -This guide is for technical implementers considering a migration from Segment to Snowplow. This move represents a shift from a managed Customer Data Platform (CDP) to a more flexible, composable behavioral data platform which runs in your cloud environment. +## Platform differences -## The strategic imperative: Why data teams migrate from Segment to Snowplow +There are a number of differences between Segment and Snowplow as a data platform. -The move from Segment to Snowplow is usually driven by a desire for greater control, higher data fidelity, and a more predictable financial model. +| Feature | Segment | Snowplow | +| ----------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------- | +| Deployment model | SaaS-only; data is processed on Segment servers | Private cloud (BDP Enterprise) and SaaS (BDP Cloud) are both available | +| Data ownership | Data access in warehouse; vendor controls pipeline | You own your data and control pipeline infrastructure | +| Governance model | Post-hoc validation with Protocols (premium add-on) | Schema validation for every event | +| Data structure | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | +| Warehouse structure | Separate tables for each custom event type | One single `atomic.events` table where possible | +| Pricing model | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | +| Real-time capability | Limited low-latency support and observability | Real-time streaming pipeline supports sub-second use cases | +| Downstream integrations | Native connections to 300+ tools | Event forwarding to custom destinations plus reverse ETL, powered by Census | -### Achieve data ownership and control in your cloud +## What do events look like in tracking? -The key architectural difference is deployment. Segment is a SaaS platform where your data is processed on their servers. Snowplow runs as a set of services in your private cloud (AWS/GCP/Azure), giving you full ownership of your data at all stages. +Segment and Snowplow structure and conceptualize events differently. -This provides several advantages: +### Segment event structure -- **Enhanced security and compliance**: Keeping data within your own cloud simplifies security reviews and compliance audits (e.g., GDPR, CCPA, HIPAA), as no third-party vendor processes raw user data -- **Complete data control**: You can configure, scale, and monitor every component of the pipeline according to your specific needs -- **Elimination of vendor lock-in**: Because you own the infrastructure and the data format is open, you are not locked into a proprietary ecosystem +Segment's core method for tracking user behavior is `track`. A `track` call contains a name that describes the action taken, and a `properties` object that contains contextual information about the action. -### A new approach to governance: Foundational data quality +The other Segment tracking methods are: +* `page` and `screen` record page views and screen views +* `identify` describes the user, and associates a `userId` with user `traits` +* `group` associates the user with a group +* `alias` merges user identities, for identity resolution across applications -Segment and Snowplow approach data governance differently. Segment's Protocols feature validates data reactively, acting as a gatekeeper for incoming events. This is a premium feature and, if not rigorously managed, can lead to inconsistent data requiring significant downstream cleaning. +Data about the user's action is tracked separately from data about the user. You'll stitch them together during data modeling in the warehouse. -Snowplow enforces data quality proactively with mandatory, machine-readable **[schemas](https://docs.snowplow.io/docs/fundamentals/schemas/)** for every [event](https://docs.snowplow.io/docs/fundamentals/events/) and [entity](https://docs.snowplow.io/docs/fundamentals/entities/). [Events that fail validation](https://docs.snowplow.io/docs/fundamentals/failed-events/) are quarantined for inspection, ensuring only clean, consistent data lands in your warehouse. This "shift-left" approach moves the cost of data quality from a continuous operational expense to a one-time design investment. +Here's an example showing how you could track an ecommerce transaction event on web using Segment: -### Unlock advanced analytics with greater granularity +```javascript +analytics.track('Transaction Completed', { + order_id: 'T_12345', + revenue: 99.99, + currency: 'USD', + products: [{ + product_id: 'ABC123', + name: 'Widget', + price: 99.99, + quantity: 1 + }] +}) +``` -Segment started out as a data router, excelling at sending event data to third-party tools. Snowplow is designed to create a rich, granular first-party behavioral data asset. Segment's `track` events use a flat JSON `properties` object, limiting contextual depth. Snowplow's [event-entity model](https://docs.snowplow.io/docs/fundamentals/events/) allows a single event to be enriched with numerous contextual entities on the tracker and also in the pipeline, providing over 100 structured data points per event. +The tracked events can be optionally validated against Protocols, defined as part of a tracking plan. They'll detect violations against your tracking plan, and you can choose to filter out events that don't pass validation. -This rich, structured data is ideal for: +### Snowplow event structure -- **Complex data modeling**: Snowplow provides source-available dbt packages to transform raw data into analysis-ready tables -- **AI and machine learning**: High-fidelity data is ideal for training ML models like recommendation engines or churn predictors -- **Deep user behavior analysis**: Rich entities enable multi-faceted exploration of user journeys without complex data wrangling +Snowplow separates the action that occurred (the [event](/docs/fundamentals/events/index.md)) from the contextual objects involved in the action (the entities), such as the user, the device, etc. -### A predictable, infrastructure-based cost model +Snowplow SDKs also provide methods for tracking page views and screen views, along with many other kinds of events, such as button clicks, form submissions, page pings (activity), media interactions, and so on. -Segment's entry-level pricing is based on Monthly Tracked Users (MTUs), which can become expensive and unpredictable as you scale. This model can penalize growth. +All Snowplow events, whether designed by you or built-in, are defined by [JSON schemas](/docs/fundamentals/schemas/index.md). The events are always validated as they're processed through the Snowplow pipeline, and events that fail validation are separated out for assessment. -Snowplow's costs are based on your cloud infrastructure usage (compute and storage from AWS or GCP) plus a license fee depending on event volume which is more predictable and cost-effective at scale. This model aligns cost directly with data processing volume, not user count, encouraging comprehensive data collection without financial penalty. +The equivalent to Segment's custom `track` method is `track_self_describing_event`. -### Flexible integration with downstream tools +Here's an example showing how you could track a Snowplow ecommerce transaction event on web: -While Segment excels at routing data to third-party marketing and analytics tools, Snowplow provides flexible options for connecting your behavioral data to downstream systems. [Event forwarding](https://docs.snowplow.io/docs/destinations/forwarding-events/) enables real-time streaming of enriched events to various destinations, supporting both analytical and operational use cases. For reverse ETL workflows that send processed data back to operational systems, Snowplow has partnered with Census to provide best-in-class functionality for activating your warehouse data in marketing and sales tools. +```javascript +snowplow('trackTransaction', { + transaction_id: 'T_12345', + revenue: 99.99, + currency: 'USD', + products: [{ + id: 'ABC123', + name: 'Widget', + price: 99.99, + quantity: 1 + }] +}) +``` -| Feature | Segment | Snowplow | -| --------------------------- | ------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------- | -| **Deployment Model** | SaaS-only; data processed on Segment servers hosted by AWS | Private cloud; runs entirely in your AWS/GCP/Azure account | -| **Data Ownership** | Data access in warehouse; vendor controls pipeline | Customer owns data and controls pipeline infrastructure | -| **Governance Model** | Reactive; post-hoc validation with Protocols (a premium add-on) | Proactive; foundational schema validation for every event | -| **Data Structure** | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | -| **Primary Use Case** | Building a Customer Data Platform for routing to 3rd party marketing/analytics tools | Creating a foundational behavioral data asset for BI and AI | -| **Pricing Model** | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | -| **Real-Time Capability** | Limited low-latency support and observability | Real-time streaming pipeline (e.g., via Kafka) supports use cases in seconds | -| **Downstream Integrations** | Native connections to 300+ marketing and analytics tools | Event forwarding to custom destinations plus reverse ETL via Census partnership | +Superficially, it looks similar to Segment's `track` call. The first key difference is that the product property here contains a reusable `product` entity. This entity would be added to any other relevant event, such as `add_to_cart` or `view_product`. -## Deconstructing the data model: From flat events to rich context +Secondly, the Snowplow tracking SDKs add multiple entities to all tracked events by default, including information about the specific page or screen, the user's session, and the device or browser. Many other built-in entities can be configured, and you can define your own custom entities to any Snowplow event. -To appreciate the strategic value of migrating to Snowplow, it is essential to understand the fundamental differences in how each platform approaches the modeling of behavioral data. This is not just a technical distinction; it is a difference in approach that has consequences for data quality, flexibility, and analytical power. Segment operates on a simple, action-centric model, while Snowplow introduces a more sophisticated, context-centric paradigm that more accurately reflects the complexity of the real world. +### Tracking comparison -### The Segment model: A review of `track`, `identify`, and the property-centric approach +This table explains how different Segment tracking methods map to Snowplow events. -Segment's data specification is designed for simplicity and ease of use. It is built around a handful of core API methods that capture the essential elements of user interaction. The most foundational of these is the `track` call, which is designed to answer the question, "What is the user doing?". Each `track` call records a single user action, known as an event, which has a human-readable name (e.g., `User Registered`) and an associated `properties` object. This object is a simple JSON containing key-value pairs that describe the action (e.g., `plan: 'pro'`, accountType: 'trial'`). +| Segment concept | Segment example | Snowplow equivalent | +| ------------------- | ------------------------------------------------------------- | ---------------------------------------------------------- | +| Core action | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | Self-describing event with custom `order_completed` schema | +| User identification | `identify('user123', {plan: 'pro', created_at: '...'})` | User entity and `setUserId` call | +| Page context | `page('Pricing', {category: 'Products'})` | `trackPageView` with `web_page` entity | +| Reusable properties | `properties.product_sku` in multiple `track` calls | Dedicated `product` entity attached to relevant events | -The other key methods in the Segment spec support this action-centric view: +## What does the data look like in the warehouse? -- **`identify`**: Answers the question, "Who is the user?" It associates a `userId` with a set of `traits` (e.g., `email`, `name`), which describe the user themselves -- **`page` and `screen`**: Record when a user views a webpage or a mobile app screen, respectively -- **`group`**: Associates an individual user with a group, such as a company or organization -- **`alias`**: Used to merge the identities of a user across different systems or states (e.g., anonymous to logged-in) +Segment loads each custom event type into separate tables, for example, `order_completed`, or `product_viewed` tables. Analysts must `UNION` tables together to reconstruct user journeys. -This model forces the world into a verb-centric framework. The event—the action—is the primary object of interest. All other information, whether it describes the product involved, the user performing the action, or the page on which it occurred, is relegated to being a "property" or a "trait" attached to that action. While this approach is intuitive, it lacks a formal, structured way to define and reuse the *nouns* of the business—the users, products, content, and campaigns—as first-class, independent components of the data model itself. This architectural choice leads to data being defined and repeated within the context of each individual action, rather than as a set of interconnected, reusable concepts. It often requires a consolidation period down the line as downstream users struggle with data quality issues. +Snowplow uses a single [`atomic.events`](/docs/fundamentals/canonical-event/index.md) table in warehouses like Snowflake and BigQuery. Events and entities are stored as structured columns within that table, simplifying analysis. -### The Snowplow approach: Understanding the event-entity distinction +## Migration phases -Snowplow introduces a more nuanced and powerful paradigm that separates the *event* (the action that occurred at a point in time) from the *entities* (the nouns that were involved in that action). In Snowplow, every tracked event can be decorated with an array of contextual entities. This is the core of the event-entity model. +We recommend using a parallel-run migration approach. This process can be divided into three phases: +1. Assess and plan +2. Implement and validate +3. Cutover and finalize -An **[event](https://docs.snowplow.io/docs/fundamentals/events/)** is an immutable record of something that happened. A **[self-describing event](https://docs.snowplow.io/docs/fundamentals/events/#self-describing-events)** in Snowplow is the equivalent of a Segment `track` call, capturing a specific action like `add_to_cart`. +### Assess and plan -An **[entity](https://docs.snowplow.io/docs/fundamentals/entities/)**, however, is a reusable, self-describing JSON object that provides rich, structured context about the circumstances surrounding an event. This distinction is a key differentiator. Consider a retail example. Instead of adding properties like `product_sku`, `product_name`, and `product_price` to every single event related to a product, you define a single, reusable `product` entity. This one entity can then be attached to a multitude of different events throughout the customer journey: +#### Audit existing implementation +- Audit the Segment tracking calls in your application code +- Document all downstream data consumers, such as BI dashboards, dbt models, or ML pipelines +- Export your complete Segment tracking plan, using one of these methods: + - Ideally, use the Segment Public API to obtain the full JSON structure for each event + - Manually download CSVs from the Segment UI + - Infer it from warehouse data -- `view_product` -- `add_to_basket` -- `remove_from_basket` -- `purchase_product` -- `review_product` +#### Design Snowplow tracking plan +- Translate Segment events into Snowplow self-describing events + - The [Snowplow CLI](/docs/data-product-studio/snowplow-cli/index.md) MCP server can help with this +- Identify reusable entities that can replace repeated properties +- Create JSON schemas for all events and entities -This approach reflects the real world more accurately. An "event" is a momentary action, while "entities" like users, products, and marketing campaigns are persistent objects that participate in many events over time. This separation provides immense power. It allows you to analyze the `product` entity across its entire lifecycle, from initial discovery to final purchase, by querying a single, consistent data structure. You are no longer forced to hunt for and coalesce disparate property fields (`viewed_product_sku`, `purchased_product_sku`, etc.) across different event tables. +#### Deploy infrastructure +- Confirm that your Snowplow infrastructure is up and running +- Publish your schemas so they're available to your pipeline + - Use Snowplow BDP Console or the Snowplow CLI -Furthermore, Snowplow comes with a rich set of [out-of-the-box entities](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/tracking-events/#out-of-the-box-entity-tracking) that can be enabled to automatically enrich every event with crucial context, such as the `webPage` entity, the `performanceTiming` entity for site speed, and the `user` entity. This moves the data model from being action-centric to being context-centric, providing a much richer and more interconnected view of behavior from the moment of collection. +### Implement and validate -### The language of your business: Building composable data structures with self-describing schemas (data contracts) +#### Set up dual tracking +- Add [Snowplow tracking](/docs/sources/index.md) to run in parallel with existing Segment tracking + - Use [Snowtype](/docs/data-product-studio/snowtype/index.md) to generate type-safe tracking code +- Use [Snowplow Micro](/docs/data-product-studio/data-quality/snowplow-micro/index.md) for local testing and validation -The technical foundation that makes the event-entity model possible is Snowplow's use of **self-describing schemas**. In the Segment world, developers often start by implementing events, and then the data team then retrospectively classifies and governs them via Segment Protocols. While they do provide tracking plan capabilities, these are hard to find or tied to enterprise-level onboarding. +#### Data validation +- Compare high-level metrics between systems e.g. daily event counts, or unique users +- Validate critical business logic and property values +- Perform end-to-end data reconciliation in your warehouse +- Decide what to do about historical data -In the Snowplow ecosystem, the schema registry *is* the single source of truth. Every self-describing event and every custom entity is defined by a formal JSON Schema, which is stored and versioned in a schema registry called **[Iglu](https://docs.snowplow.io/docs/fundamentals/schemas/#iglu)**. Each schema is a machine-readable contract that specifies: +For historical data, you have a choice of approaches: +- Coexistence: leave historical Segment data in existing tables. Write queries that `UNION` data from both systems, using a transformation layer (for example, in dbt) to create compatible structures. +- Unification: transform and backfill historical Segment data into Snowplow format. This requires a custom engineering project to export Segment data, reshape it into the Snowplow enriched event format, and load it into the warehouse. The result is a unified historical dataset. -- **Identity**: A unique URI comprising a vendor (`com.acme`), a name (`product`), a format (`jsonschema`), and a version (`1-0-0`) -- **Structure**: The exact properties the event or entity should contain (e.g., `sku`, `name`, `price`) -- **Validation Rules**: The data type for each property (`string`, `number`, `boolean`), as well as constraints like minimum/maximum length, regular expression patterns, or enumerated values +#### Gradual rollout +- Start with non-critical pages or features +- Gradually expand to cover all tracking points +- Monitor data quality and pipeline health +- Update [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/) to use the new data structure -The data payload itself contains a reference to the specific schema and version that defines it, which is why it's called a "self-describing JSON". This creates a powerful, unambiguous, and shared language for data across the entire organization. When a product manager designs a new feature, they collaborate with engineers and analysts to define the schemas for the new events and entities involved. This contract is then stored in Iglu. The engineers implement tracking based on this contract, and the analysts know exactly what data to expect in the warehouse because they can reference the same contract. This is a cultural shift that treats data as a deliberately designed product, not as a byproduct of application code. +### Cutover and finalize -### Analytical implications: How the event-entity model unlocks deeper, contextual insights +#### Update downstream consumers +- Migrate BI dashboards to query Snowplow tables +- Test all data-dependent workflows -The architectural advantage of the event-entity model becomes apparent in the data warehouse. In a Segment implementation, each custom event type is loaded into its own table (e.g., `order_completed`, `product_viewed`). While this provides structure, it can lead to a large number of tables in the warehouse, a challenge sometimes referred to as "schema sprawl." A significant amount of analytical work involves discovering the correct tables and then `UNION`-ing them together to reconstruct a user's complete journey. +#### Configure integrations +- Set up [event forwarding](https://docs.snowplow.io/docs/destinations/forwarding-events/) for real-time destinations +- Configure reverse ETL workflows to use your new modeled data -Snowplow's data model for modern warehouses like Snowflake and BigQuery simplifies this downstream work by using a "one big table" approach. All data is loaded into a single, wide [`atomic.events` table](https://docs.snowplow.io/docs/fundamentals/canonical-event/). Self-describing events and their associated entities are not loaded into separate tables. Instead, they are stored as dedicated, structured columns within that one table—for example, as an `OBJECT` in Snowflake or a `REPEATED RECORD` in BigQuery. This model avoids the schema sprawl of the Segment approach. - -For an analyst, this means that to get a complete picture of an `add_to_cart` event and the product involved, they query a single, predictable table. The event and all its contextual entities are present in the same row. This structure can simplify data modeling in tools like dbt and accelerate time-to-insight, as the analytical work shifts from joining many disparate event tables to unnesting or accessing data within the structured columns of a single table. It is important to note that this loading behavior is different for Amazon Redshift, where each entity type does get loaded into its own separate table. - -| Segment Concept | Segment Example | Snowplow Equivalent | Snowplow Implementation Detail | -| ----------------------- | ------------------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Core Action** | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | **Self-Describing Event** | `trackSelfDescribingEvent` with a custom `order_completed` schema containing `revenue` (number) and `currency` (string) properties. | -| **User Identification** | `identify('user123', {plan: 'pro', created_at: '...'})` | **User Entity & `setUserId`** | A call to `setUserId('user123')` to populate the atomic `user_id` field, plus attaching a custom `user` entity with a schema containing properties like `plan` and `created_at`. | -| **Page/Screen Context** | `page('Pricing', {category: 'Products'})` | **`trackPageView` & `web_page` Entity** | A `trackPageView` call with a `title` of 'Pricing'. This automatically attaches the standard `web_page` entity. The `category` would be a custom property added to a custom `web_page` context or a separate content entity. | -| **Reusable Properties** | `properties.product_sku` in multiple `track` calls | **Dedicated `product` Entity** | A single, reusable `product` entity schema is defined with a `sku` property. This entity is then attached as context to all relevant events (`product_viewed`, `add_to_cart`, etc.). | - -## Architecting your migration: A phased framework - -A successful migration requires a well-defined strategy that manages risk and ensures data continuity. This section outlines a high-level project plan, including different strategic scenarios and a plan for handling historical data. - -### The three-phase migration roadmap - -A migration from Segment to Snowplow can be broken down into three phases: - -- **Phase 1: Assess and plan** - - Audit all existing Segment `track`, `identify`, `page`, and `group` calls - - Export the complete Segment Tracking Plan via API (if you still have an active account) or infer it from data in a data warehouse - - Translate the Segment plan into a Snowplow tracking plan, defining event schemas and identifying reusable entities - using the Snowplow CLI MCP Server - - Deploy the Snowplow pipeline components ([Collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/), [Enrich](https://docs.snowplow.io/docs/pipeline-components-and-applications/enrichment-components/), [Loaders](https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/)) and the [Iglu Schema Registry](https://docs.snowplow.io/docs/pipeline-components-and-applications/iglu/) in your cloud -- **Phase 2: Implement and validate** - - Add [Snowplow trackers](https://docs.snowplow.io/docs/collecting-data/) to your applications to run in parallel with existing Segment trackers (dual-tracking) - - Use tools like [Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/) for local testing and validation before deployment - - Perform end-to-end data reconciliation in your data warehouse by comparing Segment and Snowplow data to ensure accuracy -- **Phase 3: Cutover and optimize** - - Update all downstream data consumers (BI dashboards, [dbt models](https://docs.snowplow.io/docs/modeling-data/modeling-your-data/dbt/)) to query the new Snowplow data tables - - Remove the Segment trackers and SDKs from application codebases - - Decommission the Segment sources and, eventually, the subscription - -### Migration scenario 1: The parallel-run approach - -The parallel-run approach is the recommended, lowest-risk strategy. It involves running both systems simultaneously (dual-tracking) to validate data integrity before cutting over. Existing Segment-powered workflows remain operational while you test and reconcile the new Snowplow data in the warehouse. This approach builds confidence and allows you to resolve discrepancies without impacting production systems. - -### Migration scenario 2: The full re-architecture - -A "rip-and-replace" approach is faster but riskier, involving a direct switch from Segment to Snowplow SDKs. This is best suited for: - -- New projects or applications with no legacy system -- Major application refactors where the switch can be part of a larger effort -- Teams with high risk tolerance and robust automated testing frameworks - -This strategy requires thorough pre-launch testing in a staging environment to prevent data loss. - -### A strategy for historical data - -You have two main options for handling historical data from Segment: - -- **Option A: Coexistence (Pragmatic)** Leave historical Segment data in its existing tables. For longitudinal analysis, write queries that `UNION` data from both Segment and Snowplow tables, using a transformation layer (e.g., in dbt) to create a compatible structure. This avoids a large backfill project -- **Option B: Unification (Backfill)** For a single, unified dataset, undertake a custom engineering project to transform and backfill historical data. This involves exporting Segment data, writing a script to reshape it into the Snowplow enriched event format, and loading it into the warehouse. This is a significant effort but provides a consistent historical dataset - -## The technical playbook: Executing your migration - -This section provides a detailed, hands-on playbook for the technical execution of the migration. A central theme of this playbook is the use of the Snowplow CLI and its integrated AI capabilities to accelerate the most challenging part of the migration: designing a new, high-quality tracking plan. - -### Step 1: Deconstruct your legacy: Export and analyze the Segment tracking plan - -Before building the new data foundation, you must create a complete blueprint of the existing structure. The first practical step is to export your Segment Tracking Plan into a machine-readable format that can serve as the raw material for your redesign. - -There are two primary methods for this export: - -1. **Manual CSV download**: The Segment UI provides an option to download your Tracking Plan as a CSV file. This is a quick way to get a human-readable inventory of your events and properties. However, it can be less ideal for programmatic analysis and may not capture the full structural detail of your plan -2. **Programmatic API export (recommended)**: The superior method is to use the Segment Public API. The API allows you to programmatically list all Tracking Plans in your workspace and retrieve the full definition of each plan, including its rules, in a structured JSON format. This JSON output is invaluable because it often includes the underlying JSON Schema that Segment uses to validate the `properties` of each event - -The result of this step is a definitive, version-controlled artifact (e.g., a `segment_plan.json` file) that represents the ground truth of your current tracking implementation. This file will be the primary input for the next step of the process. - -### Step 2: AI-assisted design: Build your Snowplow tracking plan with the CLI and MCP server - -Next, you'll need to translate that tracking plan into a Snowplow-appropriate format (Data Products and Data Structures). - -The [Snowplow CLI](https://docs.snowplow.io/docs/data-product-studio/snowplow-cli/) is a command-line utility that includes a Model Context Protocol (MCP) server, so you can use an AI agent to generate idiomatic Snowplow tracking. For more information on how to do this, read the [tutorial](https://docs.snowplow.io/tutorials/snowplow-cli-mcp/introduction/). - -### Step 3: Re-instrument your codebase: A conceptual guide - -With a robust and well-designed tracking plan published to your Iglu registry, the next step is to update your application code to send events to Snowplow. While the specific code will vary by language and platform, the core concepts are consistent. We recommend using [Snowtype](https://docs.snowplow.io/docs/data-product-studio/snowtype/), our Code Generation tool, to automatically generate type-safe tracking code. - -#### Migrate client-side tracking: From `analytics.js` to the Snowplow Browser Tracker - -The [Snowplow JavaScript/Browser tracker](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/javascript-trackers/web-tracker/) introduces a more modern and readable API. The most significant change from Segment's `analytics.js` is the move from function calls with long, ordered parameter lists to calls that accept a single object with named arguments. - -- A Segment call like `analytics.track('Event', {prop: 'value'})` becomes a Snowplow call like `snowplow('trackSelfDescribingEvent', {schema: 'iglu:com.acme/event/jsonschema/1-0-0', data: {prop: 'value'}})` -- A Segment `identify` call is replaced by a combination of a `setUserId` call to set the primary user identifier and the attachment of a custom `user` entity to provide the user traits - -This object-based approach improves code readability, as the purpose of each value is explicit, and makes the tracking calls more extensible for the future. - -#### Migrate server-side and mobile tracking: An overview of Snowplow's polyglot trackers - -Snowplow provides a comprehensive suite of [trackers](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/) for virtually every common back-end language and mobile platform, including [Java](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/java-tracker/), [Python](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/python-tracker/), [.NET](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/net-tracker/), [Go](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/go-tracker/), [Ruby](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/ruby-tracker/), [iOS](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/ios-tracker/) (Swift/Objective-C), and [Android](https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/mobile-trackers/android-tracker/) (Kotlin/Java). - -While the syntax is idiomatic to each language, the underlying paradigm remains the same across all trackers. The developer will: - -1. Initialize the tracker with the endpoint of their [Snowplow collector](https://docs.snowplow.io/docs/pipeline-components-and-applications/stream-collector/) -2. Use builder patterns or helper classes to construct self-describing events and entity objects, referencing the schema URIs from the Iglu registry. For example, the Java tracker uses a `SelfDescribing.builder()` to construct the event payload -3. Use a `track` method to send the fully constructed event to the collector - -The consistency of the event-entity model across all trackers ensures that data from every platform will arrive in the warehouse in a unified, coherent structure. - -### Step 4: Configure downstream integrations - -After implementing tracking, you'll want to connect your Snowplow data to downstream systems. Snowplow provides two main approaches for this: - -**Event forwarding** enables real-time streaming of enriched events to various destinations. This capability allows you to send data to custom endpoints, message queues, or third-party services as events flow through the pipeline. You can configure forwarding rules to send specific event types or filtered data streams to different destinations. For detailed setup instructions, see the [event forwarding documentation](https://docs.snowplow.io/docs/destinations/forwarding-events/). - -**Reverse ETL workflows** leverage your data warehouse as the source of truth for activating processed data in operational systems. Through Snowplow's partnership with Census, you can build sophisticated audience segments, computed user properties, and behavioral scores in your warehouse, then sync these insights to marketing automation platforms, CRM systems, and personalization tools. This approach enables data-driven activation workflows that would be difficult to achieve with traditional CDP routing. - -### Step 5: Ensure a smooth transition: Validation, testing, and cutover - -The final technical step is to rigorously validate the new implementation and manage the cutover. A smooth transition is non-negotiable. - -#### Local validation with Snowplow Micro - -To empower developers and "shift-left" on data quality, customers should incorporate **[Snowplow Micro](https://docs.snowplow.io/docs/testing-debugging/snowplow-micro/)**. Micro is a partial Snowplow pipeline packaged into a single Docker container that can be run on a developer's local machine. Before committing any new tracking code, a developer can point their application's tracker to their local Micro instance. They can then interact with the application and see the events they generate appear in the Micro UI in real-time. Micro performs the same validation against the Iglu registry as the production pipeline, allowing developers to instantly confirm that their events are well-formed and pass schema validation. This catches errors early, reduces the feedback loop from hours to seconds, and prevents bad data from ever reaching the production pipeline. - -#### End-to-end data reconciliation strategies - -During the parallel-run phase, it is essential to perform end-to-end data reconciliation in the data warehouse. This involves writing a suite of SQL queries to compare the data collected by the two systems. Analysts should compare high-level metrics like daily event counts and unique user counts, as well as the values of specific, critical properties. The goal is not to achieve 100% identical data—the data models are different, which is the point of the migration. The goal is to be able to confidently explain any variances and to prove that the new Snowplow pipeline is capturing all critical business logic correctly. - -#### Final cutover: Decommission Segment senders - -Once the data has been thoroughly reconciled and all downstream dependencies (e.g., BI dashboards, ML models, marketing automation workflows) have been successfully migrated to use the new, richer Snowplow data tables, the team can proceed with the final cutover. This involves a coordinated deployment to remove the Segment SDKs and all `analytics.track()` calls from the codebases. Following general data migration best practices, the old Segment sources should be left active for a short period as a final fallback before being fully decommissioned. +#### Complete transition +- Remove Segment tracking from codebases +- Decommission Segment sources +- Cancel Segment subscription once validation period is complete From 6358029d89a44ad4af901ac4167e24fb669a628b Mon Sep 17 00:00:00 2001 From: Miranda Wilson Date: Wed, 6 Aug 2025 14:29:21 +0100 Subject: [PATCH 9/9] Remove comparison table --- .../migration-guides/segment/index.md | 48 +++++++------------ 1 file changed, 18 insertions(+), 30 deletions(-) diff --git a/docs/resources/migration-guides/segment/index.md b/docs/resources/migration-guides/segment/index.md index 07ab4ff74..2a6f3bd34 100644 --- a/docs/resources/migration-guides/segment/index.md +++ b/docs/resources/migration-guides/segment/index.md @@ -8,22 +8,7 @@ This guide helps technical implementers migrate from Segment to Snowplow. ## Platform differences -There are a number of differences between Segment and Snowplow as a data platform. - -| Feature | Segment | Snowplow | -| ----------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------- | -| Deployment model | SaaS-only; data is processed on Segment servers | Private cloud (BDP Enterprise) and SaaS (BDP Cloud) are both available | -| Data ownership | Data access in warehouse; vendor controls pipeline | You own your data and control pipeline infrastructure | -| Governance model | Post-hoc validation with Protocols (premium add-on) | Schema validation for every event | -| Data structure | Flat events with properties, user traits and context objects | Rich events enriched by multiple, reusable entities | -| Warehouse structure | Separate tables for each custom event type | One single `atomic.events` table where possible | -| Pricing model | Based on Monthly Tracked Users (MTUs) or API calls | Based on event volume | -| Real-time capability | Limited low-latency support and observability | Real-time streaming pipeline supports sub-second use cases | -| Downstream integrations | Native connections to 300+ tools | Event forwarding to custom destinations plus reverse ETL, powered by Census | - -## What do events look like in tracking? - -Segment and Snowplow structure and conceptualize events differently. +There are a number of differences between Segment and Snowplow as a data platform. For migration, it's important to be aware of how Snowplow structures events differently from Segment. This affects how you'll implement tracking and how you'll model the warehouse data. ### Segment event structure @@ -35,9 +20,9 @@ The other Segment tracking methods are: * `group` associates the user with a group * `alias` merges user identities, for identity resolution across applications -Data about the user's action is tracked separately from data about the user. You'll stitch them together during data modeling in the warehouse. +With Segment, you track data about the user's action separately from data about the user. These are stitched together during data modeling in the warehouse. -Here's an example showing how you could track an ecommerce transaction event on web using Segment: +Here's an example showing how you can track an ecommerce transaction event on web using Segment: ```javascript analytics.track('Transaction Completed', { @@ -57,13 +42,14 @@ The tracked events can be optionally validated against Protocols, defined as par ### Snowplow event structure -Snowplow separates the action that occurred (the [event](/docs/fundamentals/events/index.md)) from the contextual objects involved in the action (the entities), such as the user, the device, etc. +Snowplow separates the action that occurred (the [event](/docs/fundamentals/events/index.md)) from the contextual objects involved in the action (the [entities](/docs/fundamentals/entities/index.md)), such as the user, the device, etc. Snowplow SDKs also provide methods for tracking page views and screen views, along with many other kinds of events, such as button clicks, form submissions, page pings (activity), media interactions, and so on. +The equivalent to Segment's custom `track` method is `trackSelfDescribingEvent`. + All Snowplow events, whether designed by you or built-in, are defined by [JSON schemas](/docs/fundamentals/schemas/index.md). The events are always validated as they're processed through the Snowplow pipeline, and events that fail validation are separated out for assessment. -The equivalent to Segment's custom `track` method is `track_self_describing_event`. Here's an example showing how you could track a Snowplow ecommerce transaction event on web: @@ -81,22 +67,24 @@ snowplow('trackTransaction', { }) ``` -Superficially, it looks similar to Segment's `track` call. The first key difference is that the product property here contains a reusable `product` entity. This entity would be added to any other relevant event, such as `add_to_cart` or `view_product`. +Superficially, it looks similar to Segment's `track` call. The first key difference is that the products property here contains a reusable `product` entity. You'd add this entity to any other relevant event, such as `add_to_cart` or `view_product`. -Secondly, the Snowplow tracking SDKs add multiple entities to all tracked events by default, including information about the specific page or screen, the user's session, and the device or browser. Many other built-in entities can be configured, and you can define your own custom entities to any Snowplow event. +Secondly, the Snowplow tracking SDKs add multiple entities to all tracked events by default, including information about the specific page or screen view, the user's session, and the device or browser. Many other built-in entities can be configured, and you can define your own custom entities to add to any or all Snowplow events. ### Tracking comparison -This table explains how different Segment tracking methods map to Snowplow events. +This table gives examples of how the different Segment tracking methods map to Snowplow tracking. -| Segment concept | Segment example | Snowplow equivalent | -| ------------------- | ------------------------------------------------------------- | ---------------------------------------------------------- | -| Core action | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | Self-describing event with custom `order_completed` schema | -| User identification | `identify('user123', {plan: 'pro', created_at: '...'})` | User entity and `setUserId` call | -| Page context | `page('Pricing', {category: 'Products'})` | `trackPageView` with `web_page` entity | -| Reusable properties | `properties.product_sku` in multiple `track` calls | Dedicated `product` entity attached to relevant events | +| Segment API Call | Segment Example | Snowplow Implementation | +| ---------------- | --------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `track()` | `track('Order Completed', {revenue: 99.99, currency: 'USD'})` | Use one of the built-in event types, or define a custom `order_completed` schema containing `revenue` and `currency` properties, and track `trackSelfDescribingEvent`. | +| `page()` | `page('Pricing')` | Use `trackPageView`. The tracker SDK will capture details such as `title` and `url`. | +| `screen()` | `screen('Home Screen')` | Use `trackScreenView`. | +| `identify()` | `identify('user123', {plan: 'pro', created_at: '2024-01-15'})` | Call `setUserId('user123')` to track the ID in all events. Attach a custom `user` entity with schema containing `plan` and `created_at` properties. | +| `group()` | `group('company-123', {name: 'Acme Corp', plan: 'Enterprise'})` | No direct equivalent. Attach a custom `group` entity to your events, or track group membership changes as custom events with `group_joined` or `group_updated` schemas. | +| `alias()` | `alias('new-user-id', 'anonymous-id')` | No direct equivalent. Track identity changes as custom events with `user_alias_created` schema. Use `setUserId` to update the current user identifier for subsequent events. | -## What does the data look like in the warehouse? +### Warehouse data structure Segment loads each custom event type into separate tables, for example, `order_completed`, or `product_viewed` tables. Analysts must `UNION` tables together to reconstruct user journeys.