QuixLake Template

A complete data lake solution built on DuckDB, S3, and Apache Iceberg for high-performance streaming data analytics.

Overview

QuixLake is a production-ready data lake platform that enables real-time ingestion, storage, and querying of streaming data. It combines Apache Kafka for streaming, S3 for object storage, Apache Iceberg for table format, and DuckDB for blazing-fast SQL analytics.

This template deploys a fully configured QuixLake instance with:

Pre-built, production-ready container images for core services
Example data pipeline with Time Series Benchmark Suite (TSBS) data
Interactive query UI for data exploration
S3-compatible MinIO storage for local development

Architecture

DataLake Infrastructure

┌──────────────────────────────────────────────────────────┐
│                    QuixLake DataLake                     │
│                                                          │
│  ┌─────────────────┐      ┌─────────────────────────┐  │
│  │  Iceberg Catalog│◄────►│  PostgreSQL Database    │  │
│  │  (REST API)     │      │  - Table metadata       │  │
│  └────────┬────────┘      │  - File manifests       │  │
│           │               │  - Partition info       │  │
│           │               └─────────────────────────┘  │
│           │                                            │
│           ▼                                            │
│  ┌─────────────────┐                                  │
│  │   MinIO (S3)    │                                  │
│  │ Object Storage  │                                  │
│  │                 │                                  │
│  │ Parquet Files:  │                                  │
│  │ ├── table_1/    │                                  │
│  │ │   └── data/   │                                  │
│  │ ├── table_2/    │                                  │
│  │     └── data/   │                                  │
│  └────────▲────────┘                                  │
│           │                                            │
│           │                                            │
│  ┌────────┴────────┐                                  │
│  │  QuixLake API   │  Query Engine                    │
│  │    (DuckDB)     │  - SQL execution                 │
│  │                 │  - Table discovery               │
│  │                 │  - Partition pruning             │
│  └────────┬────────┘                                  │
│           │                                            │
│           ▼                                            │
│  ┌─────────────────┐                                  │
│  │    Query UI     │  Interactive SQL Interface       │
│  │ (Data Explorer) │  - Query editor                  │
│  │                 │  - Results viewer                │
│  └─────────────────┘                                  │
│                                                          │
└──────────────────────────────────────────────────────────┘

Ingestion Pipeline (Example)

┌──────────────────────────────────────────────────────────┐
│              TSBS Example Data Pipeline                  │
│                                                          │
│  ┌─────────────────┐                                    │
│  │ TSBS Data Gen   │  Generate sample time-series data  │
│  │   (Job)         │  - CPU metrics                     │
│  │                 │  - DevOps data                     │
│  └────────┬────────┘  - IoT sensor data                 │
│           │                                              │
│           v                                              │
│      Kafka Topic                                         │
│      (tsbs_data)                                         │
│           │                                              │
│           v                                              │
│  ┌─────────────────┐                                    │
│  │ TSBS Transformer│  Transform and enrich              │
│  │   (Service)     │  - Add metadata                    │
│  │                 │  - Format timestamps               │
│  └────────┬────────┘  - Normalize schema                │
│           │                                              │
│           v                                              │
│      Kafka Topic                                         │
│  (tsbs_data_transformed)                                 │
│           │                                              │
│           v                                              │
│  ┌─────────────────────────┐                            │
│  │ Quix TS Datalake Sink   │  Write to DataLake        │
│  │      (Service)          │                            │
│  │                         │  - Batch messages          │
│  │  ┌──────────────────┐   │  - Partition data         │
│  │  │ Partition by:    │   │  - Write Parquet          │
│  │  │ - region         │   │  - Register in catalog    │
│  │  │ - datacenter     │   │                            │
│  │  │ - hostname       │   │                            │
│  │  └──────────────────┘   │                            │
│  └────────┬────────────────┘                            │
│           │                                              │
│           v                                              │
│     [To DataLake]                                        │
│      MinIO (S3)                                          │
│                                                          │
└──────────────────────────────────────────────────────────┘

Components

Core Services (Pre-built Images)

1. Quix TS Datalake API

DuckDB-based REST API for querying S3 data with:

SQL query execution over Parquet files
Table discovery and automatic registration
Partition management and pruning
Schema evolution support
Grafana datasource integration

2. Quix TS Query UI

Interactive web interface with:

SQL query editor with syntax highlighting
Real-time query results
Table and partition browsing
Embedded in Quix platform as "Data Explorer"

3. Quix TS Datalake Catalog

Metadata catalog service featuring:

PostgreSQL backend for reliability
Manifest management to avoid expensive S3 ListObjects
Query optimization with partition filtering
Automatic table discovery

Data Pipeline Components

4. Quix TS Datalake Sink

Kafka-to-S3 sink connector that:

Writes streaming data as Hive-partitioned Parquet files
Supports time-based and custom partitioning
Automatically registers tables in catalog
Handles schema evolution

View detailed README

5. TSBS Data Generator

Generates realistic time-series benchmark data for testing:

Configurable data types (cpu-only, devops, iot)
Adjustable scale and time ranges
Deterministic generation with seed support

View detailed README

6. TSBS Transformer

Transforms raw TSBS data into optimized format for the data lake.

View detailed README

Infrastructure Services

7. PostgreSQL

Database backend for the Iceberg Catalog, storing:

Table metadata and schemas
File manifests
Partition information

View detailed README

8. MinIO

S3-compatible object storage for development and testing.

View detailed README

9. MinIO Proxy

Public access proxy for MinIO in Quix platform.

Features

Real-time Ingestion: Stream data from Kafka directly to S3 as Parquet
High-Performance Queries: DuckDB provides sub-second analytical queries
Schema Evolution: Automatic schema detection and evolution support
Partition Pruning: Efficient queries using Hive-style partitioning
Table Discovery: Automatically discover and register existing S3 data
Scalable Storage: S3-based storage scales to petabytes
Standard Formats: Parquet files compatible with any analytics tool
Production Ready: Pre-built, tested container images
Embedded UI: Query interface integrated in Quix platform

Getting Started

Prerequisites

Quix account (sign up at https://portal.platform.quix.io/signup)
AWS account (optional - template includes MinIO for local testing)

Quick Start

Deploy to Quix
- Log in to your Quix account
- Navigate to Templates → "QuixLake Template"
- Click "Deploy template"

Configure Secrets

Set the following secrets in your Quix environment:

minio_user: admin
minio_password: <your-secure-password>
postgres_password: <your-secure-password>
config_ui_auth_token: <your-auth-token>

Start the Pipeline

The template deploys with:
- All core services (API, UI, Catalog) automatically running
- MinIO storage ready for data
- PostgreSQL catalog initialized
- Example pipeline in "Example pipeline" group
Generate Sample Data

Start the TSBS Data Generator job to produce sample CPU metrics data.
Query Your Data

Access the Query UI through:
- Public URL: https://query-ui-<your-workspace>.deployments.quix.io
- Or via the "Data Explorer" sidebar in Quix platform
Try this query:
```
SELECT * FROM sensordata LIMIT 10;
```

Configuration

Storage Configuration

Configure S3 storage (or use included MinIO):

S3_BUCKET: quixdatalaketest          # Your bucket name
S3_PREFIX: ts_test                   # Data folder prefix
AWS_REGION: eu-west-2                # AWS region
AWS_ENDPOINT_URL: http://minio:9000  # For MinIO; remove for AWS S3

Catalog Configuration

PostgreSQL backend configuration:

CATALOG_BACKEND: postgres
POSTGRES_HOST: postgresql
POSTGRES_PORT: 80
POSTGRES_DB: iceberg_catalog
POSTGRES_USER: admin

Sink Configuration

Configure how data is written to the lake:

BATCH_SIZE: 1000                     # Messages per batch
COMMIT_INTERVAL: 30                  # Commit interval (seconds)
HIVE_COLUMNS: region,datacenter,hostname  # Partition columns
TIMESTAMP_COLUMN: ts_ms              # Time column for partitioning
AUTO_DISCOVER: true                  # Auto-register in catalog

See quix.yaml for complete configuration options.

Data Flow

Data Generation: TSBS generator produces time-series metrics
Transformation: Transformer enriches and formats the data
Streaming: Data flows through Kafka topics
Storage: Sink writes batches to S3 as partitioned Parquet files
Registration: Tables automatically register in Iceberg Catalog
Query: API uses DuckDB to query Parquet files from S3
Visualization: Query UI provides interactive data exploration

Data Organization

Data is organized in S3 using Hive-style partitioning:

s3://bucket/prefix/table_name/
├── region=us-east/
│   ├── datacenter=dc1/
│   │   ├── hostname=server01/
│   │   │   ├── batch_001_uuid.parquet
│   │   │   └── batch_002_uuid.parquet

For time-based partitioning:

s3://bucket/prefix/table_name/
├── year=2025/month=01/day=20/hour=10/
│   ├── batch_001_uuid.parquet
│   └── batch_002_uuid.parquet

This structure enables:

Partition Pruning: Only scan relevant files based on WHERE clauses
Time-based Queries: Efficiently filter by time ranges
Dimensional Analysis: Group and filter by business dimensions
Cost Optimization: Minimize data scanned and transfer costs

Example Queries

Basic Query

SELECT COUNT(*) as total_records
FROM sensordata
WHERE hostname = 'host_0';

Time Series Analysis

SELECT
  ts_ms as timestamp,
  hostname,
  AVG(usage_user) as avg_cpu
FROM sensordata
GROUP BY ts_ms, hostname
ORDER BY ts_ms DESC
LIMIT 100;

Partition-Aware Query

SELECT
  region,
  datacenter,
  COUNT(*) as record_count,
  AVG(usage_system) as avg_system_cpu
FROM sensordata
WHERE region = 'us-east-1'
  AND datacenter = 'dc1'
GROUP BY region, datacenter;

Aggregation Query

SELECT
  hostname,
  MIN(usage_idle) as min_idle,
  MAX(usage_idle) as max_idle,
  AVG(usage_idle) as avg_idle
FROM sensordata
GROUP BY hostname
ORDER BY avg_idle ASC;

API Documentation

Query Endpoints

Execute SQL Query

POST /query
Content-Type: text/plain

SELECT * FROM sensordata LIMIT 10

List Tables

GET /tables

Get Table Schema

GET /schema?table=sensordata

Get Partitions

GET /partitions?table=sensordata

Discover Table from S3

POST /discover?table=my_table&s3_path=s3://bucket/path

Grafana Integration

QuixLake API includes built-in Grafana datasource support:

POST /grafana/query
POST /grafana/metrics
GET /hive-folders

Configure Grafana datasource with your QuixLake API URL.

Performance Tuning

Query Performance

Use Partition Filters

SELECT * FROM sensordata
WHERE region = 'us-east-1'  -- Partition column
  AND datacenter = 'dc1';   -- Partition column

Optimize Batch Size
- Larger batches = fewer, larger files = faster queries
- Target 128-256MB Parquet files for optimal performance
Choose Appropriate Partitioning
- High cardinality: Avoid (e.g., user_id with millions of values)
- Low to medium cardinality: Good (e.g., region, sensor_type, date)
- Time-based: Excellent for time-series data

Storage Optimization

Compact Small Files: Use the API's /compact endpoint
Repartition if Needed: Change partitioning strategy with /repartition
Monitor Storage: Check file sizes and distribution in MinIO console

Memory Management

QuixLake API configures DuckDB with optimized memory settings
Adjust deployment resources in quix.yaml if needed
Monitor query performance through logs

Monitoring

Service Health

Check service status:

QuixLake API: GET https://quixlake-<workspace>.deployments.quix.io/health
Query UI: Access via public URL
MinIO Console: Access via MinIO proxy public URL
Catalog: GET https://iceberg-catalog-<workspace>.deployments.quix.io/cache-status

Metrics

Monitor through Quix platform:

Message throughput in Kafka topics
CPU and memory usage per service
Storage size in MinIO
Query execution times in logs

Using Your Own Data

To ingest your own data instead of sample data:

Prepare Your Data Source
- Ensure data is in JSON format
- Publish to a Kafka topic in your Quix environment

Configure the Sink Update the quix-ts-datalake-sink deployment:

input: your-topic-name
TABLE_NAME: your-table-name
HIVE_COLUMNS: your,partition,columns
TIMESTAMP_COLUMN: your_timestamp_field

Adjust Partitioning Choose partition columns based on your query patterns:
- Frequently filtered columns
- Low to medium cardinality
- Time-based for time-series data
Deploy and Test
- Start publishing data
- Query via the UI to verify
- Monitor file sizes and adjust BATCH_SIZE if needed

Troubleshooting

Common Issues

1. Tables Not Appearing

Check if sink is running: View deployment logs
Verify data is flowing: Check Kafka topic messages
Check catalog registration: GET /tables from API

2. S3 Access Denied

Verify MinIO credentials in secrets
Check bucket exists in MinIO
Ensure endpoint URL is correct

3. Slow Queries

Add partition filters to WHERE clause
Check file sizes (target 128-256MB)
Verify partition strategy matches query patterns

4. Schema Errors

Ensure consistent data types in source
Check partition column values are valid
View table schema: GET /schema?table=your_table

Debug Logging

Enable debug logging:

# In quix-ts-datalake-sink
LOGLEVEL: DEBUG

# In Query UI
DEBUG_LOG_LEVEL: DEBUG

View logs in Quix platform deployment pages.

Use Cases

IoT Analytics

Ingest sensor data from thousands of devices
Partition by device location, type, and time
Query historical trends and detect anomalies

Application Monitoring

Stream application logs and metrics to the data lake
Partition by service, environment, and time
Query for debugging and performance analysis

Business Analytics

Capture clickstream and event data
Partition by user segments, campaigns, and time
Run ad-hoc analytical queries for insights

Time Series Monitoring

Store metrics from distributed systems
Partition by metric type and time windows
Query for dashboards, alerts, and reports

Advanced Configuration

Custom Partitioning Strategies

Daily partitions with region:

HIVE_COLUMNS: region,year,month,day
TIMESTAMP_COLUMN: ts_ms

Hourly partitions:

HIVE_COLUMNS: year,month,day,hour
TIMESTAMP_COLUMN: event_time

Multi-dimensional:

HIVE_COLUMNS: customer_id,product_category,year,month
TIMESTAMP_COLUMN: purchase_date

Using AWS S3 Instead of MinIO

Create S3 Bucket in your AWS account

Update Configuration in all deployments:

S3_BUCKET: your-aws-bucket-name
AWS_REGION: us-east-1  # Your region
AWS_ENDPOINT_URL: ""   # Remove this line

Configure Secrets:

minio_user: <your-aws-access-key-id>
minio_password: <your-aws-secret-access-key>

Remove MinIO Deployments if not needed (optional)

Scaling Considerations

For High Throughput:

Increase MAX_WRITE_WORKERS in sink (default: 10)
Increase sink CPU/memory resources
Use larger BATCH_SIZE (1000-5000)

For Many Tables:

Consider multiple sink deployments per table
Adjust catalog cache settings
Monitor PostgreSQL performance

For Large Queries:

Increase API CPU/memory resources
Use partition filters aggressively
Consider compacting small files

Development

Project Structure

template-quixlake/
├── minio/                      # MinIO application
├── minio-proxy/                # MinIO proxy application
├── postgresql/                 # PostgreSQL application
├── quix-ts-datalake-sink/     # Sink application (source included)
├── tsbs-quix-data-generator/  # Data generator application
├── tsbs-transformer/          # Transformer application
├── quix.yaml                  # Quix deployment configuration
├── template.json              # Template metadata
└── README.md                  # This file

Note: Core services (API, UI, Catalog) use pre-built container images from Quix container registry.

Customizing the Sink

The sink source code is included in quix-ts-datalake-sink/. To customize:

Edit quixlake_sink.py or main.py
Update dockerfile if needed
Build and push to your own registry
Update image: in quix.yaml

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Make changes with tests
Submit a pull request

Resources

Quix Platform: https://quix.io/
Documentation: https://docs.quix.io/
DuckDB: https://duckdb.org/
Apache Iceberg: https://iceberg.apache.org/
Apache Parquet: https://parquet.apache.org/

Support

GitHub Issues: https://github.com/quixio/template-quixlake/issues
Quix Community: https://quix.io/slack-invite
Email: [email protected]

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Acknowledgments

Built with open source technologies:

DuckDB - Fast analytical database
Apache Kafka - Streaming platform
Apache Parquet - Columnar storage format
MinIO - S3-compatible object storage
PostgreSQL - Metadata database
TSBS - Time series benchmarking suite

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
minio-proxy		minio-proxy
minio		minio
postgresql		postgresql
quix-ts-datalake-sink		quix-ts-datalake-sink
tsbs-quix-data-generator		tsbs-quix-data-generator
tsbs-transformer		tsbs-transformer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
quix.yaml		quix.yaml
template.json		template.json

License

quixio/template-quixlake

Folders and files

Latest commit

History

Repository files navigation

QuixLake Template

Overview

Architecture

DataLake Infrastructure

Ingestion Pipeline (Example)

Components

Core Services (Pre-built Images)

1. Quix TS Datalake API

2. Quix TS Query UI

3. Quix TS Datalake Catalog

Data Pipeline Components

4. Quix TS Datalake Sink

5. TSBS Data Generator

6. TSBS Transformer

Infrastructure Services

7. PostgreSQL

8. MinIO

9. MinIO Proxy

Features

Getting Started

Prerequisites

Quick Start

Configuration

Storage Configuration

Catalog Configuration

Sink Configuration

Data Flow

Data Organization

Example Queries

Basic Query

Time Series Analysis

Partition-Aware Query

Aggregation Query

API Documentation

Query Endpoints

Grafana Integration

Performance Tuning

Query Performance

Storage Optimization

Memory Management

Monitoring

Service Health

Metrics

Using Your Own Data

Troubleshooting

Common Issues

Debug Logging

Use Cases

IoT Analytics

Application Monitoring

Business Analytics

Time Series Monitoring

Advanced Configuration

Custom Partitioning Strategies

Using AWS S3 Instead of MinIO

Scaling Considerations

Development

Project Structure

Customizing the Sink

Contributing

Resources

Support

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages