A complete data lake solution built on DuckDB, S3, and Apache Iceberg for high-performance streaming data analytics.
QuixLake is a production-ready data lake platform that enables real-time ingestion, storage, and querying of streaming data. It combines Apache Kafka for streaming, S3 for object storage, Apache Iceberg for table format, and DuckDB for blazing-fast SQL analytics.
This template deploys a fully configured QuixLake instance with:
- Pre-built, production-ready container images for core services
- Example data pipeline with Time Series Benchmark Suite (TSBS) data
- Interactive query UI for data exploration
- S3-compatible MinIO storage for local development
┌──────────────────────────────────────────────────────────┐
│ QuixLake DataLake │
│ │
│ ┌─────────────────┐ ┌─────────────────────────┐ │
│ │ Iceberg Catalog│◄────►│ PostgreSQL Database │ │
│ │ (REST API) │ │ - Table metadata │ │
│ └────────┬────────┘ │ - File manifests │ │
│ │ │ - Partition info │ │
│ │ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ MinIO (S3) │ │
│ │ Object Storage │ │
│ │ │ │
│ │ Parquet Files: │ │
│ │ ├── table_1/ │ │
│ │ │ └── data/ │ │
│ │ ├── table_2/ │ │
│ │ └── data/ │ │
│ └────────▲────────┘ │
│ │ │
│ │ │
│ ┌────────┴────────┐ │
│ │ QuixLake API │ Query Engine │
│ │ (DuckDB) │ - SQL execution │
│ │ │ - Table discovery │
│ │ │ - Partition pruning │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Query UI │ Interactive SQL Interface │
│ │ (Data Explorer) │ - Query editor │
│ │ │ - Results viewer │
│ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ TSBS Example Data Pipeline │
│ │
│ ┌─────────────────┐ │
│ │ TSBS Data Gen │ Generate sample time-series data │
│ │ (Job) │ - CPU metrics │
│ │ │ - DevOps data │
│ └────────┬────────┘ - IoT sensor data │
│ │ │
│ v │
│ Kafka Topic │
│ (tsbs_data) │
│ │ │
│ v │
│ ┌─────────────────┐ │
│ │ TSBS Transformer│ Transform and enrich │
│ │ (Service) │ - Add metadata │
│ │ │ - Format timestamps │
│ └────────┬────────┘ - Normalize schema │
│ │ │
│ v │
│ Kafka Topic │
│ (tsbs_data_transformed) │
│ │ │
│ v │
│ ┌─────────────────────────┐ │
│ │ Quix TS Datalake Sink │ Write to DataLake │
│ │ (Service) │ │
│ │ │ - Batch messages │
│ │ ┌──────────────────┐ │ - Partition data │
│ │ │ Partition by: │ │ - Write Parquet │
│ │ │ - region │ │ - Register in catalog │
│ │ │ - datacenter │ │ │
│ │ │ - hostname │ │ │
│ │ └──────────────────┘ │ │
│ └────────┬────────────────┘ │
│ │ │
│ v │
│ [To DataLake] │
│ MinIO (S3) │
│ │
└──────────────────────────────────────────────────────────┘
DuckDB-based REST API for querying S3 data with:
- SQL query execution over Parquet files
- Table discovery and automatic registration
- Partition management and pruning
- Schema evolution support
- Grafana datasource integration
Interactive web interface with:
- SQL query editor with syntax highlighting
- Real-time query results
- Table and partition browsing
- Embedded in Quix platform as "Data Explorer"
Metadata catalog service featuring:
- PostgreSQL backend for reliability
- Manifest management to avoid expensive S3 ListObjects
- Query optimization with partition filtering
- Automatic table discovery
Kafka-to-S3 sink connector that:
- Writes streaming data as Hive-partitioned Parquet files
- Supports time-based and custom partitioning
- Automatically registers tables in catalog
- Handles schema evolution
Generates realistic time-series benchmark data for testing:
- Configurable data types (cpu-only, devops, iot)
- Adjustable scale and time ranges
- Deterministic generation with seed support
Transforms raw TSBS data into optimized format for the data lake.
Database backend for the Iceberg Catalog, storing:
- Table metadata and schemas
- File manifests
- Partition information
S3-compatible object storage for development and testing.
Public access proxy for MinIO in Quix platform.
- Real-time Ingestion: Stream data from Kafka directly to S3 as Parquet
- High-Performance Queries: DuckDB provides sub-second analytical queries
- Schema Evolution: Automatic schema detection and evolution support
- Partition Pruning: Efficient queries using Hive-style partitioning
- Table Discovery: Automatically discover and register existing S3 data
- Scalable Storage: S3-based storage scales to petabytes
- Standard Formats: Parquet files compatible with any analytics tool
- Production Ready: Pre-built, tested container images
- Embedded UI: Query interface integrated in Quix platform
- Quix account (sign up at https://portal.platform.quix.io/signup)
- AWS account (optional - template includes MinIO for local testing)
-
Deploy to Quix
- Log in to your Quix account
- Navigate to Templates → "QuixLake Template"
- Click "Deploy template"
-
Configure Secrets
Set the following secrets in your Quix environment:
minio_user: admin minio_password: <your-secure-password> postgres_password: <your-secure-password> config_ui_auth_token: <your-auth-token> -
Start the Pipeline
The template deploys with:
- All core services (API, UI, Catalog) automatically running
- MinIO storage ready for data
- PostgreSQL catalog initialized
- Example pipeline in "Example pipeline" group
-
Generate Sample Data
Start the TSBS Data Generator job to produce sample CPU metrics data.
-
Query Your Data
Access the Query UI through:
- Public URL:
https://query-ui-<your-workspace>.deployments.quix.io - Or via the "Data Explorer" sidebar in Quix platform
Try this query:
SELECT * FROM sensordata LIMIT 10;
- Public URL:
Configure S3 storage (or use included MinIO):
S3_BUCKET: quixdatalaketest # Your bucket name
S3_PREFIX: ts_test # Data folder prefix
AWS_REGION: eu-west-2 # AWS region
AWS_ENDPOINT_URL: http://minio:9000 # For MinIO; remove for AWS S3PostgreSQL backend configuration:
CATALOG_BACKEND: postgres
POSTGRES_HOST: postgresql
POSTGRES_PORT: 80
POSTGRES_DB: iceberg_catalog
POSTGRES_USER: adminConfigure how data is written to the lake:
BATCH_SIZE: 1000 # Messages per batch
COMMIT_INTERVAL: 30 # Commit interval (seconds)
HIVE_COLUMNS: region,datacenter,hostname # Partition columns
TIMESTAMP_COLUMN: ts_ms # Time column for partitioning
AUTO_DISCOVER: true # Auto-register in catalogSee quix.yaml for complete configuration options.
- Data Generation: TSBS generator produces time-series metrics
- Transformation: Transformer enriches and formats the data
- Streaming: Data flows through Kafka topics
- Storage: Sink writes batches to S3 as partitioned Parquet files
- Registration: Tables automatically register in Iceberg Catalog
- Query: API uses DuckDB to query Parquet files from S3
- Visualization: Query UI provides interactive data exploration
Data is organized in S3 using Hive-style partitioning:
s3://bucket/prefix/table_name/
├── region=us-east/
│ ├── datacenter=dc1/
│ │ ├── hostname=server01/
│ │ │ ├── batch_001_uuid.parquet
│ │ │ └── batch_002_uuid.parquet
For time-based partitioning:
s3://bucket/prefix/table_name/
├── year=2025/month=01/day=20/hour=10/
│ ├── batch_001_uuid.parquet
│ └── batch_002_uuid.parquet
This structure enables:
- Partition Pruning: Only scan relevant files based on WHERE clauses
- Time-based Queries: Efficiently filter by time ranges
- Dimensional Analysis: Group and filter by business dimensions
- Cost Optimization: Minimize data scanned and transfer costs
SELECT COUNT(*) as total_records
FROM sensordata
WHERE hostname = 'host_0';SELECT
ts_ms as timestamp,
hostname,
AVG(usage_user) as avg_cpu
FROM sensordata
GROUP BY ts_ms, hostname
ORDER BY ts_ms DESC
LIMIT 100;SELECT
region,
datacenter,
COUNT(*) as record_count,
AVG(usage_system) as avg_system_cpu
FROM sensordata
WHERE region = 'us-east-1'
AND datacenter = 'dc1'
GROUP BY region, datacenter;SELECT
hostname,
MIN(usage_idle) as min_idle,
MAX(usage_idle) as max_idle,
AVG(usage_idle) as avg_idle
FROM sensordata
GROUP BY hostname
ORDER BY avg_idle ASC;Execute SQL Query
POST /query
Content-Type: text/plain
SELECT * FROM sensordata LIMIT 10List Tables
GET /tablesGet Table Schema
GET /schema?table=sensordataGet Partitions
GET /partitions?table=sensordataDiscover Table from S3
POST /discover?table=my_table&s3_path=s3://bucket/pathQuixLake API includes built-in Grafana datasource support:
POST /grafana/query
POST /grafana/metrics
GET /hive-foldersConfigure Grafana datasource with your QuixLake API URL.
-
Use Partition Filters
SELECT * FROM sensordata WHERE region = 'us-east-1' -- Partition column AND datacenter = 'dc1'; -- Partition column
-
Optimize Batch Size
- Larger batches = fewer, larger files = faster queries
- Target 128-256MB Parquet files for optimal performance
-
Choose Appropriate Partitioning
- High cardinality: Avoid (e.g., user_id with millions of values)
- Low to medium cardinality: Good (e.g., region, sensor_type, date)
- Time-based: Excellent for time-series data
- Compact Small Files: Use the API's
/compactendpoint - Repartition if Needed: Change partitioning strategy with
/repartition - Monitor Storage: Check file sizes and distribution in MinIO console
- QuixLake API configures DuckDB with optimized memory settings
- Adjust deployment resources in
quix.yamlif needed - Monitor query performance through logs
Check service status:
- QuixLake API:
GET https://quixlake-<workspace>.deployments.quix.io/health - Query UI: Access via public URL
- MinIO Console: Access via MinIO proxy public URL
- Catalog:
GET https://iceberg-catalog-<workspace>.deployments.quix.io/cache-status
Monitor through Quix platform:
- Message throughput in Kafka topics
- CPU and memory usage per service
- Storage size in MinIO
- Query execution times in logs
To ingest your own data instead of sample data:
-
Prepare Your Data Source
- Ensure data is in JSON format
- Publish to a Kafka topic in your Quix environment
-
Configure the Sink Update the
quix-ts-datalake-sinkdeployment:input: your-topic-name TABLE_NAME: your-table-name HIVE_COLUMNS: your,partition,columns TIMESTAMP_COLUMN: your_timestamp_field
-
Adjust Partitioning Choose partition columns based on your query patterns:
- Frequently filtered columns
- Low to medium cardinality
- Time-based for time-series data
-
Deploy and Test
- Start publishing data
- Query via the UI to verify
- Monitor file sizes and adjust
BATCH_SIZEif needed
1. Tables Not Appearing
- Check if sink is running: View deployment logs
- Verify data is flowing: Check Kafka topic messages
- Check catalog registration:
GET /tablesfrom API
2. S3 Access Denied
- Verify MinIO credentials in secrets
- Check bucket exists in MinIO
- Ensure endpoint URL is correct
3. Slow Queries
- Add partition filters to WHERE clause
- Check file sizes (target 128-256MB)
- Verify partition strategy matches query patterns
4. Schema Errors
- Ensure consistent data types in source
- Check partition column values are valid
- View table schema:
GET /schema?table=your_table
Enable debug logging:
# In quix-ts-datalake-sink
LOGLEVEL: DEBUG
# In Query UI
DEBUG_LOG_LEVEL: DEBUGView logs in Quix platform deployment pages.
- Ingest sensor data from thousands of devices
- Partition by device location, type, and time
- Query historical trends and detect anomalies
- Stream application logs and metrics to the data lake
- Partition by service, environment, and time
- Query for debugging and performance analysis
- Capture clickstream and event data
- Partition by user segments, campaigns, and time
- Run ad-hoc analytical queries for insights
- Store metrics from distributed systems
- Partition by metric type and time windows
- Query for dashboards, alerts, and reports
Daily partitions with region:
HIVE_COLUMNS: region,year,month,day
TIMESTAMP_COLUMN: ts_msHourly partitions:
HIVE_COLUMNS: year,month,day,hour
TIMESTAMP_COLUMN: event_timeMulti-dimensional:
HIVE_COLUMNS: customer_id,product_category,year,month
TIMESTAMP_COLUMN: purchase_date-
Create S3 Bucket in your AWS account
-
Update Configuration in all deployments:
S3_BUCKET: your-aws-bucket-name AWS_REGION: us-east-1 # Your region AWS_ENDPOINT_URL: "" # Remove this line
-
Configure Secrets:
minio_user: <your-aws-access-key-id> minio_password: <your-aws-secret-access-key> -
Remove MinIO Deployments if not needed (optional)
For High Throughput:
- Increase
MAX_WRITE_WORKERSin sink (default: 10) - Increase sink CPU/memory resources
- Use larger
BATCH_SIZE(1000-5000)
For Many Tables:
- Consider multiple sink deployments per table
- Adjust catalog cache settings
- Monitor PostgreSQL performance
For Large Queries:
- Increase API CPU/memory resources
- Use partition filters aggressively
- Consider compacting small files
template-quixlake/
├── minio/ # MinIO application
├── minio-proxy/ # MinIO proxy application
├── postgresql/ # PostgreSQL application
├── quix-ts-datalake-sink/ # Sink application (source included)
├── tsbs-quix-data-generator/ # Data generator application
├── tsbs-transformer/ # Transformer application
├── quix.yaml # Quix deployment configuration
├── template.json # Template metadata
└── README.md # This file
Note: Core services (API, UI, Catalog) use pre-built container images from Quix container registry.
The sink source code is included in quix-ts-datalake-sink/. To customize:
- Edit
quixlake_sink.pyormain.py - Update
dockerfileif needed - Build and push to your own registry
- Update
image:inquix.yaml
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make changes with tests
- Submit a pull request
- Quix Platform: https://quix.io/
- Documentation: https://docs.quix.io/
- DuckDB: https://duckdb.org/
- Apache Iceberg: https://iceberg.apache.org/
- Apache Parquet: https://parquet.apache.org/
- GitHub Issues: https://github.com/quixio/template-quixlake/issues
- Quix Community: https://quix.io/slack-invite
- Email: [email protected]
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Built with open source technologies:
- DuckDB - Fast analytical database
- Apache Kafka - Streaming platform
- Apache Parquet - Columnar storage format
- MinIO - S3-compatible object storage
- PostgreSQL - Metadata database
- TSBS - Time series benchmarking suite