GitHub

Da trang xe cat bien dong

You have a bunch of data inside your lovely Postgres, but its design does not allow efficient aggregated queries? You then end up acquire another database just for this usecase (AlloyDB, Clickhhouse...), or in my old company, we threw everything to Bigquery.

I start this project for fun, but with a question in mind: Can i provide a self-hosted solution but with significant enough cost saving.

Tech stacks I will use:

Postgres and logical replication (I just want to see how it works), the same principle can applied to Mysql Binlog, Mongodb Oplog, or more generally Debezium
Rust (I cannot apply for Rust jobs because people don't believe I can efficiently write Rust 😢 )
Datafusion (A big fan, also a project I am contributing to)

This looks somewhat similar to the combo of Debezium: https://debezium.io/blog/2021/10/20/using-debezium-create-data-lake-with-apache-iceberg/

High level diagram

Ingestion flow

Postgres -> logical replication -> Ingestor -> Object store (Parquet) 
                                        |
                                        --> Secondary index (Postgres)

Query flow

Query -> Datafusion framework -> Custom Postgres TableProvider -> Object store (Parquet)
                                            |
                                            v
                                    Secondary index (Postgres)

Schema for secondary index

I'll try to keep this project simple at first. For different queries exist different ways/types of metadata we need to maintain inside secondary index, I will put basic items here to remind myself (It won't cover everything though)

Rowgroup metadata
Bloomfilter
PageIndex metadata (For finer page skipping)
Fulltext search???

We need a schema for this index

The purpose is for learning here, so i'll try to use json where possible (although it is not as optimal storage format)

An example of row_group_statistics

{
    "num_row":10000,
    "column_statistics": [
        {
            "null_count": 10,
            "max": binary,
            "min": binary,
            "sum": binary,
            "distinct_count": 10
        }
    ]
}

With this metadata, we can apply skip scanning at row_group level, or even the whole parquet file

Reference

https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datrang-ingestor		datrang-ingestor
logs		logs
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.MD		README.MD
STORAGE.MD		STORAGE.MD
docker-compose.yaml		docker-compose.yaml
insert.sql		insert.sql
sample.sql		sample.sql
update.sql		update.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Da trang xe cat bien dong

High level diagram

Ingestion flow

Query flow

Schema for secondary index

Reference

About

Uh oh!

Releases

Packages

Languages

duongcongtoai/datrang

Folders and files

Latest commit

History

Repository files navigation

Da trang xe cat bien dong

High level diagram

Ingestion flow

Query flow

Schema for secondary index

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages