Web3Insights

Web3Insights is a data lake indexer designed to analyze GitHub events. It efficiently downloads, processes, and stores GitHub event data from GHArchive, making it accessible for further analysis and insights.

🚀 Features

Efficient Data Processing: Downloads and processes GitHub event data from GHArchive
Advanced Data Lake Architecture:
- Store data in DuckDB for local analysis
- PostgreSQL integration with Apache Iceberg via pg_mooncake extension
Configurable Processing:
- Filter bot accounts, event bodies, and payloads
- Adjustable concurrency for file processing and database operations
- Flexible time range selection for targeted data collection

📋 Requirements

Rust (2024 edition)
PostgreSQL with pg_mooncake extension (optional)
Internet connection (for downloading GitHub event data)

🔧 Installation & Usage

Setup:

git clone https://github.com/Web3Insights/web3insights.git
cd web3insights
cp .env.example .env  # Edit with your configuration

Run:

cargo run --release

cargo build --release

Note: If use x86_64 cpu，enable modern cpu optimization:

RUSTFLAGS='-C target-cpu=x86-64-v4' cargo run --release

RUSTFLAGS='-C target-cpu=x86-64-v4' cargo build --release

Process Flow:
- Downloads GitHub event data from GHArchive for the specified time range
- Processes and filters events according to your configuration
- Stores data in selected database(s) (DuckDB and/or PostgreSQL with Iceberg)

⚙️ Configuration

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	`postgresql://web3insights:web3insights@localhost/web3insights`
`FILTER_OUT_PAYLOAD`	Whether to filter out event payloads	`true`
`FILTER_OUT_BODY`	Whether to filter out event bodies	`true`
`FILTER_OUT_BOT`	Whether to filter out bot accounts	`false`
`MAX_FILE_CONCURRENT`	Maximum concurrent file processing	`12`
`MAX_DB_CONCURRENT`	Maximum concurrent database operations	`4`
`DB_SELECT`	Database selection (`duckdb`, `pg`, or `both`)	`duckdb`
`TIME_START`	Start time for event processing	`2020-01-01T00:00:00Z`
`TIME_END`	End time for event processing	`2020-02-01T00:00:00Z`
`GHARCHIVE_FILE_PATH`	Path to store GHArchive files	`./gharchive`

📊 Data Storage and Analysis

DuckDB for Analytics

When using DuckDB (DB_SELECT=duckdb or DB_SELECT=both), Web3Insights leverages DuckDB's powerful analytical capabilities:

Key Benefits:
- High Performance Analytics: Exceptional query performance for analytical workloads
- Column-Oriented Storage: Optimized for analytical queries with efficient data compression
- Zero-ETL Analytics: Directly query data in various formats without transformation
- Native File Format Support: Directly read and write Parquet, CSV, JSON and other formats
- SQL Interface: Familiar SQL syntax with rich analytical functions
- Lightweight & Embeddable: Zero configuration, single-file database with minimal footprint
- Vectorized Execution: SIMD-accelerated processing for maximum throughput
- Parallel Query Execution: Efficient utilization of multi-core processors
- Data Science Integration: Seamless integration with Python, R, and other tools
- Cross-Platform Compatibility: Works on all major operating systems
- Iceberg Support: Capable of reading Apache Iceberg tables

PostgreSQL with Apache Iceberg

When using PostgreSQL (DB_SELECT=pg or DB_SELECT=both), Web3Insights leverages the pg_mooncake extension to store data using Apache Iceberg:

pg_mooncake Extension: Provides Apache Iceberg integration for PostgreSQL
- GitHub: pg_mooncake
- Related: moonlink

Data Model

GitHub event data is stored using the following schema:

Field	Type	Description
`id`	BIGINT	Event ID
`actor_id`	BIGINT	GitHub user ID
`actor_login`	TEXT	GitHub username
`repo_id`	BIGINT	Repository ID
`repo_name`	TEXT	Repository name
`org_id`	BIGINT	Organization ID (if applicable)
`org_login`	TEXT	Organization name (if applicable)
`event_type`	TEXT	Type of GitHub event
`payload`	JSON	Event payload data
`body`	TEXT	Event body content
`created_at`	TIMESTAMP	Event creation time

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
.vscode		.vscode
migrations		migrations
src		src
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web3Insights

🚀 Features

📋 Requirements

🔧 Installation & Usage

⚙️ Configuration

📊 Data Storage and Analysis

DuckDB for Analytics

PostgreSQL with Apache Iceberg

Data Model

🤝 Contributing

📜 License

🌐 Languages

About

Uh oh!

Releases 3

Packages

Contributors 2

Uh oh!

Languages

License

web3insight-ai/web3insight-indexer

Folders and files

Latest commit

History

Repository files navigation

Web3Insights

🚀 Features

📋 Requirements

🔧 Installation & Usage

⚙️ Configuration

📊 Data Storage and Analysis

DuckDB for Analytics

PostgreSQL with Apache Iceberg

Data Model

🤝 Contributing

📜 License

🌐 Languages

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Packages