Skip to content

A high-performance Spark History Server rewrite in Rust with embedded DuckDB for cross-application analytics and zero-deployment complexity.

License

Notifications You must be signed in to change notification settings

zemin-piao/spark-history-server-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

72 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spark History Server (Rust)

Rust Apache Spark DuckDB License: Apache 2.0

A high-performance, analytics-first Spark History Server built in Rust. Unlike traditional history servers focused on individual application details, this server excels at cross-application analytics and trends using DuckDB's analytical power.

πŸ† Proven Performance: 100,000+ applications, 2M events, 20,000-30,000 events/sec (with DuckDB Appender optimization)

Summary

✨ What Makes This Different

βœ… Analytics-First Design

  • Cross-application insights: Query metrics across thousands of Spark applications simultaneously
  • Performance trends: Time-series analysis of resource usage and system health
  • Resource optimization: Identify underutilized executors and memory bottlenecks across your entire Spark estate

βœ… Enterprise-Ready Infrastructure

  • Multi-storage support: Local filesystem, HDFS with Kerberos, and S3-compatible storage
  • DuckDB analytical backend for lightning-fast aggregations
  • Circuit breaker protection for fault-tolerant operations
  • Single binary deployment - no external dependencies

❌ What We Don't Do (Use Standard Spark History Server Instead)

  • Individual job/stage/task drill-down details
  • Real-time application debugging
  • SQL query execution plan analysis

πŸš€ Quick Start

Prerequisites

Before starting, ensure you have:

  • Rust toolchain installed (1.70+): curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

1. Build the Server

# Clone the repository
git clone https://github.com/your-repo/spark-history-server-rs
cd spark-history-server-rs

# Build the release binary
cargo build --release

2. Run the Server

# Start with local Spark events
./target/release/spark-history-server --log-directory ~/logs/

# Or with HDFS
./target/release/spark-history-server \
  --hdfs \
  --hdfs-namenode hdfs://namenode:9000 \
  --log-directory /spark-events

# With S3 storage
./target/release/spark-history-server \
  --s3 \
  --s3-bucket my-spark-events \
  --s3-region us-east-1 \
  --log-directory /spark-events

# With Kerberos authentication
./target/release/spark-history-server \
  --hdfs \
  --hdfs-namenode hdfs://secure-namenode:9000 \
  --kerberos-principal [email protected] \
  --keytab-path /etc/security/keytabs/spark.keytab \
  --log-directory /hdfs/spark-events

Note: The server automatically creates the database directory (./data/) and log directory if they don't exist. You should see log messages like:

INFO Starting Spark History Server
INFO Log directory: /Users/user/logs/
INFO Storage backend: Local filesystem
INFO DuckDB workers initialized at: "./data/events.db" with 8 workers
INFO Server listening on http://0.0.0.0:18080

If your log directory is empty, the server will start successfully and wait for event logs to be added.

3. Access the Dashboard

Open your browser and navigate to http://localhost:18080 to access the web dashboard.

🎯 Key Features

🎨 Built-in Web Dashboard

  • Cluster Overview: Real-time cluster status and key metrics at http://localhost:18080
  • Optimization Dashboard: Resource hogs, efficiency analysis, cost optimization at http://localhost:18080/optimize
  • Performance Trends: Historical analysis and capacity planning insights

πŸ“Š Cross-Application Analytics

  • Enterprise-wide Spark metrics and performance analysis
  • Resource optimization and efficiency insights
  • Historical trends and capacity planning data
  • Standard Spark History Server API v1 compatibility

βš™οΈ Configuration

Create config/settings.toml:

[server]
host = "0.0.0.0"
port = 18080

[history]
# Local, HDFS, or S3 path to Spark event logs
log_directory = "/tmp/spark-events"
# or log_directory = "hdfs://namenode:9000/spark-events"
# or log_directory = "s3://my-bucket/spark-events"

max_applications = 1000
update_interval_seconds = 60
compression_enabled = true
database_directory = "./data"

# Optional HDFS configuration
[history.hdfs]
namenode_url = "hdfs://namenode:9000"
connection_timeout_ms = 30000
read_timeout_ms = 60000

[history.hdfs.kerberos]
principal = "[email protected]" 
keytab_path = "/etc/security/keytabs/spark.keytab"

# Optional S3 configuration
[history.s3]
bucket_name = "my-spark-events"
region = "us-east-1"
# endpoint_url = "https://s3.amazonaws.com"  # For S3-compatible services like MinIO
# access_key_id = "your-access-key"
# secret_access_key = "your-secret-key"
connection_timeout_ms = 30000
read_timeout_ms = 60000

πŸ—οΈ Architecture

  • Event Processing: Multi-storage support (local, HDFS, S3) with circuit breaker protection
  • Storage: DuckDB embedded analytical database optimized for cross-app queries
  • APIs: Dual support for standard Spark History Server v1 + advanced analytics
  • Dashboard: Built-in web interface with multiple analytical views

⚑ Scalability Architecture

40K+ Applications Scale Implementation

This system achieves enterprise-scale performance through several key architectural optimizations:

πŸ”„ Multi-Writer DuckDB Architecture

  • 8 parallel database workers with round-robin load balancing
  • Batched writes (5,000 events/batch) for optimal throughput
  • Connection pooling eliminates single-connection bottlenecks
  • Result: 5,130 events/sec sustained, 205K+ events/sec capacity

πŸ—‚οΈ Optimized HDFS Processing

  • Hierarchical caching with 5-minute TTL reduces NameNode pressure by 98.8%
  • 50 concurrent application processors for parallel event log scanning
  • Bulk directory operations minimize HDFS round-trips
  • Persistent cache survives process restarts (<200Β΅s recovery vs 30+ minutes)

πŸ’Ύ Memory-Bounded Operations

  • LRU cache eviction prevents OOM scenarios
  • Semaphore-controlled concurrency (20 concurrent HDFS operations)
  • Background persistence with dirty flag tracking
  • Circuit breaker protection for external dependencies

πŸ§ͺ Testing

# Run all tests
cargo test

# HDFS integration tests  
./scripts/run-hdfs-tests.sh

# S3 integration tests
cargo test --test s3_integration_test

# Performance testing
cargo test test_100k_applications_load --release

# Code quality
cargo clippy --all-targets --all-features -- -D warnings

πŸ“ˆ Performance

  • Scale: 100K+ applications, 2M+ events tested
  • Throughput: 10,700+ events/sec sustained
  • Query Speed: <10ms for analytical queries
  • Storage: 229 bytes per event average
  • Deployment: Single binary, embedded database

🀝 Use Cases

Perfect For:

  • Platform engineering teams analyzing Spark cluster performance
  • Capacity planning and resource optimization
  • Historical trend analysis across multiple applications
  • Cost optimization and efficiency insights

Not Ideal For:

  • Debugging individual Spark applications
  • Real-time application monitoring
  • Detailed task-level performance analysis

License: Apache 2.0 | Language: Rust | Database: DuckDB | UI: Built-in Web Dashboard

About

A high-performance Spark History Server rewrite in Rust with embedded DuckDB for cross-application analytics and zero-deployment complexity.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •