This document describes the architecture of arrow-lean, a pure C implementation of Apache Arrow for Lean 4.
Arrow-lean provides a complete implementation of the Arrow columnar memory format and Parquet file writing without any C++ dependencies. This design choice offers several benefits:
- Simplified Build: No C++ toolchain required, just a C99 compiler
- Reduced Binary Size: No heavy C++ runtime or Apache Arrow C++ library
- Easier Deployment: Fewer system dependencies
- Better Portability: Pure C compiles on more platforms
┌─────────────────────────────────────────────────────────┐
│ Lean Application │
├─────────────────────────────────────────────────────────┤
│ ToArrow Typeclasses │
│ (ToArrowColumn, ToArrowBatch) │
├─────────────────────────────────────────────────────────┤
│ TypedBuilders API │
│ (Int64Builder, StringBuilder, SchemaBuilder, etc.) │
├─────────────────────────────────────────────────────────┤
│ BuilderFFI │
│ (Lean FFI Declarations) │
├─────────────────────────────────────────────────────────┤
│ lean_builder_wrapper.c │
│ (Lean-to-C FFI Bridge) │
├─────────────────────────────────────────────────────────┤
│ arrow_builders.c │
│ (Pure C Arrow Array Builders) │
├─────────────────────────────────────────────────────────┤
│ parquet_writer_impl.c │
│ (Pure C Parquet Writer with Thrift Serialization) │
├─────────────────────────────────────────────────────────┤
│ Arrow C Data Interface │
│ (ArrowSchema, ArrowArray, etc.) │
└─────────────────────────────────────────────────────────┘
The foundation is the Arrow C Data Interface - a minimal, stable ABI for sharing Arrow data between systems. Key structures:
struct ArrowSchema {
const char* format; // Type format string (e.g., "l" for int64)
const char* name; // Field name
const char* metadata; // Custom metadata (key-value pairs)
int64_t flags; // ARROW_FLAG_NULLABLE, etc.
int64_t n_children; // Number of child schemas
struct ArrowSchema** children;
struct ArrowSchema* dictionary;
void (*release)(struct ArrowSchema*);
void* private_data;
};
struct ArrowArray {
int64_t length; // Number of elements
int64_t null_count; // Number of null values
int64_t offset; // Starting offset in buffers
int64_t n_buffers; // Number of data buffers
int64_t n_children; // Number of child arrays
const void** buffers; // Data buffers (validity bitmap, values, offsets)
struct ArrowArray** children;
struct ArrowArray* dictionary;
void (*release)(struct ArrowArray*);
void* private_data;
};Type-safe builders for constructing Arrow arrays:
- Int64Builder: Build arrays of 64-bit integers
- Float64Builder: Build arrays of double-precision floats
- StringBuilder: Build variable-length string arrays
- BoolBuilder: Build boolean arrays with bit-packed storage
- TimestampBuilder: Build timestamp arrays with timezone support
- SchemaBuilder: Construct multi-column schemas
- RecordBatch: Bundle schema with column arrays
Each builder manages:
- Dynamic memory allocation with automatic resizing
- Validity bitmaps for nullable values
- Proper Arrow buffer layout
A complete pure C Parquet writer implementing:
- Thrift Compact Protocol: Serialization for Parquet metadata
- Plain Encoding: Data encoding for all supported types
- RLE Encoding: For definition levels (nullable columns)
- Page Structure: Data pages with headers
- Footer Serialization: FileMetaData with schema and row groups
Supported Parquet types:
- BOOLEAN
- INT32, INT64
- FLOAT, DOUBLE
- BYTE_ARRAY (strings)
- Converted types: UTF8, TIMESTAMP_MILLIS, TIMESTAMP_MICROS
Declares opaque types and external functions:
opaque Int64BuilderPointed : NonemptyType
def Int64Builder := Int64BuilderPointed.type
@[extern "lean_int64_builder_create"]
opaque Int64Builder.create : USize → IO (Option Int64Builder)
@[extern "lean_int64_builder_append"]
opaque Int64Builder.append : Int64Builder → Int64 → IO BoolHigh-level API wrapping the FFI:
def Int64Builder.appendOption (b : Int64Builder) (val : Option Int64) : IO Bool :=
match val with
| some v => b.append v
| none => b.appendNull
def buildInt64Column (values : Array Int64) : IO (Option ArrowArray) := do
match ← Int64Builder.create values.size.toUSize with
| some builder =>
for v in values do
let _ ← builder.append v
builder.finish
| none => return noneTypeclass-based serialization for custom types:
class ToArrowColumn (α : Type) where
arrowFormat : String
toColumn : Array α → IO (Option ArrowArray)
class ToArrowBatch (α : Type) where
columnSpecs : Array ColumnSpec
buildColumns : Array α → IO (Option (Array ArrowArray))Lean Type (e.g., Trade)
↓
ToArrowBatch.buildColumns
↓
TypedBuilders (Int64Builder, StringBuilder, etc.)
↓
lean_builder_wrapper.c (FFI)
↓
arrow_builders.c (Pure C)
↓
ArrowArray (C Data Interface)
Array of Records
↓
ToArrowBatch.buildColumns
↓
RecordBatch (schema + columns)
↓
ArrowArrayStream
↓
ParquetWriter.writeTable
↓
parquet_reader_writer.c
↓
parquet_writer_impl.c
↓
Thrift Serialization + Data Pages
↓
.parquet file
- Create: Allocate builder with initial capacity
- Append: Add values (automatic resize if needed)
- Finish: Convert to ArrowArray, builder becomes invalid
- Free: Release builder resources (automatic on finish)
- Creation: Via builder.finish() or external source
- Usage: Access buffers, child arrays
- Release: Call array.release() to free buffers
Lean's automatic memory management works with Arrow through:
- Opaque pointer types for builders
- Explicit release calls for Arrow structures
- RecordBatch bundles schema and array for RAII-like management
arrow/
├── arrow_c_abi.h # Arrow C Data Interface definitions
├── arrow_schema.c # Schema creation and management
├── arrow_array.c # Array creation and management
├── arrow_stream.c # Stream implementation
├── arrow_data_access.c # Type-specific value extraction
├── arrow_buffer.c # Buffer allocation and management
├── arrow_builders.h # Builder declarations
├── arrow_builders.c # Builder implementations
├── arrow_ipc.c # IPC serialization
├── parquet_writer_impl.h # Parquet writer declarations
├── parquet_writer_impl.c # Parquet writer implementation
├── parquet_reader_writer.c # High-level Parquet API
├── lean_arrow_wrapper.c # Lean FFI for core Arrow
├── lean_builder_wrapper.c # Lean FFI for builders
├── lean_arrow_ipc.c # Lean FFI for IPC
└── lean_parquet_wrapper.c # Lean FFI for Parquet
The lakefile.lean compiles all C sources:
extern_lib libarrow_wrapper pkg := do
let schemaObj ← arrow_schema_o.fetch
let arrayObj ← arrow_array_o.fetch
let buildersObj ← arrow_builders_o.fetch
let parquetWriterImplObj ← parquet_writer_impl_o.fetch
-- ... etc
buildStaticLib (pkg.staticLibDir / nameToStaticLib "arrow_wrapper")
#[schemaObj, arrayObj, buildersObj, parquetWriterImplObj, ...]Compiler flags: -fPIC -O2 -std=c99
-
Add builder in
arrow_builders.c:typedef struct { ... } NewTypeBuilder; NewTypeBuilder* new_type_builder_create(size_t); int new_type_builder_append(NewTypeBuilder*, NewType); struct ArrowArray* new_type_builder_finish(NewTypeBuilder*);
-
Add FFI wrapper in
lean_builder_wrapper.c:LEAN_EXPORT lean_obj_res lean_new_type_builder_create(...);
-
Add Lean declarations in
BuilderFFI.lean:opaque NewTypeBuilderPointed : NonemptyType @[extern "lean_new_type_builder_create"] opaque NewTypeBuilder.create : ... -
Add high-level API in
TypedBuilders.lean:def buildNewTypeColumn (values : Array NewType) : IO (Option ArrowArray) := ...
-
Add ToArrowColumn instance in
ToArrow.lean:instance : ToArrowColumn NewType where arrowFormat := "..." toColumn := buildNewTypeColumn
The Parquet writer can be extended with:
- Dictionary encoding for repeated strings
- Delta encoding for integers
- Compression codecs (snappy, gzip, zstd)
- Statistics (min/max values)
- Bloom filters
- Nested types (lists, maps)
- Batch Operations: Process data in batches (RecordBatch) for efficiency
- Zero-Copy: Arrow C Data Interface enables zero-copy data sharing
- Memory Locality: Columnar layout is cache-friendly for analytics
- Streaming: ArrowArrayStream supports processing data incrementally