Skip to content

codedthinking/stata-dta

Repository files navigation

DuckDB Stata DTA Extension

A high-performance DuckDB extension for reading Stata .dta files directly into DuckDB tables without intermediate conversion.

Overview

This extension allows you to seamlessly read Stata data files (.dta format) directly into DuckDB, preserving data types, handling missing values, and supporting multiple Stata file versions. It provides native SQL access to Stata datasets with full DuckDB functionality.

Authors

  • Miklós Koren - Project Lead
  • Claude (Anthropic AI Assistant) - Implementation

Features

✅ Implemented

  • Multi-version Support: Stata versions 105, 108, 111, 113-119
  • Complete Data Types:
    • Numeric: byte (int8), int (int16), long (int32), float, double
    • String: Variable-length strings (1-244 characters)
    • Missing value detection and handling
  • Performance Optimized:
    • Memory-efficient chunked reading for large files
    • Streaming data processing
    • Native byte-order handling (big-endian/little-endian)
  • Full SQL Integration: Use standard SQL operations on Stata data
  • Production Ready: Comprehensive error handling and validation

🔄 Planned Features

  • Value labels support
  • Variable labels preservation
  • Stata date/time format conversion
  • Compressed DTA file support
  • Advanced metadata extraction

Installation & Building

Prerequisites

  • CMake 3.5+
  • C++17 compatible compiler
  • DuckDB development environment

Build Steps

  1. Clone with submodules:
git clone --recurse-submodules https://github.com/codedthinking/stata-dta.git
cd stata-dta
  1. Install dependencies (optional - for enhanced performance):
# Install build accelerators
brew install ninja ccache  # macOS
# or
sudo apt-get install ninja-build ccache  # Ubuntu
  1. Build the extension:
# Standard build
make

# Fast build with ninja and ccache
GEN=ninja make

# Debug build
make debug

Build Outputs

./build/release/duckdb                           # DuckDB CLI with extension
./build/release/extension/stata_dta/stata_dta.duckdb_extension  # Loadable extension
./build/release/test/unittest                    # Test runner

Usage

Basic Usage

Start DuckDB with the extension pre-loaded:

./build/release/duckdb

Read Stata files directly:

-- Load a Stata file as a table
SELECT * FROM read_stata_dta('path/to/your/file.dta');

-- Count observations
SELECT COUNT(*) FROM read_stata_dta('data/survey.dta');

-- Filter and aggregate
SELECT 
    region, 
    AVG(income) as avg_income,
    COUNT(*) as obs_count
FROM read_stata_dta('data/households.dta')
WHERE age >= 18
GROUP BY region;

Advanced Examples

-- Handle missing values
SELECT 
    id,
    COALESCE(score, 0) as score_filled,
    CASE WHEN score IS NULL THEN 'Missing' ELSE 'Valid' END as status
FROM read_stata_dta('data/test_scores.dta');

-- Join with other data sources
SELECT 
    s.*,
    d.description
FROM read_stata_dta('data/survey.dta') s
JOIN dictionary_table d ON s.code = d.code;

-- Export to other formats
COPY (
    SELECT * FROM read_stata_dta('data/analysis.dta')
    WHERE year >= 2020
) TO 'output/recent_data.csv' (HEADER);

-- Schema inspection
DESCRIBE SELECT * FROM read_stata_dta('data/variables.dta');

Extension Information

-- Get extension version and details
SELECT stata_dta_info('version');

Testing

Run Tests

# Run all tests
make test

# Run specific test categories  
./build/release/test/unittest --test-dir test/sql stata_dta*

Test Data Generation

Create test datasets for development:

python3 test/create_test_data.py

This generates various .dta files in test/data/:

  • simple.dta - Basic numeric data
  • mixed_types.dta - Multiple data types
  • with_missing.dta - Missing value handling
  • large_dataset.dta - Performance testing (10k rows)
  • Version-specific files for compatibility testing

Supported Stata Versions

Stata Version File Format Status Notes
Stata 7SE 111 ✅ Supported Basic format
Stata 8/9 113 ✅ Supported Enhanced features
Stata 10/11 114 ✅ Supported Improved encoding
Stata 12 115 ✅ Supported Extended variables
Stata 13 117 ✅ Supported Unicode support
Stata 14/15/16 118 ✅ Supported Large datasets
Stata 15/16+ 119 ✅ Supported >32K variables

Performance

Benchmarks

  • Small files (<1MB): Instant loading
  • Medium files (1-100MB): Typical loading time 1-5 seconds
  • Large files (>100MB): Chunked processing with constant memory usage
  • Very large files (>1GB): Streaming support with minimal memory footprint

Memory Usage

  • Efficient chunking: Processes data in configurable chunks (default: 2048 rows)
  • Minimal overhead: ~10MB baseline memory usage
  • Scalable: Memory usage independent of file size for large datasets

Data Type Mapping

Stata Type DuckDB Type Size Notes
byte TINYINT 1B -127 to 100
int SMALLINT 2B -32,767 to 32,740
long INTEGER 4B -2B to 2B
float FLOAT 4B IEEE 754 single
double DOUBLE 8B IEEE 754 double
str1-str244 VARCHAR Var UTF-8 strings

Architecture

Core Components

  1. StataParser: Base parser handling file I/O and type mappings
  2. StataReader: Main reader with version-specific logic
  3. Extension Interface: DuckDB table function integration

Key Features

  • Version Detection: Automatic format detection and parsing
  • Byte Order Handling: Native support for both endianness
  • Error Recovery: Robust error handling with detailed messages
  • Memory Management: RAII patterns and smart pointers
  • Thread Safety: Stateless design for concurrent access

Error Handling

Common error scenarios and solutions:

-- File not found
SELECT * FROM read_stata_dta('missing.dta');
-- Error: Cannot open Stata file: missing.dta

-- Unsupported version
SELECT * FROM read_stata_dta('old_format.dta');  
-- Error: Unsupported Stata file version: 104

-- Corrupted file
SELECT * FROM read_stata_dta('corrupted.dta');
-- Error: Unexpected end of Stata file

Contributing

We welcome contributions! Areas for enhancement:

  1. Value Labels: Implement categorical variable labels
  2. Variable Labels: Preserve descriptive variable names
  3. Date Formats: Convert Stata date/time to DuckDB temporals
  4. Compression: Support for compressed DTA files
  5. Metadata: Extract and expose file metadata
  6. Performance: Further optimization for very large files

Development Setup

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/your-feature
  3. Build and test: make && make test
  4. Submit pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Based on the pandas Stata reader implementation (pandas development team, 2020). The pandas BSD 3-Clause License allows reuse and modification.
  • Inspired by DuckDB's extensible architecture
  • Thanks to the Stata community for format documentation

Trademark Notice

Stata® is a registered trademark of StataCorp LLC. This extension is an independent third-party tool designed to help Stata users access their data through DuckDB. It is not endorsed by, affiliated with, or supported by StataCorp LLC. This extension is provided for compatibility purposes only to facilitate data analysis workflows.

References

pandas development team. (2020). pandas-dev/pandas: Pandas (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.3509134

Support

  • Issues: Report bugs via GitHub Issues
  • Documentation: See examples/ directory
  • Performance: Check test/sql/stata_dta_performance.test
  • Community: Join discussions in GitHub Discussions

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages