A high-performance DuckDB extension for reading Stata .dta
files directly into DuckDB tables without intermediate conversion.
This extension allows you to seamlessly read Stata data files (.dta
format) directly into DuckDB, preserving data types, handling missing values, and supporting multiple Stata file versions. It provides native SQL access to Stata datasets with full DuckDB functionality.
- Miklós Koren - Project Lead
- Claude (Anthropic AI Assistant) - Implementation
- Multi-version Support: Stata versions 105, 108, 111, 113-119
- Complete Data Types:
- Numeric: byte (int8), int (int16), long (int32), float, double
- String: Variable-length strings (1-244 characters)
- Missing value detection and handling
- Performance Optimized:
- Memory-efficient chunked reading for large files
- Streaming data processing
- Native byte-order handling (big-endian/little-endian)
- Full SQL Integration: Use standard SQL operations on Stata data
- Production Ready: Comprehensive error handling and validation
- Value labels support
- Variable labels preservation
- Stata date/time format conversion
- Compressed DTA file support
- Advanced metadata extraction
- CMake 3.5+
- C++17 compatible compiler
- DuckDB development environment
- Clone with submodules:
git clone --recurse-submodules https://github.com/codedthinking/stata-dta.git
cd stata-dta
- Install dependencies (optional - for enhanced performance):
# Install build accelerators
brew install ninja ccache # macOS
# or
sudo apt-get install ninja-build ccache # Ubuntu
- Build the extension:
# Standard build
make
# Fast build with ninja and ccache
GEN=ninja make
# Debug build
make debug
./build/release/duckdb # DuckDB CLI with extension
./build/release/extension/stata_dta/stata_dta.duckdb_extension # Loadable extension
./build/release/test/unittest # Test runner
Start DuckDB with the extension pre-loaded:
./build/release/duckdb
Read Stata files directly:
-- Load a Stata file as a table
SELECT * FROM read_stata_dta('path/to/your/file.dta');
-- Count observations
SELECT COUNT(*) FROM read_stata_dta('data/survey.dta');
-- Filter and aggregate
SELECT
region,
AVG(income) as avg_income,
COUNT(*) as obs_count
FROM read_stata_dta('data/households.dta')
WHERE age >= 18
GROUP BY region;
-- Handle missing values
SELECT
id,
COALESCE(score, 0) as score_filled,
CASE WHEN score IS NULL THEN 'Missing' ELSE 'Valid' END as status
FROM read_stata_dta('data/test_scores.dta');
-- Join with other data sources
SELECT
s.*,
d.description
FROM read_stata_dta('data/survey.dta') s
JOIN dictionary_table d ON s.code = d.code;
-- Export to other formats
COPY (
SELECT * FROM read_stata_dta('data/analysis.dta')
WHERE year >= 2020
) TO 'output/recent_data.csv' (HEADER);
-- Schema inspection
DESCRIBE SELECT * FROM read_stata_dta('data/variables.dta');
-- Get extension version and details
SELECT stata_dta_info('version');
# Run all tests
make test
# Run specific test categories
./build/release/test/unittest --test-dir test/sql stata_dta*
Create test datasets for development:
python3 test/create_test_data.py
This generates various .dta
files in test/data/
:
simple.dta
- Basic numeric datamixed_types.dta
- Multiple data typeswith_missing.dta
- Missing value handlinglarge_dataset.dta
- Performance testing (10k rows)- Version-specific files for compatibility testing
Stata Version | File Format | Status | Notes |
---|---|---|---|
Stata 7SE | 111 | ✅ Supported | Basic format |
Stata 8/9 | 113 | ✅ Supported | Enhanced features |
Stata 10/11 | 114 | ✅ Supported | Improved encoding |
Stata 12 | 115 | ✅ Supported | Extended variables |
Stata 13 | 117 | ✅ Supported | Unicode support |
Stata 14/15/16 | 118 | ✅ Supported | Large datasets |
Stata 15/16+ | 119 | ✅ Supported | >32K variables |
- Small files (<1MB): Instant loading
- Medium files (1-100MB): Typical loading time 1-5 seconds
- Large files (>100MB): Chunked processing with constant memory usage
- Very large files (>1GB): Streaming support with minimal memory footprint
- Efficient chunking: Processes data in configurable chunks (default: 2048 rows)
- Minimal overhead: ~10MB baseline memory usage
- Scalable: Memory usage independent of file size for large datasets
Stata Type | DuckDB Type | Size | Notes |
---|---|---|---|
byte | TINYINT | 1B | -127 to 100 |
int | SMALLINT | 2B | -32,767 to 32,740 |
long | INTEGER | 4B | -2B to 2B |
float | FLOAT | 4B | IEEE 754 single |
double | DOUBLE | 8B | IEEE 754 double |
str1-str244 | VARCHAR | Var | UTF-8 strings |
- StataParser: Base parser handling file I/O and type mappings
- StataReader: Main reader with version-specific logic
- Extension Interface: DuckDB table function integration
- Version Detection: Automatic format detection and parsing
- Byte Order Handling: Native support for both endianness
- Error Recovery: Robust error handling with detailed messages
- Memory Management: RAII patterns and smart pointers
- Thread Safety: Stateless design for concurrent access
Common error scenarios and solutions:
-- File not found
SELECT * FROM read_stata_dta('missing.dta');
-- Error: Cannot open Stata file: missing.dta
-- Unsupported version
SELECT * FROM read_stata_dta('old_format.dta');
-- Error: Unsupported Stata file version: 104
-- Corrupted file
SELECT * FROM read_stata_dta('corrupted.dta');
-- Error: Unexpected end of Stata file
We welcome contributions! Areas for enhancement:
- Value Labels: Implement categorical variable labels
- Variable Labels: Preserve descriptive variable names
- Date Formats: Convert Stata date/time to DuckDB temporals
- Compression: Support for compressed DTA files
- Metadata: Extract and expose file metadata
- Performance: Further optimization for very large files
- Fork the repository
- Create feature branch:
git checkout -b feature/your-feature
- Build and test:
make && make test
- Submit pull request
MIT License - see LICENSE file for details.
- Based on the pandas Stata reader implementation (pandas development team, 2020). The pandas BSD 3-Clause License allows reuse and modification.
- Inspired by DuckDB's extensible architecture
- Thanks to the Stata community for format documentation
Stata® is a registered trademark of StataCorp LLC. This extension is an independent third-party tool designed to help Stata users access their data through DuckDB. It is not endorsed by, affiliated with, or supported by StataCorp LLC. This extension is provided for compatibility purposes only to facilitate data analysis workflows.
pandas development team. (2020). pandas-dev/pandas: Pandas (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.3509134
- Issues: Report bugs via GitHub Issues
- Documentation: See
examples/
directory - Performance: Check
test/sql/stata_dta_performance.test
- Community: Join discussions in GitHub Discussions