nonodename · nonodename · Apr 8, 2026 · Apr 8, 2026 · Apr 9, 2026
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
@@ -51,6 +51,8 @@ Or load the extension and run queries interactively:
 ./build/release/duckdb
 ```
 
+Any test that involves filenames needs to account for the fact that directory paths are different on Windows than unix based systems. Simplest solution is to wrap the filename column. Something like `replace(filename,'\','/')`
+
 ## Architecture
 
 Note that you must stick with C++ 11 and earlier as that's the standard that DuckDB uses.

diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,5 @@ testext
 test/python/__pycache__/
 .Rhistory
 src/test_runner
-.cache/*
+.cache/*
+.vscode/
diff --git a/.vscode/c_cpp_properties.json b/.vscode/c_cpp_properties.json
diff --git a/.vscode/settings.json b/.vscode/settings.json
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -25,6 +25,7 @@ set(EXTENSION_SOURCES
     src/rdf_profiler.cpp
     src/profile_rdf.cpp
     src/pivot_rdf.cpp
+    src/read_rdf_prefixes.cpp
 )
 
 # ------------------------------------------------------------

diff --git a/README.md b/README.md
@@ -85,6 +85,35 @@ SELECT * FROM read_rdf('data/shards/*.dat', file_type = 'ttl', strict_parsing =
 
 If the pattern matches no files an `IO Error` is raised.
 
+## Reading RDF Prefixes
+
+`read_rdf_prefixes()` returns the `@prefix` and `@base` declarations from Turtle or TriG files. It is useful for namespace introspection, documentation, and building CURIE-aware tooling. NTriples, RDF/XML & NQuads are not supported (they have no prefix declarations) and will raise an error.
+
+```sql
+SELECT prefix, uri, is_base FROM read_rdf_prefixes('test/rdf/tests.ttl');
+```
+
+```
+┌────────┬───────────────────────────────┬─────────┐
+│ prefix │              uri              │ is_base │
+│varchar │            varchar            │ boolean │
+├────────┼───────────────────────────────┼─────────┤
+│ foaf   │ http://xmlns.com/foaf/0.1/    │ false   │
+│ dc     │ http://purl.org/dc/elements/… │ false   │
+│        │ http://example.org/           │ true    │
+│ uni    │ http://unicode.org/           │ false   │
+└────────┴───────────────────────────────┴─────────┘
+```
+
+`read_rdf_prefixes()` accepts the same `strict_parsing`, `file_type`, and `include_filenames` parameters as `read_rdf()` and supports glob patterns:
+
+```sql
+-- Collect all unique prefixes across a set of Turtle files
+SELECT DISTINCT prefix, uri
+FROM read_rdf_prefixes('ontologies/*.ttl')
+ORDER BY prefix;
+```
+
 ## Pivoting RDF
 
 `pivot_rdf()` takes the same path/glob argument as `read_rdf()` and returns a pivoted table, one column per predicate, at least one row per subject. (To operate on arbitrary file sizes subjects _may_ be repeated if encountered out of sequence). While a pivot is possible in the SQL domain, it is subject to memory limits which this function aims to avoid by doing two passes on the RDF.

diff --git a/TODO.md b/TODO.md
@@ -9,8 +9,9 @@ Currently all 6 columns are VARCHAR. The object_datatype column contains XSD typ
 2. **Source filename column** ✅
 When reading multiple files via glob, there's no way to know which triple came from which file. Adding a filename column (like DuckDB's read_parquet does) would be very useful for tracing provenance.
 
-3. **read_rdf_prefixes() table function**
+3. **read_rdf_prefixes() table function** ✅
 A companion function that returns the prefix declarations (@prefix / @base) from a Turtle/TriG file. Useful for documentation and for building CURIE-aware tooling.
+Implemented in `src/read_rdf_prefixes.cpp`. Returns three columns: `prefix` (VARCHAR), `uri` (VARCHAR), `is_base` (BOOLEAN). Supports the same `strict_parsing`, `file_type`, and `include_filenames` parameters as `read_rdf()`, and glob patterns. Throws `InvalidInputException` for NTriples, NQuads, and RDF/XML.
 
 4. **SPARQL endpoint reader** ✅
 `read_sparql(endpoint, query)` is now implemented. It sends a SPARQL SELECT against an HTTP/HTTPS endpoint and returns the result set as a table.

diff --git a/docs/functions.md b/docs/functions.md
@@ -63,6 +63,65 @@ ORDER BY filename;
 
 ---
 
+## `read_rdf_prefixes(path, [options])`
+
+Table function. Reads one or more Turtle or TriG files and returns their `@prefix` and `@base` declarations as rows. Useful for namespace introspection, documentation, and building CURIE-aware tooling.
+
+Throws an error for NTriples, NQuads, and RDF/XML, as those formats do not contain prefix declarations.
+
+**Parameters**
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `path` | VARCHAR | Yes | — | File path or glob pattern |
+| `strict_parsing` | BOOLEAN | No | `true` | When `false`, skips malformed content instead of raising an error |
+| `file_type` | VARCHAR | No | auto-detect | Override format detection. Values: `ttl`, `turtle`, `trig` |
+| `include_filenames` | BOOLEAN | No | `false` | When `true`, adds a 4th column `filename` containing the source file path |
+
+**Returns**
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `prefix` | VARCHAR | Prefix name; `NULL` for `@base` declarations (which have no prefix name) |
+| `uri` | VARCHAR | Namespace URI |
+| `is_base` | BOOLEAN | `true` for `@base` declarations, `false` for `@prefix` declarations |
+| `filename` | VARCHAR | Source file path; only present when `include_filenames = true` |
+
+**Supported formats**
+
+| Format | Extensions |
+|--------|-----------|
+| Turtle | `.ttl` |
+| TriG | `.trig` |
+
+**Examples**
+
+```sql
+-- List all prefixes declared in a Turtle file
+SELECT prefix, uri FROM read_rdf_prefixes('data.ttl');
+
+-- Find the base URI
+SELECT uri FROM read_rdf_prefixes('data.ttl') WHERE is_base = true;
+
+-- Collect all prefixes from multiple files
+SELECT DISTINCT prefix, uri
+FROM read_rdf_prefixes('ontologies/*.ttl')
+ORDER BY prefix;
+
+-- Show which file each prefix came from
+SELECT filename, prefix, uri
+FROM read_rdf_prefixes('ontologies/*.ttl', include_filenames = true)
+ORDER BY filename, prefix;
+
+-- Count prefix declarations per file across a glob
+SELECT filename, COUNT(*) AS prefix_count
+FROM read_rdf_prefixes('data/*.ttl', include_filenames = true)
+GROUP BY filename
+ORDER BY prefix_count DESC;
+```
+
+---
+
 ## `profile_rdf(path, [options])`
 
 Table function. Reads one or more RDF files and returns a statistical profile with one row per unique predicate. Useful for exploring an unfamiliar dataset, understanding its type distribution, and validating data quality before building a full pipeline.

diff --git a/src/include/read_rdf_prefixes.hpp b/src/include/read_rdf_prefixes.hpp
@@ -0,0 +1,9 @@
+#pragma once
+
+#include "duckdb.hpp"
+
+namespace duckdb {
+
+void RegisterReadRDFPrefixes(ExtensionLoader &loader);
+
+} // namespace duckdb
diff --git a/src/rdf_extension.cpp b/src/rdf_extension.cpp
@@ -9,6 +9,7 @@
 #include "include/r2rml_copy.hpp"
 #include "include/profile_rdf.hpp"
 #include "include/pivot_rdf.hpp"
+#include "include/read_rdf_prefixes.hpp"
 #include "duckdb/common/exception.hpp"
 #include "duckdb/common/string_util.hpp"
 #include "duckdb/function/table_function.hpp"
@@ -224,6 +225,7 @@ static void LoadInternal(ExtensionLoader &loader) {
 	RegisterSPARQLReader(loader);
 	RegisterProfileRDF(loader);
 	RegisterPivotRDF(loader);
+	RegisterReadRDFPrefixes(loader);
 }
 
 void RdfExtension::Load(ExtensionLoader &loader) {