Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ Or load the extension and run queries interactively:
./build/release/duckdb
```

Any test that involves filenames needs to account for the fact that directory paths are different on Windows than unix based systems. Simplest solution is to wrap the filename column. Something like `replace(filename,'\','/')`

## Architecture

Note that you must stick with C++ 11 and earlier as that's the standard that DuckDB uses.
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ testext
test/python/__pycache__/
.Rhistory
src/test_runner
.cache/*
.cache/*
.vscode/
16 changes: 0 additions & 16 deletions .vscode/c_cpp_properties.json

This file was deleted.

72 changes: 0 additions & 72 deletions .vscode/settings.json

This file was deleted.

1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ set(EXTENSION_SOURCES
src/rdf_profiler.cpp
src/profile_rdf.cpp
src/pivot_rdf.cpp
src/read_rdf_prefixes.cpp
)

# ------------------------------------------------------------
Expand Down
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,35 @@ SELECT * FROM read_rdf('data/shards/*.dat', file_type = 'ttl', strict_parsing =

If the pattern matches no files an `IO Error` is raised.

## Reading RDF Prefixes

`read_rdf_prefixes()` returns the `@prefix` and `@base` declarations from Turtle or TriG files. It is useful for namespace introspection, documentation, and building CURIE-aware tooling. NTriples, RDF/XML & NQuads are not supported (they have no prefix declarations) and will raise an error.

```sql
SELECT prefix, uri, is_base FROM read_rdf_prefixes('test/rdf/tests.ttl');
```

```
┌────────┬───────────────────────────────┬─────────┐
│ prefix │ uri │ is_base │
│varchar │ varchar │ boolean │
├────────┼───────────────────────────────┼─────────┤
│ foaf │ http://xmlns.com/foaf/0.1/ │ false │
│ dc │ http://purl.org/dc/elements/… │ false │
│ │ http://example.org/ │ true │
│ uni │ http://unicode.org/ │ false │
└────────┴───────────────────────────────┴─────────┘
```

`read_rdf_prefixes()` accepts the same `strict_parsing`, `file_type`, and `include_filenames` parameters as `read_rdf()` and supports glob patterns:

```sql
-- Collect all unique prefixes across a set of Turtle files
SELECT DISTINCT prefix, uri
FROM read_rdf_prefixes('ontologies/*.ttl')
ORDER BY prefix;
```

## Pivoting RDF

`pivot_rdf()` takes the same path/glob argument as `read_rdf()` and returns a pivoted table, one column per predicate, at least one row per subject. (To operate on arbitrary file sizes subjects _may_ be repeated if encountered out of sequence). While a pivot is possible in the SQL domain, it is subject to memory limits which this function aims to avoid by doing two passes on the RDF.
Expand Down
3 changes: 2 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ Currently all 6 columns are VARCHAR. The object_datatype column contains XSD typ
2. **Source filename column** ✅
When reading multiple files via glob, there's no way to know which triple came from which file. Adding a filename column (like DuckDB's read_parquet does) would be very useful for tracing provenance.

3. **read_rdf_prefixes() table function**
3. **read_rdf_prefixes() table function**
A companion function that returns the prefix declarations (@prefix / @base) from a Turtle/TriG file. Useful for documentation and for building CURIE-aware tooling.
Implemented in `src/read_rdf_prefixes.cpp`. Returns three columns: `prefix` (VARCHAR), `uri` (VARCHAR), `is_base` (BOOLEAN). Supports the same `strict_parsing`, `file_type`, and `include_filenames` parameters as `read_rdf()`, and glob patterns. Throws `InvalidInputException` for NTriples, NQuads, and RDF/XML.

4. **SPARQL endpoint reader** ✅
`read_sparql(endpoint, query)` is now implemented. It sends a SPARQL SELECT against an HTTP/HTTPS endpoint and returns the result set as a table.
Expand Down
59 changes: 59 additions & 0 deletions docs/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,65 @@ ORDER BY filename;

---

## `read_rdf_prefixes(path, [options])`

Table function. Reads one or more Turtle or TriG files and returns their `@prefix` and `@base` declarations as rows. Useful for namespace introspection, documentation, and building CURIE-aware tooling.

Throws an error for NTriples, NQuads, and RDF/XML, as those formats do not contain prefix declarations.

**Parameters**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `path` | VARCHAR | Yes | — | File path or glob pattern |
| `strict_parsing` | BOOLEAN | No | `true` | When `false`, skips malformed content instead of raising an error |
| `file_type` | VARCHAR | No | auto-detect | Override format detection. Values: `ttl`, `turtle`, `trig` |
| `include_filenames` | BOOLEAN | No | `false` | When `true`, adds a 4th column `filename` containing the source file path |

**Returns**

| Column | Type | Description |
|--------|------|-------------|
| `prefix` | VARCHAR | Prefix name; `NULL` for `@base` declarations (which have no prefix name) |
| `uri` | VARCHAR | Namespace URI |
| `is_base` | BOOLEAN | `true` for `@base` declarations, `false` for `@prefix` declarations |
| `filename` | VARCHAR | Source file path; only present when `include_filenames = true` |

**Supported formats**

| Format | Extensions |
|--------|-----------|
| Turtle | `.ttl` |
| TriG | `.trig` |

**Examples**

```sql
-- List all prefixes declared in a Turtle file
SELECT prefix, uri FROM read_rdf_prefixes('data.ttl');

-- Find the base URI
SELECT uri FROM read_rdf_prefixes('data.ttl') WHERE is_base = true;

-- Collect all prefixes from multiple files
SELECT DISTINCT prefix, uri
FROM read_rdf_prefixes('ontologies/*.ttl')
ORDER BY prefix;

-- Show which file each prefix came from
SELECT filename, prefix, uri
FROM read_rdf_prefixes('ontologies/*.ttl', include_filenames = true)
ORDER BY filename, prefix;

-- Count prefix declarations per file across a glob
SELECT filename, COUNT(*) AS prefix_count
FROM read_rdf_prefixes('data/*.ttl', include_filenames = true)
GROUP BY filename
ORDER BY prefix_count DESC;
```

---

## `profile_rdf(path, [options])`

Table function. Reads one or more RDF files and returns a statistical profile with one row per unique predicate. Useful for exploring an unfamiliar dataset, understanding its type distribution, and validating data quality before building a full pipeline.
Expand Down
9 changes: 9 additions & 0 deletions src/include/read_rdf_prefixes.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#pragma once

#include "duckdb.hpp"

namespace duckdb {

void RegisterReadRDFPrefixes(ExtensionLoader &loader);

} // namespace duckdb
2 changes: 2 additions & 0 deletions src/rdf_extension.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#include "include/r2rml_copy.hpp"
#include "include/profile_rdf.hpp"
#include "include/pivot_rdf.hpp"
#include "include/read_rdf_prefixes.hpp"
#include "duckdb/common/exception.hpp"
#include "duckdb/common/string_util.hpp"
#include "duckdb/function/table_function.hpp"
Expand Down Expand Up @@ -224,6 +225,7 @@ static void LoadInternal(ExtensionLoader &loader) {
RegisterSPARQLReader(loader);
RegisterProfileRDF(loader);
RegisterPivotRDF(loader);
RegisterReadRDFPrefixes(loader);
}

void RdfExtension::Load(ExtensionLoader &loader) {
Expand Down
Loading
Loading