Automatically generate DCAT-AP Feed from dump #1

smessie · 2025-02-19T08:21:30Z

This PR introduces an RDF-Connect pipeline which will create an LDES feed from the DCAT-AP dump datagovbe_edp.xml(.gz).

The Pipeline

The pipeline consists of multiple processors ran after each other.
First the gzipped datagovbe_edp.xml.gz file will be read from disk.
This file is passed to the gunzip processor to gunzip the file, and then passes its contents to the DumpsToFeed processor.

This dumps-to-feed-processor will extract all entities from the dump.
It does so by first finding all focus nodes it needs to consider in the dump, by extracting all nodes corresponding with one of the standalone entities, as per 1.1.1 Standalone entities.
When it has all focus nodes, it can extract its contents using the member extraction algorithm using the DCAT-AP Feed ActivityShape.
The entities are embedded and described as activities using the ActivityStreams 2.0 ontology (https://www.w3.org/ns/activitystreams#).

The entities found are streamed to the next processors Sdsify, Bucketize, and LdesDiskWriter to persist the entities as an LDES feed on disk.
During the bucketization step, the fragmentation of the LDES is determined. In this PR it is configured to generate a Timebased fragmentation where buckets can hold at most 100 members. If this limit is exceeded, it will split the bucket over time in 4 equal buckets, with a minimum bucket span of 1 day. After that it will create different pages for that bucket as needed.

As a final step, it will use the LdesDiskWriter to store the LDES as a set of files on disk, in this repository, in the docs/ directory.
The decision to store it as a set of files on disk and to store them in the docs/ directory follows from the requirement to be able to run the pipeline to generate the feed automatically using GitHub Actions -- see Automation below.

Execution

To execute the pipeline, it suffices to execute one command after you have installed its dependencies.

# Make sure you are in the right directory where the pipeline is defined.
cd pipeline

# Install the dependencies 
npm i

# Execute the pipeline
npx @rdfc/js-runner dumps-to-feed-pipeline.ttl

Once the dump got updated, it suffices to re-execute the pipeline. The processors keep their state of the entities already processed, so it will be able to only process new (or updated) entities in the dump and add them to the existing LDES.

However, using the GitHub Action elaborated in the next section below, you should not have to execute the pipeline manually.

Automation

This PR leverages GitHub Actions and GitHub Pages to generate and host the feed.

Using the create-feed.yml workflow, the LDES Feed is updated automatically on every push to master.
It then stores the state of the bucketizer in pipeline/feed-state/buckets_save.json, the state of the DumpsToFeed processor in pipeline/leveldb/state-of-belgium/, and the contents of the LDES in the docs/ folder.
The choice for docs/ is because of the limitations of GitHub Pages, as it can only be configured to deploy a branch at root, or at docs/.
As the next step is automatically hosting the LDES using GitHub Pages, the LDES source is stored at docs/.

It is important that the GitHub repository settings are configured correctly:

Make sure the GH Action has permission to write its state and feed to the repository. Do so by making sure Settings > Actions > General > Workflow permissions is set to Read and write permissions.
Make sure that GH Pages is configured to publish the docs/ folder. Do so by making sure Settings > Pages > Build and deployment is configured as Deploy from a branch for Source, and master /docs for Branch.

Then shortly after, the LDES will be accessible at https://fedict.github.io/dcat/index.trig, just like my version is already accessible now at https://smessie.github.io/dcat/index.trig.

You can verify by running the ldes-client over the LDES

npx ldes-client https://smessie.github.io/dcat/index.trig
# OR
npx ldes-client https://fedict.github.io/dcat/index.trig

You can then also configure the URL for your GitHub repository at About > Website to https://fedict.github.io/dcat/index.trig so people can easily find out about the LDES Feed hosted by this repository.

barthanssens · 2025-02-28T16:46:55Z

Thanks for the contribution !
I'm working on deploying it in our infrastructure instead of github actions

smessie and others added 14 commits February 6, 2025 19:41

feat: RDF-Connect Pipeline to generate LDES Feed

b3258f5

ci: GH Action to run pipeline and commit results

493af1f

Update LDES feed

69b79ad

fix: Remove % in URIs to work with GH Pages

41f5e86

Update LDES feed

c970c41

Merge branch 'Fedict:master' into master

8f7e809

Update LDES feed

e19d1d9

docs: Update documentation

fe53258

fix(deps): Keep state when there are no updates

6789797

Update LDES feed

d539e7a

chore(deps): Update @rdfc/dumps-to-feed-processor-ts

f496ae7

Update LDES feed

92aa105

fix(deps): Include all LDES metadata

2dd7a53

Update LDES feed

713936d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically generate DCAT-AP Feed from dump #1

Automatically generate DCAT-AP Feed from dump #1

smessie commented Feb 19, 2025

barthanssens commented Feb 28, 2025

Automatically generate DCAT-AP Feed from dump #1

Are you sure you want to change the base?

Automatically generate DCAT-AP Feed from dump #1

Conversation

smessie commented Feb 19, 2025

The Pipeline

Execution

Automation

barthanssens commented Feb 28, 2025