Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically generate DCAT-AP Feed from dump #1

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

smessie
Copy link

@smessie smessie commented Feb 19, 2025

This PR introduces an RDF-Connect pipeline which will create an LDES feed from the DCAT-AP dump datagovbe_edp.xml(.gz).

The Pipeline

The pipeline consists of multiple processors ran after each other.
First the gzipped datagovbe_edp.xml.gz file will be read from disk.
This file is passed to the gunzip processor to gunzip the file, and then passes its contents to the DumpsToFeed processor.

This dumps-to-feed-processor will extract all entities from the dump.
It does so by first finding all focus nodes it needs to consider in the dump, by extracting all nodes corresponding with one of the standalone entities, as per 1.1.1 Standalone entities.
When it has all focus nodes, it can extract its contents using the member extraction algorithm using the DCAT-AP Feed ActivityShape.
The entities are embedded and described as activities using the ActivityStreams 2.0 ontology (https://www.w3.org/ns/activitystreams#).

The entities found are streamed to the next processors Sdsify, Bucketize, and LdesDiskWriter to persist the entities as an LDES feed on disk.
During the bucketization step, the fragmentation of the LDES is determined. In this PR it is configured to generate a Timebased fragmentation where buckets can hold at most 100 members. If this limit is exceeded, it will split the bucket over time in 4 equal buckets, with a minimum bucket span of 1 day. After that it will create different pages for that bucket as needed.

As a final step, it will use the LdesDiskWriter to store the LDES as a set of files on disk, in this repository, in the docs/ directory.
The decision to store it as a set of files on disk and to store them in the docs/ directory follows from the requirement to be able to run the pipeline to generate the feed automatically using GitHub Actions -- see Automation below.

Execution

To execute the pipeline, it suffices to execute one command after you have installed its dependencies.

# Make sure you are in the right directory where the pipeline is defined.
cd pipeline

# Install the dependencies 
npm i

# Execute the pipeline
npx @rdfc/js-runner dumps-to-feed-pipeline.ttl

Once the dump got updated, it suffices to re-execute the pipeline. The processors keep their state of the entities already processed, so it will be able to only process new (or updated) entities in the dump and add them to the existing LDES.

However, using the GitHub Action elaborated in the next section below, you should not have to execute the pipeline manually.

Automation

This PR leverages GitHub Actions and GitHub Pages to generate and host the feed.

Using the create-feed.yml workflow, the LDES Feed is updated automatically on every push to master.
It then stores the state of the bucketizer in pipeline/feed-state/buckets_save.json, the state of the DumpsToFeed processor in pipeline/leveldb/state-of-belgium/, and the contents of the LDES in the docs/ folder.
The choice for docs/ is because of the limitations of GitHub Pages, as it can only be configured to deploy a branch at root, or at docs/.
As the next step is automatically hosting the LDES using GitHub Pages, the LDES source is stored at docs/.

It is important that the GitHub repository settings are configured correctly:

  • Make sure the GH Action has permission to write its state and feed to the repository. Do so by making sure Settings > Actions > General > Workflow permissions is set to Read and write permissions.
  • Make sure that GH Pages is configured to publish the docs/ folder. Do so by making sure Settings > Pages > Build and deployment is configured as Deploy from a branch for Source, and master /docs for Branch.

Then shortly after, the LDES will be accessible at https://fedict.github.io/dcat/index.trig, just like my version is already accessible now at https://smessie.github.io/dcat/index.trig.

You can verify by running the ldes-client over the LDES

npx ldes-client https://smessie.github.io/dcat/index.trig
# OR
npx ldes-client https://fedict.github.io/dcat/index.trig

You can then also configure the URL for your GitHub repository at About > Website to https://fedict.github.io/dcat/index.trig so people can easily find out about the LDES Feed hosted by this repository.

@barthanssens
Copy link
Member

Thanks for the contribution !
I'm working on deploying it in our infrastructure instead of github actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants