Automatically generate DCAT-AP Feed from dump #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces an RDF-Connect pipeline which will create an LDES feed from the DCAT-AP dump datagovbe_edp.xml(.gz).
The Pipeline
The pipeline consists of multiple processors ran after each other.
First the gzipped datagovbe_edp.xml.gz file will be read from disk.
This file is passed to the gunzip processor to gunzip the file, and then passes its contents to the DumpsToFeed processor.
This dumps-to-feed-processor will extract all entities from the dump.
It does so by first finding all focus nodes it needs to consider in the dump, by extracting all nodes corresponding with one of the standalone entities, as per 1.1.1 Standalone entities.
When it has all focus nodes, it can extract its contents using the member extraction algorithm using the DCAT-AP Feed ActivityShape.
The entities are embedded and described as activities using the ActivityStreams 2.0 ontology (
https://www.w3.org/ns/activitystreams#
).The entities found are streamed to the next processors Sdsify, Bucketize, and LdesDiskWriter to persist the entities as an LDES feed on disk.
During the bucketization step, the fragmentation of the LDES is determined. In this PR it is configured to generate a Timebased fragmentation where buckets can hold at most 100 members. If this limit is exceeded, it will split the bucket over time in 4 equal buckets, with a minimum bucket span of 1 day. After that it will create different pages for that bucket as needed.
As a final step, it will use the LdesDiskWriter to store the LDES as a set of files on disk, in this repository, in the
docs/
directory.The decision to store it as a set of files on disk and to store them in the
docs/
directory follows from the requirement to be able to run the pipeline to generate the feed automatically using GitHub Actions -- see Automation below.Execution
To execute the pipeline, it suffices to execute one command after you have installed its dependencies.
Once the dump got updated, it suffices to re-execute the pipeline. The processors keep their state of the entities already processed, so it will be able to only process new (or updated) entities in the dump and add them to the existing LDES.
However, using the GitHub Action elaborated in the next section below, you should not have to execute the pipeline manually.
Automation
This PR leverages GitHub Actions and GitHub Pages to generate and host the feed.
Using the
create-feed.yml
workflow, the LDES Feed is updated automatically on every push to master.It then stores the state of the bucketizer in
pipeline/feed-state/buckets_save.json
, the state of the DumpsToFeed processor inpipeline/leveldb/state-of-belgium/
, and the contents of the LDES in thedocs/
folder.The choice for
docs/
is because of the limitations of GitHub Pages, as it can only be configured to deploy a branch at root, or atdocs/
.As the next step is automatically hosting the LDES using GitHub Pages, the LDES source is stored at
docs/
.It is important that the GitHub repository settings are configured correctly:
Settings > Actions > General > Workflow permissions
is set toRead and write permissions
.docs/
folder. Do so by making sureSettings > Pages > Build and deployment
is configured asDeploy from a branch
for Source, andmaster
/docs
for Branch.Then shortly after, the LDES will be accessible at https://fedict.github.io/dcat/index.trig, just like my version is already accessible now at https://smessie.github.io/dcat/index.trig.
You can verify by running the ldes-client over the LDES
npx ldes-client https://smessie.github.io/dcat/index.trig # OR npx ldes-client https://fedict.github.io/dcat/index.trig
You can then also configure the URL for your GitHub repository at
About > Website
tohttps://fedict.github.io/dcat/index.trig
so people can easily find out about the LDES Feed hosted by this repository.