Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions docs/node/guides/lab21.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
sidebar_position: 20
---

# Lab 21: Delta Sharing

This lab contains an example configuration on VILLASnode's delta_sharing node-type.

An example is created to connect to an open source server present at "https://sharing.delta.io/delta_sharing/".

The delta sharing node connects to the server mentioned in the share file added in the configuration.
The required table is then read by mentioning the schema and the name of the table.

## VILLASnode configuration file

### Delta Sharing client

``` url="external/node/etc/labs/lab21.conf" title="node/etc/labs/lab21.conf"
nodes = {
node1 = {
type = "delta_sharing"
profile_path = "open-datasets.share"
table_path = "open-datasets.share#delta_sharing.default.COVID_19_NYT"
cache_dir = "cache"
op = "read"
},
node2 = {
type = "file"
uri = "delta_output.dat"
in = {
epoch_mode = "direct"
read_mode = "all"
eof = "stop"
}
out = {

}
}

}
paths = (
{
in = "node1"
out = "node2"
}
)
```

### Share file

``` url="external/node/etc/labs/open-datasets.share" title="node/etc/labs/open-datasets.share"
{
"shareCredentialsVersion": 1,
"endpoint": "https://sharing.delta.io/delta-sharing/",
"bearerToken": "faaie590d541265bcab1f2de9813274bf233"
}
```

This configuration file is used to read from and (planned) write to Delta Sharing tables using Apache Arrow/Parquet. Files downloaded from the server are cached locally.

Default cache directory is cwd/cache unless specified otherwise using the cache_dir parameter.

Supported keys in the configuration:
profile_path: path to a Delta Sharing profile JSON.
table_path: path for the table, here we mention the server, share and the schema in the format - ```server#share.schema.table```
batch_size: batch size to be used for parsing rows in the Arrow table. Currently not implemented.

The output is then piped into a .dat file using the file nodetype.

To start the delta sharing node, in a terminal:

```shell
villas node lab21.conf
```

The received data from the remote table should then be displayed in the terminal and also written into the dat file.

47 changes: 47 additions & 0 deletions docs/node/nodes/delta_sharing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
hide_table_of_contents: true
---

# Delta Sharing

The `delta_sharing` node type integrates with a Delta Sharing server to read from Delta tables using Apache Arrow.

## Prerequisites

- A reachable Delta Sharing server and a valid Delta Sharing profile path (`profile_path`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some hyperlinks to the project of the delta sharing server, as well as add the exact name of the libraries and links which need to be installed as a pre-requisite?

Please make sure to add these new dependencies also to this table, as well as the two command invocations below it (apt|dnf install):

- Apache Arrow and Parquet are required at build time. They are core dependencies for this node type.
- A local cache directory to store the downloaded parquet files.

Supported Keys:

- `profile_path` (string, required): Path to a Delta Sharing profile file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these referring to the configuration file? If so, please move them to the configuration section.

Please try to follow the same structure as in the other node-type pages.

- `cache_dir` (string, optional): Local directory for caching fetched parquet files.
- `table_path` (string, required for `read`/`write`): Table path in the format `server#share.schema.table`.
- `op` (string, optional): One of `read`, `write`, `noop`. Defaults to `noop`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid abbreviations for options. A full operation is easier to understand by the user.

- `batch_size` (integer, optional): Batch size for chunk I/O (currently not implemented).

## Behaviour:

- On start, the node initializes a Delta Sharing Client from `profile_path` and lists available shares, schemas and tables.
- For `op=read`, the node parses `table_path` populates cache from each file, loads the first file as an Arrow table. It then maps Arrow types to VILLASnode supported datatypes.
- For `op = write` the node constructs and in-memory Arrow `Table` from outgoing VILLASnode samples based on the supported signal types. Current implementation does not upload to a Delta Sharing server yet.
- Supported datatypes for reading are DOUBLE, FLOAT, INT64, INT32. Others are classified as unsupported and filled with defaults.

## Example

``` url="external/node/etc/examples/nodes/delta_sharing.conf" title="node/etc/examples/nodes/delta_sharing.conf"

nodes = {
delta_node = {
type = "delta_sharing"


Comment on lines +35 to +37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type = "delta_sharing"
type = "delta_sharing"

### The following settings are specific to the delta sharing node type!! ###

profile_path = "dataset.share" # This specifies the URI where the server credentials are saved
table_path = "dataset.share.share#delta_sharing.default.example_table" # The format for the table should be in this format: server#share.schema.table
cache_dir = "cache" # This specifies the uri for the cache directory

op = "read" # Either read or write tables
}
}