- 
                Notifications
    You must be signed in to change notification settings 
- Fork 7
Add documentation for Delta Sharing node type #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: node-delta-sharing
Are you sure you want to change the base?
Changes from 2 commits
565ee20
              0bdeb2c
              efe69e8
              0b3dd90
              3c97f5b
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| --- | ||
| sidebar_position: 20 | ||
| --- | ||
|  | ||
| # Lab 21: Delta Sharing | ||
|  | ||
| This lab contains an example configuration on VILLASnode's delta_sharing node-type. | ||
|  | ||
| An example is created to connect to an open source server present at "https://sharing.delta.io/delta_sharing/". | ||
|  | ||
| The delta sharing node connects to the server mentioned in the share file added in the configuration. | ||
| The required table is then read by mentioning the schema and the name of the table. | ||
|  | ||
| ## VILLASnode configuration file | ||
|  | ||
| ### Delta Sharing client | ||
|  | ||
| ``` url="external/node/etc/labs/lab21.conf" title="node/etc/labs/lab21.conf" | ||
| nodes = { | ||
| node1 = { | ||
| type = "delta_sharing" | ||
| profile_path = "open-datasets.share" | ||
| table_path = "open-datasets.share#delta_sharing.default.COVID_19_NYT" | ||
| cache_dir = "cache" | ||
| op = "read" | ||
| }, | ||
| node2 = { | ||
| type = "file" | ||
| uri = "delta_output.dat" | ||
| in = { | ||
| epoch_mode = "direct" | ||
| read_mode = "all" | ||
| eof = "stop" | ||
| } | ||
| out = { | ||
|  | ||
| } | ||
| } | ||
|  | ||
| } | ||
| paths = ( | ||
| { | ||
| in = "node1" | ||
| out = "node2" | ||
| } | ||
| ) | ||
| ``` | ||
|  | ||
| ### Share file | ||
|  | ||
| ``` url="external/node/etc/labs/open-datasets.share" title="node/etc/labs/open-datasets.share" | ||
| { | ||
| "shareCredentialsVersion": 1, | ||
| "endpoint": "https://sharing.delta.io/delta-sharing/", | ||
| "bearerToken": "faaie590d541265bcab1f2de9813274bf233" | ||
| } | ||
| ``` | ||
|  | ||
| This configuration file is used to read from and (planned) write to Delta Sharing tables using Apache Arrow/Parquet. Files downloaded from the server are cached locally. | ||
|  | ||
| Default cache directory is cwd/cache unless specified otherwise using the cache_dir parameter. | ||
|  | ||
| Supported keys in the configuration: | ||
| profile_path: path to a Delta Sharing profile JSON. | ||
| table_path: path for the table, here we mention the server, share and the schema in the format - ```server#share.schema.table``` | ||
| batch_size: batch size to be used for parsing rows in the Arrow table. Currently not implemented. | ||
|  | ||
| The output is then piped into a .dat file using the file nodetype. | ||
|  | ||
| To start the delta sharing node, in a terminal: | ||
|  | ||
| ```shell | ||
| villas node lab21.conf | ||
| ``` | ||
|  | ||
| The received data from the remote table should then be displayed in the terminal and also written into the dat file. | ||
|  | 
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,47 @@ | ||||||||
| --- | ||||||||
| hide_table_of_contents: true | ||||||||
| --- | ||||||||
|  | ||||||||
| # Delta Sharing | ||||||||
|  | ||||||||
| The `delta_sharing` node type integrates with a Delta Sharing server to read from Delta tables using Apache Arrow. | ||||||||
|  | ||||||||
| ## Prerequisites | ||||||||
|  | ||||||||
| - A reachable Delta Sharing server and a valid Delta Sharing profile path (`profile_path`). | ||||||||
| - Apache Arrow and Parquet are required at build time. They are core dependencies for this node type. | ||||||||
| - A local cache directory to store the downloaded parquet files. | ||||||||
|  | ||||||||
| Supported Keys: | ||||||||
|  | ||||||||
| - `profile_path` (string, required): Path to a Delta Sharing profile file. | ||||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these referring to the configuration file? If so, please move them to the configuration section. Please try to follow the same structure as in the other node-type pages. | ||||||||
| - `cache_dir` (string, optional): Local directory for caching fetched parquet files. | ||||||||
| - `table_path` (string, required for `read`/`write`): Table path in the format `server#share.schema.table`. | ||||||||
| - `op` (string, optional): One of `read`, `write`, `noop`. Defaults to `noop`. | ||||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please try to avoid abbreviations for options. A full  | ||||||||
| - `batch_size` (integer, optional): Batch size for chunk I/O (currently not implemented). | ||||||||
|  | ||||||||
| ## Behaviour: | ||||||||
|         
                  RiteshKarki27 marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||||||||
|  | ||||||||
| - On start, the node initializes a Delta Sharing Client from `profile_path` and lists available shares, schemas and tables. | ||||||||
| - For `op=read`, the node parses `table_path` populates cache from each file, loads the first file as an Arrow table. It then maps Arrow types to VILLASnode supported datatypes. | ||||||||
|         
                  RiteshKarki27 marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||||||||
| - For `op = write` the node constructs and in-memory Arrow `Table` from outgoing VILLASnode samples based on the supported signal types. Current implementation does not upload to a Delta Sharing server yet. | ||||||||
| - Supported datatypes for reading are DOUBLE, FLOAT, INT64, INT32. Others are classified as unsupported and filled with defaults. | ||||||||
|  | ||||||||
| ## Example | ||||||||
|  | ||||||||
| ``` url="external/node/etc/examples/nodes/delta_sharing.conf" title="node/etc/examples/nodes/delta_sharing.conf" | ||||||||
|  | ||||||||
| nodes = { | ||||||||
|         
                  RiteshKarki27 marked this conversation as resolved.
              Show resolved
            Hide resolved | ||||||||
| delta_node = { | ||||||||
| type = "delta_sharing" | ||||||||
|  | ||||||||
|  | ||||||||
| 
      Comment on lines
    
      +35
     to 
      +37
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
        Suggested change
       
 | ||||||||
| ### The following settings are specific to the delta sharing node type!! ### | ||||||||
|  | ||||||||
| profile_path = "dataset.share" # This specifies the URI where the server credentials are saved | ||||||||
| table_path = "dataset.share.share#delta_sharing.default.example_table" # The format for the table should be in this format: server#share.schema.table | ||||||||
| cache_dir = "cache" # This specifies the uri for the cache directory | ||||||||
|  | ||||||||
| op = "read" # Either read or write tables | ||||||||
| } | ||||||||
| } | ||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add some hyperlinks to the project of the delta sharing server, as well as add the exact name of the libraries and links which need to be installed as a pre-requisite?
Please make sure to add these new dependencies also to this table, as well as the two command invocations below it (
apt|dnf install):