generated from ydataai/opensource-template
-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add databricks integration documentation (#100)
* docs: Add new integrations section * docs: add documentation for the connectors. * fix(linting): code formatting --------- Co-authored-by: Fabiana Clemente <[email protected]> Co-authored-by: Azory YData Bot <[email protected]>
- Loading branch information
1 parent
c366146
commit 8bb3697
Showing
12 changed files
with
433 additions
and
23 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file was deleted.
Oops, something went wrong.
163 changes: 163 additions & 0 deletions
163
docs/integrations/databricks/integration_connectors_catalog.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
# Connectors & Catalog | ||
|
||
^^[YData Fabric](https://ydata.ai/products/fabric)^^ provides a seamless integration with Databricks, allowing you to connect, | ||
query, and manage your data in Databricks Unity Catalog and Delta Lake with ease. This section will guide you through the benefits, | ||
setup, and usage of the Databricks' available connector in Fabric. | ||
|
||
!!! note "Prerequisites" | ||
Before using the YData SDK in Databricks notebooks, ensure the following prerequisites are met: | ||
|
||
- Access to a Databricks workspace | ||
- A valid YData Fabric account and API key | ||
- Credentials for Databricks (tokens, Databricks host, warehouse, database, schema, etc.). | ||
|
||
## Delta Lake | ||
|
||
Databricks Delta Lake is an open-source storage layer that brings reliability to data lakes. Built on top of Apache Spark, | ||
Delta Lake provides ACID (Atomicity, Consistency, Isolation, Durability) transaction guarantees, | ||
scalable metadata handling, and unifies streaming and batch data processing. | ||
|
||
In this tutorial it will be covered how you can leverage ^^[YData Fabric connectors](../../data_catalog/connectors/supported_connections.md)^^ | ||
to integrate with Databricks Delta Lake. | ||
|
||
### Setting Up the Delta Lake Connector | ||
|
||
To create a Delta Lake connector in YData Fabric Ui you need to meet the ^^[following pre-requisites](overview.md)^^. | ||
|
||
#### Step-by-step creation through the UI | ||
To create a connector in YData Fabric, select the *"Connectors"* page from the left side menu, as illustrated in the image below. | ||
|
||
{: style="width:75%"} | ||
|
||
Now, click in the *"Create Connector"* button and the following menu with the available connectors will be shown. | ||
|
||
{: style="width:50%"} | ||
|
||
Depending on the cloud vendor that you have your Databricks' instance deployed, select the Delta Lake connector for AWS or Azure. | ||
After selecting the connector type *"Databricks Delta Lake"* the below menu will be shown. | ||
This is where you can configure the connection to your Delta Lake. For that you will need the following information: | ||
|
||
{: style="width:45%; padding-right:10px", align=left} | ||
|
||
- **Databricks Host:** The URL of your Databricks cluster | ||
- **Access token:** Your Databricks' user token | ||
- **Catalog:** The name of a Catalog that you want to connect to | ||
- **Schema:** The name of the schema that you want to connect to | ||
|
||
Depending on the cloud selected, you will be asked for the credentials to your staging storage (**AWS S3** or **Azure Blob Storage**). | ||
In this example we are using AWS and for that reason the below inputs refer to *AWS S3*. | ||
|
||
- **Key ID:** The Snowflake database to connect to. | ||
- **Key Secret:** The schema within the database. | ||
|
||
And finally, the name for your connector: | ||
- **Display name:** A unique name for your connector. | ||
</br></br> | ||
Test your connection and that's it! 🚀 | ||
|
||
You are now ready to create different **Datasources** using this connector - read the data from a table, | ||
evaluate the quality of the data or even read a full database and generate a synthetic replica of your data! | ||
Read more about ^^[Fabric Datasources in here](../datasources/index.md)^^. | ||
|
||
### Use it inside the Labs | ||
|
||
👨💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Databricks%20_%20Delta%20Lake.ipynb)^^. | ||
|
||
In case you prefer a Python interface, we also have connectors available through Fabric SDK inside the labs. | ||
For a seamless integration between the UI and the Labs environment, Fabric offers an SDK that allows you to re-use connectors, | ||
datasources and even synthesizers. | ||
|
||
Start by creating your code environment through the Labs. | ||
In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^. | ||
|
||
```python | ||
# Importing YData's packages | ||
from ydata.labs import Connectors | ||
# Getting a previously created Connector | ||
connector = Connectors.get(uid='insert-connector-id', | ||
namespace='indert-namespace-id') | ||
print(connector) | ||
``` | ||
|
||
#### Read from your Delta Lake | ||
Using the Delta Lake connector it is possible to: | ||
|
||
- Get the data from a Delta Lake table | ||
- Get a sample from a Delta Lake table | ||
- Get the data from a query to a Delta Lake instance | ||
|
||
## Unity Catalog | ||
Databricks Unity Catalog is a unified governance solution for all data and AI assets within the Databricks Lakehouse Platform. | ||
|
||
Databricks Unity Catalog leverages the concept of [Delta Sharing](https://www.databricks.com/product/delta-sharing), | ||
meaning this is a great way not only to ensure alignment between Catalogs but also to limit the access to data. | ||
This means that byt leveraging the Unity Catalog connector, users can only access a set of data assets that were authorized | ||
for a given Share. | ||
|
||
### Step-by-step creation through the UI | ||
|
||
:fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=_12AfMB8hiQ&t=2s"><u>How to create a connector to Databricks Unity Catalog in Fabric?</u></a> | ||
|
||
The process to create a new connector is similar to what we have covered before to create a new *Databricks Unity Catalog* | ||
connector in YData Fabric. | ||
|
||
After selecting the connector *"Databricks Unity Catalog"*, you will be requested to upload your Delta Sharing token as | ||
depicted in the image below. | ||
|
||
{: style="width:50%"} | ||
|
||
Test your connection and that's it! 🚀 | ||
|
||
### Use it inside the Labs | ||
|
||
👨💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Databricks%20_%20Unity%20Catalog.ipynb)^^. | ||
|
||
In case you prefer a Python interface, we also have connectors available through Fabric inside the labs. | ||
Start by creating your code environment through the Labs. In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^. | ||
|
||
#### Navigate your Delta Share | ||
With your connector created you are now able to explore the schemas and tables available in a Delta share. | ||
|
||
```python title="List available shares" | ||
#List the available shares for the provided authentication | ||
connector.list_shares() | ||
``` | ||
|
||
```python title="List available schemas" | ||
#List the available schemas for a given share | ||
connector.list_schemas(share_name='teste') | ||
``` | ||
|
||
```python title="List available tables" | ||
#List the available tables for a given schema in a share | ||
connector.list_tables(schema_name='berka', | ||
share_name='teste') | ||
|
||
#List all the tables regardless of share and schema | ||
connector.list_all_tables() | ||
``` | ||
|
||
#### Read from your Delta Share | ||
Using the Delta Lake connector it is possible to: | ||
|
||
- Get the data from a Delta Lake table | ||
- Get a sample from a Delta Lake table | ||
|
||
```python title="Read the data from a table" | ||
#This method reads all the data records in the table | ||
table = connector.read_table(table_name='insert-table-name', | ||
schema_name='insert-schema-name', | ||
share_name='insert-share-name') | ||
print(table) | ||
``` | ||
|
||
```python title="Read a data sample from a table" | ||
#This method reads all the data records in the table | ||
table = connector.read_table(table_name='insert-table-name', | ||
schema_name='insert-schema-name', | ||
share_name='insert-share-name', | ||
sample_size=100) | ||
print(table) | ||
``` | ||
|
||
I hope you enjoyed this quick tutorial on seamlessly integrating Databricks with your data preparation workflows. 🚀 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
# YData SDK in Databricks Notebooks | ||
|
||
The [YData Fabric SDK](https://pypi.org/project/ydata-sdk/) provides a powerful set of tools for integrating and enhancing data within Databricks notebooks. | ||
This guide covers the installation, basic usage, and advanced features of the Fabric SDK, helping users maximize | ||
the potential of their data for AI and machine learning applications. | ||
|
||
👨💻 ^^[Full code example and recipe can be found here](https://raw.githubusercontent.com/ydataai/academy/master/5%20-%20Integrations/databricks/YData%20Fabric%20SDK%20in%20Databricks%20notebooks)^^. | ||
|
||
!!! note "Prerequisites" | ||
Before using the YData Fabric SDK in Databricks notebooks, ensure the following prerequisites are met: | ||
|
||
- Access to a Databricks workspace | ||
- A valid YData Fabric account and API key | ||
- Basic knowledge of Python and Databricks notebooks | ||
- A safe connection between your Databricks cluster and Fabric | ||
|
||
**Best Practices** | ||
|
||
- *Data Security:* Ensure API keys and sensitive data are securely managed. | ||
- *Efficient Coding:* Use vectorized operations for data manipulation where possible. | ||
- *Resource Management:* Monitor and manage the resources used by your clusters (Databricks and Fabric) | ||
Databricks cluster to optimize performance. | ||
|
||
### Installation | ||
|
||
To install the YData SDK in a Databricks notebook, use the following command: | ||
```python | ||
%pip install ydata-sdk | ||
dbutils.library.restartPython() | ||
``` | ||
Ensure the installation is successful before proceeding to the next steps. | ||
|
||
## Basic Usage - data integration | ||
This section provides step-by-step instructions on connecting to YData Fabric and performing essential | ||
data operations using the YData SDK within Databricks notebooks. This includes establishing a secure connection | ||
to YData Fabric and accessing datasets. | ||
|
||
### Connecting to YData Fabric | ||
First, establish a connection to YData Fabric using your API key: | ||
|
||
```python | ||
import os | ||
|
||
# Add your Fabric token as part of your environment variables for authentication | ||
os.environ["YDATA_TOKEN"] = '<TOKEN>' | ||
``` | ||
|
||
### Data access & manipulation | ||
Once connected, you can access and manipulate data within YData Fabric. For example, to list available datasets: | ||
|
||
```python | ||
from ydata.sdk.datasources import DataSource | ||
|
||
#return the list of available DataSources | ||
DataSource.list() | ||
``` | ||
|
||
To load a specific dataset into a Pandas DataFrame: | ||
|
||
```python | ||
#get the data from an existing datasource | ||
dataset = DataSource.get('<DATASOURCE-ID>') | ||
``` | ||
|
||
## Advanced Usage - Synthetic data generation | ||
|
||
This section explores one of the most powerful features of the Fabric SDK for enhancing and refining data | ||
within Databricks notebooks. This includes as generating synthetic data to augment | ||
datasets or to generate privacy-preserving data. | ||
By leveraging these advanced capabilities, users can significantly enhance the robustness and performance of their AI | ||
and machine learning models, unlocking the full potential of their data. | ||
|
||
### Privacy-preserving | ||
Leveraging synthetic data allows to create privacy-preserving datasets that maintain real-world value, | ||
enabling users to work with sensitive information securely while accessing utility of real data. | ||
|
||
Check the SDK documentation for more information regarding [privacy-controls and anonymization](../../sdk/examples/synthesize_with_privacy_control.md). | ||
|
||
#### From a datasource in YData Fabric | ||
Users can generate synthetic data from datasource's existing in Fabric: | ||
|
||
```python title="Train a synthetic data generator" | ||
# From an existing Fabric datasource | ||
from ydata.sdk.synthesizers import RegularSynthesizer | ||
|
||
synth = RegularSynthesizer(name='<NAME-YOUR-MODEL>') | ||
synth.fit(X=dataset) | ||
``` | ||
|
||
```python title="Sample from a Synthetic data generator" | ||
# From an existing Fabric datasource | ||
from ydata.sdk.synthesizers import RegularSynthesizer | ||
|
||
synth = RegularSynthesizer(name='<NAME-YOUR-MODEL>') | ||
synth.fit(X=dataset) | ||
``` | ||
After your synthetic data generator have been trained successfully you can generate as many synthetic datasets as needed | ||
```python title='Sampling from the model that we have just trained' | ||
from ydata.sdk.synthesizers import RegularSynthesizer | ||
sample = synth.sample(100) | ||
sample.head() | ||
``` | ||
|
||
It is also possible to generate data from other synthetic data generation models previously trained: | ||
|
||
```python title='Generating synthetic data from a previously trained model' | ||
from ydata.sdk.synthesizers import RegularSynthesizer | ||
|
||
existing_synth = RegularSynthesizer('<INSERT-SYNTHETIC-DATA-GENERATOR-ID>').get() | ||
sample = existing_synth.sample(100) | ||
``` | ||
|
||
#### From a datasource in Databricks | ||
Another important integration is to train a synthetic data generator from a dataset that you are currently exploring | ||
in your notebook environment. | ||
In order to do so, we recommend that you create your dataset using | ||
[YData Fabric integration connector to your Delta Lake](integration_connectors_catalog.md) and follow the flow for the creation | ||
of a synthetic data generation models from Fabric existing dasources. | ||
|
||
For a small dataset you can also follow [this tutorial](../../sdk/examples/synthesize_tabular_data.md). | ||
|
||
### Data augmentation | ||
Another key focus is on generating synthetic data to augment existing datasets. | ||
This technique, particularly through conditional synthetic data generation, allows users to create targeted, | ||
realistic datasets. By addressing data imbalances and enriching the training data, conditional synthetic data generation | ||
significantly enhances the robustness and performance of machine learning (ML) models, | ||
leading to more accurate and reliable outcomes. | ||
|
||
```python title='Read data from a delta table' | ||
# Read data from the catalog | ||
df = spark.sql("SELECT * FROM ydata.default.credit_scoring_labeled") | ||
|
||
# Display the dataframe | ||
display(df) | ||
``` | ||
|
||
After reading the data we need to convert it to pandas dataframe in order to create our synthetic data generation model. | ||
For the augmentation use-case we will be leveraging Conditional Synthetic data generation. | ||
|
||
```python title='Training a conditional synthetic data generator' | ||
from ydata.sdk.synthesizers import RegularSynthesizer | ||
|
||
# Convert Spark dataframe to pandas dataframe | ||
pandas_df = df.toPandas() | ||
pandas_df = pandas_df.drop('ID', axis=1) | ||
|
||
# Train a synthetic data generator using ydata-sdk | ||
synth = RegularSynthesizer(name='Synth credit scoring | Conditional') | ||
synth.fit(pandas_df, condition_on='Label') | ||
|
||
# Display the synthetic dataframe | ||
display(synth) | ||
``` | ||
|
||
Now that we have a trained conditional synthetic data generator we are able to generate a few samples controlling the | ||
population behaviour based on the columns that we have conditioned the process to. | ||
|
||
```python title="Generating a synthetic sample conditioned to column 'Label'" | ||
#generate synthetic samples condition to Label | ||
synthetic_sample = synth.sample(n_samples=len(pandas_df), condition_on={ | ||
"Label": { | ||
"categories": [{ | ||
"category": 1, | ||
"percentage": 0.7 | ||
}] | ||
} | ||
} | ||
) | ||
``` | ||
|
||
After generating the synthetic data we can combine it with our dataset. | ||
|
||
```python title='Convert the dataframe to Spark dataframe' | ||
# Enable Arrow-based columnar data transfers | ||
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") | ||
|
||
#Create a spark dataframe from the synthetic dataframe | ||
synthetic_df = spark.createDataFrame(synthetic_sample) | ||
|
||
display(synthetic_df) | ||
``` | ||
|
||
```python title="Combining the datasets" | ||
# Concatenate the original dataframe with the synthetic dataframe | ||
#removing the column ID as it is not used | ||
df = df.drop('ID') | ||
concatenated_df = df.union(synthetic_df) | ||
|
||
# Display the concatenated dataframe | ||
display(concatenated_df) | ||
``` | ||
|
||
Afterwards you can use your augmented dataset to train a ^^[Machine Learning model using MLFlow](https://docs.databricks.com/en/mlflow/tracking-ex-scikit.html)^^. |
Oops, something went wrong.