docs: add databricks integration documentation (#100)

* docs: Add new integrations section * docs: add documentation for the connectors. * fix(linting): code formatting --------- Co-authored-by: Fabiana Clemente <[email protected]> Co-authored-by: Azory YData Bot <[email protected]>
ydataai · Jun 12, 2024 · 8bb3697 · 8bb3697
1 parent c366146
commit 8bb3697
Show file tree

Hide file tree

Showing 12 changed files with 433 additions and 23 deletions.
diff --git a/docs/assets/integrations/Databricks diagram.png b/docs/assets/integrations/Databricks diagram.png
diff --git a/docs/assets/integrations/Delta_lake_aws_inputs.webp b/docs/assets/integrations/Delta_lake_aws_inputs.webp
diff --git a/docs/assets/integrations/databricks_unity_catalog.webp b/docs/assets/integrations/databricks_unity_catalog.webp
diff --git a/docs/assets/integrations/select_delta_lake_connector.webp b/docs/assets/integrations/select_delta_lake_connector.webp
diff --git a/docs/data_catalog/connectors/integration_databricks.md b/docs/data_catalog/connectors/integration_databricks.md
diff --git a/docs/integrations/databricks/integration_connectors_catalog.md b/docs/integrations/databricks/integration_connectors_catalog.md
@@ -0,0 +1,163 @@
+# Connectors & Catalog
+
+^^[YData Fabric](https://ydata.ai/products/fabric)^^ provides a seamless integration with Databricks, allowing you to connect,
+query, and manage your data in Databricks Unity Catalog and Delta Lake with ease. This section will guide you through the benefits,
+setup, and usage of the Databricks' available connector in Fabric.
+
+!!! note "Prerequisites"
+    Before using the YData SDK in Databricks notebooks, ensure the following prerequisites are met:
+
+    - Access to a Databricks workspace
+    - A valid YData Fabric account and API key
+    - Credentials for Databricks (tokens, Databricks host, warehouse, database, schema, etc.).
+
+## Delta Lake
+
+Databricks Delta Lake is an open-source storage layer that brings reliability to data lakes. Built on top of Apache Spark,
+Delta Lake provides ACID (Atomicity, Consistency, Isolation, Durability) transaction guarantees,
+scalable metadata handling, and unifies streaming and batch data processing.
+
+In this tutorial it will be covered how you can leverage ^^[YData Fabric connectors](../../data_catalog/connectors/supported_connections.md)^^
+to integrate with Databricks Delta Lake.
+
+### Setting Up the Delta Lake Connector
+
+To create a Delta Lake connector in YData Fabric Ui you need to meet the ^^[following pre-requisites](overview.md)^^.
+
+#### Step-by-step creation through the UI
+To create a connector in YData Fabric, select the *"Connectors"* page from the left side menu, as illustrated in the image below.
+
+![Select Connectors from Homepage](../../assets/data_catalog/connectors/go_to_connector.png){: style="width:75%"}
+
+Now, click in the *"Create Connector"* button and the following menu with the available connectors will be shown.
+
+![Select Databricks Delta Lake connector](../../assets/integrations/select_delta_lake_connector.webp){: style="width:50%"}
+
+Depending on the cloud vendor that you have your Databricks' instance deployed, select the Delta Lake connector for AWS or Azure.
+After selecting the connector type *"Databricks Delta Lake"* the below menu will be shown.
+This is where you can configure the connection to your Delta Lake. For that you will need the following information:
+
+![Config Delta Lake connector](../../assets/integrations/Delta_lake_aws_inputs.webp){: style="width:45%; padding-right:10px", align=left}
+
+- **Databricks Host:** The URL of your Databricks cluster
+- **Access token:** Your Databricks' user token
+- **Catalog:** The name of a Catalog that you want to connect to
+- **Schema:** The name of the schema that you want to connect to
+
+Depending on the cloud selected, you will be asked for the credentials to your staging storage (**AWS S3** or **Azure Blob Storage**).
+In this example we are using AWS and for that reason the below inputs refer to *AWS S3*.
+
+- **Key ID:** The Snowflake database to connect to.
+- **Key Secret:** The schema within the database.
+
+And finally, the name for your connector:
+- **Display name:** A unique name for your connector.
+</br></br>
+Test your connection and that's it! 🚀
+
+You are now ready to create different **Datasources** using this connector - read the data from a table,
+evaluate the quality of the data or even read a full database and generate a synthetic replica of your data!
+Read more about ^^[Fabric Datasources in here](../datasources/index.md)^^.
+
+### Use it inside the Labs
+
+👨‍💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Databricks%20_%20Delta%20Lake.ipynb)^^.
+
+In case you prefer a Python interface, we also have connectors available through Fabric SDK inside the labs.
+For a seamless integration between the UI and the Labs environment, Fabric offers an SDK that allows you to re-use connectors,
+datasources and even synthesizers.
+
+Start by creating your code environment through the Labs.
+In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^.
+
+```python
+    # Importing YData's packages
+    from ydata.labs import Connectors
+    # Getting a previously created Connector
+    connector = Connectors.get(uid='insert-connector-id',
+                               namespace='indert-namespace-id')
+    print(connector)
+```
+
+#### Read from your Delta Lake
+Using the Delta Lake connector it is possible to:
+
+- Get the data from a Delta Lake table
+- Get a sample from a Delta Lake table
+- Get the data from a query to a Delta Lake instance
+
+## Unity Catalog
+Databricks Unity Catalog is a unified governance solution for all data and AI assets within the Databricks Lakehouse Platform.
+
+Databricks Unity Catalog leverages the concept of [Delta Sharing](https://www.databricks.com/product/delta-sharing),
+meaning this is a great way not only to ensure alignment between Catalogs but also to limit the access to data.
+This means that byt leveraging the Unity Catalog connector, users can only access a set of data assets that were authorized
+for a given Share.
+
+### Step-by-step creation through the UI
+
+:fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=_12AfMB8hiQ&t=2s"><u>How to create a connector to Databricks Unity Catalog in Fabric?</u></a>
+
+The process to create a new connector is similar to what we have covered before to create a new *Databricks Unity Catalog*
+connector in YData Fabric.
+
+After selecting the connector *"Databricks Unity Catalog"*, you will be requested to upload your Delta Sharing token as
+depicted in the image below.
+
+![Upload Delta Sharing token](../../assets/integrations/databricks_unity_catalog.webp){: style="width:50%"}
+
+Test your connection and that's it! 🚀
+
+### Use it inside the Labs
+
+👨‍💻 ^^[Full code example and recipe can be found here](https://github.com/ydataai/academy/blob/master/1%20-%20Data%20Catalog/1.%20Connectors/Databricks%20_%20Unity%20Catalog.ipynb)^^.
+
+In case you prefer a Python interface, we also have connectors available through Fabric inside the labs.
+Start by creating your code environment through the Labs. In case you need to get started with the Labs, ^^[check this step-by-step guide](../../get-started/create_lab.md)^^.
+
+#### Navigate your Delta Share
+With your connector created you are now able to explore the schemas and tables available in a Delta share.
+
+```python title="List available shares"
+    #List the available shares for the provided authentication
+    connector.list_shares()
+```
+
+```python title="List available schemas"
+    #List the available schemas for a given share
+    connector.list_schemas(share_name='teste')
+```
+
+```python title="List available tables"
+    #List the available tables for a given schema in a share
+    connector.list_tables(schema_name='berka',
+                           share_name='teste')
+
+    #List all the tables regardless of share and schema
+    connector.list_all_tables()
+```
+
+#### Read from your Delta Share
+Using the Delta Lake connector it is possible to:
+
+- Get the data from a Delta Lake table
+- Get a sample from a Delta Lake table
+
+```python title="Read the data from a table"
+    #This method reads all the data records in the table
+    table = connector.read_table(table_name='insert-table-name',
+                                 schema_name='insert-schema-name',
+                                 share_name='insert-share-name')
+    print(table)
+```
+
+```python title="Read a data sample from a table"
+    #This method reads all the data records in the table
+    table = connector.read_table(table_name='insert-table-name',
+                                 schema_name='insert-schema-name',
+                                 share_name='insert-share-name',
+                                 sample_size=100)
+    print(table)
+```
+
+I hope you enjoyed this quick tutorial on seamlessly integrating Databricks with your data preparation workflows. 🚀
diff --git a/docs/integrations/databricks/integration_with_sdk.md b/docs/integrations/databricks/integration_with_sdk.md
@@ -0,0 +1,193 @@
+# YData SDK in Databricks Notebooks
+
+The [YData Fabric SDK](https://pypi.org/project/ydata-sdk/) provides a powerful set of tools for integrating and enhancing data within Databricks notebooks.
+This guide covers the installation, basic usage, and advanced features of the Fabric SDK, helping users maximize
+the potential of their data for AI and machine learning applications.
+
+👨‍💻 ^^[Full code example and recipe can be found here](https://raw.githubusercontent.com/ydataai/academy/master/5%20-%20Integrations/databricks/YData%20Fabric%20SDK%20in%20Databricks%20notebooks)^^.
+
+!!! note "Prerequisites"
+    Before using the YData Fabric SDK in Databricks notebooks, ensure the following prerequisites are met:
+
+    - Access to a Databricks workspace
+    - A valid YData Fabric account and API key
+    - Basic knowledge of Python and Databricks notebooks
+    - A safe connection between your Databricks cluster and Fabric
+
+**Best Practices**
+
+- *Data Security:* Ensure API keys and sensitive data are securely managed.
+- *Efficient Coding:* Use vectorized operations for data manipulation where possible.
+- *Resource Management:* Monitor and manage the resources used by your clusters (Databricks and Fabric)
+Databricks cluster to optimize performance.
+
+### Installation
+
+To install the YData SDK in a Databricks notebook, use the following command:
+```python
+%pip install ydata-sdk
+dbutils.library.restartPython()
+```
+Ensure the installation is successful before proceeding to the next steps.
+
+## Basic Usage - data integration
+This section provides step-by-step instructions on connecting to YData Fabric and performing essential
+data operations using the YData SDK within Databricks notebooks. This includes establishing a secure connection
+to YData Fabric and accessing datasets.
+
+### Connecting to YData Fabric
+First, establish a connection to YData Fabric using your API key:
+
+```python
+import os
+
+# Add your Fabric token as part of your environment variables for authentication
+os.environ["YDATA_TOKEN"] = '<TOKEN>'
+```
+
+### Data access & manipulation
+Once connected, you can access and manipulate data within YData Fabric. For example, to list available datasets:
+
+```python
+from ydata.sdk.datasources import DataSource
+
+#return the list of available DataSources
+DataSource.list()
+```
+
+To load a specific dataset into a Pandas DataFrame:
+
+```python
+#get the data from an existing datasource
+dataset = DataSource.get('<DATASOURCE-ID>')
+```
+
+## Advanced Usage - Synthetic data generation
+
+This section explores one of the most powerful features of the Fabric SDK for enhancing and refining data
+within Databricks notebooks. This includes as generating synthetic data to augment
+datasets or to generate privacy-preserving data.
+By leveraging these advanced capabilities, users can significantly enhance the robustness and performance of their AI
+and machine learning models, unlocking the full potential of their data.
+
+### Privacy-preserving
+Leveraging synthetic data allows to create privacy-preserving datasets that maintain real-world value,
+enabling users to work with sensitive information securely while accessing utility of real data.
+
+Check the SDK documentation for more information regarding [privacy-controls and anonymization](../../sdk/examples/synthesize_with_privacy_control.md).
+
+#### From a datasource in YData Fabric
+Users can generate synthetic data from datasource's existing in Fabric:
+
+```python title="Train a synthetic data generator"
+# From an existing Fabric datasource
+from ydata.sdk.synthesizers import RegularSynthesizer
+
+synth = RegularSynthesizer(name='<NAME-YOUR-MODEL>')
+synth.fit(X=dataset)
+```
+
+```python title="Sample from a Synthetic data generator"
+# From an existing Fabric datasource
+from ydata.sdk.synthesizers import RegularSynthesizer
+
+synth = RegularSynthesizer(name='<NAME-YOUR-MODEL>')
+synth.fit(X=dataset)
+```
+After your synthetic data generator have been trained successfully you can generate as many synthetic datasets as needed
+```python title='Sampling from the model that we have just trained'
+from ydata.sdk.synthesizers import RegularSynthesizer
+sample = synth.sample(100)
+sample.head()
+```
+
+It is also possible to generate data from other synthetic data generation models previously trained:
+
+```python title='Generating synthetic data from a previously trained model'
+from ydata.sdk.synthesizers import RegularSynthesizer
+
+existing_synth = RegularSynthesizer('<INSERT-SYNTHETIC-DATA-GENERATOR-ID>').get()
+sample = existing_synth.sample(100)
+```
+
+#### From a datasource in Databricks
+Another important integration is to train a synthetic data generator from a dataset that you are currently exploring
+in your notebook environment.
+In order to do so, we recommend that you create your dataset using
+[YData Fabric integration connector to your Delta Lake](integration_connectors_catalog.md) and follow the flow for the creation
+of a synthetic data generation models from Fabric existing dasources.
+
+For a small dataset you can also follow [this tutorial](../../sdk/examples/synthesize_tabular_data.md).
+
+### Data augmentation
+Another key focus is on generating synthetic data to augment existing datasets.
+This technique, particularly through conditional synthetic data generation, allows users to create targeted,
+realistic datasets. By addressing data imbalances and enriching the training data, conditional synthetic data generation
+significantly enhances the robustness and performance of machine learning (ML) models,
+leading to more accurate and reliable outcomes.
+
+```python title='Read data from a delta table'
+# Read data from the catalog
+df = spark.sql("SELECT * FROM ydata.default.credit_scoring_labeled")
+
+# Display the dataframe
+display(df)
+```
+
+After reading the data we need to convert it to pandas dataframe in order to create our synthetic data generation model.
+For the augmentation use-case we will be leveraging Conditional Synthetic data generation.
+
+```python title='Training a conditional synthetic data generator'
+from ydata.sdk.synthesizers import RegularSynthesizer
+
+# Convert Spark dataframe to pandas dataframe
+pandas_df = df.toPandas()
+pandas_df = pandas_df.drop('ID', axis=1)
+
+# Train a synthetic data generator using ydata-sdk
+synth = RegularSynthesizer(name='Synth credit scoring | Conditional')
+synth.fit(pandas_df, condition_on='Label')
+
+# Display the synthetic dataframe
+display(synth)
+```
+
+Now that we have a trained conditional synthetic data generator we are able to generate a few samples controlling the
+population behaviour based on the columns that we have conditioned the process to.
+
+```python title="Generating a synthetic sample conditioned to column 'Label'"
+#generate synthetic samples condition to Label
+synthetic_sample = synth.sample(n_samples=len(pandas_df), condition_on={
+            "Label": {
+                        "categories": [{
+                            "category": 1,
+                            "percentage": 0.7
+                        }]
+        }
+    }
+)
+```
+
+After generating the synthetic data we can combine it with our dataset.
+
+```python title='Convert the dataframe to Spark dataframe'
+# Enable Arrow-based columnar data transfers
+spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
+
+#Create a spark dataframe from the synthetic dataframe
+synthetic_df = spark.createDataFrame(synthetic_sample)
+
+display(synthetic_df)
+```
+
+```python title="Combining the datasets"
+# Concatenate the original dataframe with the synthetic dataframe
+#removing the column ID as it is not used
+df = df.drop('ID')
+concatenated_df = df.union(synthetic_df)
+
+# Display the concatenated dataframe
+display(concatenated_df)
+```
+
+Afterwards you can use your augmented dataset to train a ^^[Machine Learning model using MLFlow](https://docs.databricks.com/en/mlflow/tracking-ex-scikit.html)^^.