Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(synthesizer): Support for MultiTable #81

Merged
merged 11 commits into from
Jan 16, 2024
4 changes: 2 additions & 2 deletions docs/examples/synthesize_timeseries_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

**Use YData's *TimeSeriesSynthesizer* to generate time-series synthetic data**

Tabular data is the most common type of data we encounter in data problems.
Timeseries is the most common type of data we encounter in data problems.

When thinking about tabular data, we assume independence between different records, but this does not happen in reality. Suppose we check events from our day-to-day life, such as room temperature changes, bank account transactions, stock price fluctuations, and air quality measurements in our neighborhood. In that case, we might end up with datasets where measures and records evolve and are related through time. This type of data is known to be sequential or time-series data.
When thinking about timeseries data, we assume independence between different records, but this does not happen in reality. Suppose we check events from our day-to-day life, such as room temperature changes, bank account transactions, stock price fluctuations, and air quality measurements in our neighborhood. In that case, we might end up with datasets where measures and records evolve and are related through time. This type of data is known to be sequential or time-series data.

Thus, sequential or time-series data refers to any data containing elements ordered into sequences in a structured format.
Dissecting any time-series dataset, we see differences in variables' behavior that need to be understood for an effective generation of synthetic data. Typically any time-series dataset is composed of the following:
Expand Down
17 changes: 17 additions & 0 deletions docs/examples/synthesizer_multitable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Synthesize Multi Table

**Use YData's *MultiTableSynthesizer* to generate multi table synthetic data from multiple RDBMS tables**

Multi table is the way to synthesize data from multiple tables from a database, with a relational in mind...

Quickstart example:

```python
--8<-- "examples/synthesizers/multi_table_quickstart.py"
```

Sample write connector overriding example:

```python
--8<-- "examples/synthesizers/multi_table_sample_write_override.py"
```
1 change: 1 addition & 0 deletions docs/sdk/reference/api/synthesizers/multitable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: ydata.sdk.synthesizers.multitable.MultiTableSynthesizer
25 changes: 25 additions & 0 deletions examples/synthesizers/multi_table_quickstart.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import os

from ydata.sdk.datasources import DataSource
from ydata.sdk.synthesizers import MultiTableSynthesizer

# Do not forget to add your token as env variables
os.environ["YDATA_TOKEN"] = '<TOKEN>' # Remove if already defined

# In this example, we demonstrate how to train a synthesizer from an existing multi table RDBMS datasource.
# After training a Multi Table Synthesizer, we request a sample.
# In this case, we don't return the Dataset for the sample, it will be saved in the database
# that the connector refers to.

X = DataSource.get('<DATASOURCE_UID>')

# Initialize a multi table synthesizer with the connector to write to
# As long as the synthesizer does not call `fit`, it exists only locally
# write_connector can be an UID or a Connector instance
synth = MultiTableSynthesizer(write_connector='<CONNECTOR_UID')

# The synthesizer training is requested
synth.fit(X)

# We request a synthetic dataset with a fracion of 1.5
synth.sample(frac=1.5)
32 changes: 32 additions & 0 deletions examples/synthesizers/multi_table_sample_write_override.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import os

from ydata.sdk.connectors import Connector
from ydata.sdk.datasources import DataSource
from ydata.sdk.synthesizers import MultiTableSynthesizer

# Do not forget to add your token as env variables
os.environ["YDATA_TOKEN"] = '<TOKEN>' # Remove if already defined

# In this example, we demonstrate how to train a synthesizer from an existing multi table RDBMS datasource.
# After training a Multi Table Synthesizer, we request a sample.
# In this case, we don't return the Dataset for the sample, it will be saved in the database
# that the connector refers to.

X = DataSource.get('<DATASOURCE_UID>')

# For demonstration purposes, we will use a connector instance, but you can just send the UID

write_connector = Connector.get('<CONNECTOR_UID>')

# Initialize a multi table synthesizer with the connector to write to
# As long as the synthesizer does not call `fit`, it exists only locally
# write_connector can be an UID or a Connector instance
synth = MultiTableSynthesizer(write_connector=write_connector)

# The synthesizer training is requested
synth.fit(X)

# We request a synthetic dataset with a fracion of 1.5
# In this case we use a Connector instance.
# You can just use the <CONNECTOR_UID> you don't need to get the connector upfront.
synth.sample(frac=1.5, write_connector=write_connector)