ydata-synthetic is available through PyPi, allowing an easy process of installation and integration with the data science programing environments (Google Colab, Jupyter Notebooks, Visual Studio Code, PyCharm) and stack (pandas, numpy, scikit-learn).
+
ydata-sdk is available through PyPi, allowing an easy process of installation and integration with the data science programing environments (Google Colab, Jupyter Notebooks, Visual Studio Code, PyCharm) and stack (pandas, numpy, scikit-learn).
Installing the package
-
Currently, the package supports python versions over 3.9, and can be installed in Windows, Linux or MacOS operating systems.
+
Currently, the package supports python versions over 3.9 and up-to python 3.12, and can be installed in Windows, Linux or MacOS operating systems.
Prior to the package installation, it is recommended the creation of a virtual or conda environment:
The above command creates and activates a new environment called "synth-env" with Python version 3.10.X. In the new environment, you can then install ydata-synthetic:
+
The above command creates and activates a new environment called "synth-env" with Python version 3.12.X. In the new environment, you can then install ydata-sdk:
pypi
-
pip install ydata-synthetic==1.1.0
+
pip install ydata-sdk
@@ -1560,15 +947,9 @@
Installing the package
5min – Step-by-step installation guide
Using Google Colab
To install inside a Google Colab notebook, you can use the following:
-
!pip install ydata-synthetic==1.1.0
-
-
Make sure your Google Colab is running Python versions >=3.9, <3.11. Learn how to configure Python versions on Google Colab here.
-
Installing the Streamlit App
-
Since version 1.0.0, the ydata-synthetic includes a GUI experience provided by a Streamlit app. The UI supports the data synthesization process from reading the data to profiling the synthetic data generation, and can be installed as follows:
-
pip install "ydata-synthetic[streamlit]"
+
!pip install ydata-sdk
-
Note that Jupyter or Colab Notebooks are not yet supported, so use it in your Python environment.
-
+
Make sure your Google Colab is running Python versions >=3.9, <=3.12. Learn how to configure Python versions on Google Colab here.
ydata-synthetic is equipped to handle both tabular (comprising numeric and categorical features) and sequential, time-series data. In this section we explain how you can quickstart the synthesization of tabular and time-series datasets.
-
Synthesizing a Tabular Dataset
-
The following example showcases how to synthesize the Adult Census Income dataset with CTGAN:
-
Tabular Data
-
-
-
# Import the necessary modules
-frompmlbimportfetch_data
-fromydata_synthetic.synthesizers.regularimportRegularSynthesizer
-fromydata_synthetic.synthesizersimportModelParameters,TrainParameters
-
-# Load data
-data=fetch_data('adult')
-num_cols=['age','fnlwgt','capital-gain','capital-loss','hours-per-week']
-cat_cols=['workclass','education','education-num','marital-status',
-'occupation','relationship','race','sex','native-country','target']
-
-# Define model and training parameters
-ctgan_args=ModelParameters(batch_size=500,lr=2e-4,betas=(0.5,0.9))
-train_args=TrainParameters(epochs=501)
-
-# Train the generator model
-synth=RegularSynthesizer(modelname='ctgan',model_parameters=ctgan_args)
-synth.fit(data=data,train_arguments=train_args,num_cols=num_cols,cat_cols=cat_cols)
-
-# Generate 1000 new synthetic samples
-synth_data=synth.sample(1000)
-
-
-
-
-
Synthesizing a Time-Series Dataset
-
The following example showcases how to synthesize the Yahoo Stock Price dataset with TimeGAN:
-
Time-Series Data
-
-
-
# Import the necessary modules
-importpandasaspd
-fromydata_synthetic.synthesizers.timeseriesimportTimeSeriesSynthesizer
-fromydata_synthetic.synthesizersimportModelParameters,TrainParameters
-
-# Define model parameters
-gan_args=ModelParameters(batch_size=128,
-lr=5e-4,
-noise_dim=32,
-layers_dim=128,
-latent_dim=24,
-gamma=1)
-
-train_args=TrainParameters(epochs=50000,
-sequence_length=24,
-number_sequences=6)
-
-# Read the data
-stock_data=pd.read_csv("stock_data.csv")
-
-# Training the TimeGAN synthesizer
-synth=TimeSeriesSynthesizer(modelname='timegan',model_parameters=gan_args)
-synth.fit(stock_data,train_args,num_cols=list(stock_data.columns))
-
-# Generating new synthetic samples
-synth_data=synth.sample(n_samples=500)
-
-
-
-
-
Running the Streamlit App
-
Once the package is installed with the "streamlit" extra, the app can be launched as:
ydata-synthetic is the go-to Python package for synthetic data generation for tabular and time-series data. It uses the latest Generative AI models to learn the properties of real data and create realistic synthetic data. This project was created to educate the community about synthetic data and its applications in real-world domains, such as data augmentation, bias mitigation, data sharing, and privacy engineering. To learn more about Synthetic Data and its applications, check this article.
-
Current Functionality
-
-
-
🤖 Create Realistic Synthetic Data using Generative AI Models:ydata-synthetic supports the state-of-the-art generative adversarial networks for data generation, namely Vanilla GAN, CGAN, WGAN, WGAN-GP, DRAGAN, Cramer GAN, CWGAN-GP, CTGAN, and TimeGAN. Learn more about the use of GANs for Synthetic Data generation.
-
-
-
📀 Synthetic Data Generation for Tabular and Time-Series Data: The package supports the synthesization of tabular and time-series data, covering a wide range of real-world applications. Learn how to leverage ydata-synthetic for tabular and time-series data.
-
-
-
💻 Best Generation Experience in Open Source: Including a guided UI experience for the generation of synthetic data, from reading the data to visualization of synthetic data. All served by a slick Streamlit app.
- Here's a quick overview – 1min
-
-
-
-
Question
-
Looking for an end-to-end solution to Synthetic Data Generation?
-
YData Fabric enables the generation of high-quality datasets within a full UI experience, from data preparation to synthetic data generation and evaluation. Check out the Community Version.
+
YData-Synthetic is an open-source package developed in 2020 with the primary goal of educating users about generative models for synthetic data generation.
+Designed as a collection of models, it was intended for exploratory studies and educational purposes.
+However, it was not optimized for the quality, performance, and scalability needs typically required by organizations.
+
+
We are now ydata-sdk!
+
Even though the journey was fun, and we have learned a lot from the community it is now time to upgrade ydata-synthetic.
+
Heading towards the future of synthetic data generation we recommend users to transition to ydata-sdk, which provides a superior experience with enhanced performance,
+precision, and ease of use, making it the preferred tool for synthetic data generation and a perfect introduction to Generative AI.
Supported Data Types
-
Tabular DataTime-Series Data
+
Tabular DataTime-Series DataMulti-Table Data
Tabular data does not have a temporal dependence, and can be structured and organized in a table-like format, where features are represented in columns, whereas observations correspond to the rows.
Additionally, tabular data usually comprises both numeric and categorical features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements. Categorical features can further divided in ordinal, binary or boolean, and nominal features.
-
Learn more about synthesizing tabular data in this article, or check the quickstart guide to get started with the synthesization of tabular datasets.
+
Learn more about synthesizing tabular data in this article, or check the quickstart guide to get started with the synthesization of tabular datasets.
Time-series data exhibit a sequencial, temporal dependency between records, and may present a wide range of patterns and trends, including seasonality (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or periodicity (patterns that repeat over time).
-
Read more about generating time-series data in this article and check this quickstart guide to get started with time-series data synthesization.
Multi-Table data or databases exhibit a referential behaviour between and database schema that is expected to be replicated and respected by the synthetic data generated.
+Read more about database synthetic data generation in this article and check this quickstart guide for Multi-Table synthetic data generation
+Time-series data exhibit a sequential, temporal dependency between records, and may present a wide range of patterns and trends, including seasonality (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or periodicity (patterns that repeat over time).
+
+
Validate the quality of your synthetic data generated
+
Validating the quality of synthetic data is essential to ensure its usefulness and privacy. YData Fabric provides tools for comprehensive synthetic data evaluation through:
+
+
+
Profile Comparison Visualization:
+Fabric delivers side-by-side visual comparisons of key data properties (e.g., distributions, correlations, and outliers) between synthetic and original datasets, allowing users to assess fidelity at a glance.
+
+
+
PDF Report with Metrics:
+Fabric generates a PDF report that includes key metrics to evaluate:
+
+
+
Fidelity: How closely synthetic data matches the original.
+
+
Utility: How well it performs in real-world tasks.
+
Privacy: Risk assessment of data leakage and re-identification.
+
+
These tools ensure a thorough validation of synthetic data quality, making it reliable for real-world use.
Supported Generative AI Models
-
The following architectures are currently supported:
+
With the upcoming update of ydata-syntheticto ydata-sdk, users will now have access to a single API that automatically selects and optimizes
+the best generative model for their data. This streamlined approach eliminates the need to choose between
+various models manually, as the API intelligently identifies the optimal model based on the specific dataset and use case.
+
Instead of having to manually select from models such as:
The new API handles model selection automatically, optimizing for the best performance in fidelity, utility, and privacy.
+This significantly simplifies the synthetic data generation process, ensuring that users get the highest quality output without
+the need for manual intervention and tiring hyperparameter tuning.
Great Expectations then uses this statement to validate whether the column passenger_count in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides several dozen highly expressive built-in Expectations, and allows you to write custom Expectations.
Great Expectations renders Expectations to clean, human-readable documentation called Data Docs. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run – think of it as a continuously updated data quality report.
Validating your Synthetic Data with Great Expectations
+
!!! note Outdated
+ From ydata-synthetic vx onwards this example will no longer work. Please check ydata-sdk and synthetic data generation examples.
1. Install the required libraries:
We recommend you create a virtual environment and install ydata-synthetic and great-expectations by running the following command on your terminal.
This data processor works like a scikit learn transformer in with the methods fit, transform and inverse transform.
-Args:
- num_cols (list of strings):
- List of names of numerical columns.
- cat_cols (list of strings):
- List of names of categorical columns.
-
-
- Source code in ydata_synthetic/preprocessing/base_processor.py
-
@typechecked
-classBaseProcessor(ABC,BaseEstimator,TransformerMixin):
-"""
- This data processor works like a scikit learn transformer in with the methods fit, transform and inverse transform.
- Args:
- num_cols (list of strings):
- List of names of numerical columns.
- cat_cols (list of strings):
- List of names of categorical columns.
- """
-def__init__(self,num_cols:Optional[List[str]]=None,cat_cols:Optional[List[str]]=None):
-self.num_cols=[]ifnum_colsisNoneelsenum_cols
-self.cat_cols=[]ifcat_colsisNoneelsecat_cols
-
-self._num_pipeline=None# To be overriden by child processors
-self._cat_pipeline=None# To be overriden by child processors
-
-self._col_transform_info=None# Metadata object mapping inputs/outputs of each pipeline
-
-@property
-defnum_pipeline(self)->BaseEstimator:
-"""Returns the pipeline applied to numerical columns."""
-returnself._num_pipeline
-
-@property
-defcat_pipeline(self)->BaseEstimator:
-"""Returns the pipeline applied to categorical columns."""
-returnself._cat_pipeline
-
-@property
-deftypes(self)->Series:
-"""Returns a Series with the dtypes of each column in the fitted DataFrame."""
-returnself._types
-
-@property
-defcol_transform_info(self)->SimpleNamespace:
-"""Returns a ProcessorInfo object specifying input/output feature mappings of this processor's pipelines."""
-self._check_is_fitted()
-ifself._col_transform_infoisNone:
-self._col_transform_info=self.__create_metadata_synth()
-returnself._col_transform_info
-
-def__create_metadata_synth(self)->SimpleNamespace:
-defnew_pipeline_info(feat_in,feat_out):
-returnSimpleNamespace(feat_names_in=feat_in,feat_names_out=feat_out)
-ifself.num_cols:
-num_info=new_pipeline_info(self.num_pipeline.feature_names_in_,self.num_pipeline.get_feature_names_out())
-else:
-num_info=new_pipeline_info([],[])
-ifself.cat_cols:
-cat_info=new_pipeline_info(self.cat_pipeline.feature_names_in_,self.cat_pipeline.get_feature_names_out())
-else:
-cat_info=new_pipeline_info([],[])
-returnSimpleNamespace(numerical=num_info,categorical=cat_info)
-
-def_check_is_fitted(self):
-"""Checks if the processor is fitted by testing the numerical pipeline.
- Raises NotFittedError if not."""
-ifself._num_pipelineisNone:
-raiseNotFittedError("This data processor has not yet been fitted.")
-
-def_validate_cols(self,x_cols):
-"""Ensures validity of the passed numerical and categorical columns.
- The following is verified:
- 1) Num cols and cat cols are disjoint sets;
- 2) The union of these sets should equal x_cols;.
- Assertion errors are raised in case any of the tests fails."""
-missing=set(x_cols).difference(set(self.num_cols).union(set(self.cat_cols)))
-intersection=set(self.num_cols).intersection(set(self.cat_cols))
-assertintersection==set(),f"num_cols and cat_cols share columns {intersection} but should be disjoint."
-assertmissing==set(),f"The columns {missing} of the provided dataset were not attributed to a pipeline."
-
-# pylint: disable=C0103
-@abstractmethod
-deffit(self,X:DataFrame)->BaseProcessor:
-"""Fits the DataProcessor to a passed DataFrame.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
- Returns:
- self (DataProcessor): The fitted data processor.
- """
-raiseNotImplementedError
-
-# pylint: disable=C0103
-@abstractmethod
-deftransform(self,X:DataFrame)->ndarray:
-"""Transforms the passed DataFrame with the fit DataProcessor.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
- Returns:
- transformed (ndarray): Processed version of the passed DataFrame.
- """
-raiseNotImplementedError
-
-# pylint: disable=C0103
-@abstractmethod
-definverse_transform(self,X:ndarray)->DataFrame:
-"""Inverts the data transformation pipelines on a passed DataFrame.
- Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this DataProcessor.
- Can be used to revert transformations of training data or for synthetic samples.
- Returns:
- result (DataFrame):
- DataFrame with all performed transformations inverted.
- """
-raiseNotImplementedError
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- cat_pipeline:BaseEstimator
-
-
- property
-
-
-
-
-
-
-
-
Returns the pipeline applied to categorical columns.
Returns a ProcessorInfo object specifying input/output feature mappings of this processor's pipelines.
-
-
-
-
-
-
-
-
-
- num_pipeline:BaseEstimator
-
-
- property
-
-
-
-
-
-
-
-
Returns the pipeline applied to numerical columns.
-
-
-
-
-
-
-
-
-
- types:Series
-
-
- property
-
-
-
-
-
-
-
-
Returns a Series with the dtypes of each column in the fitted DataFrame.
-
-
-
-
-
-
-
-
-
-
-
-
- fit(X)
-
-
- abstractmethod
-
-
-
-
-
-
-
-
Fits the DataProcessor to a passed DataFrame.
-Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
-Returns:
- self (DataProcessor): The fitted data processor.
-
-
- Source code in ydata_synthetic/preprocessing/base_processor.py
-
@abstractmethod
-deffit(self,X:DataFrame)->BaseProcessor:
-"""Fits the DataProcessor to a passed DataFrame.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
- Returns:
- self (DataProcessor): The fitted data processor.
- """
-raiseNotImplementedError
-
-
-
-
-
-
-
-
-
-
-
-
- inverse_transform(X)
-
-
- abstractmethod
-
-
-
-
-
-
-
-
Inverts the data transformation pipelines on a passed DataFrame.
-Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this DataProcessor.
- Can be used to revert transformations of training data or for synthetic samples.
-Returns:
- result (DataFrame):
- DataFrame with all performed transformations inverted.
-
-
- Source code in ydata_synthetic/preprocessing/base_processor.py
-
@abstractmethod
-definverse_transform(self,X:ndarray)->DataFrame:
-"""Inverts the data transformation pipelines on a passed DataFrame.
- Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this DataProcessor.
- Can be used to revert transformations of training data or for synthetic samples.
- Returns:
- result (DataFrame):
- DataFrame with all performed transformations inverted.
- """
-raiseNotImplementedError
-
-
-
-
-
-
-
-
-
-
-
-
- transform(X)
-
-
- abstractmethod
-
-
-
-
-
-
-
-
Transforms the passed DataFrame with the fit DataProcessor.
-Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
-Returns:
- transformed (ndarray): Processed version of the passed DataFrame.
-
-
- Source code in ydata_synthetic/preprocessing/base_processor.py
-
@abstractmethod
-deftransform(self,X:DataFrame)->ndarray:
-"""Transforms the passed DataFrame with the fit DataProcessor.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
- Returns:
- transformed (ndarray): Processed version of the passed DataFrame.
- """
-raiseNotImplementedError
-
CTGAN data preprocessing class.
-It works like any other transformer in scikit-learn with the methods fit, transform and inverse_transform.
-Args:
- n_clusters (int), default=10:
- Number of clusters.
- epsilon (float), default=0.005:
- Epsilon value.
- num_cols (list of strings):
- List of names of numerical columns.
- cat_cols (list of strings):
- List of names of categorical columns.
-
-
- Source code in ydata_synthetic/preprocessing/regular/ctgan_processor.py
-
@typechecked
-classCTGANDataProcessor(BaseProcessor):
-"""
- CTGAN data preprocessing class.
- It works like any other transformer in scikit-learn with the methods fit, transform and inverse_transform.
- Args:
- n_clusters (int), default=10:
- Number of clusters.
- epsilon (float), default=0.005:
- Epsilon value.
- num_cols (list of strings):
- List of names of numerical columns.
- cat_cols (list of strings):
- List of names of categorical columns.
- """
-SUPPORTED_MODEL='CTGAN'
-
-def__init__(self,n_clusters=10,epsilon=0.005,
-num_cols:Optional[List[str]]=None,
-cat_cols:Optional[List[str]]=None):
-super().__init__(num_cols,cat_cols)
-
-self._n_clusters=n_clusters
-self._epsilon=epsilon
-self._metadata=None
-self._dtypes=None
-self._output_dimensions=None
-
-@property
-defmetadata(self)->list[ColumnMetadata]:
-"""
- Returns the metadata for each column.
- """
-returnself._metadata
-
-@property
-defoutput_dimensions(self)->int:
-"""
- Returns the dataset dimensionality after the preprocessing.
- """
-returnint(self._output_dimensions)
-
-@ignore_warnings(category=ConvergenceWarning)
-deffit(self,X:pd.DataFrame)->CTGANDataProcessor:
-"""
- Fits the data processor to a passed DataFrame.
-
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
- Returns:
- self (CTGANDataProcessor): The fitted data processor.
- """
-self._dtypes=X.infer_objects().dtypes
-self._metadata=[]
-cur_idx=0
-forcolumninX.columns:
-column_data=X[[column]].values
-ifcolumninself.cat_cols:
-ohe=OneHotEncoder(sparse_output=False)
-ohe.fit(column_data)
-n_categories=len(ohe.categories_[0])
-self._metadata.append(
-ColumnMetadata(
-start_idx=cur_idx,
-end_idx=cur_idx+n_categories,
-discrete=True,
-output_dim=n_categories,
-model=ohe,
-components=None,
-name=column
-)
-)
-cur_idx+=n_categories
-else:
-bgm=BayesianGaussianMixture(
-n_components=self._n_clusters,
-weight_concentration_prior_type='dirichlet_process',
-weight_concentration_prior=0.001,
-n_init=1
-)
-bgm.fit(column_data)
-components=bgm.weights_>self._epsilon
-output_dim=components.sum()+1
-self._metadata.append(
-ColumnMetadata(
-start_idx=cur_idx,
-end_idx=cur_idx+output_dim,
-discrete=False,
-output_dim=output_dim,
-model=bgm,
-components=components,
-name=column
-)
-)
-cur_idx+=output_dim
-self._output_dimensions=cur_idx
-returnself
-
-deftransform(self,X:pd.DataFrame)->np.ndarray:
-"""
- Transforms the passed DataFrame with the fitted data processor.
-
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
- Returns:
- Processed version of the passed DataFrame.
- """
-ifself._metadataisNone:
-raiseNotFittedError("This data processor has not yet been fitted.")
-
-transformed_data=[]
-forcol_mdinself._metadata:
-column_data=X[[col_md.name]].values
-ifcol_md.discrete:
-ohe=col_md.model
-transformed_data.append(ohe.transform(column_data))
-else:
-bgm=col_md.model
-components=col_md.components
-
-means=bgm.means_.reshape((1,self._n_clusters))
-stds=np.sqrt(bgm.covariances_).reshape((1,self._n_clusters))
-features=(column_data-means)/(4*stds)
-
-probabilities=bgm.predict_proba(column_data)
-n_opts=components.sum()
-features=features[:,components]
-probabilities=probabilities[:,components]
-
-opt_sel=np.zeros(len(column_data),dtype='int')
-foriinrange(len(column_data)):
-norm_probs=probabilities[i]+1e-6
-norm_probs=norm_probs/norm_probs.sum()
-opt_sel[i]=np.random.choice(np.arange(n_opts),p=norm_probs)
-
-idx=np.arange((len(features)))
-features=features[idx,opt_sel].reshape([-1,1])
-features=np.clip(features,-.99,.99)
-
-probs_onehot=np.zeros_like(probabilities)
-probs_onehot[np.arange(len(probabilities)),opt_sel]=1
-transformed_data.append(
-np.concatenate([features,probs_onehot],axis=1).astype(float))
-
-returnnp.concatenate(transformed_data,axis=1).astype(float)
-
-definverse_transform(self,X:np.ndarray)->pd.DataFrame:
-"""
- Reverts the data transformations on a passed DataFrame.
-
- Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this data processor.
- Can be used to revert transformations of training data or for synthetic samples.
- Returns:
- DataFrame with all performed transformations reverted.
- """
-ifself._metadataisNone:
-raiseNotFittedError("This data processor has not yet been fitted.")
-
-transformed_data=[]
-col_names=[]
-forcol_mdinself._metadata:
-col_data=X[:,col_md.start_idx:col_md.end_idx]
-ifcol_md.discrete:
-inv_data=col_md.model.inverse_transform(col_data)
-else:
-mean=col_data[:,0]
-variance=col_data[:,1:]
-mean=np.clip(mean,-1,1)
-
-v_t=np.ones((len(col_data),self._n_clusters))*-100
-v_t[:,col_md.components]=variance
-variance=v_t
-means=col_md.model.means_.reshape([-1])
-stds=np.sqrt(col_md.model.covariances_).reshape([-1])
-
-p_argmax=np.argmax(variance,axis=1)
-std_t=stds[p_argmax]
-mean_t=means[p_argmax]
-inv_data=mean*4*std_t+mean_t
-
-transformed_data.append(inv_data)
-col_names.append(col_md.name)
-
-transformed_data=np.column_stack(transformed_data)
-transformed_data=pd.DataFrame(transformed_data,columns=col_names).astype(self._dtypes)
-returntransformed_data
-
@ignore_warnings(category=ConvergenceWarning)
-deffit(self,X:pd.DataFrame)->CTGANDataProcessor:
-"""
- Fits the data processor to a passed DataFrame.
-
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
- Returns:
- self (CTGANDataProcessor): The fitted data processor.
- """
-self._dtypes=X.infer_objects().dtypes
-self._metadata=[]
-cur_idx=0
-forcolumninX.columns:
-column_data=X[[column]].values
-ifcolumninself.cat_cols:
-ohe=OneHotEncoder(sparse_output=False)
-ohe.fit(column_data)
-n_categories=len(ohe.categories_[0])
-self._metadata.append(
-ColumnMetadata(
-start_idx=cur_idx,
-end_idx=cur_idx+n_categories,
-discrete=True,
-output_dim=n_categories,
-model=ohe,
-components=None,
-name=column
-)
-)
-cur_idx+=n_categories
-else:
-bgm=BayesianGaussianMixture(
-n_components=self._n_clusters,
-weight_concentration_prior_type='dirichlet_process',
-weight_concentration_prior=0.001,
-n_init=1
-)
-bgm.fit(column_data)
-components=bgm.weights_>self._epsilon
-output_dim=components.sum()+1
-self._metadata.append(
-ColumnMetadata(
-start_idx=cur_idx,
-end_idx=cur_idx+output_dim,
-discrete=False,
-output_dim=output_dim,
-model=bgm,
-components=components,
-name=column
-)
-)
-cur_idx+=output_dim
-self._output_dimensions=cur_idx
-returnself
-
-
-
-
-
-
-
-
-
-
-
-
- inverse_transform(X)
-
-
-
-
-
-
-
Reverts the data transformations on a passed DataFrame.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
X
-
- ndarray
-
-
-
-
Numpy array to be brought back to the original data format.
-Should share the schema of data transformed by this data processor.
-Can be used to revert transformations of training data or for synthetic samples.
-
-
-
- required
-
-
-
-
-
-
-
- Source code in ydata_synthetic/preprocessing/regular/ctgan_processor.py
-
definverse_transform(self,X:np.ndarray)->pd.DataFrame:
-"""
- Reverts the data transformations on a passed DataFrame.
-
- Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this data processor.
- Can be used to revert transformations of training data or for synthetic samples.
- Returns:
- DataFrame with all performed transformations reverted.
- """
-ifself._metadataisNone:
-raiseNotFittedError("This data processor has not yet been fitted.")
-
-transformed_data=[]
-col_names=[]
-forcol_mdinself._metadata:
-col_data=X[:,col_md.start_idx:col_md.end_idx]
-ifcol_md.discrete:
-inv_data=col_md.model.inverse_transform(col_data)
-else:
-mean=col_data[:,0]
-variance=col_data[:,1:]
-mean=np.clip(mean,-1,1)
-
-v_t=np.ones((len(col_data),self._n_clusters))*-100
-v_t[:,col_md.components]=variance
-variance=v_t
-means=col_md.model.means_.reshape([-1])
-stds=np.sqrt(col_md.model.covariances_).reshape([-1])
-
-p_argmax=np.argmax(variance,axis=1)
-std_t=stds[p_argmax]
-mean_t=means[p_argmax]
-inv_data=mean*4*std_t+mean_t
-
-transformed_data.append(inv_data)
-col_names.append(col_md.name)
-
-transformed_data=np.column_stack(transformed_data)
-transformed_data=pd.DataFrame(transformed_data,columns=col_names).astype(self._dtypes)
-returntransformed_data
-
-
-
-
-
-
-
-
-
-
-
-
- transform(X)
-
-
-
-
-
-
-
Transforms the passed DataFrame with the fitted data processor.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
X
-
- DataFrame
-
-
-
-
DataFrame used to fit the processor parameters.
-Should be aligned with the columns types defined in initialization.
-
-
-
- required
-
-
-
-
-
-
-
- Source code in ydata_synthetic/preprocessing/regular/ctgan_processor.py
-
deftransform(self,X:pd.DataFrame)->np.ndarray:
-"""
- Transforms the passed DataFrame with the fitted data processor.
-
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
- Returns:
- Processed version of the passed DataFrame.
- """
-ifself._metadataisNone:
-raiseNotFittedError("This data processor has not yet been fitted.")
-
-transformed_data=[]
-forcol_mdinself._metadata:
-column_data=X[[col_md.name]].values
-ifcol_md.discrete:
-ohe=col_md.model
-transformed_data.append(ohe.transform(column_data))
-else:
-bgm=col_md.model
-components=col_md.components
-
-means=bgm.means_.reshape((1,self._n_clusters))
-stds=np.sqrt(bgm.covariances_).reshape((1,self._n_clusters))
-features=(column_data-means)/(4*stds)
-
-probabilities=bgm.predict_proba(column_data)
-n_opts=components.sum()
-features=features[:,components]
-probabilities=probabilities[:,components]
-
-opt_sel=np.zeros(len(column_data),dtype='int')
-foriinrange(len(column_data)):
-norm_probs=probabilities[i]+1e-6
-norm_probs=norm_probs/norm_probs.sum()
-opt_sel[i]=np.random.choice(np.arange(n_opts),p=norm_probs)
-
-idx=np.arange((len(features)))
-features=features[idx,opt_sel].reshape([-1,1])
-features=np.clip(features,-.99,.99)
-
-probs_onehot=np.zeros_like(probabilities)
-probs_onehot[np.arange(len(probabilities)),opt_sel]=1
-transformed_data.append(
-np.concatenate([features,probs_onehot],axis=1).astype(float))
-
-returnnp.concatenate(transformed_data,axis=1).astype(float)
-
Main class for Regular/Tabular Data Preprocessing.
-It works like any other transformer in scikit learn with the methods fit, transform and inverse transform.
-Args:
- num_cols (list of strings):
- List of names of numerical columns.
- cat_cols (list of strings):
- List of names of categorical columns.
-
-
- Source code in ydata_synthetic/preprocessing/regular/processor.py
-
@typechecked
-classRegularDataProcessor(BaseProcessor):
-"""
- Main class for Regular/Tabular Data Preprocessing.
- It works like any other transformer in scikit learn with the methods fit, transform and inverse transform.
- Args:
- num_cols (list of strings):
- List of names of numerical columns.
- cat_cols (list of strings):
- List of names of categorical columns.
- """
-def__init__(self,num_cols:Optional[List[str]]=None,cat_cols:Optional[List[str]]=None):
-super().__init__(num_cols,cat_cols)
-
-self._col_order_=None
-self._num_col_idx_=None
-self._cat_col_idx_=None
-
-# pylint: disable=W0106
-deffit(self,X:DataFrame)->RegularDataProcessor:
-"""Fits the DataProcessor to a passed DataFrame.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
- Returns:
- self (RegularDataProcessor): The fitted data processor.
- """
-self._validate_cols(X.columns)
-
-self._col_order_=[cforcinX.columnsifcinself.num_cols+self.cat_cols]
-
-self._types=X.dtypes
-
-self._num_pipeline=Pipeline([
-("scaler",MinMaxScaler()),
-])
-self._cat_pipeline=Pipeline([
-("encoder",OneHotEncoder(sparse_output=False,handle_unknown='ignore')),
-])
-
-self.num_pipeline.fit(X[self.num_cols])ifself.num_colselsezeros([len(X),0])
-self.cat_pipeline.fit(X[self.cat_cols])ifself.num_colselsezeros([len(X),0])
-
-self._num_col_idx_=len(self.num_pipeline.get_feature_names_out())
-self._cat_col_idx_=self._num_col_idx_+len(self.cat_pipeline.get_feature_names_out())
-
-returnself
-
-deftransform(self,X:DataFrame)->ndarray:
-"""Transforms the passed DataFrame with the fit DataProcessor.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
- Returns:
- transformed (ndarray):
- Processed version of the passed DataFrame.
- """
-self._check_is_fitted()
-
-num_data=self.num_pipeline.transform(X[self.num_cols])ifself.num_colselsezeros([len(X),0])
-cat_data=self.cat_pipeline.transform(X[self.cat_cols])ifself.cat_colselsezeros([len(X),0])
-
-transformed=concatenate([num_data,cat_data],axis=1)
-
-returntransformed
-
-definverse_transform(self,X:ndarray)->DataFrame:
-"""Inverts the data transformation pipelines on a passed DataFrame.
- Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this DataProcessor.
- Can be used to revert transformations of training data or for synthetic samples.
- Returns:
- result (DataFrame):
- DataFrame with all performed transformations inverted.
- """
-self._check_is_fitted()
-
-num_data,cat_data,_=split(X,[self._num_col_idx_,self._cat_col_idx_],axis=1)
-
-num_data=self.num_pipeline.inverse_transform(num_data)ifself.num_colselsezeros([len(X),0])
-cat_data=self.cat_pipeline.inverse_transform(cat_data)ifself.cat_colselsezeros([len(X),0])
-
-result=concat([DataFrame(num_data,columns=self.num_cols),
-DataFrame(cat_data,columns=self.cat_cols)],axis=1)
-
-result=result.loc[:,self._col_order_]
-
-forcolinresult.columns:
-result[col]=result[col].astype(self._types[col])
-
-returnresult
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- fit(X)
-
-
-
-
-
-
-
Fits the DataProcessor to a passed DataFrame.
-Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
-Returns:
- self (RegularDataProcessor): The fitted data processor.
-
-
- Source code in ydata_synthetic/preprocessing/regular/processor.py
-
deffit(self,X:DataFrame)->RegularDataProcessor:
-"""Fits the DataProcessor to a passed DataFrame.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the num/cat columns defined in initialization.
- Returns:
- self (RegularDataProcessor): The fitted data processor.
- """
-self._validate_cols(X.columns)
-
-self._col_order_=[cforcinX.columnsifcinself.num_cols+self.cat_cols]
-
-self._types=X.dtypes
-
-self._num_pipeline=Pipeline([
-("scaler",MinMaxScaler()),
-])
-self._cat_pipeline=Pipeline([
-("encoder",OneHotEncoder(sparse_output=False,handle_unknown='ignore')),
-])
-
-self.num_pipeline.fit(X[self.num_cols])ifself.num_colselsezeros([len(X),0])
-self.cat_pipeline.fit(X[self.cat_cols])ifself.num_colselsezeros([len(X),0])
-
-self._num_col_idx_=len(self.num_pipeline.get_feature_names_out())
-self._cat_col_idx_=self._num_col_idx_+len(self.cat_pipeline.get_feature_names_out())
-
-returnself
-
-
-
-
-
-
-
-
-
-
-
-
- inverse_transform(X)
-
-
-
-
-
-
-
Inverts the data transformation pipelines on a passed DataFrame.
-Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this DataProcessor.
- Can be used to revert transformations of training data or for synthetic samples.
-Returns:
- result (DataFrame):
- DataFrame with all performed transformations inverted.
-
-
- Source code in ydata_synthetic/preprocessing/regular/processor.py
-
definverse_transform(self,X:ndarray)->DataFrame:
-"""Inverts the data transformation pipelines on a passed DataFrame.
- Args:
- X (ndarray):
- Numpy array to be brought back to the original data format.
- Should share the schema of data transformed by this DataProcessor.
- Can be used to revert transformations of training data or for synthetic samples.
- Returns:
- result (DataFrame):
- DataFrame with all performed transformations inverted.
- """
-self._check_is_fitted()
-
-num_data,cat_data,_=split(X,[self._num_col_idx_,self._cat_col_idx_],axis=1)
-
-num_data=self.num_pipeline.inverse_transform(num_data)ifself.num_colselsezeros([len(X),0])
-cat_data=self.cat_pipeline.inverse_transform(cat_data)ifself.cat_colselsezeros([len(X),0])
-
-result=concat([DataFrame(num_data,columns=self.num_cols),
-DataFrame(cat_data,columns=self.cat_cols)],axis=1)
-
-result=result.loc[:,self._col_order_]
-
-forcolinresult.columns:
-result[col]=result[col].astype(self._types[col])
-
-returnresult
-
-
-
-
-
-
-
-
-
-
-
-
- transform(X)
-
-
-
-
-
-
-
Transforms the passed DataFrame with the fit DataProcessor.
-Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
-Returns:
- transformed (ndarray):
- Processed version of the passed DataFrame.
-
-
- Source code in ydata_synthetic/preprocessing/regular/processor.py
-
deftransform(self,X:DataFrame)->ndarray:
-"""Transforms the passed DataFrame with the fit DataProcessor.
- Args:
- X (DataFrame):
- DataFrame used to fit the processor parameters.
- Should be aligned with the columns types defined in initialization.
- Returns:
- transformed (ndarray):
- Processed version of the passed DataFrame.
- """
-self._check_is_fitted()
-
-num_data=self.num_pipeline.transform(X[self.num_cols])ifself.num_colselsezeros([len(X),0])
-cat_data=self.cat_pipeline.transform(X[self.cat_cols])ifself.cat_colselsezeros([len(X),0])
-
-transformed=concatenate([num_data,cat_data],axis=1)
-
-returntransformed
-
classCGAN(ConditionalModel):
-"CGAN model for discrete conditions"
-
-__MODEL__='CGAN'
-
-def__init__(self,model_parameters):
-self._col_order=None
-super().__init__(model_parameters)
-
-defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple]): Defaults to None
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),
-label_shape=(self.label_dim),
-dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.discriminator=Discriminator(self.batch_size). \
-build_model(input_shape=(self.data_dim,),
-label_shape=(self.label_dim,),
-dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-d_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the discriminator
-self.discriminator.compile(loss='binary_crossentropy',
-optimizer=d_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-noise=Input(shape=(self.noise_dim,))
-label=Input(shape=(1,))# A label vector is expected
-record=self.generator([noise,label])
-
-# For the combined model we will only train the generator
-self.discriminator.trainable=False
-
-# The discriminator takes generated images as input and determines validity
-validity=self.discriminator([record,label])
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-self._model=Model([noise,label],validity)
-self._model.compile(loss='binary_crossentropy',optimizer=g_optimizer)
-
-def_generate_noise(self):
-"""Gaussian noise for the generator input."""
-whileTrue:
-yieldrandom.uniform(shape=(self.noise_dim,))
-
-defget_batch_noise(self):
-"""Create a batch iterator for the generator gaussian noise input."""
-returniter(tfdata.Dataset.from_generator(self._generate_noise,output_types=dtypes.float32)
-.batch(self.batch_size)
-.repeat())
-
-defget_data_batch(self,data,batch_size,seed=0):
-"""Produce real data batches from the passed data object.
-
- Args:
- data: real data.
- batch_size: batch size.
- seed (int, optional): Defaults to 0.
-
- Returns:
- data batch.
- """
-start_i=(batch_size*seed)%len(data)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(data)
-np.random.seed(shuffle_seed)
-data_ix=np.random.choice(data.shape[0],replace=False,size=len(data))# wasteful to shuffle every time
-returndata[data_ix[start_i:stop_i]]
-
-deffit(self,
-data:DataFrame,
-label_cols:List[str],
-train_arguments:TrainParameters,
-num_cols:List[str],
-cat_cols:List[str]):
-"""Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized
- label_cols: The name of the column to be used as a label and condition for the training
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-data,label=self._prep_fit(data,label_cols,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.label_dim=len(label_cols)
-
-# Init the GAN model and optimizers
-self.define_gan(self.processor.col_transform_info)
-
-# Merging labels with processed data
-processed_data=hstack([processed_data,label])
-
-noise_batches=self.get_batch_noise()
-
-iterations=int(abs(processed_data.shape[0]/self.batch_size)+1)
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=np.zeros((self.batch_size,1))
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_x=self.get_data_batch(processed_data,self.batch_size)# Batches are retrieved with labels
-batch_x,label=batch_x[:,:-1],batch_x[:,-1]# Separate labels from batch
-noise=next(noise_batches)
-
-# Generate a batch of new records
-gen_records=self.generator([noise,label],training=True)
-
-# Train the discriminator
-d_loss_real=self.discriminator.train_on_batch([batch_x,label],valid)# Separate labels
-d_loss_fake=self.discriminator.train_on_batch([gen_records,label],fake)# Separate labels
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=next(noise_batches)
-# Train the generator (to have the discriminator label samples as valid)
-g_loss=self._model.train_on_batch([noise,label],valid)
-
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-# If at save interval => save model state and generated image samples
-ifepoch%train_arguments.sample_interval==0:
-self._run_checkpoint(train_arguments,epoch,label)
-
-def_run_checkpoint(self,train_arguments,epoch,label):
-"""Run checkpoint and store model state and generated samples.
-
- Args:
- train_arguments: GAN training arguments.
- epoch: training epoch
- label: deprecated
- """
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.discriminator.save_weights(model_checkpoint_base_name.format('discriminator',epoch))
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan(activation_info=None)
-
-
-
-
-
-
-
Define the trainable model components.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
activation_info
-
- Optional[NamedTuple]
-
-
-
-
Defaults to None
-
-
-
- None
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cgan/model.py
-
defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple]): Defaults to None
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),
-label_shape=(self.label_dim),
-dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.discriminator=Discriminator(self.batch_size). \
-build_model(input_shape=(self.data_dim,),
-label_shape=(self.label_dim,),
-dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-d_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the discriminator
-self.discriminator.compile(loss='binary_crossentropy',
-optimizer=d_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-noise=Input(shape=(self.noise_dim,))
-label=Input(shape=(1,))# A label vector is expected
-record=self.generator([noise,label])
-
-# For the combined model we will only train the generator
-self.discriminator.trainable=False
-
-# The discriminator takes generated images as input and determines validity
-validity=self.discriminator([record,label])
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-self._model=Model([noise,label],validity)
-self._model.compile(loss='binary_crossentropy',optimizer=g_optimizer)
-
deffit(self,
-data:DataFrame,
-label_cols:List[str],
-train_arguments:TrainParameters,
-num_cols:List[str],
-cat_cols:List[str]):
-"""Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized
- label_cols: The name of the column to be used as a label and condition for the training
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-data,label=self._prep_fit(data,label_cols,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.label_dim=len(label_cols)
-
-# Init the GAN model and optimizers
-self.define_gan(self.processor.col_transform_info)
-
-# Merging labels with processed data
-processed_data=hstack([processed_data,label])
-
-noise_batches=self.get_batch_noise()
-
-iterations=int(abs(processed_data.shape[0]/self.batch_size)+1)
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=np.zeros((self.batch_size,1))
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_x=self.get_data_batch(processed_data,self.batch_size)# Batches are retrieved with labels
-batch_x,label=batch_x[:,:-1],batch_x[:,-1]# Separate labels from batch
-noise=next(noise_batches)
-
-# Generate a batch of new records
-gen_records=self.generator([noise,label],training=True)
-
-# Train the discriminator
-d_loss_real=self.discriminator.train_on_batch([batch_x,label],valid)# Separate labels
-d_loss_fake=self.discriminator.train_on_batch([gen_records,label],fake)# Separate labels
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=next(noise_batches)
-# Train the generator (to have the discriminator label samples as valid)
-g_loss=self._model.train_on_batch([noise,label],valid)
-
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-# If at save interval => save model state and generated image samples
-ifepoch%train_arguments.sample_interval==0:
-self._run_checkpoint(train_arguments,epoch,label)
-
-
-
-
-
-
-
-
-
-
-
-
- get_batch_noise()
-
-
-
-
-
-
-
Create a batch iterator for the generator gaussian noise input.
-
-
- Source code in ydata_synthetic/synthesizers/regular/cgan/model.py
-
defget_batch_noise(self):
-"""Create a batch iterator for the generator gaussian noise input."""
-returniter(tfdata.Dataset.from_generator(self._generate_noise,output_types=dtypes.float32)
-.batch(self.batch_size)
-.repeat())
-
-
-
-
-
-
-
-
-
-
-
-
- get_data_batch(data,batch_size,seed=0)
-
-
-
-
-
-
-
Produce real data batches from the passed data object.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
data
-
-
-
-
-
real data.
-
-
-
- required
-
-
-
-
batch_size
-
-
-
-
-
batch size.
-
-
-
- required
-
-
-
-
seed
-
- int
-
-
-
-
Defaults to 0.
-
-
-
- 0
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
data batch.
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cgan/model.py
-
defget_data_batch(self,data,batch_size,seed=0):
-"""Produce real data batches from the passed data object.
-
- Args:
- data: real data.
- batch_size: batch size.
- seed (int, optional): Defaults to 0.
-
- Returns:
- data batch.
- """
-start_i=(batch_size*seed)%len(data)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(data)
-np.random.seed(shuffle_seed)
-data_ix=np.random.choice(data.shape[0],replace=False,size=len(data))# wasteful to shuffle every time
-returndata[data_ix[start_i:stop_i]]
-
Base class of GAN synthesizer models.
-The main methods are train (for fitting the synthesizer), save/load and sample (obtain synthetic records).
-Args:
- model_parameters (ModelParameters):
- Set of architectural parameters for model definition.
-
-
- Source code in ydata_synthetic/synthesizers/base.py
-
@typechecked
-classBaseGANModel(BaseModel):
-"""
- Base class of GAN synthesizer models.
- The main methods are train (for fitting the synthesizer), save/load and sample (obtain synthetic records).
- Args:
- model_parameters (ModelParameters):
- Set of architectural parameters for model definition.
- """
-def__init__(
-self,
-model_parameters:ModelParameters
-):
-gpu_devices=tfconfig.list_physical_devices('GPU')
-iflen(gpu_devices)>0:
-try:
-tfconfig.experimental.set_memory_growth(gpu_devices[0],True)
-except(ValueError,RuntimeError):
-# Invalid device or cannot modify virtual devices once initialized.
-pass
-#Validate the provided model parameters
-ifmodel_parameters.betasisnotNone:
-assertlen(model_parameters.betas)==2,"Please provide the betas information as a tuple."
-
-self.batch_size=model_parameters.batch_size
-self._set_lr(model_parameters.lr)
-self.beta_1=model_parameters.betas[0]
-self.beta_2=model_parameters.betas[1]
-self.noise_dim=model_parameters.noise_dim
-self.data_dim=None
-self.layers_dim=model_parameters.layers_dim
-
-# Additional parameters for the CTGAN
-self.generator_dims=model_parameters.generator_dims
-self.critic_dims=model_parameters.critic_dims
-self.l2_scale=model_parameters.l2_scale
-self.latent_dim=model_parameters.latent_dim
-self.gp_lambda=model_parameters.gp_lambda
-self.pac=model_parameters.pac
-
-self.use_tanh=model_parameters.tanh
-self.processor=None
-ifself.__MODEL__inRegularModels.__members__or \
-self.__MODEL__==CTGANDataProcessor.SUPPORTED_MODEL:
-self.tau=model_parameters.tau_gs
-
-# pylint: disable=E1101
-def__call__(self,inputs,**kwargs):
-returnself.model(inputs=inputs,**kwargs)
-
-# pylint: disable=C0103
-def_set_lr(self,lr):
-ifisinstance(lr,float):
-self.g_lr=lr
-self.d_lr=lr
-elifisinstance(lr,(list,tuple)):
-assertlen(lr)==2,"Please provide a two values array for the learning rates or a float."
-self.g_lr=lr[0]
-self.d_lr=lr[1]
-
-defdefine_gan(self):
-"""Define the trainable model components.
-
- Optionally validate model structure with mock inputs and initialize optimizers."""
-raiseNotImplementedError
-
-@property
-defmodel_parameters(self):
-"Returns the parameters of the model."
-returnself._model_parameters
-
-@property
-defmodel_name(self):
-"Returns the model (class) name."
-returnself.__class__.__name__
-
-deffit(self,
-data:Union[DataFrame,array],
-num_cols:Optional[List[str]]=None,
-cat_cols:Optional[List[str]]=None,
-train_arguments:Optional[TrainParameters]=None)->Union[DataFrame,array]:
-"""
- Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data (Union[DataFrame, array]): Training data
- num_cols (Optional[List[str]]) : List with the names of the categorical columns
- cat_cols (Optional[List[str]]): List of names of categorical columns
- train_arguments (Optional[TrainParameters]): Training parameters
-
- Returns:
- Fitted synthesizer
- """
-ifself.__MODEL__inRegularModels.__members__:
-self.processor=RegularDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__inTimeSeriesModels.__members__:
-self.processor=TimeSeriesDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==CTGANDataProcessor.SUPPORTED_MODEL:
-n_clusters=train_arguments.n_clusters
-epsilon=train_arguments.epsilon
-self.processor=CTGANDataProcessor(n_clusters=n_clusters,epsilon=epsilon,
-num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==DoppelGANgerProcessor.SUPPORTED_MODEL:
-measurement_cols=train_arguments.measurement_cols
-sequence_length=train_arguments.sequence_length
-sample_length=train_arguments.sample_length
-self.processor=DoppelGANgerProcessor(num_cols=num_cols,cat_cols=cat_cols,
-measurement_cols=measurement_cols,
-sequence_length=sequence_length,
-sample_length=sample_length,
-normalize_tanh=self.use_tanh).fit(data)
-else:
-print(f'A DataProcessor is not available for the {self.__MODEL__}.')
-
-defsample(self,n_samples:int):
-"""
- Generates samples from the trained synthesizer.
-
- Args:
- n_samples (int): Number of rows to generated.
-
- Returns:
- synth_sample (pandas.DataFrame): generated synthetic samples.
- """
-steps=n_samples//self.batch_size+1
-data=[]
-for_intqdm.trange(steps,desc='Synthetic data generation'):
-z=random.uniform([self.batch_size,self.noise_dim],dtype=tf.dtypes.float32)
-records=self.generator(z,training=False).numpy()
-data.append(records)
-returnself.processor.inverse_transform(array(vstack(data)))
-
-defsave(self,path):
-"""
- Saves a synthesizer as a pickle.
-
- Args:
- path (str): Path to write the synthesizer as a pickle object.
- """
-#Save only the generator?
-ifself.__MODEL__=='WGAN'orself.__MODEL__=='WGAN_GP'orself.__MODEL__=='CWGAN_GP':
-delself.critic
-make_keras_picklable()
-dump(self,path)
-
-@classmethod
-defload(cls,path):
-"""
- Loads a saved synthesizer from a pickle.
-
- Args:
- path (str): Path to read the synthesizer pickle from.
- """
-gpu_devices=tfconfig.list_physical_devices('GPU')
-iflen(gpu_devices)>0:
-try:
-tfconfig.experimental.set_memory_growth(gpu_devices[0],True)
-except(ValueError,RuntimeError):
-# Invalid device or cannot modify virtual devices once initialized.
-pass
-synth=load(path)
-returnsynth
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- model_name
-
-
- property
-
-
-
-
-
-
-
-
Returns the model (class) name.
-
-
-
-
-
-
-
-
-
- model_parameters
-
-
- property
-
-
-
-
-
-
-
-
Returns the parameters of the model.
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan()
-
-
-
-
-
-
-
Define the trainable model components.
-
Optionally validate model structure with mock inputs and initialize optimizers.
-
-
- Source code in ydata_synthetic/synthesizers/base.py
-
defdefine_gan(self):
-"""Define the trainable model components.
-
- Optionally validate model structure with mock inputs and initialize optimizers."""
-raiseNotImplementedError
-
deffit(self,
-data:Union[DataFrame,array],
-num_cols:Optional[List[str]]=None,
-cat_cols:Optional[List[str]]=None,
-train_arguments:Optional[TrainParameters]=None)->Union[DataFrame,array]:
-"""
- Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data (Union[DataFrame, array]): Training data
- num_cols (Optional[List[str]]) : List with the names of the categorical columns
- cat_cols (Optional[List[str]]): List of names of categorical columns
- train_arguments (Optional[TrainParameters]): Training parameters
-
- Returns:
- Fitted synthesizer
- """
-ifself.__MODEL__inRegularModels.__members__:
-self.processor=RegularDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__inTimeSeriesModels.__members__:
-self.processor=TimeSeriesDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==CTGANDataProcessor.SUPPORTED_MODEL:
-n_clusters=train_arguments.n_clusters
-epsilon=train_arguments.epsilon
-self.processor=CTGANDataProcessor(n_clusters=n_clusters,epsilon=epsilon,
-num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==DoppelGANgerProcessor.SUPPORTED_MODEL:
-measurement_cols=train_arguments.measurement_cols
-sequence_length=train_arguments.sequence_length
-sample_length=train_arguments.sample_length
-self.processor=DoppelGANgerProcessor(num_cols=num_cols,cat_cols=cat_cols,
-measurement_cols=measurement_cols,
-sequence_length=sequence_length,
-sample_length=sample_length,
-normalize_tanh=self.use_tanh).fit(data)
-else:
-print(f'A DataProcessor is not available for the {self.__MODEL__}.')
-
-
-
-
-
-
-
-
-
-
-
-
- load(path)
-
-
- classmethod
-
-
-
-
-
-
-
-
Loads a saved synthesizer from a pickle.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
path
-
- str
-
-
-
-
Path to read the synthesizer pickle from.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/base.py
-
defsave(self,path):
-"""
- Saves a synthesizer as a pickle.
-
- Args:
- path (str): Path to write the synthesizer as a pickle object.
- """
-#Save only the generator?
-ifself.__MODEL__=='WGAN'orself.__MODEL__=='WGAN_GP'orself.__MODEL__=='CWGAN_GP':
-delself.critic
-make_keras_picklable()
-dump(self,path)
-
classCGAN(ConditionalModel):
-"CGAN model for discrete conditions"
-
-__MODEL__='CGAN'
-
-def__init__(self,model_parameters):
-self._col_order=None
-super().__init__(model_parameters)
-
-defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple]): Defaults to None
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),
-label_shape=(self.label_dim),
-dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.discriminator=Discriminator(self.batch_size). \
-build_model(input_shape=(self.data_dim,),
-label_shape=(self.label_dim,),
-dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-d_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the discriminator
-self.discriminator.compile(loss='binary_crossentropy',
-optimizer=d_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-noise=Input(shape=(self.noise_dim,))
-label=Input(shape=(1,))# A label vector is expected
-record=self.generator([noise,label])
-
-# For the combined model we will only train the generator
-self.discriminator.trainable=False
-
-# The discriminator takes generated images as input and determines validity
-validity=self.discriminator([record,label])
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-self._model=Model([noise,label],validity)
-self._model.compile(loss='binary_crossentropy',optimizer=g_optimizer)
-
-def_generate_noise(self):
-"""Gaussian noise for the generator input."""
-whileTrue:
-yieldrandom.uniform(shape=(self.noise_dim,))
-
-defget_batch_noise(self):
-"""Create a batch iterator for the generator gaussian noise input."""
-returniter(tfdata.Dataset.from_generator(self._generate_noise,output_types=dtypes.float32)
-.batch(self.batch_size)
-.repeat())
-
-defget_data_batch(self,data,batch_size,seed=0):
-"""Produce real data batches from the passed data object.
-
- Args:
- data: real data.
- batch_size: batch size.
- seed (int, optional): Defaults to 0.
-
- Returns:
- data batch.
- """
-start_i=(batch_size*seed)%len(data)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(data)
-np.random.seed(shuffle_seed)
-data_ix=np.random.choice(data.shape[0],replace=False,size=len(data))# wasteful to shuffle every time
-returndata[data_ix[start_i:stop_i]]
-
-deffit(self,
-data:DataFrame,
-label_cols:List[str],
-train_arguments:TrainParameters,
-num_cols:List[str],
-cat_cols:List[str]):
-"""Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized
- label_cols: The name of the column to be used as a label and condition for the training
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-data,label=self._prep_fit(data,label_cols,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.label_dim=len(label_cols)
-
-# Init the GAN model and optimizers
-self.define_gan(self.processor.col_transform_info)
-
-# Merging labels with processed data
-processed_data=hstack([processed_data,label])
-
-noise_batches=self.get_batch_noise()
-
-iterations=int(abs(processed_data.shape[0]/self.batch_size)+1)
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=np.zeros((self.batch_size,1))
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_x=self.get_data_batch(processed_data,self.batch_size)# Batches are retrieved with labels
-batch_x,label=batch_x[:,:-1],batch_x[:,-1]# Separate labels from batch
-noise=next(noise_batches)
-
-# Generate a batch of new records
-gen_records=self.generator([noise,label],training=True)
-
-# Train the discriminator
-d_loss_real=self.discriminator.train_on_batch([batch_x,label],valid)# Separate labels
-d_loss_fake=self.discriminator.train_on_batch([gen_records,label],fake)# Separate labels
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=next(noise_batches)
-# Train the generator (to have the discriminator label samples as valid)
-g_loss=self._model.train_on_batch([noise,label],valid)
-
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-# If at save interval => save model state and generated image samples
-ifepoch%train_arguments.sample_interval==0:
-self._run_checkpoint(train_arguments,epoch,label)
-
-def_run_checkpoint(self,train_arguments,epoch,label):
-"""Run checkpoint and store model state and generated samples.
-
- Args:
- train_arguments: GAN training arguments.
- epoch: training epoch
- label: deprecated
- """
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.discriminator.save_weights(model_checkpoint_base_name.format('discriminator',epoch))
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan(activation_info=None)
-
-
-
-
-
-
-
Define the trainable model components.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
activation_info
-
- Optional[NamedTuple]
-
-
-
-
Defaults to None
-
-
-
- None
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cgan/model.py
-
defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple]): Defaults to None
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),
-label_shape=(self.label_dim),
-dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.discriminator=Discriminator(self.batch_size). \
-build_model(input_shape=(self.data_dim,),
-label_shape=(self.label_dim,),
-dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-d_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the discriminator
-self.discriminator.compile(loss='binary_crossentropy',
-optimizer=d_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-noise=Input(shape=(self.noise_dim,))
-label=Input(shape=(1,))# A label vector is expected
-record=self.generator([noise,label])
-
-# For the combined model we will only train the generator
-self.discriminator.trainable=False
-
-# The discriminator takes generated images as input and determines validity
-validity=self.discriminator([record,label])
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-self._model=Model([noise,label],validity)
-self._model.compile(loss='binary_crossentropy',optimizer=g_optimizer)
-
deffit(self,
-data:DataFrame,
-label_cols:List[str],
-train_arguments:TrainParameters,
-num_cols:List[str],
-cat_cols:List[str]):
-"""Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized
- label_cols: The name of the column to be used as a label and condition for the training
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-data,label=self._prep_fit(data,label_cols,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.label_dim=len(label_cols)
-
-# Init the GAN model and optimizers
-self.define_gan(self.processor.col_transform_info)
-
-# Merging labels with processed data
-processed_data=hstack([processed_data,label])
-
-noise_batches=self.get_batch_noise()
-
-iterations=int(abs(processed_data.shape[0]/self.batch_size)+1)
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=np.zeros((self.batch_size,1))
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_x=self.get_data_batch(processed_data,self.batch_size)# Batches are retrieved with labels
-batch_x,label=batch_x[:,:-1],batch_x[:,-1]# Separate labels from batch
-noise=next(noise_batches)
-
-# Generate a batch of new records
-gen_records=self.generator([noise,label],training=True)
-
-# Train the discriminator
-d_loss_real=self.discriminator.train_on_batch([batch_x,label],valid)# Separate labels
-d_loss_fake=self.discriminator.train_on_batch([gen_records,label],fake)# Separate labels
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=next(noise_batches)
-# Train the generator (to have the discriminator label samples as valid)
-g_loss=self._model.train_on_batch([noise,label],valid)
-
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-# If at save interval => save model state and generated image samples
-ifepoch%train_arguments.sample_interval==0:
-self._run_checkpoint(train_arguments,epoch,label)
-
-
-
-
-
-
-
-
-
-
-
-
- get_batch_noise()
-
-
-
-
-
-
-
Create a batch iterator for the generator gaussian noise input.
-
-
- Source code in ydata_synthetic/synthesizers/regular/cgan/model.py
-
defget_batch_noise(self):
-"""Create a batch iterator for the generator gaussian noise input."""
-returniter(tfdata.Dataset.from_generator(self._generate_noise,output_types=dtypes.float32)
-.batch(self.batch_size)
-.repeat())
-
-
-
-
-
-
-
-
-
-
-
-
- get_data_batch(data,batch_size,seed=0)
-
-
-
-
-
-
-
Produce real data batches from the passed data object.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
data
-
-
-
-
-
real data.
-
-
-
- required
-
-
-
-
batch_size
-
-
-
-
-
batch size.
-
-
-
- required
-
-
-
-
seed
-
- int
-
-
-
-
Defaults to 0.
-
-
-
- 0
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
data batch.
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cgan/model.py
-
defget_data_batch(self,data,batch_size,seed=0):
-"""Produce real data batches from the passed data object.
-
- Args:
- data: real data.
- batch_size: batch size.
- seed (int, optional): Defaults to 0.
-
- Returns:
- data batch.
- """
-start_i=(batch_size*seed)%len(data)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(data)
-np.random.seed(shuffle_seed)
-data_ix=np.random.choice(data.shape[0],replace=False,size=len(data))# wasteful to shuffle every time
-returndata[data_ix[start_i:stop_i]]
-
classCRAMERGAN(BaseGANModel):
-
-__MODEL__='CRAMERGAN'
-
-def__init__(self,model_parameters,gradient_penalty_weight=10):
-"""Create a base CramerGAN.
-
- Based according to the WGAN paper - https://arxiv.org/pdf/1705.10743.pdf
- CramerGAN, a solution to biased Wassertein Gradients https://arxiv.org/abs/1705.10743"""
-self.gradient_penalty_weight=gradient_penalty_weight
-super().__init__(model_parameters)
-
-defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple], optional): Defaults to None.
-
- Returns:
- (generator_optimizer, critic_optimizer): Generator and critic optimizers
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.critic=Critic(self.batch_size). \
-build_model(input_shape=(self.data_dim,),dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-c_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# The generator takes noise as input and generates records
-z=Input(shape=(self.noise_dim,),batch_size=self.batch_size)
-fake=self.generator(z)
-logits=self.critic(fake)
-
-returng_optimizer,c_optimizer
-
-defgradient_penalty(self,real,fake):
-"""Compute gradient penalty.
-
- Args:
- real: real event.
- fake: fake event.
- Returns:
- gradient_penalty.
- """
-gp=gradient_penalty(self.f_crit,real,fake,mode=Mode.CRAMER)
-returngp
-
-defupdate_gradients(self,x,g_optimizer,c_optimizer):
-"""Compute and apply the gradients for both the Generator and the Critic.
-
- Args:
- x: real data event
- g_optimizer: generator optimizer
- c_optimizer: critic optimizer
- Returns:
- (critic loss, generator loss)
- """
-# Update the gradients of critic for n_critic times (Training the critic)
-
-##New generator gradient_tape
-noise=tf.random.normal([x.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-noise2=tf.random.normal([x.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-
-withtf.GradientTape()asg_tape,tf.GradientTape()asd_tape:
-fake=self.generator(noise,training=True)
-fake2=self.generator(noise2,training=True)
-
-g_loss=self.g_lossfn(x,fake,fake2)
-
-c_loss=self.c_lossfn(x,fake,fake2)
-
-# Get the gradients of the generator
-g_gradients=g_tape.gradient(g_loss,self.generator.trainable_variables)
-
-# Update the weights of the generator
-g_optimizer.apply_gradients(
-zip(g_gradients,self.generator.trainable_variables)
-)
-
-c_gradient=d_tape.gradient(c_loss,self.critic.trainable_variables)
-# Update the weights of the critic using the optimizer
-c_optimizer.apply_gradients(
-zip(c_gradient,self.critic.trainable_variables)
-)
-
-returnc_loss,g_loss
-
-defg_lossfn(self,real,fake,fake2):
-"""Compute generator loss function according to the CramerGAN paper.
-
- Args:
- real: A real sample
- fake: A fake sample
- fak2: A second fake sample
-
- Returns:
- Loss of the generator
- """
-g_loss=tf.norm(self.critic(real,training=True)-self.critic(fake,training=True),axis=1)+ \
-tf.norm(self.critic(real,training=True)-self.critic(fake2,training=True),axis=1)- \
-tf.norm(self.critic(fake,training=True)-self.critic(fake2,training=True),axis=1)
-returntf.reduce_mean(g_loss)
-
-deff_crit(self,real,fake):
-"""
- Computes the critic distance function f between two samples.
-
- Args:
- real: A real sample
- fake: A fake sample
- Returns:
- Loss of the critic
- """
-returntf.norm(self.critic(real,training=True)-self.critic(fake,training=True),axis=1)-tf.norm(self.critic(real,training=True),axis=1)
-
-defc_lossfn(self,real,fake,fake2):
-"""Compute the loss of the critic.
-
- Args:
- real: A real sample
- fake: A fake sample
- fake2: A second fake sample
-
- Returns:
- Loss of the critic
- """
-f_real=self.f_crit(real,fake2)
-f_fake=self.f_crit(fake,fake2)
-loss_surrogate=f_real-f_fake
-gp=self.gradient_penalty(real,[fake,fake2])
-returntf.reduce_mean(-loss_surrogate+self.gradient_penalty_weight*gp)
-
-@staticmethod
-defget_data_batch(train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
-deftrain_step(self,train_data,optimizers):
-"""Perform a training step.
-
- Args:
- train_data: training data
- optimizers: generator and critic optimizers
-
- Returns:
- (critic_loss, generator_loss): Critic and generator loss.
- """
-critic_loss,g_loss=self.update_gradients(train_data,*optimizers)
-returncritic_loss,g_loss
-
-deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-super().fit(data,num_cols,cat_cols)
-
-data=self.processor.transform(data)
-self.data_dim=data.shape[1]
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-
-# Create a summary file
-train_summary_writer=tf.summary.create_file_writer(path.join('..\cramergan_test','summaries','train'))
-
-withtrain_summary_writer.as_default():
-forepochintrange(train_arguments.epochs):
-foriterationinrange(iterations):
-batch_data=self.get_data_batch(data,self.batch_size)
-c_loss,g_loss=self.train_step(batch_data,optimizers)
-
-ifiteration%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',iteration))
-self.critic.save_weights(model_checkpoint_base_name.format('critic',iteration))
-print(f"Epoch: {epoch} | critic_loss: {c_loss} | gen_loss: {g_loss}")
-
Based according to the WGAN paper - https://arxiv.org/pdf/1705.10743.pdf
-CramerGAN, a solution to biased Wassertein Gradients https://arxiv.org/abs/1705.10743
-
-
- Source code in ydata_synthetic/synthesizers/regular/cramergan/model.py
-
def__init__(self,model_parameters,gradient_penalty_weight=10):
-"""Create a base CramerGAN.
-
- Based according to the WGAN paper - https://arxiv.org/pdf/1705.10743.pdf
- CramerGAN, a solution to biased Wassertein Gradients https://arxiv.org/abs/1705.10743"""
-self.gradient_penalty_weight=gradient_penalty_weight
-super().__init__(model_parameters)
-
-
-
-
-
-
-
-
-
-
-
-
- c_lossfn(real,fake,fake2)
-
-
-
-
-
-
-
Compute the loss of the critic.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
A real sample
-
-
-
- required
-
-
-
-
fake
-
-
-
-
-
A fake sample
-
-
-
- required
-
-
-
-
fake2
-
-
-
-
-
A second fake sample
-
-
-
- required
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
Loss of the critic
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cramergan/model.py
-
defc_lossfn(self,real,fake,fake2):
-"""Compute the loss of the critic.
-
- Args:
- real: A real sample
- fake: A fake sample
- fake2: A second fake sample
-
- Returns:
- Loss of the critic
- """
-f_real=self.f_crit(real,fake2)
-f_fake=self.f_crit(fake,fake2)
-loss_surrogate=f_real-f_fake
-gp=self.gradient_penalty(real,[fake,fake2])
-returntf.reduce_mean(-loss_surrogate+self.gradient_penalty_weight*gp)
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan(activation_info=None)
-
-
-
-
-
-
-
Define the trainable model components.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
activation_info
-
- Optional[NamedTuple]
-
-
-
-
Defaults to None.
-
-
-
- None
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
- (generator_optimizer, critic_optimizer)
-
-
-
-
Generator and critic optimizers
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cramergan/model.py
-
deff_crit(self,real,fake):
-"""
- Computes the critic distance function f between two samples.
-
- Args:
- real: A real sample
- fake: A fake sample
- Returns:
- Loss of the critic
- """
-returntf.norm(self.critic(real,training=True)-self.critic(fake,training=True),axis=1)-tf.norm(self.critic(real,training=True),axis=1)
-
-
-
-
-
-
-
-
-
-
-
-
- fit(data,train_arguments,num_cols,cat_cols)
-
-
-
-
-
-
-
Fit a synthesizer model to a given input dataset.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
data
-
-
-
-
-
A pandas DataFrame or a Numpy array with the data to be synthesized
-
-
-
- required
-
-
-
-
train_arguments
-
- TrainParameters
-
-
-
-
GAN training arguments.
-
-
-
- required
-
-
-
-
num_cols
-
- List[str]
-
-
-
-
List of columns of the data object to be handled as numerical
-
-
-
- required
-
-
-
-
cat_cols
-
- List[str]
-
-
-
-
List of columns of the data object to be handled as categorical
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cramergan/model.py
-
deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-super().fit(data,num_cols,cat_cols)
-
-data=self.processor.transform(data)
-self.data_dim=data.shape[1]
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-
-# Create a summary file
-train_summary_writer=tf.summary.create_file_writer(path.join('..\cramergan_test','summaries','train'))
-
-withtrain_summary_writer.as_default():
-forepochintrange(train_arguments.epochs):
-foriterationinrange(iterations):
-batch_data=self.get_data_batch(data,self.batch_size)
-c_loss,g_loss=self.train_step(batch_data,optimizers)
-
-ifiteration%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',iteration))
-self.critic.save_weights(model_checkpoint_base_name.format('critic',iteration))
-print(f"Epoch: {epoch} | critic_loss: {c_loss} | gen_loss: {g_loss}")
-
-
-
-
-
-
-
-
-
-
-
-
- g_lossfn(real,fake,fake2)
-
-
-
-
-
-
-
Compute generator loss function according to the CramerGAN paper.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
A real sample
-
-
-
- required
-
-
-
-
fake
-
-
-
-
-
A fake sample
-
-
-
- required
-
-
-
-
fak2
-
-
-
-
-
A second fake sample
-
-
-
- required
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
Loss of the generator
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cramergan/model.py
-
defg_lossfn(self,real,fake,fake2):
-"""Compute generator loss function according to the CramerGAN paper.
-
- Args:
- real: A real sample
- fake: A fake sample
- fak2: A second fake sample
-
- Returns:
- Loss of the generator
- """
-g_loss=tf.norm(self.critic(real,training=True)-self.critic(fake,training=True),axis=1)+ \
-tf.norm(self.critic(real,training=True)-self.critic(fake2,training=True),axis=1)- \
-tf.norm(self.critic(fake,training=True)-self.critic(fake2,training=True),axis=1)
-returntf.reduce_mean(g_loss)
-
@staticmethod
-defget_data_batch(train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
-
-
-
-
-
-
-
-
-
-
-
- gradient_penalty(real,fake)
-
-
-
-
-
-
-
Compute gradient penalty.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
real event.
-
-
-
- required
-
-
-
-
fake
-
-
-
-
-
fake event.
-
-
-
- required
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cramergan/model.py
-
defupdate_gradients(self,x,g_optimizer,c_optimizer):
-"""Compute and apply the gradients for both the Generator and the Critic.
-
- Args:
- x: real data event
- g_optimizer: generator optimizer
- c_optimizer: critic optimizer
- Returns:
- (critic loss, generator loss)
- """
-# Update the gradients of critic for n_critic times (Training the critic)
-
-##New generator gradient_tape
-noise=tf.random.normal([x.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-noise2=tf.random.normal([x.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-
-withtf.GradientTape()asg_tape,tf.GradientTape()asd_tape:
-fake=self.generator(noise,training=True)
-fake2=self.generator(noise2,training=True)
-
-g_loss=self.g_lossfn(x,fake,fake2)
-
-c_loss=self.c_lossfn(x,fake,fake2)
-
-# Get the gradients of the generator
-g_gradients=g_tape.gradient(g_loss,self.generator.trainable_variables)
-
-# Update the weights of the generator
-g_optimizer.apply_gradients(
-zip(g_gradients,self.generator.trainable_variables)
-)
-
-c_gradient=d_tape.gradient(c_loss,self.critic.trainable_variables)
-# Update the weights of the critic using the optimizer
-c_optimizer.apply_gradients(
-zip(c_gradient,self.critic.trainable_variables)
-)
-
-returnc_loss,g_loss
-
@staticmethod
-defload(class_dict):
-"""
- Load the CTGAN model from a pickle file.
- Only the required components to sample new data are loaded.
-
- Args:
- class_dict: Class dict loaded from the pickle file.
- """
-new_instance=CTGAN(class_dict["model_parameters"])
-setattr(new_instance,"generator_dims",class_dict["generator_dims"])
-setattr(new_instance,"tau",class_dict["tau"])
-setattr(new_instance,"batch_size",class_dict["batch_size"])
-setattr(new_instance,"latent_dim",class_dict["latent_dim"])
-
-new_instance._conditional_sampler=ConditionalSampler()
-new_instance._conditional_sampler.__dict__=class_dict["conditional_sampler"]
-new_instance.processor=CTGANDataProcessor()
-new_instance.processor.__dict__=class_dict["processor"]
-
-new_instance._generator_model=new_instance._create_generator_model(
-class_dict["gen_input_dim"],class_dict["generator_dims"],
-class_dict["data_dim"],class_dict["metadata"],class_dict["tau"])
-
-new_instance._generator_model.build((class_dict["batch_size"],class_dict["gen_input_dim"]))
-new_instance._generator_model.set_weights(class_dict['generator_model_weights'])
-returnnew_instance
-
-
-
-
-
-
-
-
-
-
-
-
- sample(n_samples)
-
-
-
-
-
-
-
Samples new data from the CTGAN.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
n_samples
-
- int
-
-
-
-
Number of samples to be generated.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/ctgan/model.py
-
defsample(self,n_samples:int):
-"""
- Samples new data from the CTGAN.
-
- Args:
- n_samples: Number of samples to be generated.
- """
-ifn_samples<=0:
-raiseValueError("Invalid number of samples.")
-
-steps=n_samples//self.batch_size+1
-data=[]
-for_intf.range(steps):
-fake_z=tf.random.normal([self.batch_size,self.latent_dim])
-cond_vec=self._conditional_sampler.sample(self.batch_size,from_active_bits=True)
-ifcond_vecisnotNone:
-cond=tf.constant(cond_vec)
-fake_z=tf.concat([fake_z,cond],1)
-
-fake=self._generator_model(fake_z)[1]
-data.append(fake.numpy())
-
-data=np.concatenate(data,0)
-data=data[:n_samples]
-returnself.processor.inverse_transform(data)
-
-
-
-
-
-
-
-
-
-
-
-
- save(path)
-
-
-
-
-
-
-
Save the CTGAN model in a pickle file.
-Only the required components to sample new data are saved.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
path
-
-
-
-
-
Path of the pickle file.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/ctgan/model.py
-
defsave(self,path):
-"""
- Save the CTGAN model in a pickle file.
- Only the required components to sample new data are saved.
-
- Args:
- path: Path of the pickle file.
- """
-dump({
-"model_parameters":self._model_parameters,
-"data_dim":self.processor.output_dimensions,
-"gen_input_dim":self.latent_dim+self._conditional_sampler.output_dimensions,
-"generator_dims":self.generator_dims,
-"tau":self.tau,
-"metadata":self.processor.metadata,
-"batch_size":self.batch_size,
-"latent_dim":self.latent_dim,
-"conditional_sampler":self._conditional_sampler.__dict__,
-"generator_model_weights":self._generator_model.get_weights(),
-"processor":self.processor.__dict__
-},path)
-
classCWGANGP(ConditionalModel,WGAN_GP):
-
-__MODEL__='CWGAN_GP'
-
-def__init__(self,model_parameters,
-n_generator:Optional[int]=1,
-n_critic:Optional[int]=1,
-gradient_penalty_weight:int=10):
-"""
- Adapts the WGAN_GP synthesizer implementation to be conditional.
-
- Several conditional WGAN implementations can be found online, here are a few:
- https://cameronfabbri.github.io/papers/conditionalWGAN.pdf
- https://www.sciencedirect.com/science/article/abs/pii/S0020025519309715
- https://arxiv.org/pdf/2008.09202.pdf
- """
-WGAN_GP.__init__(self,model_parameters,
-n_generator=n_generator,
-n_critic=n_critic,
-gradient_penalty_weight=gradient_penalty_weight)
-
-defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple]): Defaults to None
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),
-label_shape=(self.label_dim,),
-dim=self.layers_dim,
-data_dim=self.data_dim,
-activation_info=activation_info,
-tau=self.tau)
-
-self.critic=Critic(self.batch_size). \
-build_model(input_shape=(self.data_dim,),
-label_shape=(self.label_dim,),
-dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-c_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-returng_optimizer,c_optimizer
-
-defgradient_penalty(self,real,fake,label):
-"""Compute gradient penalty.
-
- Args:
- real: real event.
- fake: fake event.
- label: ground truth.
- Returns:
- gradient_penalty
- """
-epsilon=random.uniform([real.shape[0],1],0.0,1.0,dtype=dtypes.float32)
-x_hat=epsilon*real+(1-epsilon)*fake
-withGradientTape()ast:
-t.watch(x_hat)
-d_hat=self.critic([x_hat,label])
-gradients=t.gradient(d_hat,x_hat)
-ddx=sqrt(reduce_sum(gradients**2))
-d_regularizer=reduce_mean((ddx-1.0)**2)
-returnd_regularizer
-
-@staticmethod
-defget_data_batch(data,batch_size,seed=0):
-"""Produce real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-start_i=(batch_size*seed)%len(data)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(data)
-np.random.seed(shuffle_seed)
-data_ix=np.random.choice(data.shape[0],replace=False,size=len(data))# wasteful to shuffle every time
-returndtypes.cast(data[data_ix[start_i:stop_i]],dtype=dtypes.float32)
-
-defc_lossfn(self,real):
-"""Compute the critic loss.
-
- Args:
- real: A real sample
-
- Returns:
- Critic loss
- """
-real,label=real
-# generating noise from a uniform distribution
-noise=random.uniform([real.shape[0],self.noise_dim],minval=0.999,maxval=1.0,dtype=dtypes.float32)
-# run noise through generator
-fake=self.generator([noise,label])
-# discriminate x and x_gen
-logits_real=self.critic([real,label])
-logits_fake=self.critic([fake,label])
-
-# gradient penalty
-gp=self.gradient_penalty(real,fake,label)
-# getting the loss of the critic.
-c_loss=(reduce_mean(logits_fake)
--reduce_mean(logits_real)
-+gp*self.gradient_penalty_weight)
-returnc_loss
-
-defg_lossfn(self,real):
-"""
- Forward pass on the generator and computes the loss.
-
- Args:
- real: Data batch we are analyzing
- Returns:
- Generator loss
- """
-real,label=real
-
-# generating noise from a uniform distribution
-noise=random.uniform([real.shape[0],self.noise_dim],minval=0.0,maxval=0.001,dtype=dtypes.float32)
-
-fake=self.generator([noise,label])
-logits_fake=self.critic([fake,label])
-g_loss=-reduce_mean(logits_fake)
-returng_loss
-
-deffit(self,data:DataFrame,
-label_cols:List[str],
-train_arguments:TrainParameters,
-num_cols:List[str],
-cat_cols:List[str]):
-"""
- Train the synthesizer on a provided dataset based on a specified condition column.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized
- label: The name of the column to be used as a label and condition for the training
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-data,label=self._prep_fit(data,label_cols,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.label_dim=len(label_cols)
-
-#Init the GAN model and optimizers
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-# Merging labels with processed data
-processed_data=hstack([processed_data,label])
-
-iterations=int(abs(processed_data.shape[0]/self.batch_size)+1)
-print(f'Number of iterations per epoch: {iterations}')
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_x=self.get_data_batch(processed_data,self.batch_size)# Batches are retrieved with labels
-batch_x,label=batch_x[:,:-self.label_dim],batch_x[:,-self.label_dim:]# Separate labels from batch
-
-cri_loss,ge_loss=self.train_step((batch_x,label),optimizers)
-
-print(
-"Epoch: {} | critic_loss: {} | gen_loss: {}".format(
-epoch,cri_loss,ge_loss
-))
-
-# If at save interval => save model state and generated image samples
-ifepoch%train_arguments.sample_interval==0:
-self._run_checkpoint(train_arguments,epoch)
-
-def_run_checkpoint(self,train_arguments,epoch):
-"Run checkpoint and store model state and generated samples."
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.critic.save_weights(model_checkpoint_base_name.format('critic',epoch))
-
Adapts the WGAN_GP synthesizer implementation to be conditional.
-
Several conditional WGAN implementations can be found online, here are a few:
- https://cameronfabbri.github.io/papers/conditionalWGAN.pdf
- https://www.sciencedirect.com/science/article/abs/pii/S0020025519309715
- https://arxiv.org/pdf/2008.09202.pdf
-
-
- Source code in ydata_synthetic/synthesizers/regular/cwgangp/model.py
-
def__init__(self,model_parameters,
-n_generator:Optional[int]=1,
-n_critic:Optional[int]=1,
-gradient_penalty_weight:int=10):
-"""
- Adapts the WGAN_GP synthesizer implementation to be conditional.
-
- Several conditional WGAN implementations can be found online, here are a few:
- https://cameronfabbri.github.io/papers/conditionalWGAN.pdf
- https://www.sciencedirect.com/science/article/abs/pii/S0020025519309715
- https://arxiv.org/pdf/2008.09202.pdf
- """
-WGAN_GP.__init__(self,model_parameters,
-n_generator=n_generator,
-n_critic=n_critic,
-gradient_penalty_weight=gradient_penalty_weight)
-
-
-
-
-
-
-
-
-
-
-
-
- c_lossfn(real)
-
-
-
-
-
-
-
Compute the critic loss.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
A real sample
-
-
-
- required
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
Critic loss
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cwgangp/model.py
-
defc_lossfn(self,real):
-"""Compute the critic loss.
-
- Args:
- real: A real sample
-
- Returns:
- Critic loss
- """
-real,label=real
-# generating noise from a uniform distribution
-noise=random.uniform([real.shape[0],self.noise_dim],minval=0.999,maxval=1.0,dtype=dtypes.float32)
-# run noise through generator
-fake=self.generator([noise,label])
-# discriminate x and x_gen
-logits_real=self.critic([real,label])
-logits_fake=self.critic([fake,label])
-
-# gradient penalty
-gp=self.gradient_penalty(real,fake,label)
-# getting the loss of the critic.
-c_loss=(reduce_mean(logits_fake)
--reduce_mean(logits_real)
-+gp*self.gradient_penalty_weight)
-returnc_loss
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan(activation_info=None)
-
-
-
-
-
-
-
Define the trainable model components.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
activation_info
-
- Optional[NamedTuple]
-
-
-
-
Defaults to None
-
-
-
- None
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cwgangp/model.py
-
deffit(self,data:DataFrame,
-label_cols:List[str],
-train_arguments:TrainParameters,
-num_cols:List[str],
-cat_cols:List[str]):
-"""
- Train the synthesizer on a provided dataset based on a specified condition column.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized
- label: The name of the column to be used as a label and condition for the training
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-data,label=self._prep_fit(data,label_cols,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.label_dim=len(label_cols)
-
-#Init the GAN model and optimizers
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-# Merging labels with processed data
-processed_data=hstack([processed_data,label])
-
-iterations=int(abs(processed_data.shape[0]/self.batch_size)+1)
-print(f'Number of iterations per epoch: {iterations}')
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_x=self.get_data_batch(processed_data,self.batch_size)# Batches are retrieved with labels
-batch_x,label=batch_x[:,:-self.label_dim],batch_x[:,-self.label_dim:]# Separate labels from batch
-
-cri_loss,ge_loss=self.train_step((batch_x,label),optimizers)
-
-print(
-"Epoch: {} | critic_loss: {} | gen_loss: {}".format(
-epoch,cri_loss,ge_loss
-))
-
-# If at save interval => save model state and generated image samples
-ifepoch%train_arguments.sample_interval==0:
-self._run_checkpoint(train_arguments,epoch)
-
-
-
-
-
-
-
-
-
-
-
-
- g_lossfn(real)
-
-
-
-
-
-
-
Forward pass on the generator and computes the loss.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
Data batch we are analyzing
-
-
-
- required
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cwgangp/model.py
-
defg_lossfn(self,real):
-"""
- Forward pass on the generator and computes the loss.
-
- Args:
- real: Data batch we are analyzing
- Returns:
- Generator loss
- """
-real,label=real
-
-# generating noise from a uniform distribution
-noise=random.uniform([real.shape[0],self.noise_dim],minval=0.0,maxval=0.001,dtype=dtypes.float32)
-
-fake=self.generator([noise,label])
-logits_fake=self.critic([fake,label])
-g_loss=-reduce_mean(logits_fake)
-returng_loss
-
@staticmethod
-defget_data_batch(data,batch_size,seed=0):
-"""Produce real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-start_i=(batch_size*seed)%len(data)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(data)
-np.random.seed(shuffle_seed)
-data_ix=np.random.choice(data.shape[0],replace=False,size=len(data))# wasteful to shuffle every time
-returndtypes.cast(data[data_ix[start_i:stop_i]],dtype=dtypes.float32)
-
-
-
-
-
-
-
-
-
-
-
-
- gradient_penalty(real,fake,label)
-
-
-
-
-
-
-
Compute gradient penalty.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
real event.
-
-
-
- required
-
-
-
-
fake
-
-
-
-
-
fake event.
-
-
-
- required
-
-
-
-
label
-
-
-
-
-
ground truth.
-
-
-
- required
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/cwgangp/model.py
-
classDRAGAN(BaseGANModel):
-
-__MODEL__='DRAGAN'
-
-def__init__(self,model_parameters,n_discriminator,gradient_penalty_weight=10):
-"""DRAGAN model architecture implementation.
-
- Args:
- model_parameters:
- n_discriminator:
- gradient_penalty_weight (int, optional): Defaults to 10.
- """
-# As recommended in DRAGAN paper - https://arxiv.org/abs/1705.07215
-self.n_discriminator=n_discriminator
-self.gradient_penalty_weight=gradient_penalty_weight
-super().__init__(model_parameters)
-
-defdefine_gan(self,col_transform_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- col_transform_info (Optional[NamedTuple], optional): Defaults to None.
-
- Returns:
- (generator_optimizer, discriminator_optimizer): Generator and discriminator optimizers
- """
-# define generator/discriminator
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=col_transform_info,tau=self.tau)
-
-self.discriminator=Discriminator(self.batch_size). \
-build_model(input_shape=(self.data_dim,),dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2,clipvalue=0.001)
-d_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2,clipvalue=0.001)
-returng_optimizer,d_optimizer
-
-defgradient_penalty(self,real,fake):
-"""Compute gradient penalty.
-
- Args:
- real: real event.
- fake: fake event.
- Returns:
- gradient_penalty.
- """
-gp=gradient_penalty(self.discriminator,real,fake,mode=Mode.DRAGAN)
-returngp
-
-defupdate_gradients(self,x,g_optimizer,d_optimizer):
-"""Compute the gradients for Generator and Discriminator.
-
- Args:
- x (tf.tensor): real data event
- g_optimizer (tf.OptimizerV2): Optimizer for the generator model
- c_optimizer (tf.OptimizerV2): Optimizer for the discriminator model
- Returns:
- (discriminator loss, generator loss)
- """
-# Update the gradients of critic for n_critic times (Training the critic)
-for_inrange(self.n_discriminator):
-withtf.GradientTape()asd_tape:
-d_loss=self.d_lossfn(x)
-# Get the gradients of the critic
-d_gradient=d_tape.gradient(d_loss,self.discriminator.trainable_variables)
-# Update the weights of the critic using the optimizer
-d_optimizer.apply_gradients(
-zip(d_gradient,self.discriminator.trainable_variables)
-)
-
-# Update the generator
-withtf.GradientTape()asg_tape:
-gen_loss=self.g_lossfn(x)
-
-# Get the gradients of the generator
-gen_gradients=g_tape.gradient(gen_loss,self.generator.trainable_variables)
-
-# Update the weights of the generator
-g_optimizer.apply_gradients(
-zip(gen_gradients,self.generator.trainable_variables)
-)
-
-returnd_loss,gen_loss
-
-defd_lossfn(self,real):
-"""Calculates the critic losses.
-
- Args:
- real: real data examples.
-
- Returns:
- discriminator loss
- """
-noise=tf.random.normal((self.batch_size,self.noise_dim),dtype=tf.dtypes.float64)
-# run noise through generator
-fake=self.generator(noise)
-# discriminate x and x_gen
-logits_real=self.discriminator(real,training=True)
-logits_fake=self.discriminator(fake,training=True)
-
-# gradient penalty
-gp=self.gradient_penalty(real,fake)
-
-# getting the loss of the discriminator.
-d_loss=(tf.reduce_mean(logits_fake)
--tf.reduce_mean(logits_real)
-+gp*self.gradient_penalty_weight)
-returnd_loss
-
-defg_lossfn(self,real):
-"""Calculates the Generator losses.
-
- Args:
- real: real data.
- Returns:
- generator loss
- """
-# generating noise from a uniform distribution
-noise=tf.random.normal((real.shape[0],self.noise_dim),dtype=tf.float64)
-
-fake=self.generator(noise,training=True)
-logits_fake=self.discriminator(fake,training=True)
-g_loss=-tf.reduce_mean(logits_fake)
-returng_loss
-
-defget_data_batch(self,train,batch_size):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-buffer_size=len(train)
-#tensor_data = pd.concat([x_train, y_train], axis=1)
-train_loader=tf.data.Dataset.from_tensor_slices(train) \
-.batch(batch_size).shuffle(buffer_size)
-returntrain_loader
-
-deftrain_step(self,train_data,optimizers):
-"""Perform a training step.
-
- Args:
- train_data: training data
- optimizers: generator and critic optimizers
-
- Returns:
- (critic_loss, generator_loss): Critic and generator loss.
- """
-d_loss,g_loss=self.update_gradients(train_data,*optimizers)
-returnd_loss,g_loss
-
-deffit(self,data,train_arguments,num_cols,cat_cols):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-train_loader=self.get_data_batch(processed_data,self.batch_size)
-
-# Create a summary file
-train_summary_writer=tf.summary.create_file_writer(path.join('..\dragan_test','summaries','train'))
-
-withtrain_summary_writer.as_default():
-forepochintqdm.trange(train_arguments.epochs):
-forbatch_dataintrain_loader:
-batch_data=tf.cast(batch_data,dtype=tf.float32)
-d_loss,g_loss=self.train_step(batch_data,optimizers)
-
-print(
-"Epoch: {} | disc_loss: {} | gen_loss: {}".format(
-epoch,d_loss,g_loss
-))
-
-ifepoch%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.discriminator.save_weights(model_checkpoint_base_name.format('discriminator',epoch))
-
defd_lossfn(self,real):
-"""Calculates the critic losses.
-
- Args:
- real: real data examples.
-
- Returns:
- discriminator loss
- """
-noise=tf.random.normal((self.batch_size,self.noise_dim),dtype=tf.dtypes.float64)
-# run noise through generator
-fake=self.generator(noise)
-# discriminate x and x_gen
-logits_real=self.discriminator(real,training=True)
-logits_fake=self.discriminator(fake,training=True)
-
-# gradient penalty
-gp=self.gradient_penalty(real,fake)
-
-# getting the loss of the discriminator.
-d_loss=(tf.reduce_mean(logits_fake)
--tf.reduce_mean(logits_real)
-+gp*self.gradient_penalty_weight)
-returnd_loss
-
deffit(self,data,train_arguments,num_cols,cat_cols):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized
- train_arguments: GAN training arguments.
- num_cols: List of columns of the data object to be handled as numerical
- cat_cols: List of columns of the data object to be handled as categorical
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-train_loader=self.get_data_batch(processed_data,self.batch_size)
-
-# Create a summary file
-train_summary_writer=tf.summary.create_file_writer(path.join('..\dragan_test','summaries','train'))
-
-withtrain_summary_writer.as_default():
-forepochintqdm.trange(train_arguments.epochs):
-forbatch_dataintrain_loader:
-batch_data=tf.cast(batch_data,dtype=tf.float32)
-d_loss,g_loss=self.train_step(batch_data,optimizers)
-
-print(
-"Epoch: {} | disc_loss: {} | gen_loss: {}".format(
-epoch,d_loss,g_loss
-))
-
-ifepoch%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.discriminator.save_weights(model_checkpoint_base_name.format('discriminator',epoch))
-
-
-
-
-
-
-
-
-
-
-
-
- g_lossfn(real)
-
-
-
-
-
-
-
Calculates the Generator losses.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
real data.
-
-
-
- required
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/dragan/model.py
-
defupdate_gradients(self,x,g_optimizer,d_optimizer):
-"""Compute the gradients for Generator and Discriminator.
-
- Args:
- x (tf.tensor): real data event
- g_optimizer (tf.OptimizerV2): Optimizer for the generator model
- c_optimizer (tf.OptimizerV2): Optimizer for the discriminator model
- Returns:
- (discriminator loss, generator loss)
- """
-# Update the gradients of critic for n_critic times (Training the critic)
-for_inrange(self.n_discriminator):
-withtf.GradientTape()asd_tape:
-d_loss=self.d_lossfn(x)
-# Get the gradients of the critic
-d_gradient=d_tape.gradient(d_loss,self.discriminator.trainable_variables)
-# Update the weights of the critic using the optimizer
-d_optimizer.apply_gradients(
-zip(d_gradient,self.discriminator.trainable_variables)
-)
-
-# Update the generator
-withtf.GradientTape()asg_tape:
-gen_loss=self.g_lossfn(x)
-
-# Get the gradients of the generator
-gen_gradients=g_tape.gradient(gen_loss,self.generator.trainable_variables)
-
-# Update the weights of the generator
-g_optimizer.apply_gradients(
-zip(gen_gradients,self.generator.trainable_variables)
-)
-
-returnd_loss,gen_loss
-
Base class of GAN synthesizer models.
-The main methods are train (for fitting the synthesizer), save/load and sample (obtain synthetic records).
-Args:
- model_parameters (ModelParameters):
- Set of architectural parameters for model definition.
-
-
- Source code in ydata_synthetic/synthesizers/base.py
-
@typechecked
-classBaseGANModel(BaseModel):
-"""
- Base class of GAN synthesizer models.
- The main methods are train (for fitting the synthesizer), save/load and sample (obtain synthetic records).
- Args:
- model_parameters (ModelParameters):
- Set of architectural parameters for model definition.
- """
-def__init__(
-self,
-model_parameters:ModelParameters
-):
-gpu_devices=tfconfig.list_physical_devices('GPU')
-iflen(gpu_devices)>0:
-try:
-tfconfig.experimental.set_memory_growth(gpu_devices[0],True)
-except(ValueError,RuntimeError):
-# Invalid device or cannot modify virtual devices once initialized.
-pass
-#Validate the provided model parameters
-ifmodel_parameters.betasisnotNone:
-assertlen(model_parameters.betas)==2,"Please provide the betas information as a tuple."
-
-self.batch_size=model_parameters.batch_size
-self._set_lr(model_parameters.lr)
-self.beta_1=model_parameters.betas[0]
-self.beta_2=model_parameters.betas[1]
-self.noise_dim=model_parameters.noise_dim
-self.data_dim=None
-self.layers_dim=model_parameters.layers_dim
-
-# Additional parameters for the CTGAN
-self.generator_dims=model_parameters.generator_dims
-self.critic_dims=model_parameters.critic_dims
-self.l2_scale=model_parameters.l2_scale
-self.latent_dim=model_parameters.latent_dim
-self.gp_lambda=model_parameters.gp_lambda
-self.pac=model_parameters.pac
-
-self.use_tanh=model_parameters.tanh
-self.processor=None
-ifself.__MODEL__inRegularModels.__members__or \
-self.__MODEL__==CTGANDataProcessor.SUPPORTED_MODEL:
-self.tau=model_parameters.tau_gs
-
-# pylint: disable=E1101
-def__call__(self,inputs,**kwargs):
-returnself.model(inputs=inputs,**kwargs)
-
-# pylint: disable=C0103
-def_set_lr(self,lr):
-ifisinstance(lr,float):
-self.g_lr=lr
-self.d_lr=lr
-elifisinstance(lr,(list,tuple)):
-assertlen(lr)==2,"Please provide a two values array for the learning rates or a float."
-self.g_lr=lr[0]
-self.d_lr=lr[1]
-
-defdefine_gan(self):
-"""Define the trainable model components.
-
- Optionally validate model structure with mock inputs and initialize optimizers."""
-raiseNotImplementedError
-
-@property
-defmodel_parameters(self):
-"Returns the parameters of the model."
-returnself._model_parameters
-
-@property
-defmodel_name(self):
-"Returns the model (class) name."
-returnself.__class__.__name__
-
-deffit(self,
-data:Union[DataFrame,array],
-num_cols:Optional[List[str]]=None,
-cat_cols:Optional[List[str]]=None,
-train_arguments:Optional[TrainParameters]=None)->Union[DataFrame,array]:
-"""
- Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data (Union[DataFrame, array]): Training data
- num_cols (Optional[List[str]]) : List with the names of the categorical columns
- cat_cols (Optional[List[str]]): List of names of categorical columns
- train_arguments (Optional[TrainParameters]): Training parameters
-
- Returns:
- Fitted synthesizer
- """
-ifself.__MODEL__inRegularModels.__members__:
-self.processor=RegularDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__inTimeSeriesModels.__members__:
-self.processor=TimeSeriesDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==CTGANDataProcessor.SUPPORTED_MODEL:
-n_clusters=train_arguments.n_clusters
-epsilon=train_arguments.epsilon
-self.processor=CTGANDataProcessor(n_clusters=n_clusters,epsilon=epsilon,
-num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==DoppelGANgerProcessor.SUPPORTED_MODEL:
-measurement_cols=train_arguments.measurement_cols
-sequence_length=train_arguments.sequence_length
-sample_length=train_arguments.sample_length
-self.processor=DoppelGANgerProcessor(num_cols=num_cols,cat_cols=cat_cols,
-measurement_cols=measurement_cols,
-sequence_length=sequence_length,
-sample_length=sample_length,
-normalize_tanh=self.use_tanh).fit(data)
-else:
-print(f'A DataProcessor is not available for the {self.__MODEL__}.')
-
-defsample(self,n_samples:int):
-"""
- Generates samples from the trained synthesizer.
-
- Args:
- n_samples (int): Number of rows to generated.
-
- Returns:
- synth_sample (pandas.DataFrame): generated synthetic samples.
- """
-steps=n_samples//self.batch_size+1
-data=[]
-for_intqdm.trange(steps,desc='Synthetic data generation'):
-z=random.uniform([self.batch_size,self.noise_dim],dtype=tf.dtypes.float32)
-records=self.generator(z,training=False).numpy()
-data.append(records)
-returnself.processor.inverse_transform(array(vstack(data)))
-
-defsave(self,path):
-"""
- Saves a synthesizer as a pickle.
-
- Args:
- path (str): Path to write the synthesizer as a pickle object.
- """
-#Save only the generator?
-ifself.__MODEL__=='WGAN'orself.__MODEL__=='WGAN_GP'orself.__MODEL__=='CWGAN_GP':
-delself.critic
-make_keras_picklable()
-dump(self,path)
-
-@classmethod
-defload(cls,path):
-"""
- Loads a saved synthesizer from a pickle.
-
- Args:
- path (str): Path to read the synthesizer pickle from.
- """
-gpu_devices=tfconfig.list_physical_devices('GPU')
-iflen(gpu_devices)>0:
-try:
-tfconfig.experimental.set_memory_growth(gpu_devices[0],True)
-except(ValueError,RuntimeError):
-# Invalid device or cannot modify virtual devices once initialized.
-pass
-synth=load(path)
-returnsynth
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- model_name
-
-
- property
-
-
-
-
-
-
-
-
Returns the model (class) name.
-
-
-
-
-
-
-
-
-
- model_parameters
-
-
- property
-
-
-
-
-
-
-
-
Returns the parameters of the model.
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan()
-
-
-
-
-
-
-
Define the trainable model components.
-
Optionally validate model structure with mock inputs and initialize optimizers.
-
-
- Source code in ydata_synthetic/synthesizers/base.py
-
defdefine_gan(self):
-"""Define the trainable model components.
-
- Optionally validate model structure with mock inputs and initialize optimizers."""
-raiseNotImplementedError
-
deffit(self,
-data:Union[DataFrame,array],
-num_cols:Optional[List[str]]=None,
-cat_cols:Optional[List[str]]=None,
-train_arguments:Optional[TrainParameters]=None)->Union[DataFrame,array]:
-"""
- Trains and fit a synthesizer model to a given input dataset.
-
- Args:
- data (Union[DataFrame, array]): Training data
- num_cols (Optional[List[str]]) : List with the names of the categorical columns
- cat_cols (Optional[List[str]]): List of names of categorical columns
- train_arguments (Optional[TrainParameters]): Training parameters
-
- Returns:
- Fitted synthesizer
- """
-ifself.__MODEL__inRegularModels.__members__:
-self.processor=RegularDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__inTimeSeriesModels.__members__:
-self.processor=TimeSeriesDataProcessor(num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==CTGANDataProcessor.SUPPORTED_MODEL:
-n_clusters=train_arguments.n_clusters
-epsilon=train_arguments.epsilon
-self.processor=CTGANDataProcessor(n_clusters=n_clusters,epsilon=epsilon,
-num_cols=num_cols,cat_cols=cat_cols).fit(data)
-elifself.__MODEL__==DoppelGANgerProcessor.SUPPORTED_MODEL:
-measurement_cols=train_arguments.measurement_cols
-sequence_length=train_arguments.sequence_length
-sample_length=train_arguments.sample_length
-self.processor=DoppelGANgerProcessor(num_cols=num_cols,cat_cols=cat_cols,
-measurement_cols=measurement_cols,
-sequence_length=sequence_length,
-sample_length=sample_length,
-normalize_tanh=self.use_tanh).fit(data)
-else:
-print(f'A DataProcessor is not available for the {self.__MODEL__}.')
-
-
-
-
-
-
-
-
-
-
-
-
- load(path)
-
-
- classmethod
-
-
-
-
-
-
-
-
Loads a saved synthesizer from a pickle.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
path
-
- str
-
-
-
-
Path to read the synthesizer pickle from.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/base.py
-
defsave(self,path):
-"""
- Saves a synthesizer as a pickle.
-
- Args:
- path (str): Path to write the synthesizer as a pickle object.
- """
-#Save only the generator?
-ifself.__MODEL__=='WGAN'orself.__MODEL__=='WGAN_GP'orself.__MODEL__=='CWGAN_GP':
-delself.critic
-make_keras_picklable()
-dump(self,path)
-
classVanilllaGAN(BaseGANModel):
-
-__MODEL__='GAN'
-
-def__init__(self,model_parameters):
-super().__init__(model_parameters)
-
-defdefine_gan(self,activation_info:Optional[NamedTuple]):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple], optional): Defaults to None.
-
- Returns:
- (generator_optimizer, critic_optimizer): Generator and critic optimizers
- """
-self.generator=Generator(self.batch_size).\
-build_model(input_shape=(self.noise_dim,),dim=self.layers_dim,data_dim=self.data_dim,)
-
-self.discriminator=Discriminator(self.batch_size).\
-build_model(input_shape=(self.data_dim,),dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-d_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the discriminator
-self.discriminator.compile(loss='binary_crossentropy',
-optimizer=d_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-z=Input(shape=(self.noise_dim,))
-record=self.generator(z)
-
-# For the combined model we will only train the generator
-self.discriminator.trainable=False
-
-# The discriminator takes generated images as input and determines validity
-validity=self.discriminator(record)
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-self._model=Model(z,validity)
-self._model.compile(loss='binary_crossentropy',optimizer=g_optimizer)
-
-defget_data_batch(self,train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data
- batch_size: batch size
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch
- """
-# # random sampling - some samples will have excessively low or high sampling, but easy to implement
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
-deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized
- train_arguments: GAN training arguments.
- num_cols (List[str]): List of columns of the data object to be handled as numerical
- cat_cols (List[str]): List of columns of the data object to be handled as categorical
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.define_gan(self.processor.col_transform_info)
-
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=np.zeros((self.batch_size,1))
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_data=self.get_data_batch(processed_data,self.batch_size)
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-
-# Generate a batch of events
-gen_data=self.generator(noise,training=True)
-
-# Train the discriminator
-d_loss_real=self.discriminator.train_on_batch(batch_data,valid)
-d_loss_fake=self.discriminator.train_on_batch(gen_data,fake)
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-# Train the generator (to have the discriminator label samples as valid)
-g_loss=self._model.train_on_batch(noise,valid)
-
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-# If at save interval => save generated events
-ifepoch%train_arguments.sample_interval==0:
-#Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.discriminator.save_weights(model_checkpoint_base_name.format('discriminator',epoch))
-
-#Here is generating the data
-z=tf.random.normal((432,self.noise_dim))
-gen_data=self.generator(z)
-print('generated_data')
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan(activation_info)
-
-
-
-
-
-
-
Define the trainable model components.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
activation_info
-
- Optional[NamedTuple]
-
-
-
-
Defaults to None.
-
-
-
- required
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
- (generator_optimizer, critic_optimizer)
-
-
-
-
Generator and critic optimizers
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/vanillagan/model.py
-
defdefine_gan(self,activation_info:Optional[NamedTuple]):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple], optional): Defaults to None.
-
- Returns:
- (generator_optimizer, critic_optimizer): Generator and critic optimizers
- """
-self.generator=Generator(self.batch_size).\
-build_model(input_shape=(self.noise_dim,),dim=self.layers_dim,data_dim=self.data_dim,)
-
-self.discriminator=Discriminator(self.batch_size).\
-build_model(input_shape=(self.data_dim,),dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-d_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the discriminator
-self.discriminator.compile(loss='binary_crossentropy',
-optimizer=d_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-z=Input(shape=(self.noise_dim,))
-record=self.generator(z)
-
-# For the combined model we will only train the generator
-self.discriminator.trainable=False
-
-# The discriminator takes generated images as input and determines validity
-validity=self.discriminator(record)
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-self._model=Model(z,validity)
-self._model.compile(loss='binary_crossentropy',optimizer=g_optimizer)
-
-
-
-
-
-
-
-
-
-
-
-
- fit(data,train_arguments,num_cols,cat_cols)
-
-
-
-
-
-
-
Fit a synthesizer model to a given input dataset.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
data
-
-
-
-
-
A pandas DataFrame or a Numpy array with the data to be synthesized
-
-
-
- required
-
-
-
-
train_arguments
-
- TrainParameters
-
-
-
-
GAN training arguments.
-
-
-
- required
-
-
-
-
num_cols
-
- List[str]
-
-
-
-
List of columns of the data object to be handled as numerical
-
-
-
- required
-
-
-
-
cat_cols
-
- List[str]
-
-
-
-
List of columns of the data object to be handled as categorical
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/vanillagan/model.py
-
deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized
- train_arguments: GAN training arguments.
- num_cols (List[str]): List of columns of the data object to be handled as numerical
- cat_cols (List[str]): List of columns of the data object to be handled as categorical
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.define_gan(self.processor.col_transform_info)
-
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=np.zeros((self.batch_size,1))
-
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-# ---------------------
-# Train Discriminator
-# ---------------------
-batch_data=self.get_data_batch(processed_data,self.batch_size)
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-
-# Generate a batch of events
-gen_data=self.generator(noise,training=True)
-
-# Train the discriminator
-d_loss_real=self.discriminator.train_on_batch(batch_data,valid)
-d_loss_fake=self.discriminator.train_on_batch(gen_data,fake)
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-# Train the generator (to have the discriminator label samples as valid)
-g_loss=self._model.train_on_batch(noise,valid)
-
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-# If at save interval => save generated events
-ifepoch%train_arguments.sample_interval==0:
-#Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.discriminator.save_weights(model_checkpoint_base_name.format('discriminator',epoch))
-
-#Here is generating the data
-z=tf.random.normal((432,self.noise_dim))
-gen_data=self.generator(z)
-print('generated_data')
-
-
-
-
-
-
-
-
-
-
-
-
- get_data_batch(train,batch_size,seed=0)
-
-
-
-
-
-
-
Get real data batches from the passed data object.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
train
-
-
-
-
-
real data
-
-
-
- required
-
-
-
-
batch_size
-
-
-
-
-
batch size
-
-
-
- required
-
-
-
-
seed
-
- int
-
-
-
-
Defaults to 0.
-
-
-
- 0
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
data batch
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/vanillagan/model.py
-
defget_data_batch(self,train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data
- batch_size: batch size
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch
- """
-# # random sampling - some samples will have excessively low or high sampling, but easy to implement
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
classWGAN(BaseGANModel):
-
-__MODEL__='WGAN'
-
-def__init__(self,model_parameters,n_critic,clip_value=0.01):
-# As recommended in WGAN paper - https://arxiv.org/abs/1701.07875
-# WGAN-GP - WGAN with Gradient Penalty
-self.n_critic=n_critic
-self.clip_value=clip_value
-super().__init__(model_parameters)
-
-defwasserstein_loss(self,y_true,y_pred):
-"""Calculate wasserstein loss.
-
- Args:
- y_true: ground truth.
- y_pred: predictions.
-
- Returns:
- wasserstein loss.
- """
-returnK.mean(y_true*y_pred)
-
-defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple], optional): Defaults to None.
-
- Returns:
- (generator_optimizer, critic_optimizer): Generator and critic optimizers.
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.critic=Critic(self.batch_size). \
-build_model(input_shape=(self.data_dim,),dim=self.layers_dim)
-
-optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-critic_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the critic
-self.critic.compile(loss=self.wasserstein_loss,
-optimizer=critic_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-z=Input(shape=(self.noise_dim,))
-record=self.generator(z)
-# The discriminator takes generated images as input and determines validity
-validity=self.critic(record)
-
-# For the combined model we will only train the generator
-self.critic.trainable=False
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-#For the WGAN model use the Wassertein loss
-self._model=Model(z,validity)
-self._model.compile(loss='binary_crossentropy',optimizer=optimizer)
-
-defget_data_batch(self,train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
-deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],
-cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized.
- train_arguments: GAN training arguments.
- num_cols (List[str]): List of columns of the data object to be handled as numerical.
- cat_cols (List[str]): List of columns of the data object to be handled as categorical.
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.define_gan(self.processor.col_transform_info)
-
-#Create a summary file
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-train_summary_writer=tf.summary.create_file_writer(path.join('.','summaries','train'))
-
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=-np.ones((self.batch_size,1))
-
-withtrain_summary_writer.as_default():
-forepochintrange(train_arguments.epochs,desc='Epoch Iterations'):
-for_inrange(iterations):
-for_inrange(self.n_critic):
-# ---------------------
-# Train the Critic
-# ---------------------
-batch_data=self.get_data_batch(processed_data,self.batch_size)
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-
-# Generate a batch of events
-gen_data=self.generator(noise)
-
-# Train the Critic
-d_loss_real=self.critic.train_on_batch(batch_data,valid)
-d_loss_fake=self.critic.train_on_batch(gen_data,fake)
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-forlinself.critic.layers:
-weights=l.get_weights()
-weights=[np.clip(w,-self.clip_value,self.clip_value)forwinweights]
-l.set_weights(weights)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-# Train the generator (to have the critic label samples as valid)
-g_loss=self._model.train_on_batch(noise,valid)
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-#If at save interval => save generated events
-ifepoch%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.critic.save_weights(model_checkpoint_base_name.format('critic',epoch))
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan(activation_info=None)
-
-
-
-
-
-
-
Define the trainable model components.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
activation_info
-
- Optional[NamedTuple]
-
-
-
-
Defaults to None.
-
-
-
- None
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
- (generator_optimizer, critic_optimizer)
-
-
-
-
Generator and critic optimizers.
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgan/model.py
-
defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple], optional): Defaults to None.
-
- Returns:
- (generator_optimizer, critic_optimizer): Generator and critic optimizers.
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.critic=Critic(self.batch_size). \
-build_model(input_shape=(self.data_dim,),dim=self.layers_dim)
-
-optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-critic_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-
-# Build and compile the critic
-self.critic.compile(loss=self.wasserstein_loss,
-optimizer=critic_optimizer,
-metrics=['accuracy'])
-
-# The generator takes noise as input and generates imgs
-z=Input(shape=(self.noise_dim,))
-record=self.generator(z)
-# The discriminator takes generated images as input and determines validity
-validity=self.critic(record)
-
-# For the combined model we will only train the generator
-self.critic.trainable=False
-
-# The combined model (stacked generator and discriminator)
-# Trains the generator to fool the discriminator
-#For the WGAN model use the Wassertein loss
-self._model=Model(z,validity)
-self._model.compile(loss='binary_crossentropy',optimizer=optimizer)
-
-
-
-
-
-
-
-
-
-
-
-
- fit(data,train_arguments,num_cols,cat_cols)
-
-
-
-
-
-
-
Fit a synthesizer model to a given input dataset.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
data
-
-
-
-
-
A pandas DataFrame or a Numpy array with the data to be synthesized.
-
-
-
- required
-
-
-
-
train_arguments
-
- TrainParameters
-
-
-
-
GAN training arguments.
-
-
-
- required
-
-
-
-
num_cols
-
- List[str]
-
-
-
-
List of columns of the data object to be handled as numerical.
-
-
-
- required
-
-
-
-
cat_cols
-
- List[str]
-
-
-
-
List of columns of the data object to be handled as categorical.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgan/model.py
-
deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],
-cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized.
- train_arguments: GAN training arguments.
- num_cols (List[str]): List of columns of the data object to be handled as numerical.
- cat_cols (List[str]): List of columns of the data object to be handled as categorical.
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-self.define_gan(self.processor.col_transform_info)
-
-#Create a summary file
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-train_summary_writer=tf.summary.create_file_writer(path.join('.','summaries','train'))
-
-# Adversarial ground truths
-valid=np.ones((self.batch_size,1))
-fake=-np.ones((self.batch_size,1))
-
-withtrain_summary_writer.as_default():
-forepochintrange(train_arguments.epochs,desc='Epoch Iterations'):
-for_inrange(iterations):
-for_inrange(self.n_critic):
-# ---------------------
-# Train the Critic
-# ---------------------
-batch_data=self.get_data_batch(processed_data,self.batch_size)
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-
-# Generate a batch of events
-gen_data=self.generator(noise)
-
-# Train the Critic
-d_loss_real=self.critic.train_on_batch(batch_data,valid)
-d_loss_fake=self.critic.train_on_batch(gen_data,fake)
-d_loss=0.5*np.add(d_loss_real,d_loss_fake)
-
-forlinself.critic.layers:
-weights=l.get_weights()
-weights=[np.clip(w,-self.clip_value,self.clip_value)forwinweights]
-l.set_weights(weights)
-
-# ---------------------
-# Train Generator
-# ---------------------
-noise=tf.random.normal((self.batch_size,self.noise_dim))
-# Train the generator (to have the critic label samples as valid)
-g_loss=self._model.train_on_batch(noise,valid)
-# Plot the progress
-print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]"%(epoch,d_loss[0],100*d_loss[1],g_loss))
-
-#If at save interval => save generated events
-ifepoch%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.critic.save_weights(model_checkpoint_base_name.format('critic',epoch))
-
-
-
-
-
-
-
-
-
-
-
-
- get_data_batch(train,batch_size,seed=0)
-
-
-
-
-
-
-
Get real data batches from the passed data object.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
train
-
-
-
-
-
real data.
-
-
-
- required
-
-
-
-
batch_size
-
-
-
-
-
batch size.
-
-
-
- required
-
-
-
-
seed
-
- int
-
-
-
-
Defaults to 0.
-
-
-
- 0
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
data batch.
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgan/model.py
-
defget_data_batch(self,train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
-
-
-
-
-
-
-
-
-
-
-
- wasserstein_loss(y_true,y_pred)
-
-
-
-
-
-
-
Calculate wasserstein loss.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
y_true
-
-
-
-
-
ground truth.
-
-
-
- required
-
-
-
-
y_pred
-
-
-
-
-
predictions.
-
-
-
- required
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
wasserstein loss.
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgan/model.py
-
classWGAN_GP(BaseGANModel):
-
-__MODEL__='WGAN_GP'
-
-def__init__(self,model_parameters,n_generator:int=1,n_critic:int=1,gradient_penalty_weight:int=10):
-# As recommended in WGAN paper - https://arxiv.org/abs/1701.07875
-# WGAN-GP - WGAN with Gradient Penalty
-self.n_critic=n_critic
-self.n_generator=n_generator
-self.gradient_penalty_weight=gradient_penalty_weight
-super().__init__(model_parameters)
-
-defdefine_gan(self,activation_info:Optional[NamedTuple]=None):
-"""Define the trainable model components.
-
- Args:
- activation_info (Optional[NamedTuple], optional): Defaults to None.
-
- Returns:
- (generator_optimizer, critic_optimizer): Generator and critic optimizers.
- """
-self.generator=Generator(self.batch_size). \
-build_model(input_shape=(self.noise_dim,),dim=self.layers_dim,data_dim=self.data_dim,
-activation_info=activation_info,tau=self.tau)
-
-self.critic=Critic(self.batch_size). \
-build_model(input_shape=(self.data_dim,),dim=self.layers_dim)
-
-g_optimizer=Adam(self.g_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-c_optimizer=Adam(self.d_lr,beta_1=self.beta_1,beta_2=self.beta_2)
-returng_optimizer,c_optimizer
-
-defgradient_penalty(self,real,fake):
-"""Compute gradient penalty.
-
- Args:
- real: real event.
- fake: fake event.
- Returns:
- gradient_penalty.
- """
-epsilon=tf.random.uniform([real.shape[0],1],minval=0.0,maxval=1.0,dtype=tf.dtypes.float32)
-x_hat=epsilon*real+(1-epsilon)*fake
-withtf.GradientTape()ast:
-t.watch(x_hat)
-d_hat=self.critic(x_hat)
-gradients=t.gradient(d_hat,x_hat)
-ddx=tf.sqrt(tf.reduce_sum(gradients**2))
-d_regularizer=tf.reduce_mean((ddx-1.0)**2)
-returnd_regularizer
-
-@tf.function
-defupdate_gradients(self,x,g_optimizer,c_optimizer):
-"""Compute and apply the gradients for both the Generator and the Critic.
-
- Args:
- x: real data event
- g_optimizer: generator optimizer
- c_optimizer: critic optimizer
- Returns:
- (critic loss, generator loss)
- """
-for_inrange(self.n_critic):
-withtf.GradientTape()asd_tape:
-critic_loss=self.c_lossfn(x)
-# Get the gradients of the critic
-d_gradient=d_tape.gradient(critic_loss,self.critic.trainable_variables)
-# Update the weights of the critic using the optimizer
-c_optimizer.apply_gradients(
-zip(d_gradient,self.critic.trainable_variables)
-)
-
-##Add here the n_generator
-# Update the generator
-for_inrange(self.n_generator):
-withtf.GradientTape()asg_tape:
-gen_loss=self.g_lossfn(x)
-# Get the gradients of the generator
-gen_gradients=g_tape.gradient(gen_loss,self.generator.trainable_variables)
-# Update the weights of the generator
-g_optimizer.apply_gradients(
-zip(gen_gradients,self.generator.trainable_variables)
-)
-
-returncritic_loss,gen_loss
-
-defc_lossfn(self,real):
-"""Compute critic loss.
-
- Args:
- real: real data
-
- Returns:
- critic loss
- """
-# generating noise from a uniform distribution
-noise=tf.random.normal([real.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-# run noise through generator
-fake=self.generator(noise)
-# discriminate x and x_gen
-logits_real=self.critic(real)
-logits_fake=self.critic(fake)
-
-# gradient penalty
-gp=self.gradient_penalty(real,fake)
-# getting the loss of the critic.
-c_loss=(tf.reduce_mean(logits_fake)
--tf.reduce_mean(logits_real)
-+gp*self.gradient_penalty_weight)
-returnc_loss
-
-defg_lossfn(self,real):
-"""Compute generator loss.
-
- Args:
- real: A real sample
- fake: A fake sample
- fak2: A second fake sample
-
- Returns:
- Loss of the generator
- """
-# generating noise from a uniform distribution
-noise=tf.random.normal([real.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-
-fake=self.generator(noise)
-logits_fake=self.critic(fake)
-g_loss=-tf.reduce_mean(logits_fake)
-returng_loss
-
-defget_data_batch(self,train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
-deftrain_step(self,train_data,optimizers):
-"""Perform a training step.
-
- Args:
- train_data: training data
- optimizers: generator and critic optimizers
-
- Returns:
- (critic_loss, generator_loss): Critic and generator loss.
- """
-cri_loss,ge_loss=self.update_gradients(train_data,*optimizers)
-returncri_loss,ge_loss
-
-deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized.
- train_arguments: GAN training arguments.
- num_cols (List[str]): List of columns of the data object to be handled as numerical.
- cat_cols (List[str]): List of columns of the data object to be handled as categorical.
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-
-# Create a summary file
-train_summary_writer=tf.summary.create_file_writer(path.join('..\wgan_gp_test','summaries','train'))
-
-withtrain_summary_writer.as_default():
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-batch_data=self.get_data_batch(processed_data,self.batch_size).astype(np.float32)
-cri_loss,ge_loss=self.train_step(batch_data,optimizers)
-
-print(
-"Epoch: {} | disc_loss: {} | gen_loss: {}".format(
-epoch,cri_loss,ge_loss
-))
-
-ifepoch%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.critic.save_weights(model_checkpoint_base_name.format('critic',epoch))
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- c_lossfn(real)
-
-
-
-
-
-
-
Compute critic loss.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
real data
-
-
-
- required
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
critic loss
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgangp/model.py
-
defc_lossfn(self,real):
-"""Compute critic loss.
-
- Args:
- real: real data
-
- Returns:
- critic loss
- """
-# generating noise from a uniform distribution
-noise=tf.random.normal([real.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-# run noise through generator
-fake=self.generator(noise)
-# discriminate x and x_gen
-logits_real=self.critic(real)
-logits_fake=self.critic(fake)
-
-# gradient penalty
-gp=self.gradient_penalty(real,fake)
-# getting the loss of the critic.
-c_loss=(tf.reduce_mean(logits_fake)
--tf.reduce_mean(logits_real)
-+gp*self.gradient_penalty_weight)
-returnc_loss
-
-
-
-
-
-
-
-
-
-
-
-
- define_gan(activation_info=None)
-
-
-
-
-
-
-
Define the trainable model components.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
activation_info
-
- Optional[NamedTuple]
-
-
-
-
Defaults to None.
-
-
-
- None
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
- (generator_optimizer, critic_optimizer)
-
-
-
-
Generator and critic optimizers.
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgangp/model.py
-
deffit(self,data,train_arguments:TrainParameters,num_cols:List[str],cat_cols:List[str]):
-"""Fit a synthesizer model to a given input dataset.
-
- Args:
- data: A pandas DataFrame or a Numpy array with the data to be synthesized.
- train_arguments: GAN training arguments.
- num_cols (List[str]): List of columns of the data object to be handled as numerical.
- cat_cols (List[str]): List of columns of the data object to be handled as categorical.
- """
-super().fit(data,num_cols,cat_cols)
-
-processed_data=self.processor.transform(data)
-self.data_dim=processed_data.shape[1]
-optimizers=self.define_gan(self.processor.col_transform_info)
-
-iterations=int(abs(data.shape[0]/self.batch_size)+1)
-
-# Create a summary file
-train_summary_writer=tf.summary.create_file_writer(path.join('..\wgan_gp_test','summaries','train'))
-
-withtrain_summary_writer.as_default():
-forepochintrange(train_arguments.epochs):
-for_inrange(iterations):
-batch_data=self.get_data_batch(processed_data,self.batch_size).astype(np.float32)
-cri_loss,ge_loss=self.train_step(batch_data,optimizers)
-
-print(
-"Epoch: {} | disc_loss: {} | gen_loss: {}".format(
-epoch,cri_loss,ge_loss
-))
-
-ifepoch%train_arguments.sample_interval==0:
-# Test here data generation step
-# save model checkpoints
-ifpath.exists('./cache')isFalse:
-os.mkdir('./cache')
-model_checkpoint_base_name='./cache/'+train_arguments.cache_prefix+'_{}_model_weights_step_{}.h5'
-self.generator.save_weights(model_checkpoint_base_name.format('generator',epoch))
-self.critic.save_weights(model_checkpoint_base_name.format('critic',epoch))
-
-
-
-
-
-
-
-
-
-
-
-
- g_lossfn(real)
-
-
-
-
-
-
-
Compute generator loss.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
A real sample
-
-
-
- required
-
-
-
-
fake
-
-
-
-
-
A fake sample
-
-
-
- required
-
-
-
-
fak2
-
-
-
-
-
A second fake sample
-
-
-
- required
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
Loss of the generator
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgangp/model.py
-
defg_lossfn(self,real):
-"""Compute generator loss.
-
- Args:
- real: A real sample
- fake: A fake sample
- fak2: A second fake sample
-
- Returns:
- Loss of the generator
- """
-# generating noise from a uniform distribution
-noise=tf.random.normal([real.shape[0],self.noise_dim],dtype=tf.dtypes.float32)
-
-fake=self.generator(noise)
-logits_fake=self.critic(fake)
-g_loss=-tf.reduce_mean(logits_fake)
-returng_loss
-
-
-
-
-
-
-
-
-
-
-
-
- get_data_batch(train,batch_size,seed=0)
-
-
-
-
-
-
-
Get real data batches from the passed data object.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
train
-
-
-
-
-
real data.
-
-
-
- required
-
-
-
-
batch_size
-
-
-
-
-
batch size.
-
-
-
- required
-
-
-
-
seed
-
- int
-
-
-
-
Defaults to 0.
-
-
-
- 0
-
-
-
-
-
-
-
-
Returns:
-
-
-
-
Type
-
Description
-
-
-
-
-
-
-
-
-
data batch.
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgangp/model.py
-
defget_data_batch(self,train,batch_size,seed=0):
-"""Get real data batches from the passed data object.
-
- Args:
- train: real data.
- batch_size: batch size.
- seed (int, optional):Defaults to 0.
-
- Returns:
- data batch.
- """
-# np.random.seed(seed)
-# x = train.loc[ np.random.choice(train.index, batch_size) ].values
-# iterate through shuffled indices, so every sample gets covered evenly
-start_i=(batch_size*seed)%len(train)
-stop_i=start_i+batch_size
-shuffle_seed=(batch_size*seed)//len(train)
-np.random.seed(shuffle_seed)
-train_ix=np.random.choice(train.shape[0],replace=False,size=len(train))# wasteful to shuffle every time
-train_ix=list(train_ix)+list(train_ix)# duplicate to cover ranges past the end of the set
-returntrain[train_ix[start_i:stop_i]]
-
-
-
-
-
-
-
-
-
-
-
-
- gradient_penalty(real,fake)
-
-
-
-
-
-
-
Compute gradient penalty.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
real
-
-
-
-
-
real event.
-
-
-
- required
-
-
-
-
fake
-
-
-
-
-
fake event.
-
-
-
- required
-
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/regular/wgangp/model.py
-
@tf.function
-defupdate_gradients(self,x,g_optimizer,c_optimizer):
-"""Compute and apply the gradients for both the Generator and the Critic.
-
- Args:
- x: real data event
- g_optimizer: generator optimizer
- c_optimizer: critic optimizer
- Returns:
- (critic loss, generator loss)
- """
-for_inrange(self.n_critic):
-withtf.GradientTape()asd_tape:
-critic_loss=self.c_lossfn(x)
-# Get the gradients of the critic
-d_gradient=d_tape.gradient(critic_loss,self.critic.trainable_variables)
-# Update the weights of the critic using the optimizer
-c_optimizer.apply_gradients(
-zip(d_gradient,self.critic.trainable_variables)
-)
-
-##Add here the n_generator
-# Update the generator
-for_inrange(self.n_generator):
-withtf.GradientTape()asg_tape:
-gen_loss=self.g_lossfn(x)
-# Get the gradients of the generator
-gen_gradients=g_tape.gradient(gen_loss,self.generator.trainable_variables)
-# Update the weights of the generator
-g_optimizer.apply_gradients(
-zip(gen_gradients,self.generator.trainable_variables)
-)
-
-returncritic_loss,gen_loss
-
classDoppelGANger(BaseGANModel):
-"""
- DoppelGANger model.
- Based on the paper https://dl.acm.org/doi/pdf/10.1145/3419394.3423643.
-
- Args:
- model_parameters: Parameters used to create the DoppelGANger model.
- """
-__MODEL__='DoppelGANger'
-
-def__init__(self,model_parameters:ModelParameters):
-super().__init__(model_parameters)
-self._model_parameters=model_parameters
-self._gan_model=None
-self._tf_session=None
-self._sequence_length=None
-tf.compat.v1.disable_eager_execution()
-
-deffit(self,data:DataFrame,
-train_arguments:TrainParameters,
-num_cols:list[str]|None=None,
-cat_cols:list[str]|None=None):
-"""
- Fits the DoppelGANger model.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized.
- train_arguments: DoppelGANger training arguments.
- num_cols: List of columns to be handled as numerical
- cat_cols: List of columns to be handled as categorical
- """
-super().fit(data=data,num_cols=num_cols,cat_cols=cat_cols,train_arguments=train_arguments)
-
-self._sequence_length=train_arguments.sequence_length
-self._sample_length=train_arguments.sample_length
-self._rounds=train_arguments.rounds
-
-ifdata.shape[0]%self._sequence_length!=0:
-raiseValueError("The number of samples must be a multiple of the sequence length.")
-
-ifself._sequence_length%self._sample_length!=0:
-raiseValueError("The sequence length must be a multiple of the sample length.")
-
-data_features,data_attributes=self.processor.transform(data)
-measurement_cols_metadata=self.processor.measurement_cols_metadata
-attribute_cols_metadata=self.processor.attribute_cols_metadata
-
-generator=DoppelGANgerGenerator(
-feed_back=False,
-noise=True,
-use_tanh=self.use_tanh,
-measurement_cols_metadata=measurement_cols_metadata,
-attribute_cols_metadata=attribute_cols_metadata,
-sample_len=self._sample_length)
-discriminator=Discriminator()
-attr_discriminator=AttrDiscriminator()
-
-self._tf_session=tf.compat.v1.Session()
-withself._tf_session.as_default()assess:
-self._gan_model=DoppelGANgerNetwork(
-sess=sess,
-epoch=train_arguments.epochs,
-batch_size=self.batch_size,
-data_feature=data_features,
-data_attribute=data_attributes,
-attribute_cols_metadata=attribute_cols_metadata,
-sample_len=self._sample_length,
-generator=generator,
-discriminator=discriminator,
-rounds=self._rounds,
-attr_discriminator=attr_discriminator,
-d_gp_coe=self.gp_lambda,
-attr_d_gp_coe=self.gp_lambda,
-g_attr_d_coe=self.gp_lambda,
-num_packing=self.pac,
-attribute_latent_dim=self.latent_dim,
-feature_latent_dim=self.latent_dim,
-fix_feature_network=False,
-g_lr=self.g_lr,
-g_beta1=self.beta_1,
-d_lr=self.d_lr,
-d_beta1=self.beta_1,
-attr_d_lr=self.d_lr,
-attr_d_beta1=self.beta_1)
-self._gan_model.build()
-self._gan_model.train()
-
-defsample(self,n_samples:int):
-"""
- Samples new data from the DoppelGANger.
-
- Args:
- n_samples: Number of samples to be generated.
- """
-ifn_samples<=0:
-raiseValueError("Invalid number of samples.")
-
-real_attribute_input_noise=self._gan_model.gen_attribute_input_noise(n_samples)
-addi_attribute_input_noise=self._gan_model.gen_attribute_input_noise(n_samples)
-length=int(self._sequence_length/self._sample_length)
-feature_input_noise=self._gan_model.gen_feature_input_noise(n_samples,length=length)
-input_data=self._gan_model.gen_feature_input_data_free(n_samples)
-
-withself._tf_session.as_default()assess:
-self._gan_model.sess=sess
-data_features,data_attributes,gen_flags,_=self._gan_model.sample_from(
-real_attribute_input_noise,addi_attribute_input_noise,
-feature_input_noise,input_data)
-
-returnself.processor.inverse_transform(data_features,data_attributes,gen_flags)
-
-defsave(self,path):
-"""
- Save the DoppelGANger model in a directory.
-
- Args:
- path: Path of the directory where the files will be saved.
- """
-saver=tf.compat.v1.train.Saver()
-withself._tf_session.as_default()assess:
-saver.save(sess,os.path.join(path,"doppelganger"),write_meta_graph=False)
-self._gan_model.save(os.path.join(path,"doppelganger_network.pkl"))
-dump({
-"processor":self.processor.__dict__,
-"measurement_cols_metadata":self.processor.measurement_cols_metadata,
-"attribute_cols_metadata":self.processor.attribute_cols_metadata,
-"_sequence_length":self._sequence_length,
-"_sample_length":self._sample_length
-},os.path.join(path,"doppelganger_metadata.pkl"))
-
-@staticmethod
-defload(path):
-"""
- Load the DoppelGANger model from a directory.
- Only the required components to sample new data are loaded.
-
- Args:
- class_dict: Path of the directory where the files were saved.
- """
-dp_model=DoppelGANger(ModelParameters())
-dp_network_parms=load(os.path.join(path,"doppelganger_network.pkl"))
-dp_metadata=load(os.path.join(path,"doppelganger_metadata.pkl"))
-
-dp_model.processor=DoppelGANgerProcessor()
-dp_model.processor.__dict__=dp_metadata["processor"]
-dp_model._sequence_length=dp_metadata["_sequence_length"]
-dp_model._sample_length=dp_metadata["_sample_length"]
-
-generator=DoppelGANgerGenerator(
-feed_back=False,
-noise=True,
-measurement_cols_metadata=dp_metadata["measurement_cols_metadata"],
-attribute_cols_metadata=dp_metadata["attribute_cols_metadata"],
-sample_len=dp_network_parms["sample_len"])
-discriminator=Discriminator()
-attr_discriminator=AttrDiscriminator()
-
-withtf.compat.v1.Session().as_default()assess:
-dp_model._gan_model=DoppelGANgerNetwork(
-sess=sess,
-epoch=dp_network_parms["epoch"],
-batch_size=dp_network_parms["batch_size"],
-data_feature=None,
-data_attribute=None,
-attribute_cols_metadata=dp_metadata["attribute_cols_metadata"],
-sample_len=dp_network_parms["sample_len"],
-generator=generator,
-discriminator=discriminator,
-rounds=dp_network_parms["rounds"],
-attr_discriminator=attr_discriminator,
-d_gp_coe=dp_network_parms["d_gp_coe"],
-attr_d_gp_coe=dp_network_parms["attr_d_gp_coe"],
-g_attr_d_coe=dp_network_parms["g_attr_d_coe"],
-num_packing=dp_network_parms["num_packing"],
-attribute_latent_dim=dp_network_parms["attribute_latent_dim"],
-feature_latent_dim=dp_network_parms["feature_latent_dim"],
-fix_feature_network=dp_network_parms["fix_feature_network"],
-g_lr=dp_network_parms["g_lr"],
-g_beta1=dp_network_parms["g_beta1"],
-d_lr=dp_network_parms["d_lr"],
-d_beta1=dp_network_parms["d_beta1"],
-attr_d_lr=dp_network_parms["attr_d_lr"],
-attr_d_beta1=dp_network_parms["attr_d_beta1"])
-
-dp_model._gan_model.sample_time=dp_network_parms["sample_time"]
-dp_model._gan_model.sample_feature_dim=dp_network_parms["sample_feature_dim"]
-dp_model._gan_model.sample_attribute_dim=dp_network_parms["sample_attribute_dim"]
-dp_model._gan_model.sample_real_attribute_dim=dp_network_parms["sample_real_attribute_dim"]
-dp_model._gan_model.build()
-
-saver=tf.compat.v1.train.Saver()
-saver.restore(sess,tf.compat.v1.train.latest_checkpoint(path))
-dp_model._tf_session=sess
-
-returndp_model
-
deffit(self,data:DataFrame,
-train_arguments:TrainParameters,
-num_cols:list[str]|None=None,
-cat_cols:list[str]|None=None):
-"""
- Fits the DoppelGANger model.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized.
- train_arguments: DoppelGANger training arguments.
- num_cols: List of columns to be handled as numerical
- cat_cols: List of columns to be handled as categorical
- """
-super().fit(data=data,num_cols=num_cols,cat_cols=cat_cols,train_arguments=train_arguments)
-
-self._sequence_length=train_arguments.sequence_length
-self._sample_length=train_arguments.sample_length
-self._rounds=train_arguments.rounds
-
-ifdata.shape[0]%self._sequence_length!=0:
-raiseValueError("The number of samples must be a multiple of the sequence length.")
-
-ifself._sequence_length%self._sample_length!=0:
-raiseValueError("The sequence length must be a multiple of the sample length.")
-
-data_features,data_attributes=self.processor.transform(data)
-measurement_cols_metadata=self.processor.measurement_cols_metadata
-attribute_cols_metadata=self.processor.attribute_cols_metadata
-
-generator=DoppelGANgerGenerator(
-feed_back=False,
-noise=True,
-use_tanh=self.use_tanh,
-measurement_cols_metadata=measurement_cols_metadata,
-attribute_cols_metadata=attribute_cols_metadata,
-sample_len=self._sample_length)
-discriminator=Discriminator()
-attr_discriminator=AttrDiscriminator()
-
-self._tf_session=tf.compat.v1.Session()
-withself._tf_session.as_default()assess:
-self._gan_model=DoppelGANgerNetwork(
-sess=sess,
-epoch=train_arguments.epochs,
-batch_size=self.batch_size,
-data_feature=data_features,
-data_attribute=data_attributes,
-attribute_cols_metadata=attribute_cols_metadata,
-sample_len=self._sample_length,
-generator=generator,
-discriminator=discriminator,
-rounds=self._rounds,
-attr_discriminator=attr_discriminator,
-d_gp_coe=self.gp_lambda,
-attr_d_gp_coe=self.gp_lambda,
-g_attr_d_coe=self.gp_lambda,
-num_packing=self.pac,
-attribute_latent_dim=self.latent_dim,
-feature_latent_dim=self.latent_dim,
-fix_feature_network=False,
-g_lr=self.g_lr,
-g_beta1=self.beta_1,
-d_lr=self.d_lr,
-d_beta1=self.beta_1,
-attr_d_lr=self.d_lr,
-attr_d_beta1=self.beta_1)
-self._gan_model.build()
-self._gan_model.train()
-
-
-
-
-
-
-
-
-
-
-
-
- load(path)
-
-
- staticmethod
-
-
-
-
-
-
-
-
Load the DoppelGANger model from a directory.
-Only the required components to sample new data are loaded.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
class_dict
-
-
-
-
-
Path of the directory where the files were saved.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/timeseries/doppelganger/model.py
-
defsample(self,n_samples:int):
-"""
- Samples new data from the DoppelGANger.
-
- Args:
- n_samples: Number of samples to be generated.
- """
-ifn_samples<=0:
-raiseValueError("Invalid number of samples.")
-
-real_attribute_input_noise=self._gan_model.gen_attribute_input_noise(n_samples)
-addi_attribute_input_noise=self._gan_model.gen_attribute_input_noise(n_samples)
-length=int(self._sequence_length/self._sample_length)
-feature_input_noise=self._gan_model.gen_feature_input_noise(n_samples,length=length)
-input_data=self._gan_model.gen_feature_input_data_free(n_samples)
-
-withself._tf_session.as_default()assess:
-self._gan_model.sess=sess
-data_features,data_attributes,gen_flags,_=self._gan_model.sample_from(
-real_attribute_input_noise,addi_attribute_input_noise,
-feature_input_noise,input_data)
-
-returnself.processor.inverse_transform(data_features,data_attributes,gen_flags)
-
-
-
-
-
-
-
-
-
-
-
-
- save(path)
-
-
-
-
-
-
-
Save the DoppelGANger model in a directory.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
path
-
-
-
-
-
Path of the directory where the files will be saved.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/timeseries/doppelganger/model.py
-
defsave(self,path):
-"""
- Save the DoppelGANger model in a directory.
-
- Args:
- path: Path of the directory where the files will be saved.
- """
-saver=tf.compat.v1.train.Saver()
-withself._tf_session.as_default()assess:
-saver.save(sess,os.path.join(path,"doppelganger"),write_meta_graph=False)
-self._gan_model.save(os.path.join(path,"doppelganger_network.pkl"))
-dump({
-"processor":self.processor.__dict__,
-"measurement_cols_metadata":self.processor.measurement_cols_metadata,
-"attribute_cols_metadata":self.processor.attribute_cols_metadata,
-"_sequence_length":self._sequence_length,
-"_sample_length":self._sample_length
-},os.path.join(path,"doppelganger_metadata.pkl"))
-
deffit(self,data:DataFrame,
-train_arguments:TrainParameters,
-num_cols:list[str]|None=None,
-cat_cols:list[str]|None=None):
-"""
- Fits the TimeGAN model.
-
- Args:
- data: A pandas DataFrame with the data to be synthesized.
- train_arguments: TimeGAN training arguments.
- num_cols: List of columns to be handled as numerical
- cat_cols: List of columns to be handled as categorical
- """
-super().fit(data=data,num_cols=num_cols,cat_cols=cat_cols,train_arguments=train_arguments)
-ifcat_cols:
-raiseNotImplementedError("TimeGAN does not support categorical features.")
-self.num_cols=num_cols
-self.seq_len=train_arguments.sequence_length
-self.n_seq=train_arguments.number_sequences
-processed_data=real_data_loading(data[self.num_cols].values,seq_len=self.seq_len)
-self.train(data=processed_data,train_steps=train_arguments.epochs)
-
-
-
-
-
-
-
-
-
-
-
-
- sample(n_samples)
-
-
-
-
-
-
-
Samples new data from the TimeGAN.
-
-
-
-
Parameters:
-
-
-
-
Name
-
Type
-
Description
-
Default
-
-
-
-
-
n_samples
-
- int
-
-
-
-
Number of samples to be generated.
-
-
-
- required
-
-
-
-
-
-
- Source code in ydata_synthetic/synthesizers/timeseries/timegan/model.py
-
defsample(self,n_samples:int):
-"""
- Samples new data from the TimeGAN.
-
- Args:
- n_samples: Number of samples to be generated.
- """
-Z_=next(self.get_batch_noise(size=n_samples))
-records=self.generator(Z_)
-data=[]
-foriinrange(records.shape[0]):
-data.append(DataFrame(records[i],columns=self.num_cols))
-returndata
-
ydata-synthetic is a powerful library designed to generate synthetic data.
-As part of our ongoing efforts to improve user experience and functionality, ydata-synthetic
-includes a telemetry feature. This feature collects anonymous usage data, helping us understand
-how the library is used and identify areas for improvement.
-
The primary goal of collecting telemetry data is to:
-
-
Enhance the functionality and performance of the ydata-synthetic library
-
Prioritize new features based on user engagement
-
Identify common issues and bugs to improve overall user experience
-
-
Data Collected
-
The telemetry system collects non-personal, anonymous information such as:
-
-
Python version
-
ydata-synthetic version
-
Frequency of use of ydata-synthetic features
-
Errors or exceptions thrown within the library
-
-
Disabling usage analytics
-
We respect your choice to not participate in our telemetry collection.
-If you prefer to disable telemetry, you can do so by setting an environment
-variable on your system. Disabling telemetry will not affect the functionality
-of the ydata-profiling library, except for the ability to contribute to its usage analytics.
-
Set an Environment Variable
-
In your notebook or script make sure to set YDATA_SYNTHETIC_NO_ANALYTICS
-environment variable to True.
You are always welcome to contribute to this incredible ecosystem for synthetic data generation. We have several areas in that we are always looking for an extra pair of hands to help us get things going:
-
-
Documentation: we all love it, but keeping it up-to-date can be challenging! But guess what? This is also the fastest way for you to get to know ydata-synthetic and start contributing! Even if it is missing, like a new example that could help the community go from zero to hero with synthetic data!
-
Getting started: Issues that use this tag are usually the most friendly for someone that has just begun the journey into open-source contributions. So don't be shy; assign the task to you and introduce yourself in the GitHub issue! We will be there to guide you.
-
-
But we always look for contributions that go beyond documentation and fixes
-
-
Synthetic data for NLP: If you are an expert or just someone that would like to dive into this topic, we welcome you to participate in a small project around the generation of synthetic text.
-
Synthetic data for images: if you are a computer vision and you like to share, feel free to add some examples for images!
-
Any other research around synthetic data you would like to share with the community is more than welcome! If you want your research to be part of a rich ecosystem, open a PR!
Issues and where I can find more info about contributing
-
If you need help figuring out where to start or want to learn what other contributors are doing, go to the Data-Centric AI community Discord channel and introduce yourself in the #ydata-synthetic channel.
-
If you can't find an issue that interests you and wants to add an improvement or change, create a new one.
-
+
@@ -750,13 +694,13 @@
-
+
Generate Multiple Tables
-
-
+
+
Generate Multiple Tables
@@ -827,13 +771,6 @@
How to get accurate data from my synthetic data generation processes?
-
For a use-case oriented UI experience, try YData Fabric. From an interactive and complete data profiling to an efficient synthetization, your data preparation process will be seamlessly adjusted to your data characteristics.
-
How can I run the Streamlit app?
-
To try ydata-synthetic using the streamlit app, you need to install it using the [] notation that encodes the extras that the package incorporates. In this case, you can simply create your virtual environment and install ydata-synthetic as:
-
pipinstallydata-synthetic[streamlit]
-
-
Note that Jupyter or Colab Notebooks are not yet supported, so you need to work it out in your Python environment. Once the package is installed, you can use the following snippet to start the app:
And that's it! After running the command, the console will output the URL from which you can access the app!
-
-
Example
-
For a step-by-step installation guide, check this 5-min video that will help you get started!
-
What is the best way to evaluate the quality of my synthetic data?
The most appropriate metrics to evaluate the quality of your synthetic data are also dependent on the goal for which synthetic data will be used. Nevertheless, we may define three essential pillars for synthetic data quality: privacy, fidelity, and utility:
@@ -1612,14 +987,14 @@
Most issues with installations are usually associated with unsupported Python versions or misalignment between python environments and package requirements.
Let’s see how you can get both right:
Python Versions
-
Note that ydata-synthetic currently requires Python >=3.9, < 3.11 so if you're trying to run our code in Google Colab, then you need to update your Google Colab’s Python version accordingly. The same goes for your development environment.
+
Note that ydata-sdk currently requires Python >=3.9, < 3.13 so if you're trying to run our code in Google Colab, then you need to update your Google Colab’s Python version accordingly. The same goes for your development environment.
Virtual Environments
A lot of troubleshooting arises due to misalignments between environments and package requirements.
Virtual Environments isolate your installations from the "global" environment so that you don't have to worry about conflicts.
Using conda, creating a new environment is as easy as running this on your shell:
Now you can open up your Python editor or Jupyter Lab and use the synth-env as your development environment, without having to worry about conflicting versions or packages between projects!
Synthetic data is data that has been created artificially through computer simulation or that algorithms can generate to
take the place of real-world data. The data can be used as an alternative or supplement to real-world data when real-world
data is not readily available. It can also be used as a Machine Learning performance booster.
-
The ydata-synthetic package is an open-source Python package developed by YData’s team that allows users to experiment
-with several generative models for synthetic data generation. The main goal of the package is to serve as a way for data
+
The ydata-sdk package is a Python package developed by YData’s team that allows users to easily benefit from Generative AI
+and generate synthetic data. The main goal of the package is to serve as a way for data
scientists to get familiar with synthetic data and its applications in real-world domains, as well as the potential of Generative AI.
-
The ydata-synthetic package provides different methods for generating synthetic tabular and time-series data,
-such as Variational Auto Encoders (VAE), Gaussian Mixture Models (GMM), and Conditional Generative Adversarial Networks (CTGAN).
-The package also includes a user-friendly UI interface that guides users through the steps and inputs to generate synthetic data
-samples.
-
The package also aims to facilitate the exploration and understanding of synthetic data generation methods and their limitations.
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
CGAN is a deep learning model that combines GANs with conditional models to generate data samples based on specific conditions:
Using CRAMER GAN to generate tabular synthetic data:
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
CRAMER GAN is a variant of GAN that employs the Cramer distance as a measure of similarity between real and generated data distributions to improve training stability and enhance sample quality:
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
Additionally, real-world data usually comprises both numeric and categorical features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements.
Using CWGAN-GP to generate tabular synthetic data:
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
CWGAN GP is a variant of GAN that incorporates conditional information to generate data samples, while leveraging the Wasserstein distance to improve training stability and sample quality:
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
DRAGAN is a GAN variant that uses a gradient penalty to improve training stability and mitigate mode collapse:
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like
format, where features/variables are represented in columns, whereas observations correspond to the rows.
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
WGAN is a variant of GAN that utilizes the Wasserstein distance to improve training stability and generate higher quality samples:
Real-world domains are often described by tabular data i.e., data that can be structured and organized in a table-like format, where features/variables are represented in columns, whereas observations correspond to the rows.
WGANGP is a variant of GAN that incorporates a gradient penalty term to enhance training stability and improve the diversity of generated samples:
The UI guided experience for Synthetic Data generation
-
´ydata-synthetic´ offers a UI interface to guide you through the steps and inputs to generate structure tabular data.
-The streamlit app is available from v1.0.0 onwards, and supports the following flows:
-
-
Train a synthesizer model for a single table dataset
-
Generate & profile the generated synthetic samples
-
-
-
-
-
-
Installation
-
pip install ydata-synthetic[streamlit]
-
Quickstart
-
Use the code snippet below in a python file:
-
-
Use python scripts
-
I know you probably love Jupyter Notebooks or Google Colab, but make sure that you start your
-synthetic data generation streamlit app from a python script as notebooks are not supported!
Using DoppelGANger to generate synthetic time-series data:
Although tabular data may be the most frequently discussed type of data, a great number of real-world domains — from traffic and daily trajectories to stock prices and energy consumption patterns — produce time-series data which introduces several aspects of complexity to synthetic data generation.
Time-series data is structured sequentially, with observations ordered chronologically based on their associated timestamps or time intervals. It explicitly incorporates the temporal aspect, allowing for the analysis of trends, seasonality, and other dependencies over time.
YData Fabric offers advanced capabilities for time-series synthetic data generation, surpassing TimeGAN in terms of flexibility,
+scalability, and ease of use. With YData Fabric, users can generate high-quality synthetic time-series data while benefiting from built-in data profiling tools
+that ensure the integrity and consistency of the data. Unlike TimeGAN, which is a single model for time-series, YData Fabric offers a solution that is suitable for different types of datasets and behaviours.
+Additionally, YData Fabric is designed for scalability, enabling seamless handling of large, complex time-series datasets. Its guided UI makes it easy to adapt to different time-series scenarios,
+from healthcare to financial data, making it a more comprehensive and flexible solution for time-series data generation.
Using TimeGAN to generate synthetic time-series data
Although tabular data may be the most frequently discussed type of data, a great number of real-world domains — from traffic and daily trajectories to stock prices and energy consumption patterns — produce time-series data which introduces several aspects of complexity to synthetic data generation.
Time-series data is structured sequentially, with observations ordered chronologically based on their associated timestamps or time intervals. It explicitly incorporates the temporal aspect, allowing for the analysis of trends, seasonality, and other dependencies over time.
TimeGAN is a model that uses a Generative Adversarial Network (GAN) framework to generate synthetic time series data by learning the underlying temporal dependencies and characteristics of the original data:
diff --git a/1.4/reference/api/index.html b/1.4/synthetic_data/ydata_fabric_app/index.html
similarity index 67%
rename from 1.4/reference/api/index.html
rename to 1.4/synthetic_data/ydata_fabric_app/index.html
index c91d758a..1b2fbf50 100644
--- a/1.4/reference/api/index.html
+++ b/1.4/synthetic_data/ydata_fabric_app/index.html
@@ -9,13 +9,17 @@
+
+
+
+
- Index - YData-Synthetic
+ UI interface - YData Fabric - YData-Synthetic
@@ -111,6 +115,11 @@
The YData Fabric UI organizes the synthetic data generation process into a structured, step-by-step workflow.
+Each stage of the process is clearly defined and supported by guidance within the interface, helping users navigate tasks like data profiling,
+metadata and synthesizer configuration and synthetic data quality evaluation.
+
+
Data Upload and Profiling: Users start by uploading their datasets directly into the platform. YData Fabric’s profiling tool automatically scans
+the data, generating insights into key attributes such as data distributions, correlations, and missing values.
+These insights are presented in an intuitive, visual format, ensuring users can quickly assess the quality and structure of their data.
+
Alerts for Data Issues: The UI will alert users to potential issues such as data imbalances, outliers, or incomplete fields that may affect the
+quality of the synthetic data.
+
Synthetic Data Generation Model Configuration: Once the data is profiled, the UI supports metadata configuration (categorical, numerical, dates, etc),
+anonymization integration.
+
Model Performance Insights: During the model training phase, YData Fabric monitors key performance indicators (KPIs) like fidelity, utility and privacy.
+These KPIs, such as data fidelity and privacy scores, are displayed on the dashboard, allowing users to evaluate how closely the synthetic data aligns with the original dataset.
+
Customization and Advanced Controls: For more experienced users, YData Fabric provides customization options within the guided UI.
+Users have access to advanced settings, such as conditional synthetic data generation or business rules.
+
Preserving Data Integrity: For datasets requiring strict adherence to structural patterns (e.g., time-series data, healthcare records or databases).
+
+
Getting started with YData Fabric (Community version)
+
YData Fabric’s Community Version offers users a free, accessible entry point to explore synthetic data generation.
+To get started, users can sign up for the Community Version and access the guided UI directly.
+Once registered, users are provided with a range of features, including data profiling, synthetic data generation, pipelines and access to YData’s proprietary models for data quality!
diff --git a/latest/getting-started/quickstart/index.html b/latest/getting-started/quickstart/index.html
deleted file mode 100644
index 9e1fef98..00000000
--- a/latest/getting-started/quickstart/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../1.4/getting-started/quickstart/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/index.html b/latest/reference/api/index.html
deleted file mode 100644
index 2f91127b..00000000
--- a/latest/reference/api/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../1.4/reference/api/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/preprocessing/base/index.html b/latest/reference/api/preprocessing/base/index.html
deleted file mode 100644
index fc341c53..00000000
--- a/latest/reference/api/preprocessing/base/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../1.4/reference/api/preprocessing/base/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/preprocessing/regular/ctgan_preprocessor/index.html b/latest/reference/api/preprocessing/regular/ctgan_preprocessor/index.html
deleted file mode 100644
index 97159c4b..00000000
--- a/latest/reference/api/preprocessing/regular/ctgan_preprocessor/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/preprocessing/regular/ctgan_preprocessor/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/preprocessing/regular/preprocessor/index.html b/latest/reference/api/preprocessing/regular/preprocessor/index.html
deleted file mode 100644
index f4de5ae1..00000000
--- a/latest/reference/api/preprocessing/regular/preprocessor/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/preprocessing/regular/preprocessor/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/cgan/index.html b/latest/reference/api/synthesizers/cgan/index.html
deleted file mode 100644
index 7664ee89..00000000
--- a/latest/reference/api/synthesizers/cgan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../1.4/reference/api/synthesizers/cgan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/gan/index.html b/latest/reference/api/synthesizers/gan/index.html
deleted file mode 100644
index 0bf2714c..00000000
--- a/latest/reference/api/synthesizers/gan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../1.4/reference/api/synthesizers/gan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/cgan/index.html b/latest/reference/api/synthesizers/regular/cgan/index.html
deleted file mode 100644
index 8182e6f2..00000000
--- a/latest/reference/api/synthesizers/regular/cgan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/cgan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/cramergan/index.html b/latest/reference/api/synthesizers/regular/cramergan/index.html
deleted file mode 100644
index 873478f7..00000000
--- a/latest/reference/api/synthesizers/regular/cramergan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/cramergan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/ctgan/index.html b/latest/reference/api/synthesizers/regular/ctgan/index.html
deleted file mode 100644
index cb214a99..00000000
--- a/latest/reference/api/synthesizers/regular/ctgan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/ctgan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/cwgangp/index.html b/latest/reference/api/synthesizers/regular/cwgangp/index.html
deleted file mode 100644
index a74b7560..00000000
--- a/latest/reference/api/synthesizers/regular/cwgangp/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/cwgangp/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/dragan/index.html b/latest/reference/api/synthesizers/regular/dragan/index.html
deleted file mode 100644
index 6c5ccfe5..00000000
--- a/latest/reference/api/synthesizers/regular/dragan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/dragan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/gan/index.html b/latest/reference/api/synthesizers/regular/gan/index.html
deleted file mode 100644
index fdfcea28..00000000
--- a/latest/reference/api/synthesizers/regular/gan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/gan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/vanilllagan/index.html b/latest/reference/api/synthesizers/regular/vanilllagan/index.html
deleted file mode 100644
index 2b789fc3..00000000
--- a/latest/reference/api/synthesizers/regular/vanilllagan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/vanilllagan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/wgan/index.html b/latest/reference/api/synthesizers/regular/wgan/index.html
deleted file mode 100644
index c44c1bf7..00000000
--- a/latest/reference/api/synthesizers/regular/wgan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/wgan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/regular/wgan_gp/index.html b/latest/reference/api/synthesizers/regular/wgan_gp/index.html
deleted file mode 100644
index 239ac155..00000000
--- a/latest/reference/api/synthesizers/regular/wgan_gp/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/regular/wgan_gp/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/timeseries/doppelganger/index.html b/latest/reference/api/synthesizers/timeseries/doppelganger/index.html
deleted file mode 100644
index b9771bf2..00000000
--- a/latest/reference/api/synthesizers/timeseries/doppelganger/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/timeseries/doppelganger/...
-
-
\ No newline at end of file
diff --git a/latest/reference/api/synthesizers/timeseries/timegan/index.html b/latest/reference/api/synthesizers/timeseries/timegan/index.html
deleted file mode 100644
index f7280133..00000000
--- a/latest/reference/api/synthesizers/timeseries/timegan/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../../../../1.4/reference/api/synthesizers/timeseries/timegan/...
-
-
\ No newline at end of file
diff --git a/latest/reference/changelog/index.html b/latest/reference/changelog/index.html
deleted file mode 100644
index a95104e5..00000000
--- a/latest/reference/changelog/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../1.4/reference/changelog/...
-
-
\ No newline at end of file
diff --git a/latest/support/analytics/index.html b/latest/support/analytics/index.html
deleted file mode 100644
index 14b80a5f..00000000
--- a/latest/support/analytics/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../1.4/support/analytics/...
-
-
\ No newline at end of file
diff --git a/latest/support/contribute/index.html b/latest/support/contribute/index.html
deleted file mode 100644
index aece667a..00000000
--- a/latest/support/contribute/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../1.4/support/contribute/...
-
-
\ No newline at end of file
diff --git a/latest/synthetic_data/streamlit_app/index.html b/latest/synthetic_data/streamlit_app/index.html
deleted file mode 100644
index be7d82fa..00000000
--- a/latest/synthetic_data/streamlit_app/index.html
+++ /dev/null
@@ -1,16 +0,0 @@
-
-
-
-
- Redirecting
-
-
-
-
- Redirecting to ../../../1.4/synthetic_data/streamlit_app/...
-
-
\ No newline at end of file
diff --git a/latest/synthetic_data/ydata_fabric_app/index.html b/latest/synthetic_data/ydata_fabric_app/index.html
new file mode 100644
index 00000000..ec288704
--- /dev/null
+++ b/latest/synthetic_data/ydata_fabric_app/index.html
@@ -0,0 +1,16 @@
+
+
+
+
+ Redirecting
+
+
+
+
+ Redirecting to ../../../1.4/synthetic_data/ydata_fabric_app/...
+
+
\ No newline at end of file