Simple library to make pipelines or ETL
$ pip install pypelines-etlpypelines allows you to build ETL pipeline. For that, you simply need
the combination of an Extractor, some Transformer or Filter, and a Loader.
Making an extractor is fairly easy. Simply decorate a function that return
the data with Extractor:
import pandas
from pypelines import Extractor
@Extractor
def read_iris_dataset(filepath: str) -> pandas.Dataframe:
return pandas.read_csv(filepath)The Transformer and Filter decorators are equivalent.
Making a Transformer or a Filter is even more easy:
import pandas
from pypelines import Filter, Transformer
@Filter
def keep_setosa(df: pandas.DataFrame) -> pandas.DataFrame:
return df[df['class'] == 'Iris-setosa']
@Filter
def keep_petal_length(df: pandas.DataFrame) -> pandas.Series:
return df['petallength']
@Transformer
def mean(series: pandas.Series) -> float:
return series.mean()Note that it is possible to combine the Transformer and the Filter
to shorten the pipeline syntax. For example:
new_transformer = keep_setosa | keep_petal_length | mean
pipeline = read_iris_dataset('filepath.csv') | new_transformer
print(pipeline.value)
# 1.464In order to build a Loader, it suffices to decorate a function that takes at
least one data parameter.
import json
from pypelines import Loader
@Loader
def write_to_json(output_filepath: str, data: float) -> None:
with open(output_filepath, 'w') as file:
json.dump({'mean-petal-lenght': {'value': data, 'units': 'cm'}}, file)A Loader can be called without the data parameter,
which loads arguments (like an URL or a path). For example, calling write_to_json(output.json)
will not execute the function, but store the output_filepath argument until the Loader execution in a pipeline.
The standard execution of the function (with the data argument) is however still usable write_to_json(output.json, data=1.464).
To make and run the pipeline, simply combine the Extractor with the Transformer, the Filter and the Loader
read_iris_dataset('filepath.csv') | keep_setosa | keep_petal_length | mean | write_to_json('output.json')