Skip to content

Commit

Permalink
Remove Dask and Spark DataFrame Support (#2705)
Browse files Browse the repository at this point in the history
* mass deletion

* cleanup tests

* fix

* try to fix unit tests

* fix ww main test ci yaml

* more ci work

* fix fixture

* update release notes

* update miniconda hash

* more cleanup

* docs cleanup

* update release notes

* revert 3.12 change

* remove sql and update checker

* fix test

* try install test fix

* remove dask references

* remove sql from complete install due to psycopg2 issue

* more install fixes

* lint

* fix complete install

* remove dask_tokenize

* remove outdated link

* revert dask_tokenize change

* remove isinstance checks

* remove agg_type

* remove use of cache for install test
  • Loading branch information
thehomebrewnerd authored May 1, 2024
1 parent 21d0bf0 commit 12ad75a
Show file tree
Hide file tree
Showing 174 changed files with 1,297 additions and 6,487 deletions.
1 change: 0 additions & 1 deletion .github/workflows/build_docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ jobs:
sudo apt update
sudo apt install -y pandoc
sudo apt install -y graphviz
sudo apt install -y openjdk-11-jre-headless
python -m pip check
- name: Build docs
run: make -C docs/ -e "SPHINXOPTS=-W -j auto" clean html
23 changes: 5 additions & 18 deletions .github/workflows/install_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,7 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python_version: ["3.9", "3.10"]
exclude:
- python_version: "3.10"
os: macos-latest
python_version: ["3.9", "3.10", "3.11"]
runs-on: ${{ matrix.os }}
steps:
- name: Checkout repository
Expand All @@ -31,29 +28,19 @@ jobs:
python-version: ${{ matrix.python_version }}
cache: 'pip'
cache-dependency-path: 'pyproject.toml'
- uses: actions/cache@v3
id: cache
with:
path: ${{ env.pythonLocation }}
key: ${{ matrix.os- }}-${{ matrix.python_version }}-install-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01
- name: Build featuretools package
run: |
make package
- name: Install complete version of featuretools from sdist (not using cache)
if: steps.cache.outputs.cache-hit != 'true'
- name: Install complete version of featuretools from sdist
run: |
python -m pip install "unpacked_sdist/[complete]"
- name: Install complete version of featuretools from sdist (using cache)
if: steps.cache.outputs.cache-hit == 'true'
run: |
python -m pip install "unpacked_sdist/[complete]" --no-deps
- name: Test by importing packages
run: |
python -c "import alteryx_open_src_update_checker"
python -c "from featuretools_sql import DBConnector"
python -c "import premium_primitives"
python -c "from nlp_primitives import PolarityScore"
- name: Check package conflicts
run: |
python -m pip check
- name: Verify extra_requires commands
run: |
python -m pip install "unpacked_sdist/[nlp,spark,updater,sql]"
python -m pip install "unpacked_sdist/[nlp]"
2 changes: 1 addition & 1 deletion .github/workflows/latest_dependency_checker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
- name: Update dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -e ".[dask,spark,test]"
python -m pip install -e ".[dask,test]"
make checkdeps OUTPUT_PATH=featuretools/tests/requirement_files/latest_requirements.txt
- name: Create pull request
uses: peter-evans/create-pull-request@v3
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
strategy:
fail-fast: true
matrix:
test_type: ["pandas", "dask", "spark"]
test_type: ["pandas"]
steps:
- name: Generate default ISO timestamp
run: |
Expand Down
8 changes: 0 additions & 8 deletions .github/workflows/minimum_dependency_checker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,6 @@ jobs:
options: 'dependencies'
extras_require: 'dask'
output_filepath: featuretools/tests/requirement_files/minimum_dask_requirements.txt
- name: Run min dep generator - spark
id: min_dep_gen_spark
uses: alteryx/minimum-dependency-generator@v3
with:
paths: 'pyproject.toml'
options: 'dependencies'
extras_require: 'spark'
output_filepath: featuretools/tests/requirement_files/minimum_spark_requirements.txt
- name: Create Pull Request
uses: peter-evans/create-pull-request@v3
with:
Expand Down
48 changes: 6 additions & 42 deletions .github/workflows/tests_with_latest_deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,12 @@ on:
workflow_dispatch:
jobs:
tests:
name: ${{ matrix.python_version }} tests ${{ matrix.libraries }}
name: ${{ matrix.python_version }} unit tests
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python_version: ["3.9", "3.10", "3.11"]
libraries: ["core", "spark/dask - misc", "spark/dask - computational", "spark/dask - entityset_1", "spark/dask - entityset_2", "spark/dask - primitives"]

steps:
- uses: actions/setup-python@v4
Expand All @@ -32,57 +31,22 @@ jobs:
pip config --site set global.progress_bar off
python -m pip install --upgrade pip
sudo apt update && sudo apt install -y graphviz
- if: ${{ !startsWith(matrix.libraries, 'spark/dask') }}
name: Install featuretools with test requirements
- name: Install featuretools with test requirements
run: |
python -m pip install -e unpacked_sdist/
python -m pip install -e unpacked_sdist/[test]
- if: ${{ startsWith(matrix.libraries, 'spark/dask') }}
name: Install spark pkg, featuretools with test requirements and spark/dask requirements
run: |
sudo apt install -y openjdk-11-jre-headless
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
python -m pip install -e unpacked_sdist/[dask]
python -m pip install -e unpacked_sdist/[spark]
python -m pip install -e unpacked_sdist/[test]
- if: ${{ matrix.python_version == 3.9 && startsWith(matrix.libraries, 'spark/dask') }}
- if: ${{ matrix.python_version == 3.9 }}
name: Generate coverage args
run: echo "coverage_args=--cov=featuretools --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml" >> $GITHUB_ENV
- if: ${{ env.coverage_args }}
name: Erase coverage files
run: |
cd unpacked_sdist
coverage erase
- if: ${{ !startsWith(matrix.libraries, 'spark/dask') }}
name: Run unit tests (no code coverage)
run: |
cd unpacked_sdist
pytest featuretools/ -n auto
- if: ${{ matrix.libraries == 'spark/dask - misc' }}
name: Run unit tests (misc)
run: |
cd unpacked_sdist
pytest featuretools/ -n auto --ignore=featuretools/tests/computational_backend --ignore=featuretools/tests/entityset_tests --ignore=featuretools/tests/primitive_tests ${{ env.coverage_args }}
- if: ${{ matrix.libraries == 'spark/dask - computational' }}
name: Run unit tests (computational backend)
run: |
cd unpacked_sdist
pytest featuretools/tests/computational_backend/ -n auto ${{ env.coverage_args }}
- if: ${{ matrix.libraries == 'spark/dask - entityset_1' }}
name: Run unit tests (entityset batch 1)
run: |
cd unpacked_sdist
pytest featuretools/tests/entityset_tests -n auto --ignore=featuretools/tests/entityset_tests/test_es.py --ignore=featuretools/tests/entityset_tests/test_ww_es.py ${{ env.coverage_args }}
- if: ${{ matrix.libraries == 'spark/dask - entityset_2' }}
name: Run unit tests (entityset batch 2)
run: |
cd unpacked_sdist
pytest featuretools/tests/entityset_tests/test_es.py featuretools/tests/entityset_tests/test_ww_es.py ${{ env.coverage_args }}
- if: ${{ matrix.libraries == 'spark/dask - primitives' }}
name: Run unit tests (primitives)
- name: Run unit tests
run: |
cd unpacked_sdist
pytest featuretools/tests/primitive_tests -n auto ${{ env.coverage_args }}
pytest featuretools/ -n auto ${{ env.coverage_args }}
- if: ${{ env.coverage_args }}
name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
Expand All @@ -109,7 +73,7 @@ jobs:
$ProgressPreference = "silentlyContinue"
Invoke-WebRequest -Uri $Uri -Outfile "$env:USERPROFILE/$File"
$hashFromFile = Get-FileHash "$env:USERPROFILE/$File" -Algorithm SHA256
$hashFromUrl = "ff53a36b7024f8398cbfd043020f1f662cd4c5c2095c0007ddb4348aa5459375"
$hashFromUrl = "21b56b75861573ec8ab146d555b20e1ed4462a06aa286d7e92a1cd31acc64dba"
if ($hashFromFile.Hash -ne "$hashFromUrl") {
Throw "$File hashes do not match"
}
Expand Down
57 changes: 6 additions & 51 deletions .github/workflows/tests_with_minimum_deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ on:
- main
workflow_dispatch:
jobs:
py38_tests_minimum_dependencies:
py39_tests_minimum_dependencies:
name: Tests - 3.9 Minimum Dependencies
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
libraries: ["core", "dask", "spark - misc", "spark - computational", "spark - entityset_1", "spark - entityset_2", "spark - primitives"]
python_version: ["3.9"]
steps:
- name: Checkout repository
uses: actions/checkout@v3
Expand All @@ -33,59 +33,14 @@ jobs:
- name: Install featuretools with no dependencies
run: |
python -m pip install -e . --no-dependencies
- if: ${{ startsWith(matrix.libraries, 'spark') }}
name: Install numpy for spark
run: |
NUMPY_VERSION=$(cat featuretools/tests/requirement_files/minimum_spark_requirements.txt | grep numpy)
python -m pip uninstall numpy -y
python -m pip install $NUMPY_VERSION --no-build-isolation
- if: ${{ matrix.libraries == 'core' }}
name: Install numpy for core
run: |
NUMPY_VERSION=$(cat featuretools/tests/requirement_files/minimum_core_requirements.txt | grep numpy)
python -m pip uninstall numpy -y
python -m pip install $NUMPY_VERSION --no-build-isolation
- if: ${{ matrix.libraries == 'dask' }}
name: Install numpy for dask
run: |
NUMPY_VERSION=$(cat featuretools/tests/requirement_files/minimum_dask_requirements.txt | grep numpy)
python -m pip uninstall numpy -y
python -m pip install $NUMPY_VERSION --no-build-isolation
- name: Install featuretools - minimum tests dependencies
run: |
python -m pip install -r featuretools/tests/requirement_files/minimum_test_requirements.txt
- if: ${{ startsWith(matrix.libraries, 'spark') }}
name: Install featuretools - minimum spark, core dependencies
run: |
sudo apt install -y openjdk-11-jre-headless
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
python -m pip install -r featuretools/tests/requirement_files/minimum_spark_requirements.txt
- if: ${{ matrix.libraries == 'core' }}
name: Install featuretools - minimum core dependencies
- name: Install featuretools - minimum core dependencies
run: |
python -m pip install -r featuretools/tests/requirement_files/minimum_core_requirements.txt
- if: ${{ matrix.libraries == 'dask' }}
name: Install featuretools - minimum dask dependencies
- name: Install featuretools - minimum Dask dependencies
run: |
python -m pip install -r featuretools/tests/requirement_files/minimum_dask_requirements.txt
- if: ${{ matrix.libraries == 'core' }}
name: Run unit tests without code coverage
run: python -m pytest -x -n auto featuretools/tests/
- if: ${{ matrix.libraries == 'dask' }}
name: Run dask unit tests without code coverage
run: python -m pytest -x -n auto featuretools/tests/
- if: ${{ matrix.libraries == 'spark - misc' }}
name: Run unit tests (misc)
run: pytest featuretools/ -n auto --ignore=featuretools/tests/computational_backend --ignore=featuretools/tests/entityset_tests --ignore=featuretools/tests/primitive_tests
- if: ${{ matrix.libraries == 'spark - computational' }}
name: Run unit tests (computational backend)
run: pytest featuretools/tests/computational_backend/ -n auto
- if: ${{ matrix.libraries == 'spark - entityset_1' }}
name: Run unit tests (entityset batch 1)
run: pytest featuretools/tests/entityset_tests -n auto --ignore=featuretools/tests/entityset_tests/test_es.py --ignore=featuretools/tests/entityset_tests/test_ww_es.py
- if: ${{ matrix.libraries == 'spark - entityset_2' }}
name: Run unit tests (entityset batch 2)
run: pytest featuretools/tests/entityset_tests/test_es.py featuretools/tests/entityset_tests/test_ww_es.py
- if: ${{ matrix.libraries == 'spark - primitives' }}
name: Run unit tests (primitives)
run: pytest featuretools/tests/primitive_tests -n auto
- name: Run unit tests without code coverage
run: python -m pytest -x -n auto featuretools/tests/
28 changes: 2 additions & 26 deletions .github/workflows/tests_with_woodwork_main_branch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ jobs:
fail-fast: true
matrix:
python_version: ["3.9", "3.10", "3.11"]
libraries: ["core", "spark - misc", "spark - computational", "spark - entityset_1", "spark - entityset_2", "spark - primitives"]

steps:
- uses: actions/setup-python@v4
Expand All @@ -25,40 +24,17 @@ jobs:
pip config --site set global.progress_bar off
python -m pip install -U pip
sudo apt update && sudo apt install -y graphviz
- if: ${{ startsWith(matrix.libraries, 'spark')}}
name: Install Woodwork & Featuretools with spark pkg - spark requirements
run: |
sudo apt install -y openjdk-11-jre-headless
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
python -m pip install -e unpacked_sdist/[spark]
- name: Install Woodwork & Featuretools - test requirements
run: |
python -m pip install -e unpacked_sdist/[test]
python -m pip uninstall -y woodwork
python -m pip install https://github.com/alteryx/woodwork/archive/main.zip
- name: Log test run info
run: |
echo "Run unit tests without code coverage for ${{ matrix.python_version }} and ${{ matrix.libraries }}"
echo "Run unit tests without code coverage for ${{ matrix.python_version }}"
echo "Testing with woodwork version:" `python -c "import woodwork; print(woodwork.__version__)"`
- if: ${{ matrix.libraries == 'core' }}
name: Run unit tests without code coverage
- name: Run unit tests without code coverage
run: pytest featuretools/ -n auto
- if: ${{ matrix.libraries == 'spark - misc' }}
name: Run unit tests (misc)
run: pytest featuretools/ -n auto --ignore=featuretools/tests/computational_backend --ignore=featuretools/tests/entityset_tests --ignore=featuretools/tests/primitive_tests
- if: ${{ matrix.libraries == 'spark - computational' }}
name: Run unit tests (computational backend)
run: pytest featuretools/tests/computational_backend/ -n auto
- if: ${{ matrix.libraries == 'spark - entityset_1' }}
name: Run unit tests (entityset batch 1)
run: pytest featuretools/tests/entityset_tests -n auto --ignore=featuretools/tests/entityset_tests/test_es.py --ignore=featuretools/tests/entityset_tests/test_ww_es.py
- if: ${{ matrix.libraries == 'spark - entityset_2' }}
name: Run unit tests (entityset batch 2)
run: pytest featuretools/tests/entityset_tests/test_es.py featuretools/tests/entityset_tests/test_ww_es.py
- if: ${{ matrix.libraries == 'spark - primitives' }}
name: Run unit tests (primitives)
run: pytest featuretools/tests/primitive_tests -n auto

slack_alert_failure:
name: Send Slack alert if failure
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ installdeps-test: upgradepip

.PHONY: checkdeps
checkdeps:
$(eval allow_list='holidays|scipy|numpy|pandas|tqdm|cloudpickle|distributed|dask|psutil|pyspark|woodwork')
$(eval allow_list='holidays|scipy|numpy|pandas|tqdm|cloudpickle|distributed|dask|psutil|woodwork')
pip freeze | grep -v "alteryx/featuretools.git" | grep -E $(allow_list) > $(OUTPUT_PATH)

.PHONY: upgradepip
Expand Down
21 changes: 5 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,41 +47,30 @@ conda install -c conda-forge featuretools

### Add-ons

You can install add-ons individually or all at once by running
You can install add-ons individually or all at once by running:

```
python -m pip install "featuretools[complete]"
```

**Update checker** - Receive automatic notifications of new Featuretools releases

```
python -m pip install "featuretools[updater]"
```

**Premium Primitives** - Use Premium Primitives, including Natural Language Processing primitives:
**Premium Primitives** - Use Premium Primitives from the premium-primitives repo

```
python -m pip install "featuretools[premium]"
```

**TSFresh Primitives** - Use 60+ primitives from [tsfresh](https://tsfresh.readthedocs.io/en/latest/) within Featuretools
**NLP Primitives** - Use Natural Language Primitives from the nlp-primitives repo

```
python -m pip install "featuretools[tsfresh]"
python -m pip install "featuretools[nlp]"
```

**Dask Support** - Use Dask Dataframes to create EntitySets or run DFS with njobs > 1
**Dask Support** - Use Dask to run DFS with njobs > 1

```
python -m pip install "featuretools[dask]"
```

**SQL** - Automatic EntitySet generation from relational data stored in a SQL database:

```
python -m pip install "featuretools[sql]"
```
## Example
Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

Expand Down
Loading

0 comments on commit 12ad75a

Please sign in to comment.