Skip to content

Commit

Permalink
Retire skater, update spaCy
Browse files Browse the repository at this point in the history
Retired the skater module. Updated to scikit-learn 0.23.1 which has the new capability to calculate feature importances. Updated to spaCy 2.2.4 and applied related fixes. Doc updates.
  • Loading branch information
Nabeel committed May 28, 2020
1 parent 9dbf33c commit 5bfad12
Show file tree
Hide file tree
Showing 11 changed files with 154 additions and 82 deletions.
8 changes: 4 additions & 4 deletions Qlik-Py-Init.bat
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ cd ..
echo.
echo Installing required packages... & echo.
python -m pip install --upgrade setuptools pip
pip install wheel==0.34.2
pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3
pip install pystan==2.17
pip install fbprophet==0.4.post2
pip install scikit-learn==0.21.3
pip install hdbscan==0.8.23
pip install skater==1.1.2
pip install spacy==2.1.4
pip install scikit-learn==0.23.1
pip install hdbscan==0.8.26
pip install spacy==2.2.4
pip install efficient_apriori==1.0.0
pip install tensorflow==1.14.0
pip install keras==2.2.5
Expand Down
8 changes: 4 additions & 4 deletions Qlik-Py-Init.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ Write-Output "Activating the virtual environment..."
& $PSScriptRoot\qlik-py-env\Scripts\activate.ps1
Write-Output "Installing required packages..."
python -m pip install --upgrade setuptools pip
pip install wheel==0.34.2
pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3
pip install pystan==2.17
pip install fbprophet==0.4.post2
pip install scikit-learn==0.21.3
pip install hdbscan==0.8.23
pip install skater==1.1.2
pip install spacy==2.1.4
pip install scikit-learn==0.23.1
pip install hdbscan==0.8.26
pip install spacy==2.2.4
pip install efficient_apriori==1.0.0
pip install tensorflow==1.14.0
pip install keras==2.2.5
Expand Down
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,9 @@

## Announcements

Version 7.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools).
Version 8.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools).

This release adds the capability to use pre-trained scikit-learn and Keras models with Qlik. More on this [here](docs/Pretrained.md).

With version 6, Deep Learning capabilities were added through integration with Keras and Tensorflow. This offers powerful capabilities for sequence predictions and complex timeseries forecasting.

PyTools now also includes the ability to use [Additional Regressors](docs/Prophet.md#additional-regressors) with Prophet, allowing you to model more complex timeseries.
This release adds the capability to use pre-trained scikit-learn, Keras or REST API based models with Qlik. More on this [here](docs/Pretrained.md).

## Table of Contents

Expand All @@ -31,10 +27,10 @@ Sample Qlik Sense apps are included and explained so that the techniques shown h

The current implementation includes:

- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html).
- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik.
- **Unsupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering.
- **Deep Learning** : Implemented using [Keras](https://keras.io/) and [TensorFlow](https://www.tensorflow.org/). This SSE implements the full flow of setting up a neural network, training and evaluating it, and using it to make predictions. Deep Learning models can be used for sequence predictions and complex timeseries forecasting.
- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn and Keras models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models.
- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn, Keras and REST API based models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models.
- **Named Entity Recognition** : Implemented using [spaCy](https://spacy.io/), an excellent Natural Language Processing library that comes with pre-trained neural networks. This SSE allows you to use spaCy's models for Named Entity Recognition or retrain them with your data for even better results.
- **Association rules** : Implemented using [Efficient-Apriori](https://github.com/tommyod/Efficient-Apriori). Association Rules Analysis is a data mining technique to uncover how items are associated to each other. This technique is best known for Market Basket Analysis, but can be used more generally for finding interesting associations between sets of items that occur together, for example, in a transaction, a paragraph, or a diagnosis.
- **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis.
Expand All @@ -57,6 +53,10 @@ Deep Learning & Additional Regressors with Prophet:

[![Demonstration Video 2](docs/images/YouTube-02.png)](https://youtu.be/KM0Fo1wdMYw)

Clustering COVID-19 Literature:

[![Demonstration Video 3](docs/images/YouTube-03.png)](https://youtu.be/5fYWgglx84M)

## Note on the approach
In this project we have defined functions that expose open source algorithms to Qlik using the [gRPC framework](http://www.grpc.io/). Each function allows the user to define input data and parameters to control the underlying algorithm's output.

Expand Down Expand Up @@ -143,7 +143,7 @@ This installation requires Internet access. To install this SSE on a machine wit
4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication.
- Note that the script always ends with a "All done" message and does not check for errors.
- If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges).
- Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
- Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
- If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`.

5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`.
Expand Down
2 changes: 1 addition & 1 deletion core/_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -485,7 +485,7 @@ def _get_rest_response(self, X):
predictions_df = pd.DataFrame(predictions).astype("str")

# Return the required column from the response dataframe
y = predictions_df.iloc[:,0].values if self.prediction_func == 'predict' else predictions_df['self.prediction_func'].values
y = predictions_df.iloc[:,0].values if self.prediction_func == 'predict' else predictions_df[self.prediction_func].values
return y

def _add_model_path(self, model_path):
Expand Down
46 changes: 14 additions & 32 deletions core/_sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,7 @@
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, Birch, DBSCAN, FeatureAgglomeration, KMeans,\
MiniBatchKMeans, MeanShift, SpectralClustering

from skater.model import InMemoryModel
from skater.core.explanations import Interpretation
from sklearn.inspection import permutation_importance

# Workaround for Keras issue #1406
# "Using X backend." always printed to stdout #1406
Expand Down Expand Up @@ -1604,6 +1603,7 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
self.model.estimator_kwargs = {}
self.model.missing = "zeros"
self.model.calc_feature_importances = False
self.model.importances_n_repeats = 30
self.model.lags= None
self.model.lag_target = False
self.model.scale_target = False
Expand Down Expand Up @@ -1714,6 +1714,10 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
# Flag to determine if feature importances should be calculated when the fit method is called
if 'calculate_importances' in execution_args:
self.model.calc_feature_importances = 'true' == execution_args['calculate_importances'].lower()

# Sets the number of times a feature is randomly shuffled during the feature importance calculation
if 'importances_n_repeats' in execution_args:
self.model.importances_n_repeats = utils.atoi(execution_args['importances_n_repeats'])

# Set the debug option for generating execution logs
# Valid values are: true, false
Expand All @@ -1734,7 +1738,8 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
"time_series_split": self.model.time_series_split, "max_train_size":self.model.max_train_size, "lags":self.model.lags,\
"lag_target":self.model.lag_target, "scale_target":self.model.scale_target, "make_stationary":self.model.make_stationary,\
"random_state":self.model.random_state, "compress":self.model.compress, "retain_data":self.model.retain_data,\
"calculate_importances": self.model.calc_feature_importances, "debug":self.model.debug}
"calculate_importances": self.model.calc_feature_importances, "importances_n_repeats": self.model.importances_n_repeats,\
"debug":self.model.debug}

self._print_log(1)

Expand Down Expand Up @@ -2301,41 +2306,18 @@ def _prep_confusion_matrix(self, y_test, y_pred, labels):
def _calc_importances(self, X=None, y=None):
"""
Calculate feature importances.
Importances are calculated using the Skater library to provide this capability for all sklearn algorithms.
For more information: https://www.datascience.com/resources/tools/skater
Importances are calculated using sklearn.inspection.permutation_importance to provide this capability for all sklearn algorithms.
https://scikit-learn.org/stable/modules/permutation_importance.html
"""

# Fill null values in the test set according to the model settings
X_test = utils.fillna(X, method=self.model.missing)

# Calculate model agnostic feature importances using the skater library
interpreter = Interpretation(X_test, training_labels=y, feature_names=self.model.features_df.index.tolist())

if self.model.estimator_type == "classifier":
try:
# We use the predicted probabilities from the estimator if available
predictor = self.model.pipe.predict_proba

# Set up keyword arguments accordingly
imm_kwargs = {"probability": True}
except AttributeError:
# Otherwise we simply use the predict method
predictor = self.model.pipe.predict
# Calculate mean importances
importances = permutation_importance(self.model.pipe, X, y, n_repeats=self.model.importances_n_repeats, random_state=self.model.random_state)

# Set up keyword arguments accordingly
imm_kwargs = {"probability": False, "unique_values": self.model.pipe.classes_}

# Set up a skater InMemoryModel to calculate feature importances
imm = InMemoryModel(predictor, examples = X_test[:10], model_type="classifier", **imm_kwargs)

elif self.model.estimator_type == "regressor":
# Set up a skater InMemoryModel to calculate feature importances using the predict method
imm = InMemoryModel(self.model.pipe.predict, examples = X_test[:10], model_type="regressor")

# Add the feature importances to the model as a sorted data frame
self.model.importances = interpreter.feature_importance.feature_importance(imm, progressbar=False, ascending=False)
self.model.importances = pd.DataFrame(self.model.importances).reset_index()
self.model.importances.columns = ["feature_name", "importance"]
# Structure into a dataframe
self.model.importances = pd.DataFrame({"feature_name": X_test.columns, "importance": importances.importances_mean})

def _send_table_description(self, variant):
"""
Expand Down
36 changes: 32 additions & 4 deletions core/_spacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import warnings
import numpy as np
import pandas as pd
from copy import copy

# Suppress warnings
if not sys.warnoptions:
Expand Down Expand Up @@ -327,15 +328,42 @@ def _prep_data(self):
entities = sample[1]["entities"]
entity_boundaries = []

# For each entity
for entity in entities:
# Structure the entities and types into a DataFrame
entities_df = pd.DataFrame(zip(*entities)).T
entities_df.columns = ['ents', 'types']

# For each unique entity
for entity in entities_df.ents.unique():

# Set up a regex pattern to look for the entity w.r.t. word boundaries
pattern = re.compile(r"\b" + entity[0] + r"\b")
pattern = re.compile(r"\b" + entity + r"\b")

# Get entity types for the entity. This may be a series of values if the entity appears more than once.
types = entities_df[entities_df.ents == entity].types.reset_index(drop=True)
has_multiple_types = True if len(types.unique()) > 1 else False
i = 0

# Find all occurrences of the entity in the text
for match in re.finditer(pattern, text):
entity_boundaries.append((match.start(), match.end(), entity[1]))
entity_boundaries.append((match.start(), match.end(), types[i]))

# Assign types according to the series
if has_multiple_types:
i += 1

if len(entity_boundaries) > 0:
# Prepare variables to check for overlapping entity boundaries
start, stop, entity_type = map(list, zip(*entity_boundaries))

# Drop overlapping entities, i.e. where an entity is a subset of a longer entity
for i in range(len(start)):
other_start, other_stop = copy(start), copy(stop)
del other_start[i]
del other_stop[i]

for j in range(len(other_start)):
if start[i] >= other_start[j] and stop[i] <= other_stop[j]:
entity_boundaries.remove((start[i], stop[i], entity_type[i]))

# Add the entity boundaries to the sample
sample[1]["entities"] = entity_boundaries
Expand Down
16 changes: 12 additions & 4 deletions core/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def get_kwargs_by_type(dict_kwargs):

# Dictionary used to convert argument values to the correct type
types = {"boolean":ast.literal_eval, "bool":ast.literal_eval, "integer":atoi, "int":atoi,\
"float":atof, "string":str, "str":str}
"float":atof, "string":str, "str":str, "none":atonone, "None":atonone}

result_dict = {}

Expand Down Expand Up @@ -228,7 +228,7 @@ def get_kwargs_by_type(dict_kwargs):
b = b.capitalize()

# Handle None as an item in the dictionary
if b == "None":
if b in ("None", "none"):
d[types[split[2]](a)] = None
else:
d[types[split[2]](a)] = types[split[3]](b)
Expand All @@ -245,8 +245,8 @@ def get_kwargs_by_type(dict_kwargs):
if split[2] in ("boolean", "bool"):
i = i.capitalize()

# Handle None as an item in the dictionary
if i == "None":
# Handle None as an item
if i in ("None", "none"):
l.append(None)
else:
l.append(types[split[2]](i))
Expand Down Expand Up @@ -417,6 +417,14 @@ def atof(a):

return float(s.replace(",", "."))

def atonone(a):
"""
Return None.
Convenience function for type conversions.
"""

return None

def dict_to_sse_arg(d):
"""
Converts a dictionary to the argument syntax for this SSE
Expand Down
37 changes: 37 additions & 0 deletions docker/Dockerfile v.8.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Use an official Python runtime as a parent image
FROM python:3.6.8

# Set the working directory to /qlik-py-tools
WORKDIR /qlik-py-tools

# Copy the current directory contents into the container at /qlik-py-tools
COPY . /qlik-py-tools

# Install dependencies
RUN apt-get update
RUN apt-get install build-essential

# Upgrade pip and setuptools
RUN python -m pip install --upgrade setuptools pip

# Install required packages
RUN pip install wheel==0.34.2
RUN pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3
RUN pip install pystan==2.17
RUN pip install fbprophet==0.4.post2
RUN pip install scikit-learn==0.23.1
RUN pip install hdbscan==0.8.26
RUN pip install spacy==2.2.4
RUN pip install efficient_apriori==1.0.0
RUN pip install tensorflow==1.14.0
RUN pip install keras==2.2.5
RUN python -m spacy download en

# Make ports 80 and 50055 available to the world outside this container
EXPOSE 80 50055

# Set the working directory to /qlik-py-tools/core
WORKDIR /qlik-py-tools/core

# Run __main__.py when the container launches
CMD ["python", "__main__.py"]
Loading

0 comments on commit 5bfad12

Please sign in to comment.