Retire skater, update spaCy

Retired the skater module. Updated to scikit-learn 0.23.1 which has the new capability to calculate feature importances. Updated to spaCy 2.2.4 and applied related fixes. Doc updates.
nabeel-oz · May 28, 2020 · 5bfad12 · 5bfad12
1 parent 9dbf33c
commit 5bfad12
Show file tree

Hide file tree

Showing 11 changed files with 154 additions and 82 deletions.
diff --git a/Qlik-Py-Init.bat b/Qlik-Py-Init.bat
@@ -14,13 +14,13 @@ cd ..
 echo.
 echo Installing required packages... & echo.
 python -m pip install --upgrade setuptools pip
+pip install wheel==0.34.2
 pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3
 pip install pystan==2.17
 pip install fbprophet==0.4.post2
-pip install scikit-learn==0.21.3
-pip install hdbscan==0.8.23
-pip install skater==1.1.2
-pip install spacy==2.1.4
+pip install scikit-learn==0.23.1
+pip install hdbscan==0.8.26
+pip install spacy==2.2.4
 pip install efficient_apriori==1.0.0
 pip install tensorflow==1.14.0
 pip install keras==2.2.5

diff --git a/Qlik-Py-Init.ps1 b/Qlik-Py-Init.ps1
@@ -8,13 +8,13 @@ Write-Output "Activating the virtual environment..."
 & $PSScriptRoot\qlik-py-env\Scripts\activate.ps1
 Write-Output "Installing required packages..."
 python -m pip install --upgrade setuptools pip
+pip install wheel==0.34.2
 pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3
 pip install pystan==2.17
 pip install fbprophet==0.4.post2
-pip install scikit-learn==0.21.3
-pip install hdbscan==0.8.23
-pip install skater==1.1.2
-pip install spacy==2.1.4
+pip install scikit-learn==0.23.1
+pip install hdbscan==0.8.26
+pip install spacy==2.2.4
 pip install efficient_apriori==1.0.0
 pip install tensorflow==1.14.0
 pip install keras==2.2.5

diff --git a/README.md b/README.md
@@ -2,13 +2,9 @@
 
 ## Announcements
 
-Version 7.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools).
+Version 8.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools).
 
-This release adds the capability to use pre-trained scikit-learn and Keras models with Qlik. More on this [here](docs/Pretrained.md).
-
-With version 6, Deep Learning capabilities were added through integration with Keras and Tensorflow. This offers powerful capabilities for sequence predictions and complex timeseries forecasting.
-
-PyTools now also includes the ability to use [Additional Regressors](docs/Prophet.md#additional-regressors) with Prophet, allowing you to model more complex timeseries.
+This release adds the capability to use pre-trained scikit-learn, Keras or REST API based models with Qlik. More on this [here](docs/Pretrained.md).
 
 ## Table of Contents
 
@@ -31,10 +27,10 @@ Sample Qlik Sense apps are included and explained so that the techniques shown h
 
 The current implementation includes:
 
-- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html).
+- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. 
 - **Unsupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering.
 - **Deep Learning** : Implemented using [Keras](https://keras.io/) and [TensorFlow](https://www.tensorflow.org/). This SSE implements the full flow of setting up a neural network, training and evaluating it, and using it to make predictions. Deep Learning models can be used for sequence predictions and complex timeseries forecasting.
-- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn and Keras models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models.
+- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn, Keras and REST API based models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models.
 - **Named Entity Recognition** : Implemented using [spaCy](https://spacy.io/), an excellent Natural Language Processing library that comes with pre-trained neural networks. This SSE allows you to use spaCy's models for Named Entity Recognition or retrain them with your data for even better results.
 - **Association rules** : Implemented using [Efficient-Apriori](https://github.com/tommyod/Efficient-Apriori). Association Rules Analysis is a data mining technique to uncover how items are associated to each other. This technique is best known for Market Basket Analysis, but can be used more generally for finding interesting associations between sets of items that occur together, for example, in a transaction, a paragraph, or a diagnosis.
 - **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis.  
@@ -57,6 +53,10 @@ Deep Learning & Additional Regressors with Prophet:
 
 [![Demonstration Video 2](docs/images/YouTube-02.png)](https://youtu.be/KM0Fo1wdMYw)
 
+Clustering COVID-19 Literature:
+
+[![Demonstration Video 3](docs/images/YouTube-03.png)](https://youtu.be/5fYWgglx84M)
+
 ## Note on the approach
 In this project we have defined functions that expose open source algorithms to Qlik using the [gRPC framework](http://www.grpc.io/). Each function allows the user to define input data and parameters to control the underlying algorithm's output. 
 
@@ -143,7 +143,7 @@ This installation requires Internet access. To install this SSE on a machine wit
 4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication. 
      - Note that the script always ends with a "All done" message and does not check for errors. 
      - If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges).
-     - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
+     - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory.
      - If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`.
 
 5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`.

diff --git a/core/_common.py b/core/_common.py
@@ -485,7 +485,7 @@ def _get_rest_response(self, X):
         predictions_df = pd.DataFrame(predictions).astype("str")
 
         # Return the required column from the response dataframe
-        y = predictions_df.iloc[:,0].values if self.prediction_func == 'predict' else predictions_df['self.prediction_func'].values
+        y = predictions_df.iloc[:,0].values if self.prediction_func == 'predict' else predictions_df[self.prediction_func].values
         return y
 
     def _add_model_path(self, model_path):

diff --git a/core/_sklearn.py b/core/_sklearn.py
@@ -54,8 +54,7 @@
 from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, Birch, DBSCAN, FeatureAgglomeration, KMeans,\
                             MiniBatchKMeans, MeanShift, SpectralClustering
 
-from skater.model import InMemoryModel
-from skater.core.explanations import Interpretation
+from sklearn.inspection import permutation_importance
 
 # Workaround for Keras issue #1406
 # "Using X backend." always printed to stdout #1406 
@@ -1604,6 +1603,7 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
         self.model.estimator_kwargs = {}
         self.model.missing = "zeros"
         self.model.calc_feature_importances = False
+        self.model.importances_n_repeats = 30
         self.model.lags= None
         self.model.lag_target = False
         self.model.scale_target = False
@@ -1714,6 +1714,10 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
             # Flag to determine if feature importances should be calculated when the fit method is called
             if 'calculate_importances' in execution_args:
                 self.model.calc_feature_importances = 'true' == execution_args['calculate_importances'].lower()
+
+            # Sets the number of times a feature is randomly shuffled during the feature importance calculation
+            if 'importances_n_repeats' in execution_args:
+                self.model.importances_n_repeats = utils.atoi(execution_args['importances_n_repeats'])
 
             # Set the debug option for generating execution logs
             # Valid values are: true, false
@@ -1734,7 +1738,8 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N
                     "time_series_split": self.model.time_series_split, "max_train_size":self.model.max_train_size, "lags":self.model.lags,\
                     "lag_target":self.model.lag_target, "scale_target":self.model.scale_target, "make_stationary":self.model.make_stationary,\
                     "random_state":self.model.random_state, "compress":self.model.compress, "retain_data":self.model.retain_data,\
-                    "calculate_importances": self.model.calc_feature_importances, "debug":self.model.debug}
+                    "calculate_importances": self.model.calc_feature_importances, "importances_n_repeats": self.model.importances_n_repeats,\
+                    "debug":self.model.debug}
 
                     self._print_log(1)
 
@@ -2301,41 +2306,18 @@ def _prep_confusion_matrix(self, y_test, y_pred, labels):
     def _calc_importances(self, X=None, y=None):
         """
         Calculate feature importances.
-        Importances are calculated using the Skater library to provide this capability for all sklearn algorithms.
-        For more information: https://www.datascience.com/resources/tools/skater
+        Importances are calculated using sklearn.inspection.permutation_importance to provide this capability for all sklearn algorithms.
+        https://scikit-learn.org/stable/modules/permutation_importance.html
         """
 
         # Fill null values in the test set according to the model settings
         X_test = utils.fillna(X, method=self.model.missing)
 
-        # Calculate model agnostic feature importances using the skater library
-        interpreter = Interpretation(X_test, training_labels=y, feature_names=self.model.features_df.index.tolist())
-
-        if self.model.estimator_type == "classifier":
-            try:
-                # We use the predicted probabilities from the estimator if available
-                predictor = self.model.pipe.predict_proba
-
-                # Set up keyword arguments accordingly
-                imm_kwargs = {"probability": True}
-            except AttributeError:
-                # Otherwise we simply use the predict method
-                predictor = self.model.pipe.predict
+        # Calculate mean importances
+        importances = permutation_importance(self.model.pipe, X, y, n_repeats=self.model.importances_n_repeats, random_state=self.model.random_state)
 
-                # Set up keyword arguments accordingly
-                imm_kwargs = {"probability": False, "unique_values": self.model.pipe.classes_}
-
-            # Set up a skater InMemoryModel to calculate feature importances
-            imm = InMemoryModel(predictor, examples = X_test[:10], model_type="classifier", **imm_kwargs)
-
-        elif self.model.estimator_type == "regressor":
-            # Set up a skater InMemoryModel to calculate feature importances using the predict method
-            imm = InMemoryModel(self.model.pipe.predict, examples = X_test[:10], model_type="regressor")
-
-        # Add the feature importances to the model as a sorted data frame
-        self.model.importances = interpreter.feature_importance.feature_importance(imm, progressbar=False, ascending=False)
-        self.model.importances = pd.DataFrame(self.model.importances).reset_index()
-        self.model.importances.columns = ["feature_name", "importance"]
+        # Structure into a dataframe
+        self.model.importances = pd.DataFrame({"feature_name": X_test.columns, "importance": importances.importances_mean})
 
     def _send_table_description(self, variant):
         """

diff --git a/core/_spacy.py b/core/_spacy.py
@@ -9,6 +9,7 @@
 import warnings
 import numpy as np
 import pandas as pd
+from copy import copy
 
 # Suppress warnings
 if not sys.warnoptions:
@@ -327,15 +328,42 @@ def _prep_data(self):
             entities = sample[1]["entities"]
             entity_boundaries = []
 
-            # For each entity
-            for entity in entities:
+            # Structure the entities and types into a DataFrame
+            entities_df = pd.DataFrame(zip(*entities)).T
+            entities_df.columns = ['ents', 'types']
+
+            # For each unique entity
+            for entity in entities_df.ents.unique():
 
                 # Set up a regex pattern to look for the entity w.r.t. word boundaries 
-                pattern = re.compile(r"\b" + entity[0] + r"\b")
+                pattern = re.compile(r"\b" + entity + r"\b")
+
+                # Get entity types for the entity. This may be a series of values if the entity appears more than once.
+                types = entities_df[entities_df.ents == entity].types.reset_index(drop=True)
+                has_multiple_types = True if len(types.unique()) > 1 else False
+                i = 0
 
                 # Find all occurrences of the entity in the text
                 for match in re.finditer(pattern, text):
-                    entity_boundaries.append((match.start(), match.end(), entity[1]))
+                    entity_boundaries.append((match.start(), match.end(), types[i]))
+
+                    # Assign types according to the series
+                    if has_multiple_types:
+                        i += 1
+
+            if len(entity_boundaries)  > 0:
+                # Prepare variables to check for overlapping entity boundaries
+                start, stop, entity_type = map(list, zip(*entity_boundaries))
+
+                # Drop overlapping entities, i.e. where an entity is a subset of a longer entity
+                for i in range(len(start)):
+                    other_start, other_stop = copy(start), copy(stop)
+                    del other_start[i]
+                    del other_stop[i]
+
+                    for j in range(len(other_start)):
+                        if start[i] >= other_start[j] and stop[i] <= other_stop[j]:
+                            entity_boundaries.remove((start[i], stop[i], entity_type[i]))
 
             # Add the entity boundaries to the sample
             sample[1]["entities"] = entity_boundaries

diff --git a/core/_utils.py b/core/_utils.py
@@ -195,7 +195,7 @@ def get_kwargs_by_type(dict_kwargs):
 
     # Dictionary used to convert argument values to the correct type
     types = {"boolean":ast.literal_eval, "bool":ast.literal_eval, "integer":atoi, "int":atoi,\
-             "float":atof, "string":str, "str":str}
+             "float":atof, "string":str, "str":str, "none":atonone, "None":atonone}
 
     result_dict = {}
 
@@ -228,7 +228,7 @@ def get_kwargs_by_type(dict_kwargs):
                         b = b.capitalize()
 
                     # Handle None as an item in the dictionary
-                    if b == "None":
+                    if b in ("None", "none"):
                         d[types[split[2]](a)] = None
                     else:
                         d[types[split[2]](a)] = types[split[3]](b)
@@ -245,8 +245,8 @@ def get_kwargs_by_type(dict_kwargs):
                     if split[2] in ("boolean", "bool"):
                         i = i.capitalize()
 
-                    # Handle None as an item in the dictionary
-                    if i == "None":
+                    # Handle None as an item
+                    if i in ("None", "none"):
                         l.append(None)
                     else:
                         l.append(types[split[2]](i))
@@ -417,6 +417,14 @@ def atof(a):
 
     return float(s.replace(",", "."))
 
+def atonone(a):
+    """
+    Return None.
+    Convenience function for type conversions.
+    """
+
+    return None
+
 def dict_to_sse_arg(d):
     """
     Converts a dictionary to the argument syntax for this SSE

diff --git a/docker/Dockerfile v.8.0 b/docker/Dockerfile v.8.0
@@ -0,0 +1,37 @@
+# Use an official Python runtime as a parent image
+FROM python:3.6.8
+
+# Set the working directory to /qlik-py-tools
+WORKDIR /qlik-py-tools
+
+# Copy the current directory contents into the container at /qlik-py-tools
+COPY . /qlik-py-tools
+
+# Install dependencies
+RUN apt-get update
+RUN apt-get install build-essential
+
+# Upgrade pip and setuptools
+RUN python -m pip install --upgrade setuptools pip
+
+# Install required packages
+RUN pip install wheel==0.34.2
+RUN pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3
+RUN pip install pystan==2.17
+RUN pip install fbprophet==0.4.post2
+RUN pip install scikit-learn==0.23.1
+RUN pip install hdbscan==0.8.26
+RUN pip install spacy==2.2.4
+RUN pip install efficient_apriori==1.0.0
+RUN pip install tensorflow==1.14.0
+RUN pip install keras==2.2.5
+RUN python -m spacy download en
+
+# Make ports 80 and 50055 available to the world outside this container
+EXPOSE 80 50055
+
+# Set the working directory to /qlik-py-tools/core
+WORKDIR /qlik-py-tools/core
+
+# Run __main__.py when the container launches
+CMD ["python", "__main__.py"]