diff --git a/Qlik-Py-Init.bat b/Qlik-Py-Init.bat index be2b079..9073d49 100644 --- a/Qlik-Py-Init.bat +++ b/Qlik-Py-Init.bat @@ -14,13 +14,13 @@ cd .. echo. echo Installing required packages... & echo. python -m pip install --upgrade setuptools pip +pip install wheel==0.34.2 pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3 pip install pystan==2.17 pip install fbprophet==0.4.post2 -pip install scikit-learn==0.21.3 -pip install hdbscan==0.8.23 -pip install skater==1.1.2 -pip install spacy==2.1.4 +pip install scikit-learn==0.23.1 +pip install hdbscan==0.8.26 +pip install spacy==2.2.4 pip install efficient_apriori==1.0.0 pip install tensorflow==1.14.0 pip install keras==2.2.5 diff --git a/Qlik-Py-Init.ps1 b/Qlik-Py-Init.ps1 index 13d9269..b068211 100644 --- a/Qlik-Py-Init.ps1 +++ b/Qlik-Py-Init.ps1 @@ -8,13 +8,13 @@ Write-Output "Activating the virtual environment..." & $PSScriptRoot\qlik-py-env\Scripts\activate.ps1 Write-Output "Installing required packages..." python -m pip install --upgrade setuptools pip +pip install wheel==0.34.2 pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3 pip install pystan==2.17 pip install fbprophet==0.4.post2 -pip install scikit-learn==0.21.3 -pip install hdbscan==0.8.23 -pip install skater==1.1.2 -pip install spacy==2.1.4 +pip install scikit-learn==0.23.1 +pip install hdbscan==0.8.26 +pip install spacy==2.2.4 pip install efficient_apriori==1.0.0 pip install tensorflow==1.14.0 pip install keras==2.2.5 diff --git a/README.md b/README.md index 20d76c5..ff8c7f9 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,9 @@ ## Announcements -Version 7.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools). +Version 8.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools). -This release adds the capability to use pre-trained scikit-learn and Keras models with Qlik. More on this [here](docs/Pretrained.md). - -With version 6, Deep Learning capabilities were added through integration with Keras and Tensorflow. This offers powerful capabilities for sequence predictions and complex timeseries forecasting. - -PyTools now also includes the ability to use [Additional Regressors](docs/Prophet.md#additional-regressors) with Prophet, allowing you to model more complex timeseries. +This release adds the capability to use pre-trained scikit-learn, Keras or REST API based models with Qlik. More on this [here](docs/Pretrained.md). ## Table of Contents @@ -31,10 +27,10 @@ Sample Qlik Sense apps are included and explained so that the techniques shown h The current implementation includes: -- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html). +- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. - **Unsupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering. - **Deep Learning** : Implemented using [Keras](https://keras.io/) and [TensorFlow](https://www.tensorflow.org/). This SSE implements the full flow of setting up a neural network, training and evaluating it, and using it to make predictions. Deep Learning models can be used for sequence predictions and complex timeseries forecasting. -- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn and Keras models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models. +- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn, Keras and REST API based models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models. - **Named Entity Recognition** : Implemented using [spaCy](https://spacy.io/), an excellent Natural Language Processing library that comes with pre-trained neural networks. This SSE allows you to use spaCy's models for Named Entity Recognition or retrain them with your data for even better results. - **Association rules** : Implemented using [Efficient-Apriori](https://github.com/tommyod/Efficient-Apriori). Association Rules Analysis is a data mining technique to uncover how items are associated to each other. This technique is best known for Market Basket Analysis, but can be used more generally for finding interesting associations between sets of items that occur together, for example, in a transaction, a paragraph, or a diagnosis. - **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis. @@ -57,6 +53,10 @@ Deep Learning & Additional Regressors with Prophet: [![Demonstration Video 2](docs/images/YouTube-02.png)](https://youtu.be/KM0Fo1wdMYw) +Clustering COVID-19 Literature: + +[![Demonstration Video 3](docs/images/YouTube-03.png)](https://youtu.be/5fYWgglx84M) + ## Note on the approach In this project we have defined functions that expose open source algorithms to Qlik using the [gRPC framework](http://www.grpc.io/). Each function allows the user to define input data and parameters to control the underlying algorithm's output. @@ -143,7 +143,7 @@ This installation requires Internet access. To install this SSE on a machine wit 4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication. - Note that the script always ends with a "All done" message and does not check for errors. - If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges). - - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. + - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. - If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`. 5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`. diff --git a/core/_common.py b/core/_common.py index 16d9f34..4699c6a 100644 --- a/core/_common.py +++ b/core/_common.py @@ -485,7 +485,7 @@ def _get_rest_response(self, X): predictions_df = pd.DataFrame(predictions).astype("str") # Return the required column from the response dataframe - y = predictions_df.iloc[:,0].values if self.prediction_func == 'predict' else predictions_df['self.prediction_func'].values + y = predictions_df.iloc[:,0].values if self.prediction_func == 'predict' else predictions_df[self.prediction_func].values return y def _add_model_path(self, model_path): diff --git a/core/_sklearn.py b/core/_sklearn.py index d2d7d81..7448b16 100644 --- a/core/_sklearn.py +++ b/core/_sklearn.py @@ -54,8 +54,7 @@ from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, Birch, DBSCAN, FeatureAgglomeration, KMeans,\ MiniBatchKMeans, MeanShift, SpectralClustering -from skater.model import InMemoryModel -from skater.core.explanations import Interpretation +from sklearn.inspection import permutation_importance # Workaround for Keras issue #1406 # "Using X backend." always printed to stdout #1406 @@ -1604,6 +1603,7 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N self.model.estimator_kwargs = {} self.model.missing = "zeros" self.model.calc_feature_importances = False + self.model.importances_n_repeats = 30 self.model.lags= None self.model.lag_target = False self.model.scale_target = False @@ -1714,6 +1714,10 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N # Flag to determine if feature importances should be calculated when the fit method is called if 'calculate_importances' in execution_args: self.model.calc_feature_importances = 'true' == execution_args['calculate_importances'].lower() + + # Sets the number of times a feature is randomly shuffled during the feature importance calculation + if 'importances_n_repeats' in execution_args: + self.model.importances_n_repeats = utils.atoi(execution_args['importances_n_repeats']) # Set the debug option for generating execution logs # Valid values are: true, false @@ -1734,7 +1738,8 @@ def _set_params(self, estimator_args, scaler_args, execution_args, metric_args=N "time_series_split": self.model.time_series_split, "max_train_size":self.model.max_train_size, "lags":self.model.lags,\ "lag_target":self.model.lag_target, "scale_target":self.model.scale_target, "make_stationary":self.model.make_stationary,\ "random_state":self.model.random_state, "compress":self.model.compress, "retain_data":self.model.retain_data,\ - "calculate_importances": self.model.calc_feature_importances, "debug":self.model.debug} + "calculate_importances": self.model.calc_feature_importances, "importances_n_repeats": self.model.importances_n_repeats,\ + "debug":self.model.debug} self._print_log(1) @@ -2301,41 +2306,18 @@ def _prep_confusion_matrix(self, y_test, y_pred, labels): def _calc_importances(self, X=None, y=None): """ Calculate feature importances. - Importances are calculated using the Skater library to provide this capability for all sklearn algorithms. - For more information: https://www.datascience.com/resources/tools/skater + Importances are calculated using sklearn.inspection.permutation_importance to provide this capability for all sklearn algorithms. + https://scikit-learn.org/stable/modules/permutation_importance.html """ # Fill null values in the test set according to the model settings X_test = utils.fillna(X, method=self.model.missing) - # Calculate model agnostic feature importances using the skater library - interpreter = Interpretation(X_test, training_labels=y, feature_names=self.model.features_df.index.tolist()) - - if self.model.estimator_type == "classifier": - try: - # We use the predicted probabilities from the estimator if available - predictor = self.model.pipe.predict_proba - - # Set up keyword arguments accordingly - imm_kwargs = {"probability": True} - except AttributeError: - # Otherwise we simply use the predict method - predictor = self.model.pipe.predict + # Calculate mean importances + importances = permutation_importance(self.model.pipe, X, y, n_repeats=self.model.importances_n_repeats, random_state=self.model.random_state) - # Set up keyword arguments accordingly - imm_kwargs = {"probability": False, "unique_values": self.model.pipe.classes_} - - # Set up a skater InMemoryModel to calculate feature importances - imm = InMemoryModel(predictor, examples = X_test[:10], model_type="classifier", **imm_kwargs) - - elif self.model.estimator_type == "regressor": - # Set up a skater InMemoryModel to calculate feature importances using the predict method - imm = InMemoryModel(self.model.pipe.predict, examples = X_test[:10], model_type="regressor") - - # Add the feature importances to the model as a sorted data frame - self.model.importances = interpreter.feature_importance.feature_importance(imm, progressbar=False, ascending=False) - self.model.importances = pd.DataFrame(self.model.importances).reset_index() - self.model.importances.columns = ["feature_name", "importance"] + # Structure into a dataframe + self.model.importances = pd.DataFrame({"feature_name": X_test.columns, "importance": importances.importances_mean}) def _send_table_description(self, variant): """ diff --git a/core/_spacy.py b/core/_spacy.py index 3371cda..5820222 100644 --- a/core/_spacy.py +++ b/core/_spacy.py @@ -9,6 +9,7 @@ import warnings import numpy as np import pandas as pd +from copy import copy # Suppress warnings if not sys.warnoptions: @@ -327,15 +328,42 @@ def _prep_data(self): entities = sample[1]["entities"] entity_boundaries = [] - # For each entity - for entity in entities: + # Structure the entities and types into a DataFrame + entities_df = pd.DataFrame(zip(*entities)).T + entities_df.columns = ['ents', 'types'] + + # For each unique entity + for entity in entities_df.ents.unique(): # Set up a regex pattern to look for the entity w.r.t. word boundaries - pattern = re.compile(r"\b" + entity[0] + r"\b") + pattern = re.compile(r"\b" + entity + r"\b") + + # Get entity types for the entity. This may be a series of values if the entity appears more than once. + types = entities_df[entities_df.ents == entity].types.reset_index(drop=True) + has_multiple_types = True if len(types.unique()) > 1 else False + i = 0 # Find all occurrences of the entity in the text for match in re.finditer(pattern, text): - entity_boundaries.append((match.start(), match.end(), entity[1])) + entity_boundaries.append((match.start(), match.end(), types[i])) + + # Assign types according to the series + if has_multiple_types: + i += 1 + + if len(entity_boundaries) > 0: + # Prepare variables to check for overlapping entity boundaries + start, stop, entity_type = map(list, zip(*entity_boundaries)) + + # Drop overlapping entities, i.e. where an entity is a subset of a longer entity + for i in range(len(start)): + other_start, other_stop = copy(start), copy(stop) + del other_start[i] + del other_stop[i] + + for j in range(len(other_start)): + if start[i] >= other_start[j] and stop[i] <= other_stop[j]: + entity_boundaries.remove((start[i], stop[i], entity_type[i])) # Add the entity boundaries to the sample sample[1]["entities"] = entity_boundaries diff --git a/core/_utils.py b/core/_utils.py index 928e4c6..448c54c 100644 --- a/core/_utils.py +++ b/core/_utils.py @@ -195,7 +195,7 @@ def get_kwargs_by_type(dict_kwargs): # Dictionary used to convert argument values to the correct type types = {"boolean":ast.literal_eval, "bool":ast.literal_eval, "integer":atoi, "int":atoi,\ - "float":atof, "string":str, "str":str} + "float":atof, "string":str, "str":str, "none":atonone, "None":atonone} result_dict = {} @@ -228,7 +228,7 @@ def get_kwargs_by_type(dict_kwargs): b = b.capitalize() # Handle None as an item in the dictionary - if b == "None": + if b in ("None", "none"): d[types[split[2]](a)] = None else: d[types[split[2]](a)] = types[split[3]](b) @@ -245,8 +245,8 @@ def get_kwargs_by_type(dict_kwargs): if split[2] in ("boolean", "bool"): i = i.capitalize() - # Handle None as an item in the dictionary - if i == "None": + # Handle None as an item + if i in ("None", "none"): l.append(None) else: l.append(types[split[2]](i)) @@ -417,6 +417,14 @@ def atof(a): return float(s.replace(",", ".")) +def atonone(a): + """ + Return None. + Convenience function for type conversions. + """ + + return None + def dict_to_sse_arg(d): """ Converts a dictionary to the argument syntax for this SSE diff --git a/docker/Dockerfile v.8.0 b/docker/Dockerfile v.8.0 new file mode 100644 index 0000000..b04c1d2 --- /dev/null +++ b/docker/Dockerfile v.8.0 @@ -0,0 +1,37 @@ +# Use an official Python runtime as a parent image +FROM python:3.6.8 + +# Set the working directory to /qlik-py-tools +WORKDIR /qlik-py-tools + +# Copy the current directory contents into the container at /qlik-py-tools +COPY . /qlik-py-tools + +# Install dependencies +RUN apt-get update +RUN apt-get install build-essential + +# Upgrade pip and setuptools +RUN python -m pip install --upgrade setuptools pip + +# Install required packages +RUN pip install wheel==0.34.2 +RUN pip install grpcio==1.26.0 grpcio-tools==1.26.0 numpy==1.17.5 scipy==1.4.1 pandas==0.25.3 cython==0.29.14 joblib==0.11 holidays==0.9.11 pyyaml==5.3 +RUN pip install pystan==2.17 +RUN pip install fbprophet==0.4.post2 +RUN pip install scikit-learn==0.23.1 +RUN pip install hdbscan==0.8.26 +RUN pip install spacy==2.2.4 +RUN pip install efficient_apriori==1.0.0 +RUN pip install tensorflow==1.14.0 +RUN pip install keras==2.2.5 +RUN python -m spacy download en + +# Make ports 80 and 50055 available to the world outside this container +EXPOSE 80 50055 + +# Set the working directory to /qlik-py-tools/core +WORKDIR /qlik-py-tools/core + +# Run __main__.py when the container launches +CMD ["python", "__main__.py"] \ No newline at end of file diff --git a/docs/Pretrained.md b/docs/Pretrained.md index 4046b12..2584f7d 100644 --- a/docs/Pretrained.md +++ b/docs/Pretrained.md @@ -27,31 +27,31 @@ This SSE also provides capabilities for training machine learning models entirel ## Pre-requisites -- This SSE currently supports scikit-learn and Keras models that have been saved to disk. -- The models need to be built with the same version of Python that is being used by the SSE (3.6.x). +- This SSE supports REST API based models, as well as scikit-learn and Keras models that have been saved to disk. +- Keras and scikit-learn models need to be built with the same version of Python that is being used by the SSE (3.6.x). - scikit-learn models need to be saved using the Pickle library. - Keras models need to be saved using the Keras model.save method. -- The Keras version needs to match the SSE. +- The scikit-learn and Keras version used to build the model needs to match the SSE. - Preprocessing (e.g. scaling, OHE) needs to be handled by the model / pipeline. ## Setting up the model This SSE will handle the communication between Qlik and Python and call the specified model. However, we need certain details for the model to translate the incoming data and call the model correctly. This information has to be supplied through a YAML file. -The YAML file needs to be placed in the SSE's `qlik-py-env/models` directory. The file needs to provide: +The YAML file needs to be placed in the SSE's `qlik-py-env/models` directory. The tags available for defining a model are provided below: -- **path**: Relative or absolute path to the model. -- **type**: Type of the model. - - Currently supported values are `scikit-learn`, `sklearn` and `keras`. -- **preprocessor**: Optional preprocessor to prepare data for the model. - - This has to be a path to a Python object that implements the `transform` method and has been stored using `Pickle`. - - The SSE will call the preprocessor's `transform` method on the samples received from Qlik and use its output to call the model's prediction function. -- **features**: List of features expected by the model together with their data types. - - The order of the features is important and needs to be followed by the model and the Qlik app calling the model. - - The data types are required for correctly interpreting the data received from Qlik. Valid types are `int`, `float`, `str`, `bool`. - - The names of the features should correspond to fields in the Qlik app. +| Tag | Scope | Description | Sample Values | Remarks | +| --- | --- | --- | --- | --- | +| **path** | Mandatory | Relative or absolute path to the model | `../pretrained/HR-Attrition-v1.pkl`

`http://xxx:123/public/api/v1/model/predict` | This will be a URL if the model is exposed through a REST API. | +| **type** | Mandatory | Type of the model | `scikit-learn`, `sklearn`, `keras`, `rest` | Currently this SSE only supports scikit-learn, Keras and REST API based models. | +| **response_section** | Optional. Only applicable for REST. | Defines the section of the JSON response that will be returned to Qlik | `result` | A JSON response will typically contain several sections under the root. This tag can be used to specify the section which contains the predictions to be returned to Qlik. | +| **user** | Optional. Only applicable for REST. | Used to pass the endpoint key or user in the REST call | `qlik` | Note that this SSE does not handle encryption of the YAML file on disk. | +| **password** | Optional. Only applicable for REST. | Used to pass the password in the REST call | `password` | Note that this SSE does not handle masking of the password or encryption of the YAML file on disk. | +| **payload_header** | Optional. Only applicable for REST. | Used to nest the data/payload within a section of the JSON request | `features` | A REST API may require the JSON payload in the request to be contained within a parent object, e.g. `features`. | +| **preprocessor** | Optional | Pickled preprocessor that will be called to prepare data for the model | `../pretrained/HR-Attrition-prep-v1.pkl` | This has to be a path to a Python object that implements the `transform` method and has been stored using `Pickle`. The SSE will call the preprocessor's `transform` method on the samples received from Qlik and use its output to call the model's prediction function. | +| **features** | Mandatory | List of features expected by the model together with their data types | `overtime : str`

`salary : float` | The order of the features is important and needs to be followed by the model and the Qlik app calling the model.

The data types are required for correctly interpreting the data received from Qlik. Valid types are `int`, `float`, `str`, `bool`.

The names of the features should correspond to fields in the Qlik app.

Please refer to the examples below for formatting this list. | -Here is a sample YAML file. You can also find complete examples [here](sample-scripts/HR-Attrition-v1.yaml) and [here](sample-scripts/HR-Attrition-v2.yaml). +Here is a sample YAML file for a scikit-learn model. You can also find complete examples [here](sample-scripts/HR-Attrition-v1.yaml) and [here](sample-scripts/HR-Attrition-v2.yaml). ``` --- @@ -62,6 +62,23 @@ features: salary : float ... ``` +Here is a sample YAML file for a deployed model that is exposed via a REST API. +``` +--- +path: http://xxx:123/public/api/v1/procedures/predict +type: rest +user: abc +response_section: result +payload_header: features +features: + admit_date: str + patient_status: str + proc_date: str + proc_desc: str + surg_descrp: str + surgery_type: str +... +``` ## Calling the model diff --git a/docs/README.md b/docs/README.md index 8ab86d3..0294335 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,13 +2,9 @@ ## Announcements -Version 7.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools). +Version 8.0 has been released. Get it [here](https://github.com/nabeel-oz/qlik-py-tools/releases) or with [Docker](https://hub.docker.com/r/nabeeloz/qlik-py-tools). -This release adds the capability to use pre-trained scikit-learn and Keras models with Qlik. More on this [here](Pretrained.md). - -With version 6, Deep Learning capabilities were added through integration with Keras and Tensorflow. This offers powerful capabilities for sequence predictions and complex timeseries forecasting. - -PyTools now also includes the ability to use [Additional Regressors](Prophet.md#additional-regressors) with Prophet, allowing you to model more complex timeseries. +This release adds the capability to use pre-trained scikit-learn, Keras or REST API based models with Qlik. More on this [here](Pretrained.md). ## Table of Contents @@ -31,10 +27,10 @@ Sample Qlik Sense apps are included and explained so that the techniques shown h The current implementation includes: -- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. In addition, models can be interpreted using [Skater](https://datascienceinc.github.io/Skater/overview.html). +- **Supervised Machine Learning** : Implemented using [scikit-learn](http://scikit-learn.org/stable/index.html), the go-to machine learning library for Python. This SSE implements the full machine learning flow from data preparation, model training and evaluation, to making predictions in Qlik. - **Unsupervised Machine Learning** : Also implemented using [scikit-learn](http://scikit-learn.org/stable/index.html). This provides capabilities for dimensionality reduction and clustering. - **Deep Learning** : Implemented using [Keras](https://keras.io/) and [TensorFlow](https://www.tensorflow.org/). This SSE implements the full flow of setting up a neural network, training and evaluating it, and using it to make predictions. Deep Learning models can be used for sequence predictions and complex timeseries forecasting. -- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn and Keras models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models. +- **Use of pretrained ML models in Qlik** : Pre-trained scikit-learn, Keras and REST API based models can be called from this SSE, allowing predictions to be exposed within the broader analysis and business context of a Qlik app. The implementation also allows for What-if analysis using the models. - **Named Entity Recognition** : Implemented using [spaCy](https://spacy.io/), an excellent Natural Language Processing library that comes with pre-trained neural networks. This SSE allows you to use spaCy's models for Named Entity Recognition or retrain them with your data for even better results. - **Association rules** : Implemented using [Efficient-Apriori](https://github.com/tommyod/Efficient-Apriori). Association Rules Analysis is a data mining technique to uncover how items are associated to each other. This technique is best known for Market Basket Analysis, but can be used more generally for finding interesting associations between sets of items that occur together, for example, in a transaction, a paragraph, or a diagnosis. - **Clustering** : Implemented using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html), a high performance algorithm that is great for exploratory data analysis. @@ -57,6 +53,10 @@ Deep Learning & Additional Regressors with Prophet: [![Demonstration Video 2](images/YouTube-02.png)](https://youtu.be/KM0Fo1wdMYw) +Clustering COVID-19 Literature: + +[![Demonstration Video 3](docs/images/YouTube-03.png)](https://youtu.be/5fYWgglx84M) + ## Note on the approach In this project we have defined functions that expose open source algorithms to Qlik using the [gRPC framework](http://www.grpc.io/). Each function allows the user to define input data and parameters to control the underlying algorithm's output. @@ -143,7 +143,7 @@ This installation requires Internet access. To install this SSE on a machine wit 4. Right click `Qlik-Py-Init.bat` and chose 'Run as Administrator'. You can open this file in a text editor to review the commands that will be executed. If everything goes smoothly you will see a Python virtual environment being set up, project files being copied, some packages being installed and TCP Port `50055` being opened for inbound communication. - Note that the script always ends with a "All done" message and does not check for errors. - If you need to change the port you can do so in the file `core\__main__.py` by opening the file with a text editor, changing the value of the `_DEFAULT_PORT` variable, and then saving the file. You will also need to update `Qlik-Py-Init.bat` to use the same port in the `netsh` command. This command will only work if you run the batch file through an elevated command prompt (i.e. with administrator privileges). - - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `skater`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. + - Once the execution completes, do a quick scan of the log to see everything installed correctly. The libraries imported are: `grpcio`, `grpcio-tools`, `numpy`, `scipy`, `pandas`, `cython`, `joblib`, `pyyaml`, `pystan`, `fbprophet`, `scikit-learn`, `hdbscan`, `spacy`, `efficient-apriori`, `tensorflow`, `keras` and their dependencies. Also, check that the `core` and `generated` directories have been copied successfully to the newly created `qlik-py-env` directory. - If the initialization fails for any reason, you can simply delete the `qlik-py-env` directory and re-run `Qlik-Py-Init.bat`. 5. Now whenever you want to start this Python service you can run `Qlik-Py-Start.bat`. diff --git a/docs/images/YouTube-03.png b/docs/images/YouTube-03.png new file mode 100644 index 0000000..1ea6659 Binary files /dev/null and b/docs/images/YouTube-03.png differ