[DOCS] Adds feature importance regression example (#1360) (#1361)

elastic · Sep 15, 2020 · a0d4355 · a0d4355
1 parent 9685023
commit a0d4355
Show file tree

Hide file tree

Showing 6 changed files with 90 additions and 45 deletions.
diff --git a/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc b/docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc
@@ -106,9 +106,13 @@ image::images/flights-regression-job-1.png["Creating a {dfanalytics-job} in {kib
 [role="screenshot"]
 image::images/flights-regression-job-2.png["Creating a {dfanalytics-job} in {kib}" – continued]
 
+[role="screenshot"]
+image::images/flights-regression-job-3.png["Creating a {dfanalytics-job} in {kib}" – advanced options]
+
 
 .. Choose `kibana_sample_data_flights` as the source index.
 .. Choose `regression` as the job type.
+.. Optionally improve the quality of the analysis by adding a query that removes erroneous data. In this case, we omit flights with a distance of 0 kilometers or less.
 .. Choose `FlightDelayMin` as the dependent variable, which is the field that we
 want to predict with the {reganalysis}.
 .. Add `Cancelled`, `FlightDelay`, and `FlightDelayType` to the list of excluded
@@ -117,16 +121,18 @@ exclude fields that either contain erroneous data or describe the
 `dependent_variable`.
 .. Choose a training percent of `90` which means it randomly selects 90% of the
 source data for training.
-.. Use the default feature importance values.
+.. If you want to experiment with <<ml-feature-importance,feature importance>>,
+specify a value in the advanced configuration options. In this example, we
+choose to return a maximum of 5 feature importance values per document. This
+option affects the speed of the analysis, so by default it is disabled.
 .. Use the default memory limit for the job. If the job requires more than this 
 amount of memory, it fails to start. If the available memory on the node is
 limited, this setting makes it possible to prevent job execution.
 .. Add a job ID and optionally a job description.
 .. Add the name of the destination index that will contain the results of the 
-analysis. It will contain a copy of the source index data where each document is 
-annotated with the results. If the index does not exist, it will be created 
-automatically.
-
+analysis. In {kib}, the index name matches the job ID by default. It will
+contain a copy of the source index data where each document is annotated with
+the results. If the index does not exist, it will be created automatically.
 
 .API example
 [%collapsible]
@@ -139,7 +145,7 @@ PUT _ml/data_frame/analytics/model-flight-delays
     "index": [
       "kibana_sample_data_flights"
     ],
-    "query": { <1>
+    "query": {
       "range": {
         "DistanceKilometers": { 
           "gt": 0
@@ -148,7 +154,7 @@ PUT _ml/data_frame/analytics/model-flight-delays
     }
   },
   "dest": {
-    "index": "df-flight-delays"
+    "index": "model-flight-delays"
   },
   "analysis": {
     "regression": {
@@ -167,9 +173,6 @@ PUT _ml/data_frame/analytics/model-flight-delays
 }
 --------------------------------------------------
 // TEST[skip:setup kibana sample data]
-
-<1> Optional query that removes erroneous data from the analysis to improve 
-quality.
 ====
 --
 
@@ -263,37 +266,36 @@ The API call returns the following response:
         "skipped_docs_count" : 0
       },
       "memory_usage" : {
-        "timestamp" : 1596237978801,
-        "peak_usage_bytes" : 2204548,
+        "timestamp" : 1599773614155,
+        "peak_usage_bytes" : 50156565,
         "status" : "ok"
       },
       "analysis_stats" : {
         "regression_stats" : {
-          "timestamp" : 1596237978801,
+          "timestamp" : 1599773614155,
           "iteration" : 18,
           "hyperparameters" : {
-            "alpha" : 168825.7788898173,
-            "downsample_factor" : 0.9033277769849748,
-            "eta" : 0.04884738703731517,
-            "eta_growth_rate_per_tree" : 1.0299887790757198,
+            "alpha" : 19042.721566629778,
+            "downsample_factor" : 0.911884068909842,
+            "eta" : 0.02331774683318904,
+            "eta_growth_rate_per_tree" : 1.0143154178910303,
             "feature_bag_fraction" : 0.5504020748926737,
-            "gamma" : 1454.4275926774008,
-            "lambda" : 2.1114872989215074,
+            "gamma" : 53.373570122718846,
+            "lambda" : 2.94058933878574,
             "max_attempts_to_add_tree" : 3,
             "max_optimization_rounds_per_hyperparameter" : 2,
-            "max_trees" : 427,
+            "max_trees" : 894,
             "num_folds" : 4,
             "num_splits_per_feature" : 75,
-            "soft_tree_depth_limit" : 5.8014874129785,
+            "soft_tree_depth_limit" : 2.945317520946171,
             "soft_tree_depth_tolerance" : 0.13448633124842999
           },
           "timing_stats" : {
-            "elapsed_time" : 124851,
-            "iteration_time" : 15081
+            "elapsed_time" : 302959,
+            "iteration_time" : 13075
           },
           "validation_loss" : {
-            "loss_type" : "mse",
-            "fold_values" : [ ]
+            "loss_type" : "mse"
           }
         }
       }
@@ -325,6 +327,27 @@ table to show only testing or training data and you can select which fields are
 shown in the table. You can also enable histogram charts to get a better
 understanding of the distribution of values in your data.
 
+If you chose to calculate feature importance, the destination index also
+contains `ml.feature_importance` objects. Every field that is included in the
+{reganalysis} (known as a _feature_ of the data point) is assigned a feature
+importance value. However, only the most significant values (in this case, the
+top 5) are stored in the index. These values indicate which features had the
+biggest (positive or negative) impact on each prediction. In {kib}, you can see
+this information displayed in the form of a decision plot:
+
+[role="screenshot"]
+image::images/flights-regression-importance.png["A decision plot for feature importance values in {kib}"]
+
+The decision path starts at a baseline, which is the average of the predictions
+for all the data points in the training data set. From there, the feature
+importance values are added to the decision path until it arrives at its final
+prediction. The features with the most significant positive or negative impact
+appear at the top. Thus in this example, the features related to the flight
+distance had the most significant influence on this particular predicted flight
+delay. This type of information can help you to understand how models arrive at  
+their predictions. It can also indicate which aspects of your data set are most
+influential or least useful when you are training and tuning your model.
+
 If you do not use {kib}, you can see the same information by using the standard
 {es} search command to view the results in the destination index.
 
@@ -333,7 +356,7 @@ If you do not use {kib}, you can see the same information by using the standard
 ====
 [source,console]
 --------------------------------------------------
-GET df-flight-delays/_search
+GET model-flight-delays/_search
 --------------------------------------------------
 // TEST[skip:TBD]
 
@@ -342,13 +365,35 @@ The snippet below shows a part of a document with the annotated results:
 [source,console-result]
 ----  
           ...
-          "DestCountry" : "GB",
-          "DestRegion" : "GB-ENG",
-          "OriginAirportID" : "CAN",
-          "DestCityName" : "London",
-          "ml" : {
-            "FlightDelayMin_prediction" : 10.039840698242188,
-            "is_training" : true
+          "DestCountry" : "CH",
+          "DestRegion" : "CH-ZH",
+          "OriginAirportID" : "VIE",
+          "DestCityName" : "Zurich",
+          "ml": {
+            "FlightDelayMin_prediction": 277.5392150878906,
+            "feature_importance": [
+            {
+              "feature_name": "DestCityName",
+              "importance": 0.6285966753441136
+            },
+            {
+              "feature_name": "DistanceKilometers",
+              "importance": 84.4982943868267
+            },
+            {
+              "feature_name": "DistanceMiles",
+              "importance": 103.90011847132116
+            },
+            {
+              "feature_name": "FlightTimeHour",
+              "importance": 3.7119156097309345
+            },
+            {
+              "feature_name": "FlightTimeMin",
+              "importance": 38.700587425831365
+            }
+            ],
+            "is_training": true
           }
           ...
 ----
@@ -385,16 +430,16 @@ You can alternatively generate these metrics with the
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
 {
- "index": "df-flight-delays",   <1>
+ "index": "model-flight-delays",
   "query": {
       "bool": {
-        "filter": [{ "term":  { "ml.is_training": true } }]  <2>
+        "filter": [{ "term":  { "ml.is_training": true } }]  <1>
       }
     },
  "evaluation": {
    "regression": {
-     "actual_field": "FlightDelayMin",   <3>
-     "predicted_field": "ml.FlightDelayMin_prediction", <4>
+     "actual_field": "FlightDelayMin",   <2>
+     "predicted_field": "ml.FlightDelayMin_prediction", <3>
      "metrics": {  
        "r_squared": {},
        "mse": {}                            
@@ -405,10 +450,9 @@ POST _ml/data_frame/_evaluate
 --------------------------------------------------
 // TEST[skip:TBD]
 
-<1> The destination index which is the output of the {dfanalytics-job}.
-<2> Calculate the training error by evaluating only the training data.
-<3> The field that contains the actual (ground truth) value.
-<4> The field that contains the predicted value.
+<1> Calculate the training error by evaluating only the training data.
+<2> The field that contains the actual (ground truth) value.
+<3> The field that contains the predicted value.
 
 The API returns a response like this:
 
@@ -417,10 +461,10 @@ The API returns a response like this:
 {
   "regression" : {
     "mse" : {
-      "value" : 3125.3396943667544
+      "value" : 2604.920215688451
     },
     "r_squared" : {
-      "value" : 0.6659988649180306
+      "value" : 0.7162091232654141
     }
   }
 }
@@ -432,7 +476,7 @@ Next, we calculate the generalization error:
 --------------------------------------------------
 POST _ml/data_frame/_evaluate
 {
- "index": "df-flight-delays",
+ "index": "model-flight-delays",
   "query": {
       "bool": {
         "filter": [{ "term":  { "ml.is_training": false } }] <1>
@@ -460,4 +504,5 @@ about new data. Those steps are not covered in this example. See
 
 If you don't want to keep the {dfanalytics-job}, you can delete it. For example,
 use {kib} or the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API].
-When you delete {dfanalytics-jobs}, the destination indices remain intact.
+When you delete {dfanalytics-jobs} in {kib}, you have the option to also remove
+the destination indices and index patterns.
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-details.png b/docs/en/stack/ml/df-analytics/images/flights-regression-details.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-importance.png b/docs/en/stack/ml/df-analytics/images/flights-regression-importance.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-job-1.png b/docs/en/stack/ml/df-analytics/images/flights-regression-job-1.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-job-3.png b/docs/en/stack/ml/df-analytics/images/flights-regression-job-3.png
diff --git a/docs/en/stack/ml/df-analytics/images/flights-regression-results.png b/docs/en/stack/ml/df-analytics/images/flights-regression-results.png