Decoupling Lorax from Triage Specific config #12

kasunamare · 2020-02-26T20:36:57Z

Added the following modifications.

Changed the constructor s.t. it is not dependent on a test dataset or the triage specific column information to initialize the object
Added a new function to load a dataset to the object and precompute feature stats. A dataset could be provided when the object is created or at a later point
modified the explain_example method to take in one of the following:

a data sample
A test matrix and a sample index
A sample index if a test dataset is preloaded to the object (prev Lorax functionality)

Users can use the explain_example to acquire simple feature attribution scores or descriptive explanations (with the input feature stats). For the descriptive explanations, the preloaded dataset can be used or the user could use a different test set.
There was a plotting error when a generic dataset was used, it was resolved.
Decoupled the feature stats calculation method from the object's test dataset. Now any test matric can
Tested the functionality of the new explain_example function against the old one to confirm the consistency of outputs

Now Lorax should be usable with any dataset without the need of adding placeholder columns.

…series

… definitions

shaycrk · 2020-02-28T16:07:12Z

lorax/lorax.py

+        # NOTE-KA: I feel like the method should be independent of these as these seem very triage specific. 
+        # We can always have a script that bridges the triage data with the explain API
+        # Leaving this decoupling to another PR
+        self.id_col = id_col
+        self.date_col = date_col
+        self.outcome_col = outcome_col


That's fair, though probably most true for the date column. I can't remember if we use the date anywhere as such, otherwise, we could maybe simple change this to a list of columns (in addition to the id and outcome) to be excluded from analysis (but either way agree we can worry about that in a future PR)...

shaycrk · 2020-02-28T16:18:08Z

lorax/lorax.py

+            raise ValueError('Must specify name patterns to aggregate over.' +
+                             'Use TheLorax.set_name_patterns() first.')
+        elif how not in ['features', 'patterns']:
+            # NOTE-KA: Minor, in this case, should we default to features and let the code run with a warning?


yeah, that seems reasonable to me

shaycrk · 2020-02-28T16:26:09Z

lorax/lorax.py

+            self.model_info['aggregated_dict'] = return_tuple[3]        
+
+        elif isinstance(self.clf, LogisticRegression):
+            # Getting values for Random Forest Classifier


comment should read # Getting values for Logistic Regression

Also, should add a TODO here to handle ScaledLogisticRegression (which is what we actually generally use in triage)

shaycrk · 2020-02-28T16:29:49Z

lorax/lorax.py

+        if pred_class is None:
+            # TODO: Multiclass adpatation
+            # use np.argmax(), or clf.predict()
+            pred_class = np.argmax(scores)


I would probably log a warning here since in most real cases we would likely have some other threshold/top k, so if we get here it's probably more likely a mistake than intentional.

shaycrk · 2020-03-02T22:33:56Z

lorax/lorax.py

+        # TODO: If descriptive is set, the importance scores
+        # are supported with the context provided by a test dataset
+        # The code is available in the original constructor, move it here


I think the TODO comment might be out of date?

shaycrk · 2020-03-02T22:41:41Z

lorax/lorax.py

+                test_mat = self.X_test
+                fstats = self.feature_stats
+            else:
+                fstats = self.populate_feature_stats(test_mat)


It does feel like it would be easy for users to fall into a pattern where they end up doing a lot of extra work if they repeatedly invoke explain_example() with the same test matrix (rather than registering it first), but the only potentially better option I see is if we maybe compared the passed matrix with what's already registered and then calculated the states only if they're different (actually, probably within populate_feature_stats() rather than here anyway). However, I'm not too sure how efficient comparing pandas matrices is whether its worth the overhead to avoid other calculations. Maybe add as an issue to look into sometime down the road?

shaycrk · 2020-03-03T03:47:31Z

lorax/lorax.py

+        else:
+            contrib_df = self._build_contrib_df_sample(contrib_list, how=how)
+
+        return contrib_df


Also not necessary for this PR, but I wonder if we should somehow try to return the matplotlib axis/figure object as well when graphing. Might make sense for an issue to look at in the future, though.

shaycrk · 2020-03-03T03:48:47Z

lorax/lorax.py

                 date_col='as_of_date', outcome_col='outcome',
                 name_patterns=None):
+        # TODO: This method should be removed after verifying that the new init compares


Seems like we could go ahead and remove it at this point, right?

shaycrk · 2020-03-03T03:49:38Z

lorax/lorax.py

-        self._populate_feature_stats()
+        self.feature_stats = self.populate_feature_stats(test_mat=self.X_test)
+
+    # TODO: make protected again. Making public for testing


should probably go ahead and do this here, unless there's a reason to keep it public?

shaycrk · 2020-03-03T03:58:47Z

lorax/lorax.py

            - idx: index for example
+            - sample: the row matrix of the sample. Either a idx or sample should be provided


should include feature_stats in the input list

hmmm... if I'm reading, correctly, it seems like you'd have to explicitly pass None for either idx or sample here. Could either make it a single required parameter (where you check the type) or default both to None, so they're not required/positional arguments.

shaycrk · 2020-03-03T04:07:06Z

lorax/lorax.py

-        # for arbitary feature groupings), but if just using patterns for categoricals/imputed flags
-        # we should still be able to show relevant distribution info...
-
+    def explain_example_old(self, idx, pred_class=None, num_features=10, graph=True, how='features'):


should be ok to remove now as well, right?

shaycrk · 2020-03-03T04:29:32Z

tests/test_lorax.py

+import os
+import sys
+project_path = os.path.join(os.path.dirname(__file__), '../')
+sys.path.append(project_path)


Pretty sure you shouldn't have to do this... where are you invoking the tests from (e.g. this directory or its parent)? At least according to this stackoverflow:

https://stackoverflow.com/questions/1896918/running-unittest-with-typical-test-directory-structure

shaycrk · 2020-03-03T04:44:00Z

tests/test_new_lorax.py

+        """
+        pass
+
+    def test_old_vs_new_lorax(self):


probably ok to remove these along with the old methods themselves.

shaycrk · 2020-03-03T04:44:21Z

tests/test_new_lorax.py

+            There are different methods to get a descriptive explanation
+            This test asserts all those methods yield the same answer        
+        """
+        pass


looks like this might still need to be filled in?

shaycrk

Overall, looks good and thanks for doing all of this!

See a few inline comments for some notes/suggestions. Also, might be good to add a few tests specifically around the different options in the new interface (e.g., where/when you can specify a test matrix, when it gets overwritten, options for specifying and index or sample, etc).

Anyway, feel free to make changes or create issues for those as you see fit, but approving now so you can merge whenever.

kasunamare added 30 commits January 22, 2020 23:46

renamed, will be reversed. added a new init and exp example

9b1e8ca

renamed lorax file

2702656

vscode

6e3c128

revereted the explain_example function name

633ed10

file for testing new lorax

9392929

tested exp._exam._new initial

db9be54

creates a contribution DF for a sample

edae5ff

notebooks

73c406c

combined index and TODOs

fe41d25

scarpcode function.

3562138

can generate descriptive and indiv sample explanations

0822c68

tested the new explain examlpe

4496684

returning the contribution df

913259a

moved the drop_col removal, and changed drop_col list creation

2e3d6d9

setting the id_col as the index

c61f6da

changed the column names

9ff371f

replicated Kits initial test for the new explain example

a9755d7

comment

7d1aa62

sample is a series

a56b740

tested, dependence on special columns and inputting the example as a …

67b86d1

…series

sample has to be a np.array

dfe50dd

tested np.array sample

a909f1f

changed the input configuration conditions

690bec6

added the graph command to descriptive

51652b2

fixed the plot errors and added a note

cbcffb4

tested plot

ec27c39

renamed to original

771f87b

reverted the package name

17b3bf6

feature names for explainer

e9fe8a6

sending a series to explain example

b96c170

kasunamare and others added 11 commits February 27, 2020 18:48

updated the feature contribution and aggregated_dict tests to the new…

87cd208

… definitions

handled the intercept in the data sample

4cccbd4

adaptaed all tests for the new template

117a261

PEP8

216af40

sample explanation patterns

436c429

PEP8

dc80796

pep8

92b95cc

pep8

7465461

pep8

c2125c0

fixed error

9a75118

rounding importances to 5 points

bac3481

shaycrk reviewed Feb 28, 2020

View reviewed changes

shaycrk reviewed Mar 2, 2020

View reviewed changes

shaycrk reviewed Mar 3, 2020

View reviewed changes

shaycrk approved these changes Mar 3, 2020

View reviewed changes

kasunamare assigned shaycrk and kasunamare and unassigned shaycrk and kasunamare Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupling Lorax from Triage Specific config #12

Decoupling Lorax from Triage Specific config #12

kasunamare commented Feb 26, 2020

shaycrk Feb 28, 2020 •

edited

Loading

shaycrk Feb 28, 2020

shaycrk Feb 28, 2020

shaycrk Feb 28, 2020

shaycrk Mar 2, 2020

shaycrk Mar 2, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk Mar 3, 2020

shaycrk left a comment

		- idx: index for example
		- sample: the row matrix of the sample. Either a idx or sample should be provided

Decoupling Lorax from Triage Specific config #12

Are you sure you want to change the base?

Decoupling Lorax from Triage Specific config #12

Conversation

kasunamare commented Feb 26, 2020

shaycrk Feb 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaycrk left a comment

Choose a reason for hiding this comment

shaycrk Feb 28, 2020 •

edited

Loading