Deep Learning model explainability

Hi,

In my first post, I looked into the explainability of classical machine learning models. As a next step, I’m interested in the explainability of neural networks. Model explainability is easy for simple models (linear regression, decision trees), and some tools exist for more complex algorithms (ensemble trees). Therefore, I highly recommend the book Interpretable Machine Learning by Christoph Molnar for a deeper theoretical understanding. All different approaches for model explanability are shown with a PyTorch model in this kaggle notebook.

Complex models produce helpful predictions but often behave as a black box. This means that an explanation of the model’s behavior doesn’t exist. However, stakeholders often request an explanation of the model prediction to check for causality and to gain trust in the model.

I would say that Captum is the standard library for model interpretability with Pytorch. Captum also includes implementations to calculate SHAP values or LIME. This overview shows all implemented algorithms including application and complexity. Captum separates the algorithms into three groups (primary attribution, layer attribution, and neuron attribution). Primary attribution evaluates the effect of each feature on the model output. Layer attribution evaluates the effect of each neuron of a given layer on the model output. Neuron attribution evaluates the effect of an input feature on a specific particular neuron.

Primary Attribution

Primary attribution evaluates the effect of each feature on the model output. The following algorithms are used in this example:

Integrated Gradients

Integrated Gradients is an axiomatic model interpretability algorithm that assigns an importance score to each input feature by approximating the integral of gradients of the model’s output concerning the inputs along the path (straight line) from given baselines/references to inputs.

Axiomatic Attribution for Deep Networks by Sundararajan et al. 2017
Implementation

Gradient SHAP

Gradient SHAP is a gradient method to compute SHAP values. Gradient SHAP adds Gaussian noise to each input sample multiple times, selects a random point along the path between baseline and input, and computes the gradient of outputs concerning those selected random points. The final SHAP values represent the expected value of gradients * (inputs - baselines). Gradient SHAP assumes that each feature is independent and the explanation model has a linear relationship between the inputs and the given baseline.

A Unified Approach to Interpreting Model Predictions by Lundberg and Lee 2017
Implementation

DeepLIFT

DeepLIFT is a back-propagation-based approach that attributes a change to inputs based on the differences between the inputs and corresponding references (or baselines) for non-linear activations. DeepLIFT seeks to explain the difference in the output from reference in terms of the difference in inputs from reference. DeepLIFT uses the concept of multipliers to “blame” specific neurons for the difference in the output.

Learning Important Features Through Propagating Activation Differences by Shrikumar et al. 2017
Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks by Ancona et al. 2018
Implementation

Feature Ablation

Feature ablation is a perturbation-based approach to compute attribution, which involves replacing each input feature with a given baseline/reference value, and computing the difference in the output. Input features can also be grouped and ablated together rather than individually.

Implementation

Noise Tunnel

Noise Tunnel is a method that can be used on top of any of the attribution methods. Noise tunnel computes attribution multiple times, adding Gaussian noise to the input each time, and combines the calculated attributions based on the chosen type. The supported types for noise tunnels are:

Smoothgrad: The mean of the sampled attributions is returned. This approximates smoothing the given attribution method with a Gaussian Kernel. Smoothgrad Squared: The mean of the squared sample attributions is returned. Vargrad: The variance of the sample attributions is returned.

SmoothGrad: removing noise by adding noise by Smilkov et al. 2017
Local Explanation Methods for Deep Neural Networks Lack Sensitivity to Parameter Values by Adebayo et al. 2018
Implementation

Feature importance

Feature importance describes how useful a feature is at predicting a target variable. It also takes into account all interactions with other features. Additionally, feature importance can also used for dimensionality reduction and feature selection. There are two ways of computing the feature importance. One way is via impurity importance (mean decrease of impurity), and the second way is via permutation importance (average decrease accuracy). Permutation importance is model agnostic and should be preferred. A general drawback is correlated features. In this case, both features will score a lower importance, where they will be important.

Layer Attribution

As the next step, let’s look at each neuron on a layer.

Layer Conductance

Conductance combines the neuron activation with the partial derivatives of both the neuron concerning the input and the output concerning the neuron to build a more complete picture of neuron importance.

Conductance builds on Integrated Gradients (IG) by looking at the flow of IG attribution, which occurs through the hidden neuron.

How Important Is a Neuron? by Dhamdhere et al. 2018
Computationally Efficient Measures of Internal Neuron Importance by Shrikumar et al. 2018
Implementation

Layer Activation

Layer Activation is a simple approach for computing layer attribution, returning the activation of each neuron in the identified layer.

Implementation

Layer Integrated Gradients

Layer Integrated Gradients is a variant of Integrated Gradients that assigns an importance score to layer inputs or outputs, depending on whether we attribute to the former or the latter. Integrated Gradients is an axiomatic model interpretability algorithm that attributes/assigns an importance score to each input feature by approximating the integral of gradients of the model’s output concerning the inputs along the path (straight line) from given baselines/references to inputs.

Implementation

Neuron Attribution

We can also look at the distribution of each neuron’s attributions.

Neuron Conductance

Conductance for a particular neuron builds on Integrated Gradients (IG) by looking at the flow of IG attribution from each input through a particular neuron.

How Important Is a Neuron? by Dhamdhere et al. 2018
Computationally Efficient Measures of Internal Neuron Importance by Shrikumar et al. 2018
Implementation

Shapley values

SHAP (Shapley Additive exPlanations) is a method to explain individual predictions. It is based on Shapley’s values and originates from cooperative game theory. It is a method for assigning payouts to players depending on their contribution to the total payout. Players cooperate in a coalition and receive a profit from this cooperation. Machine learning uses the difference of a single prediction towards the average prediction. The Shapley value is the average contribution of a feature value across all possible coalitions.Keep in mind that the calculation of Shapley values can be computationally expensive. The difference between the SHAP value and the average prediction is fairly distributed among the feature values. This allows a contrastive explanation down to one single data point. The Shapley value returns a simple value per feature without a prediction model. Shap cannot be used to make statements about changes in prediction for changes in the input. Another disadvantage is that the calculation needs the training data. And like other permutation-based methods, Shapley values suffer from correlated features.You can find additional information here:

With a customized prediction function, we can use the SHAP framework.

LIME

Local interpretable model-agnostic explanations (LIME) is a paper in which the authors propose a concrete implementation of interpretable models to explain individual predictions of machine learning models. LIME trains simple models to approximate the predictions of the underlying model. As a first step, a new dataset is generated consisting of perturbed samples with corresponding predictions of the model. On this new dataset, LIME trains an interpretable model (decision tree, linear regression), which is weighted by the proximity of the sampled instances to the instance of interest.

One problem is that the calculation of artificial datasets has its weaknesses. Multiple settings need to be tested before a final LIME model can be used. Another problem is that the interpretability of close points can vary greatly. This instability leads to a critical evaluation of any explanation.

You can find additional information here:

“Why Should I Trust You?”: Explaining the Predictions of Any Classifier by Rebeiro et al. 2016
LIME implementation for tabular data

Thank you for your attention.

Primary Attribution#

Integrated Gradients#

Gradient SHAP#

DeepLIFT#

Feature Ablation#

Noise Tunnel#

Feature importance#

Layer Attribution#

Layer Conductance#

Layer Activation#

Layer Integrated Gradients#

Neuron Attribution#

Neuron Conductance#

Shapley values#

LIME#