Explainability Part
— Lecture 1: XAI Intro —
What is Explainable AI/ML
• No consensus on a universal definition: definitions are domain-specific
• Interpretability: ability to explain or to present in understandable terms to a
human
⁃ The degree to which a human can understand the cause of a decision
⁃ The degree to which a human can consistently predict the result of a
model
• Explanation: answer to a why question
⁃ Usually relates the feature values of an instance to its model prediction
in a humanly understandable way
• Molnar: model interpretability (global) vs explanation of an individual
prediction (local)
• Ribeiro: explainable models are interpretable if they use a small set of
features; ‘an explanation is a local linear approximation of the model's
behavior
Motivation: why do we need XAI?
• Scientific understanding: does my model discriminate?
• Bias/fairness issues: why did my model make this mistake?
• Model debugging and auditing: how can I understand/interfere with the
model?
• Human-AI cooperation/acceptance: does my model satisfy legal requirements
(e.g. GDPR)?
• Regulatory compliance: healthcare, finance/banking, insurance
• Applications: affect recognition in video games, intelligent tutoring systems,
bank loan decision, bail/parole decisions, critical healthcare predictions (e.g.
cancer, major depression), film/music recommendation, job interview
recommendation/job offer, personality impression prediction for job interview
recommendation, tax exemption
Taxonomy
• Feature statistics: feature importance and interaction strengths
• Feature visualizations: partial dependence and feature importance plots
• Model internals: linear model weights, DT structure, CNN filters, etc.
• Data points: exemplars in counterfactual explanations
• Global or local surrogates via intrinsically interpretable models
• Example: play tennis decision tree:
⁃ Intrinsic, model specific, global & local, model internals
• Example: CNN decision areas in images:
⁃ Post-hoc, model specific, local, model internals
Scope of Interpretability
• Algorithmic transparency: how does the algorithm generate the model?
• Global, holistic model interpretability:
, ⁃ How does the trained model make predictions?
⁃ Can we comprehend the entire model at once?
• Global model interpretability on a modular level: how do parts of the model
affect predictions?
• Local interpretability for a single prediction: why did the model make a certain
prediction for an instance?
• Local interpretability for a group of predictions:
⁃ Why did the model make specific predictions for a group of instances?
⁃ May be used for analyzing group-wise bias
Evaluation of interpretability
• Application-level evaluation (real task):
⁃ Deploy the interpretation method on the application
⁃ Let the experts experiment and provide feedback
• Human-level evaluation (simple task): during development, by lay people
• Function-level evaluation (proxy task):
⁃ Does not use humans directly
⁃ Uses measures from a previous human evaluation
• All of above can be used for evaluating model interpretability as well as
individual explanations
Properties of explanation methods
• Expressive power: the "language" or structure of the explanations
⁃ E.g. IF-THEN rules, tree itself, natural language etc.
• Translucency: describes how much the explanation method relies on looking
into the machine learning model
• Portability: describes the range of machine learning models with which the
explanation method can be used
• Algorithmic complexity: computational complexity of the explanation method
Properties of individual explanations
• Accuracy: how well does an explanation predict unseen data?
• Fidelity: how well does the explanation approximate the prediction of the black
box model?
• Certainty/confidence: does the explanation reflect the certainty of the machine
learning model?
• Comprehensibility/plausibility:
⁃ How well do humans understand the explanations?
⁃ How convincing (trust building) are they?
⁃ Difficult to define and measure, but extremely important to get right
• Consistency: how much does an explanation differ between models trained on
the same task and produce similar predictions?
• Stability: how similar are the explanations for similar instances?
⁃ Stability within a model vs consistency across models
• Degree of importance: how well does the explanation reflect the importance of
features or parts of the explanation?
• Novelty and representativeness
, How is explainability usually measured?
• Fidelity: should be measured objectively; but not all explanations can be
checked for fidelity
• Plausibility: requires a user study
• Simulatability: measures the degree that a human can calculate/predict the
model’s outcome, given the explanation
What is a good explanation?
• Contrastive: requires a point of reference for comparison
• Selective: precise, a small set of most important factors; humans can handle
at most 7 ± 2 cognitive entities at once
• Social: considers the social context (environment/audience)
• Truthful (scientifically sound): good explanations prove to be true in reality
• General and probable: a cause that can explain many events is very general
and could be considered a good explanation
• Consistent with prior beliefs of the explainee
Interpretability vs Explainability revisited
• Individual terms may not have no precise definition, but interpretable models:
⁃ Are transparent and simple enough to understand
⁃ Stand for their own explanation
⁃ Thus, their explanations reflect perfect fidelity
• Black-box models:
⁃ Require post-hoc explanations (as an excuse to their opacity)
⁃ Cannot have perfect fidelity with respect to the original model
⁃ Their explanations often do not make sense, or do not provide enough
detail to understand what the black box is doing
⁃ Are often not compatible with situations where information outside the
⁃ Database needs to be combined with a risk assessment
— Lecture 2: Interpretable models —
Linear models
• GeneralizedAdditiveModels(AdditiveModels(LinearModels(ScoringSystems)))
• Assumptions:
⁃ Linearity: f(x+y) = f(x) + f(y), f(cx) = cf(x)
⁃ Normality of the target variable
⁃ Homoscedasticity: constant variance
⁃ Independent instance distribution
⁃ Absence of multicollinearity: no pairs of strongly correlated features
Interpretation of linear models
• Modular view: we assume all remaining feature values are fixed
• Numerical feature weight: increasing the numerical feature by one unit
changes the estimated outcome by its weight
• Binary feature weight: the contribution of the feature when it is set to 1
• Categorical feature with L categories:
⁃ Carry out one-hot-encoding into L binary features (e.g., 3 levels: 1→ [1