Abstract and 1. Introduction

1.1 Post Hoc Explanation

1.2 The Disagreement Problem

1.3 Encouraging Explanation Consensus

  1. Related Work

  2. Pear: Post HOC Explainer Agreement Regularizer

  3. The Efficacy of Consensus Training

    4.1 Agreement Metrics

    4.2 Improving Consensus Metrics

    4.3 Consistency At What Cost?

    4.4 Are the Explanations Still Valuable?

    4.5 Consensus and Linearity

    4.6 Two Loss Terms

  4. Discussion

    5.1 Future Work

    5.2 Conclusion, Acknowledgements, and References

Appendix

5 DISCUSSION

The empirical results we present demonstrate that our loss term is effective in its goal of boosting consensus among explainers. As with any first attempt at introducing a new objective to neural network training, we see modest results in some settings and evidence that hyperparameters can likely be tuned on a case-by-case basis. It is not our aim to leave practitioners with a how-to guide, but rather to begin exploring how practitioners can control where a model lies along the accuracy-agreement trade-off curve.

We introduce a loss term measuring two types of correlation between explainers, which unfortunately adds more complexity to the machine learning engineer’s job of tuning models. But, we show conclusively that there are settings in which using both types of correlation is better than using only one when encouraging explanation consensus.

Another limitation of these experiments as a guide on how to train for consensus is that we only trained with one pair of explainers. Our loss is defined for any pair and perhaps another choice would better suit specific applications.

In light of the contentious debate on whether deep models or decision-tree-based methods are better for tabular data [10, 31, 38], we argue that developing new tools for training deep models can help promote wider adoption for tabular deep learning. Moreover, with the results we present in this work, it is our hope that future work improves these trends, which could possibly lead to neural models that have more agreement (and possibly more accuracy) than their tree-based counterparts (such as XGBoost).

5.1 Future Work

Armed with the knowledge that training for consensus with PEAR is possible, we describe several exciting directions for future work. First, as alluded to above, we explored training with only one pair of explainers, but other pairs may help data scientists who have a specific type of target agreement. Work to better understand how a given pair of explainers in the loss affects the agreement of other explainers at test time could lead to principled decisions about how to use our loss in practice. Indeed, PEAR could fit into larger learning frameworks [22] that aim to select user- and task-specific explanation methods automatically.

It will be crucial to study the quality of explanations produced with PEAR from a human perspective. Ultimately, both the efficacy of a single explanation and the efficacy of agreement between multiple explanations is tied to how the explanations are used and interpreted. Since our work only takes a quantitative approach to demonstrate improvement when regularizing for explanation consensus, it remains to be seen whether actual human practitioners would make better judgments about models trained with PEAR vs models trained naturally.

In terms of model architecture, we chose standard sized MLPs for the experiments on our tabular datasets. Recent work proposes transformers [35] and even ResNets [10] for tabular data, so completely different architectures could also be examined in future work as well.

Finally, research into developing better explainers could lead to an even more powerful consensus loss term. Recall that IntGrad integrates the gradients over a path in input space. The designers of that algorithm point out that a straight path is the canonical choice due to its simplicity and symmetry [37]. Other paths through input space that include more realistic data points, instead of paths of points constructed via linear interpolation, could lead to even better agreement metrics on actual data.

5.2 Conclusion

In the quest for fair and accessible deep learning, balancing interpretability and performance are key. It is known that common explainers may return conflicting results on the same model and input, to the detriment of an end user. The gains in explainer consensus we achieve with our method, however modest, serve to kick start others to improve on our work in aligning machine learning models with the practical challenge of interpreting complex models for real-life stakeholders.

ACKNOWLEDGEMENTS

We thank Teresa Datta and Daniel Nissani at Arthur for their insights throughout the course of the project. We also thank Satyapriya Krishna, one of the authors of the original Disagreement Problem paper, for informative email exchanges that helped shape our experiments.

REFERENCES

[1] Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35-8, pages 6679–6687, 2021.

[2] Umang Bhatt, Adrian Weller, and José M. F. Moura. Evaluating and aggregating feature-based model explanations. CoRR, abs/2005.00631, 2020. URL https: //arxiv.org/abs/2005.00631.

[3] Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differentiable sorting and ranking. In International Conference on Machine Learning (ICML), pages 950–959. PMLR, 2020.

[4] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

[5] Vanessa Buhrmester, David Münch, and Michael Arens. Analysis of explainers of black box deep neural networks for computer vision: A survey. Machine Learning and Knowledge Extraction, 3(4):966–989, 2021.

[6] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 785–794, 2016.

[7] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.

[8] Marzyeh Ghassemi, Luke Oakden-Rayner, and Andrew L Beam. The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health, 3(11):e745–e750, 2021.

[9] Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept-based explanations. Conference on Neural Information Processing Systems (NeurIPS), 32, 2019.

[10] Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.

[11] Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2022.

[12] Tessa Han, Suraj Srinivas, and Himabindu Lakkaraju. Which explanation should I choose? A function approximation perspective to characterizing post-hoc explanations. arXiv preprint arXiv:2206.01254, 2022.

[13] Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama. How can I choose an explainer? An application-grounded evaluation of post-hoc explanations. In Conference on Fairness, Accountability, and Transparency (FAccT), pages 805–815, 2021.

[14] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International Conference on Machine Learning (ICML), pages 5338–5348, 2020.

[15] Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, and Himabindu Lakkaraju. The disagreement problem in explainable machine learning: A practitioner’s perspective. arXiv preprint arXiv:2202.01602, 2022.

[16] Gabriel Laberge, Yann Pequignot, Foutse Khomh, Mario Marchand, and Alexandre Mathieu. Partial order: Finding consensus among uncertain feature attributions. CoRR, abs/2110.13369, 2021. URL https://arxiv.org/abs/2110.13369.

[17] Roman Levin, Valeriia Cherepanova, Avi Schwarzschild, Arpit Bansal, C Bayan Bruss, Tom Goldstein, Andrew Gordon Wilson, and Micah Goldblum. Transfer learning with deep tabular models. arXiv preprint arXiv:2206.15306, 2022.

[18] Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31–57, 2018.

[19] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

[20] Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification and regression. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 150–158, 2012.

[21] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Conference on Neural Information Processing Systems (NeurIPS), 30, 2017.

[22] Vedant Nanda, Duncan C. McElfresh, and John P. Dickerson. Learning to explain machine learning. In Operationalizing Human-Centered Perspectives in Explainable AI (HCXAI) Workshop at CHI-21, 2021.

[23] Petru Rares Sincraian PePy. PePy pypi download stats. pepy.tech, 2023. Accessed: 2023-01-25. [24] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “why should I trust you?” explaining the predictions of any classifier. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 1135–1144, 2016.

[25] Laura Rieger and Lars Kai Hansen. Aggregating explainability methods for neural networks stabilizes explanations. CoRR, abs/1903.00519, 2019. URL http: //arxiv.org/abs/1903.00519.

[26] Laura Rieger, Chandan Singh, W. James Murdoch, and Bin Yu. Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. CoRR, abs/1909.13584, 2019. URL http://arxiv.org/abs/1909.13584.

[27] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717, 2017.

[28] Ivan Rubachev, Artem Alekberov, Yury Gorishniy, and Artem Babenko. Revisiting pretraining objectives for tabular deep learning. arXiv preprint arXiv:2207.03208, 2022.

[29] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.

[30] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning (ICML), pages 3145–3153. PMLR, 2017.

[31] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.

[32] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[33] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Conference on Artificial Intelligence, Ethics, and Society (AIES), page 180–186, 2020.

[34] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.

[35] Gowthami Somepalli, Micah Goldblum, Avi Schwarzschild, C Bayan Bruss, and Tom Goldstein. SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.

[36] Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Baraniuk, Micah Goldblum, and Tom Goldstein. Can neural nets learn the same model twice? Investigating reproducibility and double descent from the decision boundary perspective. In Computer Vision and Pattern Recognition Conference (CVPR), pages 13699–13708, 2022.

[37] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning (ICML), pages 3319–3328. PMLR, 2017.

[38] Bojan Tunguz @tunguz. Tweet, 2023. URL https://twitter.com/tunguz/status/ 1618343510784249856?s=20&t=e3EG7tg3pM398-dqzsw3UQ.

[39] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: networked science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.264119.

[40] Fulton Wang and Cynthia Rudin. Falling rule lists. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1013–1022. PMLR, 2015.

[41] Ethan Weinberger, Joseph D. Janizek, and Su-In Lee. Learned feature attribution priors. CoRR, abs/1912.10065, 2019. URL http://arxiv.org/abs/1912.10065.

Authors:

(1) Avi Schwarzschild, University of Maryland, College Park, Maryland, USA and Work completed while working at Arthur (avi1umd.edu);

(2) Max Cembalest, Arthur, New York City, New York, USA;

(3) Karthik Rao, Arthur, New York City, New York, USA;

(4) Keegan Hines, Arthur, New York City, New York, USA;

(5) John Dickerson†, Arthur, New York City, New York, USA ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.