Authors:

(1) Goran Muric, InferLink Corporation, Los Angeles, (California [email protected]);

(2) Ben Delay, InferLink Corporation, Los Angeles, California ([email protected]);

(3) Steven Minton, InferLink Corporation, Los Angeles, California ([email protected]).

Abstract and 1 Introduction

1.1 Motivation

2 Related Work and 2.1 Prompting techniques

2.2 In-context learning

2.3 Model interpretability

3 Method

3.1 Generating questions

3.2 Prompting LLM

3.3 Verbalizing the answers and 3.4 Training a classifier

4 Data and 4.1 Clinical trials

4.2 Catalonia Independence Corpus and 4.3 Climate Detection Corpus

4.4 Medical health advice data and 4.5 The European Court of Human Rights (ECtHR) Data

4.6 UNFAIR-ToS Dataset

5 Experiments

6 Results

7 Discussion

7.1 Implications for Model Interpretability

7.2 Limitations and Future Work

Reproducibility

Acknowledgment and References

A Questions used in ICE-T method

7.2 Limitations and Future Work

Despite its strengths, the ICE-T method has some limitations. The quality of the output heavily relies on the initial set of questions generated for the model to answer. Poorly formulated questions or those that fail to capture the necessary subtleties of the task can limit the effectiveness of this technique. Moreover, the reliance on numerical scoring of textual answers might oversimplify complex answers. This can lead to a loss of nuance, especially when answers are confined to binary outputs.

Future research could explore more sophisticated methods for question generation, perhaps incorporating active learning where the system identifies and prioritizes questions that would most improve its understanding and performance. Additionally, exploring different methods of encoding responses into feature vectors could further enhance the model’s accuracy and sensitivity to nuances in text.

Expanding the scope of ICE-T to tackle problems beyond binary classification could also prove beneficial. Applying this method to multi-class classification tasks or even regression problems could test the adaptability and scalability of the approach, potentially making it more applicable across a wider array of domains. This expansion could lead to significant advancements in the field of machine learning where interpretability and accuracy are crucial.

In conclusion, the ICE-T method presents a promising avenue for enhancing the performance and interpretability of LLMs in binary classification tasks and beyond. By bridging the gap between traditional machine learning techniques and modern LLM capabilities, this approach offers a valuable tool for applications demanding high accuracy and clear reasoning in decision-making processes. Further refinements and adaptations of this technique could significantly impact the deployment of AI in critical sectors, enhancing both the reliability and accountability of automated systems.

Reproducibility

The experiment is composed of two primary phases: 1) collecting outputs from OpenAI’s ChatGPT models, specifically using either gpt-4-0125-preview or gpt-3.5-turbo-0125; and 2) verbalizing the answers (converting the responses into numerical form), training classifiers and evaluating their performance on a hold-out test set.

The code to reproduce the verbalization, classifier training and testing is available on GitHub: https://github.com/gmuric/ICE-T

Due to data usage and confidentiality constraints associated with the clinical trial dataset, we are unable to share the complete working code for the first phase. However, we provide the outputs of the LLMs we obtained. They are available in GitHub repository. We additionally provide pseudo-code that illustrates the extraction of outputs from the language models that can be used to reproduce the first part of the experiment. The complete references to the data used in the experiments are explained in Section 4. Additionally, we include a comprehensive list of the questions used to prompt the language models in Appendix A. The pseudocode for obtaining the answers from the LLM is presented below:

Note that due to the stochastic nature of large language models, the outputs may vary with each experiment. While these variations are unlikely to significantly impact the results, minor discrepancies are possible.

Acknowledgment

This material is based upon work supported by the Army ASA(ALT) SBIR CCOE under Contract No. W51701-22-C-0035 and the US Air Force under Contract No. FA8750-22-C-0511. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Army ASA(ALT) SBIR CCOE or the US Air Force.

References

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565.

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112.

Adrien Bibal, Rémi Cardon, David Alfter, Rodrigo Wilkens, Xiaoou Wang, Thomas François, and Patrick Watrin. 2022. Is attention explanation? an introduction to the debate. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3889–3900, Dublin, Ireland. Association for Computational Linguistics.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy. Association for Computational Linguistics.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. arXiv preprint arXiv:1805.01070.

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234.

Bryce Goodman and Seth Flaxman. 2017. European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine, 38(3):50– 57.

Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332.

Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. 2023. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199– 22213.

Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2019. Claudette: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law, 27:117–139.

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804.

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.

Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.

John X Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M Rush. 2023. Text embeddings reveal (almost) as much as text. arXiv preprint arXiv:2310.06816.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135– 1144.

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2021. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633.

Timo Schick and Hinrich Schütze. 2020. It’s not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118.

Timo Schick and Hinrich Schütze. 2022. True fewshot learning with prompts—a real-world perspective. Transactions of the Association for Computational Linguistics, 10:716–731.

Chandan Singh, Jeevana Priya Inala, Michel Galley, Rich Caruana, and Jianfeng Gao. 2024. Rethinking interpretability in the era of large language models. arXiv preprint arXiv:2402.01761.

Amber Stubbs, Michele Filannino, Ergin Soysal, Samuel Henry, and Özlem Uzuner. 2019. Cohort selection for clinical trials: n2c2 2018 shared task track 1. Journal of the American Medical Informatics Association : JAMIA, 26:1163.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3319–3328. PMLR.

Dietrich Trautmann. 2023. Large language model prompt chaining for long legal document classification. arXiv preprint arXiv:2308.04138.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022a. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022b. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560.

Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, and Markus Leippold. 2021. Climatebert: A pretrained language model for climate-related text. arXiv preprint arXiv:2110.12010.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022a. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022b. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022c. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.

Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022a. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems, pages 1–22.

Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. 2022b. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. arXiv preprint arXiv:2212.10375.

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063.

Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. 2024. Gpt4tools: Teaching large language model to use tools via self-instruction. Advances in Neural Information Processing Systems, 36.

Xi Ye and Greg Durrett. 2022. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392.

Bei Yu, Yingya Li, and Jun Wang. 2019. Detecting causal language use in science findings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4664–4674, Hong Kong, China. Association for Computational Linguistics.

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022b. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.

Kaitlyn Zhou, Jena D Hwang, Xiang Ren, and Maarten Sap. 2024. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. arXiv preprint arXiv:2401.06730.

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.

Elena Zotova, Rodrigo Agerri, Manuel Nuñez, and German Rigau. 2020. Multilingual stance detection in tweets: The Catalonia independence corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1368–1375, Marseille, France. European Language Resources Association.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.