Authors:
(1) Goran Muric, InferLink Corporation, Los Angeles, (California [email protected]);
(2) Ben Delay, InferLink Corporation, Los Angeles, California ([email protected]);
(3) Steven Minton, InferLink Corporation, Los Angeles, California ([email protected]).
Table of Links
2 Related Work and 2.1 Prompting techniques
3.3 Verbalizing the answers and 3.4 Training a classifier
4 Data and 4.1 Clinical trials
4.2 Catalonia Independence Corpus and 4.3 Climate Detection Corpus
4.4 Medical health advice data and 4.5 The European Court of Human Rights (ECtHR) Data
7.1 Implications for Model Interpretability
7.2 Limitations and Future Work
A Questions used in ICE-T method
2 Related Work
Our proposed solution addresses three core aspects of using large language models for inference: prompting, in-context learning, and interpretability. It is built on top of the ever-growing body of knowledge that comes from those domains.
2.1 Prompting techniques
Numerous techniques have been developed to improve the fundamental zero-shot approach. Among these, the “chain-of-thought” (CoT) prompting is particularly notable. This method is used to prompt the model to systematically articulate its reasoning process in a step-by-step manner before reaching a conclusion. Research has shown that chainof-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks (Wei et al., 2022b,c; Wang et al., 2022a). Even simple tweaks such as adding “Let’s think step by step” before each answer can significantly outperform zero-shot LLM performances on diverse benchmark reasoning tasks (Kojima et al., 2022; Nye et al., 2021). Such generated chains that prompt language models to break down their reasoning into steps often cause errors in inference time. To reduce these errors, some researchers employ a method known as automatic Chain of Thought prompting. This technique, which generates demonstrable examples, has proven to be more effective than earlier, simpler CoT approaches (Zhang et al., 2022b). Lastly, “iterative refinement” involves repeatedly prompting the model with slightly altered versions of the original text or question, honing in on a more accurate or nuanced answer through successive iterations. Each of these strategies can be tailored to the specific needs of a task, leveraging the model’s capabilities in different ways to achieve optimal performance.
Several approaches involve using multiple prompts in a chain, where the output of one step becomes the input for the next, thus aggregating the gains per step (Wu et al., 2022a), or decomposing complex tasks into smaller, manageable components (Trautmann, 2023). Additionally, “selfinstruct” (Wang et al., 2022b; Yang et al., 2024) prompting can be used, where the model generates its own instructions or clarifications based on the initial prompt, attempting to refine or better understand the task before generating a response. Another set of approaches uses multiple models or multiple instances of the same model to improve the performance. The additionally trained models, called “verifiers” are used to judge the correctness of model completions. At the inference time, the verifiers would select the most likely answer (Cobbe et al., 2021).
2.2 In-context learning
Large Language Models possess the remarkable ability for in-context learning (ICL), in which they acquire knowledge from a few contextual examples either during inference or during training. Numerous studies have shown that through ICL, LLMs can effectively handle a diverse set of complex tasks (Wei et al., 2022a). ICL offers several advantages, notably its ease in including human knowledge into LLMs by using various demonstrations and templates (Liu et al., 2021; Wu et al., 2022b). Furthermore, unlike traditional supervised training methods, ICL operates without the need for additional training, significantly lowering the computational costs when using models to solve new tasks (Dong et al., 2022).
One of the most recognizable techniques for in-context learning is “few-shot learning” (Schick and Schütze, 2022, 2020; Gu et al., 2021; Perez et al., 2021) during inference[2]. Using this approach, the model is provided with a few examples of text and their corresponding labels or desired outputs within the prompt itself. This method teaches the model the context of the decision-making process, improving its accuracy on similar tasks.
Multiple other studies contributed to refining the ICL methods, focusing on automation, ordering, and selection of prompts. Zhou et al. (2022) introduced the Automatic Prompt Engineer (APE), which automates the generation of instructional prompts, significantly reducing manual effort and improving scalability (Zhou et al., 2022). Simultaneously, Lu et al. (2021) came up with the method to optimize the ordering of prompts. They employed entropy statistics to evaluate and identify the most effective prompt sequences (Lu et al., 2021). Rubin et al. (2021) and Liu et al. (2021) both contribute to this area but from different perspectives. Rubin et al. (2021) developed a method for efficiently retrieving prompts using annotated data, streamlining the selection process (Rubin et al., 2021). On the other hand, Liu et al. (2021) explored strategic selection methods that go beyond random sampling to leverage the few-shot capabilities of LLMs, aiming to enhance the model’s performance through example selection (Liu et al., 2021). Adding to the discussion on selection strategies, Zhang et al. (2022) approached example selection as a sequential decision problem. They proposed using a reinforcement learning algorithm to discover policies that improve the generalizability of language models (Zhang et al., 2022a). This perspective introduces a dynamic element to the selection process, aligning with the strategies discussed by Rubin and Liu but through an adaptive, policy-driven approach.
This paper is