Authors:
(1) Goran Muric, InferLink Corporation, Los Angeles, (California [email protected]);
(2) Ben Delay, InferLink Corporation, Los Angeles, California ([email protected]);
(3) Steven Minton, InferLink Corporation, Los Angeles, California ([email protected]).
Table of Links
2 Related Work and 2.1 Prompting techniques
3.3 Verbalizing the answers and 3.4 Training a classifier
4 Data and 4.1 Clinical trials
4.2 Catalonia Independence Corpus and 4.3 Climate Detection Corpus
4.4 Medical health advice data and 4.5 The European Court of Human Rights (ECtHR) Data
7.1 Implications for Model Interpretability
7.2 Limitations and Future Work
A Questions used in ICE-T method
3 Method
Training the ICE-T system consists of the following steps:
-
Generating questions: the process begins by generating a series of questions designed to prompt the Large Language Model (LLM);
-
Prompting the LLM: Previously generated questions are used to prompt the LLM and collect the yes/no answers;
-
Verbalizing the answers: for each instance within the training dataset, responses to prompts are collected and converted into numerical form, thus creating a low-dimensional feature vector for each instance;
-
Training a classifier: Previously obtained vectors, together with their respective labels, are then used to train a classifier
The Inference stage mirrors the training process: the LLM is presented with the same collection of questions. The responses obtained are numerically encoded in the same manner before being processed by the classifier that was trained during the Training stage. Training and inference process is illustrated in Figure 1. Each step is explained below.
3.1 Generating questions
To train and use the system, we need to create multiple questions that more closely reflect the core principles behind the initial yes/no question. Those questions should be crafted in a way to uncover some additional details about the problem.
Consider a use case where an expert is building a classifier to determine eligibility for medical trials based on patient data. In such a scenario, the classifier needs to assess various clinical inclusion criteria, which are typically derived from patient medical records. One of these criteria could be the patient’s language proficiency, for instance, whether they speak English. A naive formulation of this question may be to present the question to the LLM in a prompt like the following:
Does this patient speak English according to their medical records.
MEDICAL RECORDS: __RECORDS__
where the
However, a series of “secondary” questions such as:
Is there any documentation of the patient requiring an interpreter for English during medical visits?
Do the medical records contain notes written in English that indicate communication with the patient?
Are there any written consents or forms completed in English by the patient?
Are there any notations from providers about the patient's ability to understand and speak English?
may allow the model to answer directly based on the information already contained in the documents presented to it, while also serving as strong indicators for the primary question. Secondary questions are also yes/no questions.
Creating the secondary questions can be done in multiple ways, such as writing the questions manually using the expert knowledge or using the LLM to automatically generate a fixed size set of questions that might be useful in answering the original question. Starting from the primary question q0 we generate n additional questions, creating a set of all questions Q = {q0, q1 . . . qn}, where |Q| = n + 1. This process is shown in Figure 1 with a red box, illustrating the creation of the questions and using them during the training and inference process. The same set Q of questions is used for both training and inference.
The number n of secondary questions is decided based on factors such as: number of training samples, availability of the expert knowledge and the level of interpretability needed for a specific task. Our prior small-scale experiments have shown that secondary questions crafted by experts generally lead to improved performance compared to those generated by LLMs. However, in the experiments reported here, we chose a straightforward and reproducible approach where we exclusively use secondary questions created by an LLM. This choice was made to minimize human bias and showcase the method’s effectiveness in scenarios where expert input is unavailable. The exact prompts used for creating secondary questions in our experiments are described in Section 5.
This paper is