Authors:
(1) Goran Muric, InferLink Corporation, Los Angeles, (California [email protected]);
(2) Ben Delay, InferLink Corporation, Los Angeles, California ([email protected]);
(3) Steven Minton, InferLink Corporation, Los Angeles, California ([email protected]).
Table of Links
2 Related Work and 2.1 Prompting techniques
3.3 Verbalizing the answers and 3.4 Training a classifier
4 Data and 4.1 Clinical trials
4.2 Catalonia Independence Corpus and 4.3 Climate Detection Corpus
4.4 Medical health advice data and 4.5 The European Court of Human Rights (ECtHR) Data
7.1 Implications for Model Interpretability
7.2 Limitations and Future Work
A Questions used in ICE-T method
4 Data
This work utilizes data compiled from a range of sources, attempting to include a variety of domains and document lengths. The data used in the experiments described here spans the fields of medicine, law, climate science, and politics. It also includes documents of varying sizes, from brief tweets to extensive legal documents and detailed medical records.
4.1 Clinical trials
This dataset comes from Track 1 of the 2018 National NLP Clinical Challenges (n2c2) shared tasks[3]. It is designed to help in identifying patients within a corpus of longitudinal medical records who either meet or do not meet predefined selection criteria. These criteria are used for determining a patient’s eligibility for inclusion in clinical trials. (Stubbs et al., 2019). The data consists of annotated American English clinical narratives for 288 patients according to whether they met a set of specific criteria. There are 13 criteria in total, and they include: DRUGABUSE: Drug abuse, current or past; ALCOHOLABUSE: Current alcohol use over weekly recommended limits; ENGLISH: Patient must speak English; MAKES-DECISIONS: Patient must make their own medical decisions; ABDOMINAL: History of intra-abdominal surgery, small or large intestine resection, or small bowel obstruction; MAJOR-DIABETES: Major diabetes-related complication; ADVANCED-CAD: Advanced cardiovascular disease (CAD); MI-6MOS: MI in the past 6 months; KETO-1YR: Diagnosis of ketoacidosis in the past year; DIETSUPP-2MOS: Taken a dietary supplement (excluding vitamin D) in the past 2 months; ASP-FOR-MI: Use of aspirin to prevent MI; HBA1C: Any hemoglobin A1c (HbA1c) value between 6.5% and 9.5%; and CREATININE: Serum creatinine > upper limit of normal. For every medical record, each criterion can have one of two potential values: “met” or “not met.” The value based on whether an individual has fulfilled a particular criterion. Data is split 70/30 on training and test sets respectively. Training test contains 202 medical record while the test set contains 86 records. Note that for some criteria, the ratio between positive and negative class is highly imbalanced. In our analysis we excluded KETO-1YR criterion as it contains no positive samples in the test set and only one positive sample in the training set.[4]
4.2 Catalonia Independence Corpus
This dataset contains a corpus in Spanish that consist of annotated Twitter messages for automatic stance detection (Zotova et al., 2020). It encompasses data collected over a 12-day span in February and March 2019, from tweets originating in Barcelona. Originally, each tweet is categorized into one of three classes: AGAINST, FAVOR, and NEUTRAL. These classes represent the user’s stance towards the topic of Catalonia’s independence. For the purpose of binary classification and to facilitate more effective comparisons with other datasets, we have omitted the NEUTRAL class, focusing exclusively on the AGAINST and FAVOR categories.
4.3 Climate Detection Corpus
This dataset contains climate-related paragraphs extracted from financial disclosures by companies. The text has been collected from corporate annual reports and sustainability reports. The paragraphs from those reports are hand-selected and then annotated as yes (climate-related) or no (not climaterelated) (Webersinke et al., 2021).
This paper is
[3] https://n2c2.dbmi.hms.harvard.edu/
[4] Our methodology employs classifiers that are trained based on data distributions. As a result, we consistently achieve peak classification metrics, which is not a realistic performance, as the minority class is absent from the test dataset.