Authors:

(1) Ahatsham Hayat, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]);

(2) Mohammad Rashedul Hasan, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]).

Abstract and 1 Introduction

2 Method

2.1 Problem Formulation and 2.2 Missingness Patterns

2.3 Generating Missing Values

2.4 Description of CLAIM

3 Experiments

3.1 Results

4 Related Work

5 Conclusion and Future Directions

6 Limitations and References

The challenge of missing data in tabular datasets has led to the development of numerous imputation methods, broadly classified into three categories: statistical methods, machine learning techniques, and the more recently developed deep learning approaches.

Statistical Methods. Widely used statistical imputation methods include mean/median imputation, regression imputation, and the expectation-maximization (EM) algorithm [10]. These single imputation approaches are straightforward but fail to capture the uncertainty inherent in the imputation of missing values. To address this, Rubin [31] introduced a technique for multiple imputation, later refined [32,21], which generates multiple imputed datasets to reflect the uncertainty, thereby providing more reliable statistical inferences.

Among the multiple imputation techniques, Multivariate Imputation by Chained Equations (MICE) [7], stands out. MICE iteratively imputes missing values by modeling each missing data point separately, making it particularly effective for handling MCAR (Missing Completely At Random) and Missing At Random (MAR) data.

Machine Learning Methods. In the realm of machine learning, several single imputation methods have been explored, including k-nearest neighbors (k-NN) [3], traditional neural networks [18,36], and MissForest [37]. The k-NN is based on a discriminative algorithm that utilizes the similarity between instances, typically measured by Euclidean distance, to impute missing values. It offers flexibility in handling both continuous and categorical data. MissForest, leveraging the power of random forests, excels in datasets with complex interactions and non-linear relationships, often outperforming other methods in terms of accuracy and robustness. Both k-NN and missForest are shown to be very effective as compared to other sophisticated imputation methods [13,20].

Deep Learning Methods. Recent advances in deep learning have inspired novel imputation methods like Denoising Autoencoders (DAE) [41] and Generative Adversarial Nets (GAN) [17]. These approaches, however, often assume complete data during training or struggle with datasets containing mixed variable types. The Generative Adversarial Imputation Nets (GAIN) [44] represent a significant advancement, specifically designed for imputing missing data without the need for complete datasets. Despite their innovative approach, methods like GAIN often rank behind more traditional machine learning methods such as k-NN in terms of performance [20].

Other noteworthy developments include the use of denoising auto-encoders for multiple imputation (MIDA) [16] from a de-noised auto-encoder [41,23] and the application of Variational Auto-Encoders (VAE) to data imputation [8,24,27]. The Deep Ladder Imputation Network (DLIN) [19] combines denoising auto-encoders with a ladder network architecture, showing promise in handling high missing ratios and spatial/temporal data. Similarly, the Heterogeneous-Incomplete VAE (HI-VAE) [25] offers a tailored approach for imputing missing values in tabular data, demonstrating competitive performance against established methods. An approach specifically tailored to impute tabular non-numerical data (text and categorical variables) is proposed in [5]. It leverages deep learning techniques to capture complex relationships between different columns and impute missing values more accurately than traditional methods. Interestingly, the authors found that in many cases, simpler linear n-gram models performed on par with deep learning models, requiring less computational resources.

5 Conclusion and Future Directions

In this paper, we introduced CLAIM, a novel approach that leverages the contextual understanding capabilities of LLMs for data imputation. Through rigorous evaluation across diverse datasets and missingness patterns—including MCAR, MAR, and MNAR—CLAIM has demonstrated superior accuracy, outperforming conventional imputation methods. This consistency in overcoming the challenges posed by different types of missing data unequivocally affirms the effectiveness of CLAIM in a wide array of scenarios, marking a significant leap in the field of data imputation.

The robust performance of CLAIM across various missingness mechanisms not only showcases its broad applicability and reliability but also represents a departure from traditional imputation methods. These conventional approaches often exhibit limitations, excelling under specific conditions or with certain data types. In contrast, CLAIM’s methodology, which involves verbalizing data and employing contextually relevant descriptors for imputation, ensures its adeptness across a multitude of scenarios and data modalities. This adaptability underlines the importance of integrating contextualized natural language models into the data imputation process, offering a more nuanced and effective solution to the pervasive issue of missing data.

Moreover, our exploration into the use of contextually nuanced descriptors further underscores the potential of CLAIM. By engaging LLMs’ general knowledge and their sophisticated understanding of language and context, we have shown that carefully chosen descriptors significantly enhance the model’s ability to handle missing data. This not only boosts the precision of imputations but also leverages the LLM’s inherent strengths, demonstrating the critical role of context in improving data processing tasks.

Building upon the promising results demonstrated by CLAIM, future work will aim to explore several avenues to further enhance its effectiveness and applicability in the field of data imputation. One key area of focus will be the extension of CLAIM to handle more complex data types, such as time-series data, images, and unstructured text, to evaluate its versatility and efficiency in dealing with diverse data formats. Additionally, there is a potential to refine the model’s performance by incorporating feedback mechanisms that allow CLAIM to learn from its imputations, thereby improving accuracy over time through reinforcement learning techniques.

Another promising direction involves exploring the integration of CLAIM with domain-specific LLMs. By tailoring the contextual understanding capabilities of LLMs to specific fields, such as healthcare, finance, or environmental science, the imputation process could be significantly enhanced, leading to more accurate and meaningful data imputations within these specialized contexts.

6 Limitations

Despite the notable advancements presented by CLAIM in addressing missing data within tabular datasets, this work has several limitations. First, the efficacy of CLAIM is inherently dependent on the quality and breadth of the training data used to develop the underlying LLMs. In scenarios where the LLMs have not been exposed to data similar to the specific context or domain of the missing information, their ability to generate accurate and relevant imputations may be compromised. Additionally, the approach assumes that the descriptive context provided for missing values is sufficiently informative for the LLM to understand and act upon, which may not always be the case. Furthermore, the computational requirements for processing large datasets with CLAIM, given the need for interaction with sophisticated LLMs, could pose scalability challenges. Lastly, while CLAIM shows promise in handling various missingness mechanisms, its performance in highly specialized or niche domains, where expert knowledge significantly influences data interpretation, has yet to be fully explored.

References

  1. Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling, Synthesis Lectures on Data Management, vol. 10. Morgan & Claypool (2018). https://doi.org/10.2200/s00878ed1v01y201810dtm052

  2. Achiam, J., Andrychowicz, M., Beattie, A., Clark, J., Drozdov, N., Ecoffet, A., Edwards, D., Giddings, J., Goldberg, I., Gomez, M., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation method. In: Frontiers in Artificial Intelligence and Applications. vol. 87, pp. 251–260. HIS (2002)

  4. Bhatia, K., Narayan, A., De Sa, C., Ré, C.: TART: A plug-and-play Transformer module for task-agnostic reasoning (Jun 2023). https://doi.org/10.48550/arXiv.2306.07536, http://arxiv.org/abs/2306.07536, arXiv:2306.07536 [cs]

  5. Biessmann, F., Salinas, D., Schelter, S., Schmidt, P., Lange, D.: "deep" learning for missing value imputation in tables with non-numerical data. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. p. 2017–2025. CIKM ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3269206.3272005, https://doi.org/10.1145/3269206.3272005

  6. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners. In: Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020)

  7. Buuren, S.v., Groothuis-Oudshoorn, K.: Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software 45, 1–67 (2011)

  8. Camino, R.D., Hammerschmidt, C.A., State, R.: Improving missing data imputation with deep generative models. arXiv preprint arXiv:1902.10666 pp. 1–8 (2019)

  9. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A.M., Pillai, T.S., Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., Fiedel, N.: PaLM: Scaling Language Modeling with Pathways (Oct 2022), http://arxiv.org/abs/2204.02311, arXiv:2204.02311 [cs]

  10. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22 (1977)

  11. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetuning of quantized llms (2023)

  12. Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml

  13. Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., Tabona, O.: A survey on missing data in machine learning. J Big Data 8(1), 140 (2021). https://doi.org/10.1186/s40537-021-00516-9, epub 2021 Oct 27. PMID: 34722113; PMCID: PMC8549433

  14. García-Laencina, P.J., Sancho-Gómez, J., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282(2010). https://doi.org/10.1007/S00521-009-0295-6, https://doi.org/10.1007/s00521-009-0295-6

  15. Gimpy, M.: Missing value imputation in multi attribute data set. Int. J. Comput. Sci. Inf. Technol. 5(4), 1–7 (2014)

  16. Gondara, L., Wang, K.: Mida: Multiple imputation using denoising autoencoders. In: PacificAsia conference on knowledge discovery and data mining. pp. 260–272. Springer (2018)

  17. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems. vol. 27, pp. 2672–2680. Curran Associates, Inc., Montréal, Canada (2014)

  18. Gupta, A., Lam, M.S.: Estimating missing values using neural networks. Journal of the Operational Research Society 47(2), 229–238 (1996)

  19. Hallaji, E., Razavi-Far, R., Saif, M.: Dlin: Deep ladder imputation network. IEEE Transactions on Cybernetics 52(9), 8629–8641 (2021)

  20. Jäger, S., Allhorn, A., Biessmann, F.: A benchmark for data imputation methods. Front Big Data 4, 693674 (2021). https://doi.org/10.3389/fdata.2021.693674, pMID: 34308343; PMCID: PMC8297389 Enhancing Imputation Accuracy with Contextual Large Language Models 15

  21. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons, 3 edn. (2019)

  22. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken, 2 edn. (2002)

  23. Lu, H.m., Perrone, G., Unpingco, J.: Multiple imputation with denoising autoencoder using metamorphic truth and imputation feedback. arXiv preprint arXiv:2002.08338 (2020)

  24. McCoy, J.T., Kroon, S., Auret, L.: Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51(21), 141–146 (2018), 5th IFAC Workshop on Mining, Mineral and Metal Processing MMM 2018

  25. Nazabal, A., Olmos, P.M., Ghahramani, Z., Valera, I.: Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653 (2018)

  26. OpenAI: GPT-4 Technical Report (Mar 2023). https://doi.org/10.48550/arXiv.2303.08774, http://arxiv.org/abs/2303.08774, arXiv:2303.08774 [cs]

  27. Qiu, Y.L., Zheng, H., Gevaert, O.: Genomic data imputation with variational auto-encoders. GigaScience 9(8), giaa082 (2020)

  28. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 140:5485–140:5551 (Jan 2020)

  29. Roberts, A., Raffel, C., Shazeer, N.: How Much Knowledge Can You Pack Into the Parameters of a Language Model? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 5418–5426. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.437, https://aclanthology.org/2020.emnlp-main.437

  30. Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976). https://doi.org/10.1093/biomet/63.3.581

  31. Rubin, D.B.: Multiple imputations in sample surveys-a phenomenological bayesian approach to nonresponse. In: Proceedings of the survey research methods section of the American Statistical Association. vol. 1, pp. 20–34. American Statistical Association, Alexandria, VA, USA (1978)

  32. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York, NY (2004)

  33. Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London, UK (1997)

  34. Schelter, S., Biessmann, F., Januschowski, T., Salinas, D., Seufert, S., Szarvas, G.: On challenges in machine learning model management. IEEE Data Eng. Bull. 41(4), 5–15 (2018), http://sites.computer.org/debull/A18dec/p5.pdf

  35. Schelter, S., Rukat, T., Biessmann, F.: JENGA - A framework to study the impact of data errors on the predictions of machine learning models. In: Velegrakis, Y., Zeinalipour-Yazti, D., Chrysanthis, P.K., Guerra, F. (eds.) Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021. pp. 529–534. OpenProceedings.org (2021). https://doi.org/10.5441/002/EDBT.2021.63, https://doi.org/10.5441/002/edbt.2021.63

  36. Sharpe, P.K., Solly, R.: Dealing with missing values in neural network-based diagnostic systems. Neural Computing & Applications 3(2), 73–77 (1995)

  37. Stekhoven, D.J., Bühlmann, P.: Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

  38. Stoyanovich, J., Howe, B., Jagadish, H.V.: Responsible data management. Proceedings of the VLDB Endowment 13, 3474–3488 (2020). https://doi.org/10.14778/ 3415478.3415570

  39. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and Efficient Foundation Language Models (Feb 2023), http://arxiv.org/ abs/2302.13971, arXiv:2302.13971 [cs]

  40. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open foundation and fine-tuned chat models (2023)

  41. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. pp. 1096–1103 (2008)

  42. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Jan 2023). https://doi.org/10.48550/arXiv.2201.11903, http://arxiv.org/abs/2201.11903, arXiv:2201.11903 [cs]

  43. Yang, K., Huang, B., Stoyanovich, J., Schelter, S.: Fairness-aware instrumentation of preprocessing pipelines for machine learning. In: Proceedings of the Workshop on HumanIn-the-Loop Data Analytics (HILDA’20). ACM (2020). https://doi.org/10.1145/3398730.3399194

  44. Yoon, J., Jordon, J., van der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. In: International conference on machine learning. pp. 5689–5698. PMLR (2018)

  45. Yoon, J., Jordon, J., van der Schaar, M.: Gain: Missing data imputation using generative adversarial nets (2018)

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.