Abstract

1 Introduction

2 Related Work

3 The Seven Qualification Tests for an AI Scientist

4 Discussions

5 Conclusions and Future Work and References

4 Discussions

4.1 Can an AI possibly conquer these tests?

Making scientific discoveries is different from training LLMs because it would not be useful to simply feed the model with a very large set of human written corpus. Instead, we will require the AI to explore on its own and learns from the exploration, just like what a human scientist would do. However, we probably still need to use large language models to accomplish such tasks, and therefore a key question is what information can be used to train a model. The answer is exploration, probably similar to how a reinforcement learning model learns to play StarCraft[24]. An AI scientist must be able to explore, either using an interactive tool or a very large dataset, to gain knowledge about how to accomplish a particular goal. Let us take the fifth test, initial value problem, as an example. Given a large variety of math functions and the solutions to their initial value problems (i.e., curves of their integrals), an AI agent should start from randomly exploring tools at hand, such as SymPy and NumPy, to get closer to the standard answer. For example, the agent should soon find that y1 = y0 + f(x0) · ∆x, which can be its first answer. Then it should keep exploring, and possibly find that y1 = y0 + f(x0)+f(x1) 2 ∆x is a better solution. After many rounds of exploration, it should gradually transit from random exploration to more informed exploration, either through online learning or reinforcement learning. This process ends when it finds a solution that is at least as good as the fourth-order Runge-Kutta method[13]. Learning from exploration is just one possible route to pass such tests. Another key method is to use Occam’s razor, which prefers simpler explanations. To be more exact, it prefers explanations that posit fewer entities, or fewer kinds of entities, with other things equal. On the other hand, we do hope that an AI agent can develop its own methods in solving these tests.

4.2 Why do we need these tests?

The ultimate goal for an AI scientist is to make novel and impactful scientific discoveries that no one has made before. Then why do we need these “Turing tests” which have been discovered decades or centuries ago? There are two main reasons. The first reason is that we need a benchmark, just like we need ImageNet[25] for studies in computer vision. Suppose a great AI scientist has been built and it makes some new discoveries that have not been made before. Different people probably have different assessments on the importance of the new discovery, and it is hard to measure the level of human involvement in the process of research. With a well-defined benchmark, including both the targets and the scope of data and tools that can be used, it is much easier to measure the capability of an AI scientist. The second reason is that the ultimate goal of making important novel discoveries is very challenging, as it requires the AI agent to be better than the best human experts in the world. It is analogical to building an AI agent that can beat the best GO player in the world. While passing some of our tests is like beating a top GO player a thousand years ago when GO was in its early age, or beating an amateur GO player today. If we could build an AI agent that passes the majority of the above seven tests, we can conclude that we are in the right track of building an AI scientist, and it should evolve into someone who can make important scientific discoveries in the foreseeable future.

5 Conclusions and Future Work and References

Recent advancements have enabled LLMs to solve complex problems, highlighting their potential as tools in daily scientific research. However, the ability to solve predefined problems is completely different from pioneering scientific discoveries. This distinction prompts the need for a “qualification test for an AI scientist” to determine whether an AI can independently conduct scientific research without human assistance. The proposed framework for such a test is analogous to the Turing Test, which assesses whether machines can exhibit human-like intelligence. Unlike LLMs that learn from extensive datasets, scientific innovation often stems from exploring uncharted territories. We propose a series of ”Turing tests for an AI scientist” based on key historical scientific breakthroughs such as the heliocentric model and Maxwell’s equations, which were derived from empirical data and critical reasoning about the natural world. Seven such tests are outlined, ranging from astronomy to information theory, each designed to evaluate the AI’s ability to derive fundamental scientific principles from raw data. These tests require the AI to engage with interactive environments or large datasets without prior exposure to human-derived solutions in these fields. This approach not only aims to gauge an AI’s ability to generate scientific insights but also seeks to set a benchmark for AI capabilities in scientific thinking and discovery. The ultimate goal is to develop an AI that not only replicates but also innovates, paving the way for AIs that contribute uniquely to scientific progress.

Conflict of Interest Statement

The authors did not receive support from any organization for the submitted work. The authors have no relevant financial or non-financial interests to disclose.

References

1] OpenAI: Gpt-4 technical report. (2023) arXiv:2303.08774

[2] Microsoft Copilot. https://copilot.microsoft.com/ (2023)

[3] Rozi`ere, B., et al.: Code llama: Open foundation models for code. (2023) arXiv:2308.12950

[4] Huang, Y., et al.: Competition-level problems are effective llm evaluators. (2023) arXiv:2312.02143

[5] Azerbayev, Z., et al.: Llemma: An open language model for mathematics. (2023) arXiv:2310.10631

[6] Turing, A.: Computing machinery and intelligence. Mind 59(236), 433–460 (1950)

[7] Collaboration, A., et al.: The astropy project: Sustaining and growing a community-oriented open-source project and the latest major release (v5.0) of the core package (2022) arXiv:2206.14220

[8] Meurer, A., et al.: Sympy: symbolic computing in python. PeerJ Computer Science 3, 103 (2017)

[9] Harris, C.R., Millman, K.J., Walt, S.J., et al.: NumPy — A fundamental package for scientific computing with Python (2020). https://numpy.org

[10] O’Hanlon, M.: Minecraft: Pi Edition API Python Library. https://https://github. com/martinohanlon/mcpi

[11] Kurrer, K.E., Ramm, E.: The History of the Theory of Structures: From Arch Analysis to Computational Mechanics, (2012)

[12] Laporte, F.: Python 3D FDTD Simulator. https://github.com/flaport/fdtd

[13] Lambert, J.D.: Numerical Methods for Ordinary Differential Systems: The Initial Value Problem, (1991)

[14] Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40(9), 1098–1101 (1952)

[15] Waltz, D., Buchanan, B.G.: Automating science. Science 324(5923), 43–44 (2009)

[16] King, R.D., et al.: The robot scientist adam. Computer 42(8), 46–54 (2009)

[17] Naik, A.W., et al.: Active machine learning-driven experimentation to determine compound effects on protein patterns. eLife 5(e10047) (2016) [18] Trinh, T.H., et al.: Solving olympiad geometry without human demonstrations. Nature 625, 476–482 (2024)

[19] Cranmer, M.: PySR: High-Performance Symbolic Regression in Python and Julia. https://github.com/MilesCranmer/PySR

[20] Madar, R.: Simulating Vibrating Strings with Python. https://github.com/ rmadar/vibrating-string

[21] Filipovich, M., Hughes, S.: Pycharge: An open-source python package for selfconsistent electrodynamics simulations of lorentz oscillators and moving point charges. Comput. Phys. Commun. 274(108291) (2022)

[22] Filipovich, M., Hughes, S.: PyCharge. https://pycharge.readthedocs.io/

[23] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27(379-423) (1948)

[24] Vinyals, B.I.C.W.M.e.a. O.: Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(350-354) (2019)

[25] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A largescale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

Authors:

  1. Xiaoxin Yin

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.