Table of Links
-
Related Works
2.3 Evaluation benchmarks for code LLMs and 2.4 Evaluation metrics
-
Methodology
-
Evaluation
3.4 Evaluation procedure
As stated in the research question RQ2, the code LLMs are evaluated against multiple metrics. Unit tests are used to facilitate the evaluation. Each code generation task includes several unit tests against which the model-generated function can be automatically tested. Listing 2 demonstrates the unit tests for the HumanEval_8_sum_product task from the HumanEval benchmark. Both HumanEval and MBPP benchmarks within MultiPL-E use the LuaUnit[8] third-party library for unit tests. On the other hand, MCEVAL uses the native assert function for unit testing. To streamline and standardize the automated evaluation procedure, we translated the native assertions in MCEVAL to LuaUnit-based assertions.
Table 3 provides a summary of all metrics used to evaluate the performance of the quantized models. First, we measure pass@1, which is the model’s capacity to generate a correct function in its first attempt. If the function generated by the model passed all unit tests then it is assumed that the model generated a correct solution. If at least one test failed then the model generated an incorrect solution.
With LuaUnit, it is also possible to automatically differentiate between failed unit tests, runtime errors, and syntax errors. Failed unit tests and runtime errors are delivered within the same stdout stream but with different error messages. Syntax errors are delivered within the stderr stream. Lastly, if a function call lasted more than 30 seconds, it was considered a timeout error. Differentiating between these four types of errors can give more insight into the challenges of code generation by LLMs and the effect of quantization.
In addition to the error types, we also measured inference time. Inference time is the duration of time elapsed between the model receiving the prompt and the model finishing its output generation. While quantization does not aim to speed
up LLMs, it can be a side effect of changing the inference process and quality. Hence, it may be insightful to investigate how inference time changes with quantization and correlates with solution accuracy.
The final evaluation metric is the lines of code (LOC) generated. LOC can represent the relative quality of code for correct solutions. For incorrect solutions, it can shed some light on the model’s behaviors and the effect of quantization on it. LOC can also be correlated with inference time giving us a more granular understanding of dependencies.
3.5 Model parameters
The first parameter of relevance to all models is temperature. Temperature defines how predictable a model is. A lower temperature (t < 1) makes a model more deterministic in its next-token prediction. A higher temperature makes the model more unpredictable while inferring the next token in a sequence. For pass@1, a lower temperature is preferred [33]. The lower temperature also makes this study replicable. For these reasons, temperature is set at 0.1 for all models in this study.
Top-k is the second parameter that makes a model more predictable. Top-k defines the greediness of the next token sampling by limiting to only the top-k tokens with the highest probability. In this study, top-k is set to 1 to enforce further the models’ predictability and reproducibility of the study.
The last parameter is End-of-Sequence (EOS) tokens that tell the models when they should stop generating. By default, models can have different EOS tokens. For example, DeepSeek Coder uses ‘<|EOT|>’ as an EOS token, and CodeQwen uses ‘<|im_end|>’. However, MultiPL-E introduces its own EOS tokens that are ‘\ nlocal’,‘\ nfunction’,‘\n−−’, and ‘\n\n’. We noticed that these additional EOS tokens make model outputs more succinct. Hence, we use these tokens in addition to the models’ default EOS tokens.
3.6 Source code and data
The source code and data used in this study are available at https://github.com/E-Nyamsuren/ qLMM-Lua-Eval-Pipeline. The source code includes Python pipelines for downloading and preparing the datasets, generating Lua code with the models, evaluating generated Lua code, and an R script for statistical analysis.
Author:
(1) Enkhbold Nyamsuren, School of Computer Science and IT University College Cork Cork, Ireland, T12 XF62 ([email protected]).
This paper is available on arxiv under CC BY-SA 4.0 license.
[8] https://github.com/bluebird75/luaunit