Abstract
Large language models (LLMs) such as ChatGPT have risen in prominence recently, leading to the need to analyze their strengths and limitations for various tasks. The objective of this work was to evaluate the performance of large language models for model checking, which is used extensively in various critical tasks such as software and hardware verification. A set of problems were proposed as a benchmark in this work and three LLMs (GPT-4, Claude, and Gemini) were evaluated with respect to their ability to solve these problems. The evaluation was conducted by comparing the responses of the three LLMs with the gold standard provided by model checking tools. The results illustrate the limitations of LLMs in these tasks, identifying directions for future research. Specifically, the best overall performance (ratio of problems solved correctly) was 60%, indicating a high probability of reasoning errors by the LLMs, especially when dealing with more complex scenarios requiring many reasoning steps, and the LLMs typically performed better when generating scripts for solving the problems rather than solving them directly.
Official URL
More Information
Divisions: | School of Built Environment, Engineering and Computing |
---|---|
Identification Number: | https://doi.org/10.3390/electronics14020401 |
Status: | Published |
Refereed: | Yes |
Publisher: | MDPI AG |
Additional Information: | © 2025 by the authors |
Uncontrolled Keywords: | 0906 Electrical and Electronic Engineering; 4009 Electronics, sensors and digital hardware |
SWORD Depositor: | Symplectic |
Depositing User (symplectic) | Deposited by Antoniou, Grigorios |
Date Deposited: | 22 Jan 2025 10:14 |
Last Modified: | 22 Feb 2025 14:47 |
Item Type: | Article |
Export Citation
Explore Further
Read more research from the author(s):