Abstract
Methods: Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity.
Results: Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers’ perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent.
Discussion: The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.
| Original language | English |
|---|---|
| Article number | 1664303 |
| Number of pages | 11 |
| Journal | Frontiers in Artificial Intelligence |
| Volume | 8 |
| DOIs | |
| Publication status | Published - 9 Oct 2025 |
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by RTI2018-096724-B-C22, PID2021-125188OB-C32 and TED2021-129932B-C21 projects funded by Spanish Ministry of Economy and Competitiveness and CB12/03/30038 (CIBER Fisiopatologia de la Obesidad y la Nutrición, CIBERobn, Instituto de Salud Carlos III). This research was also funded by Generalitat Valenciana (grant number PROMETEO/2021/059) and Agencia Valenciana de la Innovación (grant number INNEST/2022/103). FÁ-M were supported by “Margarita Salas” fellowship from the Spanish Ministry of Universities. EB-C was supported by the “Requalification for university teachers grant” from the Spanish Ministry of Universities and European Union Next Generation program.
Keywords
- Artificial Intelligence
- scientific evaluation
- prompt engineering
- large language models
- retrieval-augmented Generation