There are significant differences among artificial intelligence large language models when answering scientific questions

Francisco Javier Álvarez-Martínez, Luis Esteban, Lucas Frungillo, Estefanía Butassi, Alessandro Zambon, María Herranz-López, Mario Aranda, Federica Pollastro, Anne Sylvie Tixier, Jose V Garcia-Perez, David Arráez-Román, Andrew Ross, Pedro Mena, Ru Angelie Edrada-Ebel, James Lyng, Vicente Micol, Fernando Borrás-Rocher, Enrique Barrajón-Catalán

Research output: Contribution to journalArticlepeer-review

Abstract

Introduction: This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B.

Methods: Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity.

Results: Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers’ perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent.

Discussion: The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.
Original languageEnglish
Article number1664303
Number of pages11
JournalFrontiers in Artificial Intelligence
Volume8
DOIs
Publication statusPublished - 9 Oct 2025

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work was supported by RTI2018-096724-B-C22, PID2021-125188OB-C32 and TED2021-129932B-C21 projects funded by Spanish Ministry of Economy and Competitiveness and CB12/03/30038 (CIBER Fisiopatologia de la Obesidad y la Nutrición, CIBERobn, Instituto de Salud Carlos III). This research was also funded by Generalitat Valenciana (grant number PROMETEO/2021/059) and Agencia Valenciana de la Innovación (grant number INNEST/2022/103). FÁ-M were supported by “Margarita Salas” fellowship from the Spanish Ministry of Universities. EB-C was supported by the “Requalification for university teachers grant” from the Spanish Ministry of Universities and European Union Next Generation program.

Keywords

  • Artificial Intelligence
  • scientific evaluation
  • prompt engineering
  • large language models
  • retrieval-augmented Generation

Fingerprint

Dive into the research topics of 'There are significant differences among artificial intelligence large language models when answering scientific questions'. Together they form a unique fingerprint.

Cite this