It is the first time, in the second half of this year, that I am not trying to urgently deal with something. So, instead of working on some manuscripts from the lab (sorry!), I took some time to look in more detail at the outputs of two recently announced science AI "assistants" dedicated to scientific publishing. The q.e.d. science peer review system and the Nature Research Assistant tool. This was not a very rigorous or quantitative assessment but instead I had a look at the tool's outputs based on 3 manuscripts from our lab - 2 recent preprints and 1 manuscript that we are still working on.
If you haven't tried it yet, q.e.d. tries to identify and list what are the claims made in a scientific paper and then identify any major or minor gaps in these claims. Visually, it is presented as hierarchical tree with a main message for the whole manuscript, main claims and related (sub) claims. It is refreshing and positive to me that they decided to present this in a way that is different from the standard text peer review format but, in essence, this is very much the type of information obtained in a peer review report. In addition, the section "What's new" also provides a description of what the model believes is the most novel about the work and what might have been done in some way by other studies.Before getting into more details about the output of q.e.d., I also tried the same 3 manuscripts in the Nature Research Assistant tool. This is clearly more conservative in scope and it provides a series of suggestions, primarily focused on improving the text. The tool does provide a list of identified "overstated claims" which comes closer to the idea of finding gaps in scientific claims/statements as done in q.e.d. science.
How good is the output of these AI assistants
Regarding the output of these tools, I am really impressed by the level of detail of q.e.d. For every gap, it has a written explanation of the identified issues and suggestions for additional work or text changes to mitigate the issue. Many of the identified gaps require quite detailed technical knowledge. In one particular example, the tool found a very non-trivial gap in the null model of a statistical test that required knowledge of proteomics, evolution and bioinformatics. The 3 manuscripts are very computational, which the authors indicate is not an area they have focused during development. One of the manuscripts was flagged as being from a domain knowledge that does not fit their current set of domain areas. Still, I could expect to see many of these comments in a human peer review report. Is there something in these gaps that we never considered before, and that I need to absolutely act on? Not really, but that can be said honestly of a significant fraction of all peer-review comments. I would generally rank these AI generated comments as about average. Not among the most useful peer-review comments but certainly better than many we have received over the years.
The output of Nature's research assistant is much more what you would expect of a tool dedicated to improving the text of a manuscript. I think it is most useful to find parts of the text that could benefit from improved clarity. In the way the information is presented it also promotes the author's revising the sections, deciding to use or not the suggestions from the tool, instead of simply feeding the whole text through an LLM. It is more of an assistant than a replacement for writing. I don't think I would give money to a tool like this over say a general LLM chat bot.
For comparison purposes, I tried to recreate the output of q.e.d. using a standard LLM chatbot (Gemini Pro in this case). I took one of the manuscripts and tried to formulate the prompt in the way to get also a list of claims, gaps and suggested changes. The output was not as good as q.e.d. but some of the gaps were the same although it seemed qualitatively a bit more superficial.
AI "peer" review is here to stay
Whether we want it or not, these tools are now reaching a point where they can be used to identify gaps in a scientific manuscript that could pass as a human (peer) review report. There are many ways these tools can be used and abused. The most positive outcome of this might be that authors take advantage of these as assistants to help improve the clarity of the manuscripts before making them public. The most obvious negative outcome is that these will be used as lazy human reviewing just copy-pasted to satisfy the ever growing need to peer-review our ever growing production of scientific papers every year. Given that these reports can be generated quickly, potentially as part of the submission process, it could well be that a good way to preempt the use of these by human peer-reviewers might be that the journal already provides them to the peer-reviewers as part of the request for assessment. This would already make clear that the editor/journal is aware of the things that an automated report would bring up and avoid having the reviewers simply trying to fake a report. Finally, there is also a likely scenario that editors of scientific journals start to integrate these reports as part of their initial editorial decisions. In particular, from the editorial perspective, these tools might end up serving as biased and lazy assessments of novelty and impact.
As a peer-reviewer, I don't think these automated reports would reduce the level of work I need to do. I still would need to spend the time to read through a paper, consider the methods used and try to figure out if there are issues that the authors might have missed and if the claims and interpretation make sense relative to the data. Having such automated reports might be a useful addition as well as having a list of related published papers.
Perhaps one aspect that is not strongly emphasized in q.e.d. but more obvious in Nature's research assistant and even other tools is the connection of a given manuscript with the broader scientific literature. As scientists, I think it is fair to admit that it is hard to be fully aware of all of the work that has been published in a field. Sometimes the connections between our work and existing literature are less obvious because they can happen through analogy and/or shared methods. Surfacing such connections in the process of writing up a manuscript in an easy way would be particularly useful.
Science and scientific publishing in the age of AI slop
We were already drowning in scientific papers before ChatGPT and co. Now there is growing evidence of papers being produced by AI and quite a lot of buzz around the concept of fully automated AI scientists. So it is unfortunately unavoidable that this is going to translate to an even stronger increase in the number of publications and added pressures to the scientific publishing system. One optimistic take on this is that the added publications will be easy to ignore crap that won't affect our productivity but it is at least likely to result in more added wasted money being spent feeding the already rich publishing industry. Unfortunately, I think this will also hurt attempts to move away from our expensive and inefficient traditional publishing system when scientists worry more about "high-quality" science. The current (bad) proxies for quality (i.e. high impact factor journals) can't be easily changed to something else in an environment where many scientists will rightfully be even more worried about scientific rigor.