Researchers at Google have used large language model AI to detect a zero day vulnerability in real world code for the first time.
An AI agent called Big Sleep has an exploitable stack buffer underflow in SQLite, a widely used open source database engine. The vulnerability was discovered and reported to the developers in early October, who fixed it on the same day before before it appeared in an official release, so SQLite users were not impacted. This was despite the memory vulnerability not showing up in extensive software testing.
“We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software,” said the researchers.
This capability evolved from Project Naptime, which has been developing a framework for large-language-model-assisted vulnerability research and demonstrated its potential by improving the state-of-the-art performance on Meta’s CyberSecEval2 benchmarks. Since then, Naptime has evolved into Big Sleep, a collaboration between Google Project Zero and Google DeepMind.
“We think that this work has tremendous defensive potential. Finding vulnerabilities in software before it’s even released, means that there’s no scope for attackers to compete: the vulnerabilities are fixed before attackers even have a chance to use them,” they said.
Techniques such as fuzz testing have helped significantly, but an approach that can help defenders to find the bugs that are difficult (or impossible) to find by fuzzing is needed, and Google sees DeepMind AI as a way to narrow this gap.
The team found the vulnerability itself quite interesting, along with the fact that the existing testing infrastructure for SQLite (both through OSS-Fuzz, and the project’s own infrastructure) did not find the issue.
A key motivating factor for Naptime and now for Big Sleep is to identify variants of previously found and patched vulnerabilities. As this trend continues, it’s clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach.
This variant-analysis task is a better fit for current LLMs than the more general open-ended vulnerability research problem. Providing a starting point such as the details of a previously fixed vulnerability removes a lot of ambiguity from vulnerability research.
The project is still in the research stage with small programs with known vulnerabilities to evaluate progress, with SQLite as the first extensive, real-world variant analysis experiment.
However, these are highly experimental results and the position of the Big Sleep team is that at present, it’s likely that a target-specific fuzzer would be at least as effective at finding vulnerabilities.