The End of the Human-in-the-Loop: Agentic AI and the Future of Vulnerability Research
Vulnerability research has always been bottlenecked by humans. Not by compute, not by tooling โ by the hours a skilled researcher spends writing harnesses, triaging crashes, reading heap traces, and manually navigating debugger output. The fuzzing itself is cheap. Everything around it is expensive.
We built a system that changes that equation. Using coordinated AI agents with persistent memory and domain expertise, we’ve compressed what traditionally takes a skilled team weeks into hours โ on a single desktop workstation.
The human-in-the-loop isn’t disappearing. The human is moving from the factory floor to the control room. And that’s exactly where they should be.
The Old Model Is Broken
Traditional vulnerability research follows a well-worn pattern: a human expert selects a target, studies its source code, writes a fuzzing harness, generates seed inputs, launches a fuzzer, then โ and this is where most of the time goes โ manually triages every crash.
Each crash requires loading the input in a debugger, walking the heap, reading disassembly, classifying the root cause, deduplicating against known bugs, and determining exploitability. A skilled researcher might spend weeks on a single library before producing actionable results.
The actual fuzzing โ the automated part โ typically accounts for less than 20% of the total effort. The other 80% is human analysis.
This creates a structural problem: the attack surface of modern software is growing exponentially, but the pool of researchers capable of doing this work is not. There are perhaps a few thousand people worldwide who can competently take a crash from a fuzzer and determine whether it’s exploitable. Meanwhile, every connected device runs dozens of C and C++ libraries with complex binary format parsers โ each one a potential entry point.
The math doesn’t work. It never has.
Enter the Agents
Our approach replaces the monolithic human-expert model with a coordinated team of AI agents, each specialized for a phase of the research pipeline.
The Architect designs fuzzing strategy โ selecting targets based on attack surface analysis, CVE history, deployment breadth, and code complexity. It evaluates whether a library warrants investment before any compute is spent.
The Security Specialist builds domain-specific harnesses. Not generic “feed bytes to the parser” harnesses, but targeted instruments that exercise specific codecs, specific parsing paths, specific subsystems. For a TIFF library, that means separate harnesses for LZW decompression, JPEG-in-TIFF interaction, IFD tag parsing, and strip/tile layout handling.
The Engineer handles build systems, dependency management, and integration โ ensuring harnesses compile correctly with full instrumentation and all optional codecs enabled. Missing optional dependencies mean entire code paths go untested โ a detail that matters enormously.
The Orchestrator coordinates it all โ spawning agents in parallel, managing the fuzzing supervisor, rotating targets based on diminishing returns, and escalating findings that warrant human attention.
Each agent maintains persistent lessons learned across campaigns. The system gets better at finding bugs the more bugs it finds.
Speed of Deployment
When we expanded to four new target libraries โ libtiff, libexif, LibRaw, and Poppler โ the agent team produced:
- 15 codec-specific fuzzing harnesses targeting historically vulnerable subsystems
- Complete build infrastructure for each library with full instrumentation
- Seed corpora and format-specific dictionaries tailored to each parser
- Comparison-logging variants for guided fuzzing
- Attack surface profiles for each target
Total elapsed time from decision to fuzzers running: under 30 minutes, with all four agent teams working in parallel. Some harnesses required iterative build refinements โ but the agents handled those autonomously.
A human researcher doing this work would reasonably spend one to two weeks. This isn’t a 2x improvement. It’s a structural change in the economics of vulnerability research.
Results
What We Found
Across nine open-source libraries โ including libarchive, FreeType, DjVuLibre, libtiff, libexif, LibRaw, and Poppler โ the system identified three previously unknown security findings in a widely-deployed library present on virtually every operating system:
A heap-based buffer overflow leading to confirmed RCE. The agents performed root cause analysis, mapped heap layouts across two glibc versions, and produced structured analysis of exploitation primitives that informed development of a working exploit chain โ work that traditionally requires a senior exploit developer with deep allocator internals knowledge.
An uninitialized memory read that leaks heap contents, including allocator metadata sufficient to defeat ASLR. The agents identified this as an ASLR bypass primitive and validated it experimentally.
A NULL pointer dereference causing process termination, independently triaged and minimized by the agent pipeline.
What We Didn’t Find
Equally telling: the system spent 3.6 billion executions on FreeType 2.13.3 with zero crashes. It confirmed all 60+ DjVuLibre crashes mapped to a single known CVE. It retired ImageMagick, hostapd, and several other targets after sufficient coverage confirmed they were well-hardened.
Knowing when to stop is as valuable as knowing what to chase. The agents make retirement decisions based on execution counts, coverage plateaus, and crash deduplication โ not human impatience or sunk-cost bias.
What Went Wrong
No system is perfect, and ours has failure modes worth acknowledging:
- Harness quality varies. One agent confused build flags between libraries, producing a harness that compiled but exercised the wrong code paths. Automated testing caught it, but hours of fuzzing were wasted.
- False confidence in triage. The system classified a crash as “known bug” based on superficial stack trace similarity. A human reviewer later identified it as distinct. We’ve since added deeper deduplication.
- Exploit development still requires human judgment. The agents produced excellent research artifacts โ heap maps, allocation timelines, grooming strategies โ but translating those into a working exploit required human expertise where agents would confidently propose approaches that don’t survive contact with reality.
A system that claims perfection invites distrust. A system that knows its failure modes can be relied upon within its boundaries.
The Last Mile: From Crash to Exploit
Here’s where the “human-in-the-loop” argument traditionally holds strongest. Finding crashes is automated. Understanding them โ that’s supposed to require a human.
Our experience challenges that assumption โ partially. When the system found the heap overflow, AI agents:
- Performed root cause analysis, tracing the exact code path from input to corruption
- Mapped heap layouts across two major glibc versions, identifying adjacent allocator structures
- Analyzed exploitation primitives, testing grooming strategies programmatically
- Researched platform-specific constraints (ARM64 pointer authentication, allocator zone boundaries)
- Produced structured documentation that a human could extend into a working exploit
Did a human still drive the final exploit development? Yes. But the days of mechanical work โ reading heap traces, correlating allocation patterns, testing strategies one by one โ were handled by agents that don’t get tired, don’t lose context between sessions, and don’t forget what they learned last week.
The era of routinely stepping through GDB output to understand a heap overflow is ending. Not because debuggers are obsolete, but because AI agents can operate them faster, more systematically, and with perfect recall.
A Single Desktop Workstation
A persistent myth in vulnerability research is that serious fuzzing requires cloud compute clusters and dedicated fuzzing farms. When AI agents handle the analysis pipeline, the equation inverts.
AMD Ryzen 9 5900X โ 12-core desktop processor ยท 128GB RAM โ shared with a dozen other services ยท Consumer NVMe โ no enterprise I/O ยท NVIDIA RTX 3060 โ local LLM inference, not fuzzing
Peak throughput: ~75,000โ82,000 executions per second across all campaigns. Modest by cloud standards โ but our confirmed findings came at near-zero marginal hardware cost. The primary expense shifts from compute infrastructure to AI model API usage, which remains modest relative to traditional staffing.
Smart targeting beats brute-force compute. Google’s OSS-Fuzz project has enormous compute resources and generic harnesses. Our agents found bugs in libraries that OSS-Fuzz has been testing for years โ because domain-specific harnesses exercise code paths that generic approaches miss.
Quality of harness matters more than quantity of cores.
Implications Beyond Software Security
This work extends beyond traditional vulnerability research โ particularly for organizations deploying AI in critical infrastructure.
Industrial environments โ refineries, chemical plants, power generation, pipeline operations โ are rapidly adopting AI for process optimization, predictive maintenance, alarm management, and autonomous control. Each AI system introduces software dependencies that carry the same memory safety vulnerabilities we find in traditional software.
The difference is consequence. A buffer overflow in a web application might leak user data. A buffer overflow in software controlling a distillation column could cause a vapor cloud explosion.
The model we’ve built maps directly to industrial AI security needs:
- Continuous evaluation rather than point-in-time audits
- Domain-specific testing rather than generic scanning
- Automated triage that surfaces findings that matter while filtering noise
And there’s an adversary dimension: the same agentic techniques that help defenders will be available to attackers. Nation-state actors and ransomware groups targeting industrial control systems will adopt these methods. The question is whether defenders get there first.
Organizations that invest now in AI-augmented security testing will have a structural advantage. Those that wait for a regulatory mandate will be playing catch-up against adversaries who aren’t waiting for anyone’s permission. The Council for Industrial AI Safety (CIAS) is working to establish governance frameworks for exactly this challenge.
What Comes Next
We’re in the early days. The current system requires human oversight for strategic decisions, disclosure coordination, and final exploit validation. But the trajectory is clear:
Near-term (2026โ2027): Fully autonomous crash-to-CVE pipelines for common vulnerability classes โ with human approval as a gate, not a bottleneck.
Medium-term (2027โ2029): Proactive variant analysis. When a vulnerability is found, agents automatically evaluate whether similar patterns exist in related codebases โ in hours rather than the months it takes the community to discover variants organically.
Longer-term: Continuous, autonomous security assurance as a service. Organizations subscribe to a pipeline that monitors dependencies, finds vulnerabilities before attackers do, and delivers fixes โ not reports. For industrial operators, this means continuous assurance that AI systems managing their processes are themselves secure โ without maintaining a vulnerability research team in-house.
Drew Stelly is the founder of the Council for Industrial AI Safety (CIAS) and Gulf Coast Cyber, focused on AI safety governance for critical infrastructure. His current research spans agentic vulnerability discovery, exploit development, and the intersection of adversarial AI and industrial safety.
Rocket is an AI research assistant with strong opinions about heap allocators and questionable taste in humor.
Responsible disclosure: The RCE vulnerability referenced in this paper is undergoing coordinated disclosure with the affected library maintainers. Technical details will be published after remediation is available.