← THE INDEX  ·  VULN RESEARCH

Flywheel

An autonomous offensive-security research platform: a 7-agent graph that hunts, exploits, and self-verifies vulnerabilities.

Authorized-use tool. Flywheel is built for assets the operator is authorized to test: in-scope bounty programs, contracted pentests, self-owned infrastructure, and CTFs. It is not a mass scanner and is not operated against third parties without permission. Full source is kept private; this page shows the architecture and selected non-weaponized internals.

What it is

Most vulnerability research is a human moving slowly through the same loop: map the attack surface, form a hypothesis, try to build a proof-of-concept, decide whether it's real, write it up. Flywheel is that loop as an autonomous multi-agent system, with one rule wired into its core: nothing is reported unless a deterministic validator could reproduce it. That single constraint is what keeps an autonomous hunter from drowning its operator in plausible-sounding false positives.

The platform exposes roughly 175 MCP tools across 28 modules to the agent graph and the interactive Claude Code session. Everything from raw HTTP probing to coverage-guided fuzzing, symbolic execution, patch-diff analysis, and GPU-accelerated semantic search runs through the same capability layer, so the agents and the human operator share the same tools and the same output format.

The 7-agent graph

An opportunity (a target URL, a CVE to find variants of, a patch diff) is threaded through a seven-stage pipeline. Each stage is its own agent with its own feature flag, so the graph can be run partially or fully autonomously:

  1. surface_mapper: enumerates the reachable attack surface
  2. recon: gathers structure, dependencies, prior art
  3. hypothesizer: proposes concrete, testable bug hypotheses (RAG-pre-injected with similar-CVE context, confirmed-pattern memory, and a false-positive blocklist)
  4. exploit_author: drafts a proof-of-concept per hypothesis; routes to an uncensored local LLM when raw PoC code is needed
  5. deterministic_validator: actually runs the PoC and asserts machine-checkable invariants; no LLM is in the loop at this step and the validator cannot be disabled via a feature flag
  6. skeptical_reviewer: an adversarial pass with a fresh Opus context (no hypothesizer reasoning visible) that tries to refute the finding
  7. reporter: only confirmed, surviving findings become a report

Per-hypothesis fan-out is bounded by a concurrency semaphore, so a noisy hypothesizer can't fork fifty exploit-author calls at once. The graph also maintains a global rate-limit coordinator and a sandbox janitor that reaps expired containers every five minutes.

flywheel/autonomous/agents/graph.py: module docstring showing the full pipeline contract
"""The Agent Graph runner (PLAN_V6 §3.2).

Threads an Opportunity through the 7-stage pipeline:

    opportunity


    surface_mapper ──► recon ──► hypothesizer

                            (per-hypothesis fan-out)


                            exploit_author


                            deterministic_validator

                         (verdict == reproduced)


                            skeptical_reviewer

                              (approve == True)


                                reporter ──► HuntResult

Every step respects its agent's feature flag. Per-hypothesis fan-out is
bounded by a concurrency semaphore so a noisy hypothesizer can't fork 50
concurrent exploit_author calls.

The graph also does RAG injection: before calling the hypothesizer it
pulls similar-CVE context (knowledge layer), program memory (confirmed
patterns + FP blocklist), and n-day variants (if the opportunity came
from a patch diff).
"""

The MCP tool surface

The agent graph and the interactive session both consume capabilities from the same ~175-tool MCP surface, organized into eight functional categories:

Recon & crawling (crawler, scope_intel, recon_tools, nodes): authenticated crawling with session injection, JS bundle endpoint extraction, attack-surface graphing, and remote-node execution.

Static analysis (source_analysis, source_audit_tools): CodeQL database creation and query execution, Semgrep scans, JS source analysis with source/sink regex matching, patch-diff generation, and multi-level auth-differential response comparison.

Binary analysis (fuzz_engine, symbolic, aeg): AFL++ and LibFuzzer coverage-guided fuzzing, angr symbolic execution and path exploration, Z3 constraint solving for logic-bug and integer-overflow analysis, ROP chain generation, format-string and heap exploit scaffolding via pwntools.

Web attack-surface testing (hypothesis, session_tools, diff_tools, oauth_fapi, cache_deception_v2, mcp_scanner): 27 active-test tools covering IDOR, SSRF, SQLi, XSS, SSTI, JWT weakness detection, GraphQL introspection and injection, CORS misconfiguration, HTTP verb tampering, CRLF injection, WebSocket hijacking, OAuth FAPI 2.0 audience confusion, PortSwigger-style cache-deception probing, and MCP server boundary testing.

Fuzzing & protocol (fuzz_tools, protocol_fuzz): parameter and header fuzzing with wordlist support, gRPC fuzzing, boofuzz session management for binary protocol targets.

Knowledge & CVE search (knowledge_layer): cve_semantic_search does GPU-accelerated cosine-similarity queries across five ChromaDB collections (NVD CVEs, Exploit-DB code + descriptions, GHSA advisories, CWE descriptions); cwe_traverse navigates the CWE hierarchy graph up to four levels in any direction; exploit_search retrieves real Exploit-DB code; advisory_search queries GHSA and OSV by ecosystem.

Sandbox & detonation (sandbox, nodes): disposable container lifecycle (create / exec / destroy with TTL-based GC), Docker-in-container verification via docker_verify_finding, OOB callback server management for blind-SSRF and XXE confirmation.

Autonomous orchestration (autonomous_tools, supply_chain, oauth_fapi): Intel daemon control, hunt triggering, campaign management, adversary-profile-guided planning, supply chain auditing (npm / PyPI / Docker), OAuth app inventory, CI/CD workflow auditing, identity posture probing, certificate transparency monitoring, subdomain takeover detection.

The knowledge layer in depth

The knowledge layer is what prevents the hypothesizer from re-inventing wheels or re-testing known dead ends. It has two halves that feed into each other.

The static corpus is ingested from five sources (NVD, CWE taxonomy, Exploit-DB, GHSA, OSV.dev) by a set of ingest scripts that run on a cron schedule. The data lands in both a SQLite store (for exact structured queries) and a ChromaDB vector store (for semantic search). Embeddings are 768-dimensional vectors produced by nomic-embed-text running on a local GPU, so semantic search doesn't depend on any external API and runs at millisecond latency.

The dynamic corpus accumulates per-hunt. hunt_memory_write / hunt_memory_read persist operational context across CLI sessions. Per-program memory files track confirmed patterns and false positives keyed by program slug; the hypothesizer prompt is pre-populated with both before each hunt, so the system won't re-test a class it already knows doesn't work on a given target.

N-day variant hunting is a first-class workflow: when an opportunity arrives via a patch diff, the graph extracts the vulnerability class from the changed code, embeds the pattern, and retrieves the top-K semantically similar findings from the vector store. Those become variant seeds for the hypothesizer. The variant_search_cve tool does the same thing on-demand from a CVE ID.

The cwe_traverse tool navigates the CWE weakness hierarchy graph. Starting from a known CWE, it walks parent / child / peer relationships (configurable depth, up to 4 levels) and returns CVE counts per node. That makes it easy to ask 'I found a CWE-190 instance. What related weakness classes should I also be testing?' without having to know the CWE taxonomy by heart.

flywheel/autonomous/validation/replay.py: the invariant schema the validator asserts
class InvariantKind(str, Enum):
    HTTP_STATUS            = "http_status"       # status == N, or N in {a,b,c}
    HEADER_PRESENT         = "header_present"
    HEADER_ABSENT          = "header_absent"
    HEADER_EQUALS          = "header_equals"
    BODY_CONTAINS          = "body_contains"
    BODY_NOT_CONTAINS      = "body_not_contains"
    BODY_REGEX             = "body_regex"
    BODY_LENGTH_GT         = "body_length_gt"
    TIMING_DELTA_MS        = "timing_delta_ms"   # diff between two requests
    OOB_CALLBACK_RECEIVED  = "oob_callback_received"
    RESPONSE_DIFF          = "response_diff"     # response A differs from B


@dataclass
class Invariant:
    """One machine-checkable assertion about a replay result."""
    kind: InvariantKind
    request_index: int = 0          # which PoC request to inspect (0-based)
    name: str | None = None         # header name, substring, etc.
    expected: Any = None            # expected value; semantics depend on kind
    compare_to_index: int | None = None   # for TIMING_DELTA / RESPONSE_DIFF
    min_ms: float | None = None     # for TIMING_DELTA_MS
    max_ms: float | None = None
    description: str = ""

Deterministic validation and sandbox detonation

The design thesis of v6 (articulated in PLAN_V6 as the "XBOW pattern") is: an LLM proposes, deterministic code verifies. The DeterministicValidatorAgent is the mechanism.

Every exploit_author output must express its own success criteria as a list of typed Invariant objects: HTTP status match, header presence, body substring or regex, timing delta for side-channel races, OOB callback receipt for blind injection classes. The validator replays the PoC against the live target (or against a freshly-spun replica for destructive tests), evaluates each invariant against the real response, and emits a verdict. The validator cannot be disabled via the per-agent feature flag. Disabling it would regress the system to the v5 failure mode where Opus self-assessed whether its own PoC worked.

For tests that need a clean target state, the sandbox manager spins up a disposable container with a TTL. The container is tagged with the hunt ID, executed against, then torn down. The background janitor sweeps expired containers every five minutes and reports leaked containers via telemetry. docker_verify_finding offers the same pattern using Docker-in-container for targets that ship official Docker images: provision the stack, run setup commands, execute the PoC, capture output, destroy.

Campaign planner and adversary profiles

For longer engagements, Flywheel has a campaign layer that reasons in ATT&CK kill-chain terms rather than per-vulnerability terms.

A Campaign has an objective (data exfil, persistence, credential theft, supply-chain compromise), a current kill-chain stage, and a running history. The CampaignPlanner agent injects the campaign state plus an optional adversary profile into an Opus prompt and asks: given my profile and my current stage, what is the cheapest path to impact, and what is the specific next Flywheel tool to invoke?

The platform ships six named adversary profiles (APT29, Turla, Sandworm, Lazarus, Scattered Spider, and Shiny Hunters), each with documented preferred initial access techniques, characteristic TTPs mapped to ATT&CK IDs, opsec level, and operational tempo. A campaign running under APT29 looks different from one running under Scattered Spider: APT29 favors long dwell, supply-chain compromise, and Golden-SAML forgery; Scattered Spider favors social engineering and MFA bypass, preferring identity terrain over web bugs.

The planner is also injected with a catalog of Flywheel's intel watchers (CT log monitoring, OAuth app inventory, supply-chain feeds, CI/CD workflow auditing, leaked credential scanning) keyed to the ATT&CK technique each watcher covers, so Opus can name a concrete tool rather than a vague recommendation.

The autonomous hunt loop

When the operator is not at a keyboard, four concurrent asyncio loops keep Flywheel working. The intel loop runs the IntelDaemon as a background task, pulling new Opportunity objects from eight configurable watcher threads: a git-commit poller (15-minute cadence), a HackerOne and Bugcrowd hacktivity feed (6-hour cadence), an NVD CVE poller (12-hour cadence), a JS-bundle diff poller (24-hour cadence), and four APT-mode sub-watchers covering certificate-transparency logs, subdomain-takeover candidates, CI/CD workflow changes, and leaked credentials (6-24 hour cadences, each toggleable). Every opportunity is hashed before it enters the queue; the daemon tracks seen hashes to disk so restarts don't re-process the same commits.

Opportunities land in a bounded priority queue (capped at 500 items). Priority is a float combining novelty score and target value; when the queue is full, incoming items displace the lowest-priority incumbent rather than blocking. The hunt loop dequeues one opportunity at a time, builds a HuntContext by querying the knowledge base for related patterns and prior findings, then routes it through the 7-agent graph. The graph's per-hypothesis fan-out is bounded by a concurrency semaphore (default: 3 concurrent exploit-author calls), so a noisy hypothesizer can't saturate the cluster.

The maintenance loop fires every 5 minutes. It checkpoints findings and daemon stats to disk, logs queue depth and pending-report counts, and reaps expired sandbox containers (the sandbox janitor runs in a fifth concurrent loop on the same interval). The stats loop fires hourly and logs a structured summary: uptime, opportunity count, hypothesis count, test count, confirmed findings, queued reports, Opus call count, and cumulative token usage.

The daemon escalates to the operator by writing confirmed findings to a filesystem queue under pending/. No submission happens until a human approves. The reporting pipeline has an explicit approve_finding gate, and auto_submit is off by default. The operator can also inject manual opportunities at any time via trigger_autonomous_hunt, which enqueues a high-priority item with novelty_score=1.0 and lets the normal hunt loop process it.

flywheel/autonomous/core/orchestrator.py: priority queue with backpressure
async def _enqueue_opp(self, opp) -> bool:
    """Enqueue with backpressure. High-priority items displace low-priority
    items when the queue is full. Returns True if enqueued, False if
    dropped."""
    rank = -float(getattr(opp, "priority", 0.0))  # PriorityQueue: smallest first
    tiebreaker = next(self._opp_counter)
    item = (rank, tiebreaker, opp)
    if not self.opportunity_queue.full():
        await self.opportunity_queue.put(item)
        return True
    # Try to displace the lowest-priority item currently queued.
    try:
        worst = self.opportunity_queue.get_nowait()
        if worst[0] > rank:  # worst has lower priority than new item
            await self.opportunity_queue.put(item)
            logger.info("[Queue] Displaced lower-priority opportunity "
                        "(rank %.2f -> %.2f)", worst[0], rank)
            return True
        await self.opportunity_queue.put(worst)
    except asyncio.QueueEmpty:
        pass
    self._opps_dropped += 1
    logger.warning("[Queue] Opportunity dropped (queue full, priority "
                   "%.2f). Dropped lifetime: %d",
                   getattr(opp, "priority", 0.0), self._opps_dropped)
    return False

The reporting pipeline

A finding that survives the validator and the skeptical reviewer enters the ReportPipeline as a structured Finding object paired with an Opus-drafted report. Before it's queued, the pipeline applies a severity-tiered confidence floor: critical findings are queued at 0.60 confidence or above (a plausible RCE warrants human eyes even if unproven), while low-severity findings need a 0.85 floor to keep noise out. Anything below threshold is silently dropped.

Findings that clear the floor are written as JSON to a pending/ directory on disk. Each file contains the full Finding object, the Opus-drafted report, and a queued_at timestamp. The operator reviews via review_pending_findings and either approves or rejects with a reason. Rejection moves the file to rejected/ with the reason attached; approved findings move to approved/ and, if auto_submit is enabled, are submitted to the bug-bounty platform API automatically.

The enrichment pass that runs before queuing (via the validate_finding / enrich_finding tools in the interactive session) handles three things: a CVSS adjustment based on FP-pattern matches (a confirmed WAF artifact or same-origin redirect downscores automatically), a CWE mapping from a static lookup table keyed by vulnerability type, and a dedup check against all confirmed findings in the current session. The enrich_finding tool also generates a steps_to_reproduce list from the validator's invariant replay log, so the final report contains machine-verified reproduction steps rather than hypothesizer prose.

For platform dispatch, the reporter agent maintains separate system prompts for HackerOne, Bugcrowd, Immunefi, YesWeHack, and direct VDPs. The HackerOne and Bugcrowd prompts produce the structured JSON those APIs require; the Immunefi prompt follows that platform's blockchain-specific severity taxonomy. The final report is never submitted without the CVSS vector, the CWE ID, and at minimum one machine-verified reproduction step. The policy gate that runs after the reporter blocks the HuntResult if those fields are missing.

Cross-target chaining and exploit-chain suggestion

Individual findings are interesting; chained findings are what move severity ratings. Flywheel builds a persistent AppGraph during every hunt: an in-memory directed graph of nodes (endpoints, parameters, auth gates, data objects, tech components) and typed edges (accepts, returns, requires_auth, flows_to, references, vulnerable_to). The graph is populated automatically via graph_hooks, a dispatcher that intercepts every tool result and extracts nodes and edges without needing any explicit instruction.

Once a finding is registered as a vulnerable_to node in the graph, suggest_chains traverses the graph from that node in both directions to depth 5. The traversal looks for three chain-amplifying patterns: a path that reaches an auth_gate node (potential auth-bypass amplification), a path that reaches a data_object node (data-exposure chain), and a path through a sensitive-flagged endpoint (privilege-escalation chain). It also walks incoming flows_to edges to surface tainted-data flows that enter the finding. Results are deduplicated by path and ranked by length. Shorter chains are more immediately actionable.

For cross-target pattern propagation, the memory_store / memory_query tool pair maintains a ChromaDB-backed knowledge base keyed by program slug. Confirmed patterns are stored with their vulnerability class, endpoint signature, and a confidence weight. The hypothesizer is pre-loaded with this context at the start of every hunt via _build_rag_context. The prompt explicitly instructs Flywheel to weight confirmed-pattern variants 2x over novel hypotheses on programs with at least one prior confirmed finding, because patterns the team got wrong once tend to appear adjacent to where they were found.

The campaign planner layer handles multi-step chains across kill-chain stages. When two or more findings exist for a target, the orchestrator calls analyze_chains after confirming each new finding. Chains are logged and, if they cross a kill-chain stage boundary, the planner recommends the next concrete tool to advance toward the campaign objective. The store_exploit_chain tool persists successful chains to the knowledge base so they can seed future campaigns on similar targets.

Multi-node execution and the intel daemon

Flywheel runs across a three-node compute cluster. The static-analysis node handles CPU-bound work: CodeQL database builds, Semgrep scans, Joern CPG queries, angr symbolic execution, AFL++ / LibFuzzer fuzzing jobs, and Go race-harness builds. The GPU node serves nomic-embed-text embeddings (768 dimensions, via Ollama) over an HTTP API and hosts the ChromaDB vector stores; because embeddings stay local, semantic search has no external-API dependency and adds single-digit milliseconds of latency per query. The egress node routes external HTTP traffic through a SOCKS5 proxy for VPN egress, so scan traffic to bounty programs leaves through a VPN rather than the operator's IP.

All three nodes are addressed via a thin infrastructure module that reads node addresses from environment variables, so the cluster topology can change without touching code. SSH connections use a persistent multiplexed socket with a 300-second keep-alive, so repeated commands to the same node reuse a single connection rather than paying the TCP+crypto handshake cost each time. Heavy commands (angr, Ghidra, CodeQL, AFL++) trigger a pre-flight memory check: if the target node is above 80% memory utilization, the job is rejected with an actionable error rather than OOMing silently.

For disposable detonation environments, the sandbox manager creates containers on demand, tagged with a hunt ID and a TTL. The janitor loop (running in a dedicated asyncio task, firing every 5 minutes) sweeps for expired containers and destroys them. Docker-in-container verification via docker_verify_finding provides the same pattern for targets that ship official container images: provision, configure, execute the PoC, capture stdout/stderr, destroy.

The intel daemon runs its eight watcher coroutines concurrently inside asyncio.gather. Each watcher is self-recovering: a crash in one watcher is caught as a return_exceptions=True exception and logged, but doesn't bring down the others. The daemon also tracks a per-vendor patch-poll schedule covering high-priority vendors (network edge gear, VPN concentrators, web application firewalls) with poll intervals as short as 6 hours. When a new patch ships for a tracked vendor, the diff watcher scores the commit for security-relevant keywords and, if the score crosses a threshold, creates an OpportunitySource.GIT_DIFF or SUPPLY_CHAIN opportunity and enqueues it with elevated priority.

System architecture
Operator / Claude Code (MCP session)
         │
         │  ~175 MCP tools
         ▼
┌─────────────────────────────────────────────────────────┐
│                   MCP Server (mcp_server.py)            │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐ │
│  │  Web/auth  │  │  Static    │  │  Autonomous        │ │
│  │  test tools│  │  analysis  │  │  orchestration     │ │
│  └────────────┘  └────────────┘  └────────────────────┘ │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐ │
│  │  Sandbox   │  │  Knowledge │  │  Campaign planner  │ │
│  │  manager   │  │  layer     │  │  (ATT&CK terms)    │ │
│  └────────────┘  └────────────┘  └────────────────────┘ │
└──────────────────────────┬──────────────────────────────┘
                           │
         ┌─────────────────┼────────────────────┐
         ▼                 ▼                    ▼
  ┌─────────────┐  ┌───────────────┐  ┌───────────────┐
  │  static-    │  │   gpu node    │  │  egress node  │
  │  analysis   │  │  (embeddings  │  │  (VPN egress  │
  │  node       │  │   768-dim     │  │   SOCKS5      │
  │  (fuzzing,  │  │   ChromaDB    │  │   proxy)      │
  │   CodeQL)   │  │   stores)     │  │               │
  └─────────────┘  └───────────────┘  └───────────────┘
         │
         ▼
  sandbox containers  ←  sandbox janitor (5-min TTL sweep)
         │
  ┌──────────────────────────────────────────────────────┐
  │              7-Agent Graph (per hunt)                │
  │  surface_mapper → recon → hypothesizer               │
  │      → exploit_author (×N, semaphore-bounded)        │
  │      → deterministic_validator (cannot be disabled)  │
  │      → skeptical_reviewer (fresh context, no RAG)    │
  │      → reporter → policy_gate                        │
  └──────────────────────────────────────────────────────┘
         │
         ▼
  ReportPipeline  →  pending/  →  operator review  →  approved/  →  submit

Why the discipline matters

The hard part of autonomous security research isn't generating ideas. A language model will happily produce a hundred. The hard part is not believing the wrong ones. Flywheel's reproduce-before-report gate, its adversarial skeptical-reviewer stage, and its novelty gate (which checks closed PRs, NVD, GHSA, and commit history before passing a finding to the queue) together form the design thesis: an automated hunter is only useful if its operator can trust every finding it surfaces is real and worth reporting.

That calibration discipline shapes every part of the architecture. The validator cannot be turned off. The skeptical reviewer gets a fresh context with no hypothesizer reasoning visible, so it can't rationalize what the hypothesizer already committed to. The novelty gate fails closed: an API timeout blocks the finding for human review rather than passing it through. Severity is always realistic vector × proven impact, never promoted on the strength of a plausible description.