SOC-in-a-Box — one LLM, eight hats, a production-bar AI SOC

why most autonomous SOCs get unplugged

An AI that auto-isolates a host at 3 AM gets pulled in a month.

The hard question was never "can the AI decide to contain?" It's "how does it hand off to a human in a way the human will actually trust?" Give an agent the keys to a destructive action and the first false positive costs you the whole program.

3:14 AM · CrowdStrike detection on a finance workstation

encoded PowerShell, outbound to an unfamiliar IP — looks like a beacon

autonomous agent, ungated

Classified malicious. Isolating FIN-WS-04 via CrowdStrike RTR now. ✅

⚠ It was a sanctioned pentest. The agent network-isolated a trading desk machine at market open — no human in the loop, no audit trail anyone trusts. That's the last night this SOC runs an AI.

The fix is structural: agents propose and never execute, every decision is replayable from the record, and the handoff addresses a human by role — "Action required from: IR Lead On-Call," not "click here."

the centerpiece · interactive

Watch an alert move through the org chart.

Pick an alert and follow it down the bus — one LLM, a different hat at each stop. The reactive roles run on their own; then everything stops at the human gate, because no agent can touch a real system alone.

1 · an alert lands from the XSOAR feed

Tier 1one LLM

Sentinel

▸

Tier 2same LLM

Analyst

▸

IR Leadsame LLM

SEV + plan

▸

Threat Intelsame LLM

Attribution

🧑 human-in-the-loop gate

the architecture decision that carried everything

The bus is the org chart. LangGraph is the runtime.

A SOC isn't a crew running one task or two LLMs in a chat — it's independent roles with their own uptime, audit and restart semantics. So the split is the point: LangGraph gives each role a clean tool-loop, and Redis Streams gives the roles a way to coordinate without knowing about each other.

per-role reasoning

LangGraph · the agent

Each role is a tool-loop with explicit state and a tool whitelist — Tier 2 gets 30 tool calls, IR Lead 15, Threat Intel 12. The framework does the prompt + tool-loop + state plumbing so the role file doesn't have to.

inter-role coordination

Redis Streams · the org chart

Durable events, consumer groups for at-least-once delivery, and an audit stream that's just another consumer. Adding a role is a 200-line file plus a systemd unit — the existing roles never find out.

CrewAI

a crew runs one task in-process; our roles are independent processes with their own uptime and HITL. Could slot in inside a single role's reasoning.

AutoGen

conversation-shaped. But Tier 2 doesn't talk to Tier 1 — it consumes Tier 1's verdict. Chat-history-as-state taxes a problem that isn't a conversation.

Plain LangChain

a synchronous chain forces every role to wait, kills per-role restart, and makes HITL a hack. Fine for two roles. We had eight.

n8n / visual

leadership likes the boxes, but LLM nodes aren't first-class and the graph lives in a DB, not a reviewable PR. Worse auditability.

LangGraph + bus

what we picked. Agent runtime per role; the bus is the coordination. Don't conflate them — that separation is what let memory ship a release later as a query, not a rewrite.

one intelligence, eight jobs

The whole org chart, one model.

The roles aren't different intelligences — they're the same local LLM with a different system prompt, tool budget and output schema. Some react to each event; some wake on a calendar and replay the audit stream; one is a human.

🛰️

Sentinel

Tier 1 triage

Reacts to each new XSOAR ticket, triages it, and publishes a verdict the rest of the floor picks up.

reactive

🔬

Tier 2 Analyst

deeper investigation

Consumes malicious / high-priority verdicts, digs in, and decides: escalate, close, or send for human review.

reactive

🎯

IR Lead

SEV + containment

Takes escalated cases, assigns severity, and proposes a containment plan — behind the HITL gate, always.

reactive

🧭

Threat Intel

actor + MITRE

Attributes the activity to an actor and maps it to ATT&CK — kept out of IR Lead's prompt so both stay tight.

reactive

📋

SOC Manager

shift summaries

Wakes on a timer three times a day, replays the audit stream, and emits a shift summary card.

periodic

🛠️

Detection Engineer

rule tuning

Weekday mornings, finds noisy rules a reactive role would never see — "fired 47 times, 41 closed benign."

periodic

🏹

Threat Hunter

pattern sweeps

Sweeps the audit log twice a day for patterns no single ticket reveals, then proposes hunts.

periodic

🧑

The Human

approval gate

A Webex card addresses an on-call by role; a two-step Flask page records the decision. The AI never executes.

human

Why one model instead of a different provider per role? Cost — one model loaded once, running 24×7. Latency — no inter-provider hop. Simplicity — one health check, one log file. The resilience comes from a langchain-failover chain that transparently falls back to a backup box and flips back the moment the primary recovers — tool-calls intact across the failover.

the layer that makes it senior, not just fast

We gave it a past tense.

v1 was amnesiac — every alert was the first alert it had ever seen. The fix wasn't a vector store bolted on the side. Because the SOC was already event-sourced, memory is a read projection over the log it was already writing — a query, not a migration. Four capabilities fall out of it.

🧠

Recall — agents that cite precedent

Before a role decides, it sees the prior cases that share a strong indicator — same actor, hash, domain, CVE. The precedent is injected as something to weigh, not a rule that forces a verdict. Retrieve mechanically; judge semantically.

📊

Proof — does memory actually help?

Every verdict is scored against analyst-curated ground truth and the per-role accuracy is published on the page, red bands and all. Recall sits behind a flag, so "did precedent help the IR Lead?" is a measured A/B delta — including when the answer is no.

🔍

Interrogation — explained from the record

"Why did it contain that host?" is answered from the recorded trace, never a fresh rationalization. Recall is deterministic; the narrator LLM may only cite what's in the trace. An agent must never sound smarter than it actually was.

🛰️

Campaigns — the view no one ticket gives

Clustering cases on shared infra surfaces the campaign no single-ticket analyst is positioned to see — "these four boxes phoned the same C2." The agent's edge isn't IQ; it's holding every case in the window at once.

numbers, not vibes

Proven against 32K real tickets before it ever touched a live alert.

The hardest sell to a SOC director isn't "we built it" — it's "how do you know it works?" So the cascade replays historical CrowdStrike tickets with every side effect neutered, and scores its escalations against what a human analyst actually did.

human escalated and the cascade escalated to IR Lead

human escalated but Tier 2 closed it

human closed but the cascade escalated

human closed and Tier 2 closed it

Precision and recall on that confusion matrix answer the question leadership actually asks: how often does the AI escalate when a human would, and how often does it cry wolf? A --dry-run mode swaps the LLM for a canned-JSON stub so the whole pipeline validates end-to-end in under two seconds without burning a token. The number lands in JSON the dashboard reads — a metric, not a vibe.

we open-sourced the kernel

The reusable heart is on PyPI — bring your own everything.

The event contract, the bus, case memory as a read model, and the role-agent framework are extracted into a vendor-neutral package. Three injection seams — chat model, alert source, tools — mean nothing in the kernel knows about any one vendor. SOC-in-a-Box is literally this package with my environment plugged in.

soc.py

# pip install "aisoc[agent]" — bring your own LLM, alerts, and tools
from aisoc import Bus, RoleTeam, Memory

bus  = Bus()                         # stdlib in-memory, or Redis Streams for durable
team = RoleTeam(model=my_llm, tools=my_tools)   # the same intelligence, many hats

# an alert lands → the team works the case over the shared log
team.consume(alert_source)             # Sentinel → Tier 2 → IR Lead → Threat Intel

# human-in-the-loop: agents propose, a human authorizes, nothing auto-executes
team.on_action_proposed(gate=my_hitl_gate)

# memory is a query over the audit log the bus already kept
mem = Memory(bus.audit)
mem.recall_similar(case)             # precedent by shared strong indicators
mem.find_campaign_clusters()         # the cross-incident view, read-only

pip install aisoc[agent] seam · chat model seam · alert source seam · tools bus · in-memory or Redis brain · langchain-failover

where it came from

An AVP-sponsored experiment that earned its trust.

SOC-in-a-Box is the parent of the OSS family — iocflow and detflow are the lessons from it, packaged.

The most useful thing to copy isn't any one agent — an LLM-with-tools loop isn't novel. It's the four ideas that turned eight separate experiments into one thing a SOC team would actually run: the bus shape, the schema-per-event discipline, the audit-stream-as-truth pattern, and the HITL handoff that addresses a human by role. Get those right and "the model orchestrates but never does the irreversible work" stops being a slogan and becomes the architecture. That's the lesson the whole library family carries forward.