A production-bar AI SOC where one local LLM plays eight analyst roles over a bus — and the agents propose while a human executes.
Sentinel, Tier 2, IR Lead, Threat Intel, SOC Manager, Detection Engineer, Threat Hunter and a human gate — one GLM model wearing every hat, coordinated over Redis Streams, read-only against real systems, with an event-sourced case memory that makes it senior instead of just fast.
The hard question was never "can the AI decide to contain?" It's "how does it hand off to a human in a way the human will actually trust?" Give an agent the keys to a destructive action and the first false positive costs you the whole program.
The fix is structural: agents propose and never execute, every decision is replayable from the record, and the handoff addresses a human by role — "Action required from: IR Lead On-Call," not "click here."
Pick an alert and follow it down the bus — one LLM, a different hat at each stop. The reactive roles run on their own; then everything stops at the human gate, because no agent can touch a real system alone.
A SOC isn't a crew running one task or two LLMs in a chat — it's independent roles with their own uptime, audit and restart semantics. So the split is the point: LangGraph gives each role a clean tool-loop, and Redis Streams gives the roles a way to coordinate without knowing about each other.
Each role is a tool-loop with explicit state and a tool whitelist — Tier 2 gets 30 tool calls, IR Lead 15, Threat Intel 12. The framework does the prompt + tool-loop + state plumbing so the role file doesn't have to.
Durable events, consumer groups for at-least-once delivery, and an audit stream that's just another consumer. Adding a role is a 200-line file plus a systemd unit — the existing roles never find out.
The roles aren't different intelligences — they're the same local LLM with a different system prompt, tool budget and output schema. Some react to each event; some wake on a calendar and replay the audit stream; one is a human.
Why one model instead of a different provider per role? Cost — one model loaded once, running 24×7. Latency — no inter-provider hop. Simplicity — one health check, one log file. The resilience comes from a langchain-failover chain that transparently falls back to a backup box and flips back the moment the primary recovers — tool-calls intact across the failover.
v1 was amnesiac — every alert was the first alert it had ever seen. The fix wasn't a vector store bolted on the side. Because the SOC was already event-sourced, memory is a read projection over the log it was already writing — a query, not a migration. Four capabilities fall out of it.
Before a role decides, it sees the prior cases that share a strong indicator — same actor, hash, domain, CVE. The precedent is injected as something to weigh, not a rule that forces a verdict. Retrieve mechanically; judge semantically.
Every verdict is scored against analyst-curated ground truth and the per-role accuracy is published on the page, red bands and all. Recall sits behind a flag, so "did precedent help the IR Lead?" is a measured A/B delta — including when the answer is no.
"Why did it contain that host?" is answered from the recorded trace, never a fresh rationalization. Recall is deterministic; the narrator LLM may only cite what's in the trace. An agent must never sound smarter than it actually was.
Clustering cases on shared infra surfaces the campaign no single-ticket analyst is positioned to see — "these four boxes phoned the same C2." The agent's edge isn't IQ; it's holding every case in the window at once.
The hardest sell to a SOC director isn't "we built it" — it's "how do you know it works?" So the cascade replays historical CrowdStrike tickets with every side effect neutered, and scores its escalations against what a human analyst actually did.
Precision and recall on that confusion matrix answer the question leadership actually asks: how often does the AI escalate when a human would, and how often does it cry wolf? A --dry-run mode swaps the LLM for a canned-JSON stub so the whole pipeline validates end-to-end in under two seconds without burning a token. The number lands in JSON the dashboard reads — a metric, not a vibe.
The event contract, the bus, case memory as a read model, and the role-agent framework are extracted into a vendor-neutral package. Three injection seams — chat model, alert source, tools — mean nothing in the kernel knows about any one vendor. SOC-in-a-Box is literally this package with my environment plugged in.
# pip install "aisoc[agent]" — bring your own LLM, alerts, and tools from aisoc import Bus, RoleTeam, Memory bus = Bus() # stdlib in-memory, or Redis Streams for durable team = RoleTeam(model=my_llm, tools=my_tools) # the same intelligence, many hats # an alert lands → the team works the case over the shared log team.consume(alert_source) # Sentinel → Tier 2 → IR Lead → Threat Intel # human-in-the-loop: agents propose, a human authorizes, nothing auto-executes team.on_action_proposed(gate=my_hitl_gate) # memory is a query over the audit log the bus already kept mem = Memory(bus.audit) mem.recall_similar(case) # precedent by shared strong indicators mem.find_campaign_clusters() # the cross-incident view, read-only
SOC-in-a-Box is the parent of the OSS family — iocflow and detflow are the lessons from it, packaged.
The most useful thing to copy isn't any one agent — an LLM-with-tools loop isn't novel. It's the four ideas that turned eight separate experiments into one thing a SOC team would actually run: the bus shape, the schema-per-event discipline, the audit-stream-as-truth pattern, and the HITL handoff that addresses a human by role. Get those right and "the model orchestrates but never does the irreversible work" stops being a slogan and becomes the architecture. That's the lesson the whole library family carries forward.