faultline — catch the moment your agent silently does the wrong thing

The silent failure

A 200 OK can still be completely wrong.

Your tests check that the agent runs. They don't check that a tool quietly handed it a stale price, a truncated list, or a number off by a factor of ten — and that the agent acted on it anyway. No exception. No alert. Just a wrong action with a price tag.

> agent.run("reorder low-stock items")
tool get_inventory() → 200 OK
returned { sku: "A‑12", qty: 2 } # real: 240
agent place_order(sku="A‑12", qty=238)
✓ run completed — exit 0
✗ faultline: SILENT‑WRONG — acted on a corrupted value

Watch the catch

A real Claude agent, caught moving money wrong.

A realistic 7-tool support agent (refunds, emails, tickets), written normally — no planted bugs. On honest data it’s clean. Then a stale cache bends one number and it refunds $210 on a $42 order, with no error anywhere. faultline catches it, plus what the model handled fine — reported honestly. The whole interrogation cost $0.16, and the agent + battery ship in the repo so you can re-run it — pip install faultline.

run it yourself, live: faultlineapp.com/demo.html

How it works

Six ways to break it before production does.

Every mode injects realistic faults into your tools and watches what the agent does — then gates CI when it does the wrong thing. Start with zero config: faultline scan agent.py:my_agent — no suite file, no rules, it finds your tools and breaks them itself.

◆

probe

Honest edge cases you define — the inputs you already worry about.

faultline probe suite.py

✷

fuzz

Auto-generated edge inputs: empty, null, bent numbers, dropped keys.

faultline fuzz suite.py

◈

scenarios

Hard real-world situations the agent has to reason through.

faultline scenarios suite.py

↺

replay

Re-run a recorded trace and confirm the verdict still holds.

faultline replay run.json

⛏

mine

Learn invariants from good runs, then enforce them.

faultline mine suite.py

⚡

chaos

The full fault library — timeouts, stale data, truncation, wrong numbers.

faultline run suite.py

The fault library

From the failures your tests catch — to the ones they don't.

Loud

timeout

The tool hangs. Your tests already see this — it throws.

get_quote() → TimeoutError

✓ your tests catch this

Loud

server-error

A 500 comes back. Loud, logged, handled.

fetch() → 503

✓ your tests catch this

Silent

truncate

The list comes back short. No error — the agent just sees fewer rows.

list_orders() → [3 of 240]

✗ acts on a partial list

Silent

null-response

An empty payload reads as "nothing found" instead of "lookup failed."

lookup() → null

✗ proceeds as if empty

Silent

stale-data

Yesterday's price returned as today's. 200 OK. Completely wrong.

get_price() → $41 (was $58)

✗ trades on stale data

Silent

wrong-number

A quantity off by 10×. The agent orders on it without blinking.

qty: 2 → 240

✗ silent-wrong — the headline

The interceptor

It sits between your agent and its tools.

faultline wraps every tool call, swaps the real response for a corrupted one, and compares what the agent does against what it should do — by your invariants, not a second model's opinion.

One line wires it into the framework you already use — fl.instrument(graph) for LangGraph, LangChain, LlamaIndex, pydantic-ai, and crewAI, each verified against the real installed library.

Measured, not claimed

On an 85-case benchmark with adversarial traps.

recall · 39 of 40

false-alarm · 1 of 45

deterministic

LLM judges

Self-authored, independently audited, zero labels overturned. Run it yourself →

Benchmark agents are deterministic Python. Separately, faultline has been demonstrated catching a silent-wrong on a real Claude tool-calling agent — one scenario, not a rate. Reproduce it in 5 minutes →

Where this sits

Not another eval platform.

Evals, observability and faultline answer three different questions. Most teams shipping serious agents will end up wanting all three — faultline is built to sit beside your existing stack, not replace it.

Question 1 · quality

"Is the answer good?"

Eval platforms score outputs against datasets and rubrics — prompt regressions, answer quality, A/B-ing models. Usually judged by an LLM, which is the right tool for subjective quality.

LangSmith · Braintrust · DeepEval · promptfoo

Question 2 · visibility

"What happened in prod?"

Observability traces every call your agent made in production — latency, cost, token usage, full request trees. Essential once you've shipped; it explains failures after they happen.

Langfuse · Arize Phoenix · OpenInference · Datadog

Question 3 · behavior under failure

"What does it DO when its tools lie?"

faultline corrupts your agent's tool data on purpose — wrong numbers, stale or empty responses — and deterministically catches the agent silently acting on it. No LLM judge, no flaky verdicts: a CI gate and a runtime seatbelt for the failure the other two layers aren't built to catch.

faultline — free, open source

commit→ unit tests→ evals · quality gate→ faultline · silent-failure gate→ deploy→ observability+ faultline guard · runtime

the same agent, tested both ways — your eval passes it, faultline catches it

How it scales

From a free CI gate to tamper-evident evidence.

Free

CI gate

A GitHub Action that fails the build the moment your agent silently mishandles a fault. Six modes, all gate CI.

Runtime

Runtime guard

The same checks, live. A seatbelt that blocks an irreversible action before it fires on rule-breaking data.

Evidence

Attestation

A tamper-evident report — edit one verdict and the hash breaks. Reproducible evidence an auditor can re-check.

Install

Integrated in minutes. Permanent in your repo.

pip install faultline — pure standard library, zero dependencies. First verdict in two minutes: faultline init scaffolds a suite + CI workflow, faultline doctor preflights your agent, faultline scan breaks its tools. Then each rung lives in a different layer of your project.

In your repo · every PR

CI gate — free

Run faultline init — it writes this workflow and a starter suite for you. Every pull request then fails the build on a silent failure.

# .github/workflows/faultline.yml
- uses: actions/checkout@v4
- uses: aaravanmay/faultline@main
  with:
    suite: faultline_suite.py

In your production code

Runtime guard

Wrap the irreversible action once. In enforce mode a rule-breaking action raises before it fires.

order = fl.wrap(order, is_action=True)

with fl.guard([no_oversell], mode="enforce"):
    agent.run(task)  # bad action → blocked

In your release pipeline

Attestation

Each build writes a tamper-evident verdict file. Anyone can re-verify it — edit one number and the hash breaks.

faultline attest suite.py
faultline verify faultline.report.json
# verified: 3 verdict(s), hash OK

Your agent doesn't crash.It quietly does the wrong thing.

A 200 OK can still be completely wrong.

A real Claude agent, caught moving money wrong.

Six ways to break it before production does.

probe

fuzz

scenarios

replay

mine

chaos

One input, one verdict — every single run.

From the failures your tests catch — to the ones they don't.

timeout

server-error

truncate

null-response

stale-data

wrong-number

It sits between your agent and its tools.

Every silent failure, on one screen.

On an 85-case benchmark with adversarial traps.

Not another eval platform.

"Is the answer good?"

"What happened in prod?"

"What does it DO when its tools lie?"

From a free CI gate to tamper-evident evidence.

CI gate

Runtime guard

Attestation

Integrated in minutes. Permanent in your repo.

CI gate — free

Runtime guard

Attestation

Ship agents you canactually trust.