YouHaveToReadTheOutput

Every. Single. Line.

At every stage of the process

Because garbage in == garbage out

I have used LLMs to do a lot of different kinds of things. Generating text, generating code, summarizing text, doing deep research, parsing data, you name it.

It has made shit up at some point on every single one of those tasks.

If you're going to put your name on it, you'd better know what it says. You'd better know that it's correct. (Not just plausible.) You'd better be willing to stake your reputation on it. Because if something makes it through, that can have real-world-consequences.

Always expand the thinking blocks to see what "reasoning" it used to come to its conclusions.

Often you'll find some bit of flawed reasoning, or a "fact" without a source, that will affect the reasoning and logic of everything that follows.

Effect chain types

Linear

A therefore B therefore C therefore D.

graph LR
A-->B-->C-->D
classDef bad stroke:red,stroke-width:2px;
class B,C,D bad

If B isn't quite right, then C might also be wrong, and therefore D might be wrong.

Example: LLM recommends a dated coding pattern that's not compatible with the latest versions of various libraries, which then causes untraceable dependency hell trying to figure out why a feature isn't working.

Generative Infection Spread

graph LR
A-->B
A-->C
B-->D
B-->E
B-->F
C-->G
C-->H

classDef bad stroke:red,stroke-width:2px;
class B,D,E,F bad

Real-life example: When generating detailed requirements for a feature, the LLM invented a feature it thought logical that nobody asked for, and it affected a big chunk of the architecture and subsequently generated code, tests, and documentation.

Tainted Aggregate Result

graph LR

A-->B-->X
A-->C-->X-->W
Y-->Z
A-->D-->W-->Y
A-->E-->V
A-->F-->V-->Y

classDef bad stroke:red,stroke-width:2px;
class B,X,W,Y,Z bad

Example: In the thinking traces of a deep research task, the LLM took some speculative marketing copy as fact, which led it to classify a bunch of other products as inferior because they lacked the nonsensical feature.

Every single one of these can have real-world consequences, in misinformed purchasing decisions, missed deadlines and increased project costs, or egg one's face when the customer discovers your mistake.

How to read every line efficiently

I'm still figuring this part out.

But it's clear that the more you foresee potential consequences of an output, the more carefully you should be reading it.

Reading early outputs closely can also uncover missed MeatspaceContext that will dramatically affect the quality and usefulness of your end result.

A good SocraticMode planning process will hopefully help you uncover these earlier in the process.

My Current Workflow (Nov 2025)

I’m using mostly Claude Code, with auto-approvals turned off. I use a cute little app called Claudio that uses Claude Code hooks to play different sounds based on different events, like reads, writes, tool uses, and most importantly, approval requests.

I’ll make a plan, tell it to proceed with implementation , the listen for the appropriate beep^[1] to know that it’s time to approve something.

Then I’ll read that particular edit, and either approve it, which is most times, or give it better steering on where it’s gone astray.

It’s definitely slower than YOLO mode, but if I’m writing code that is going to production, or that must be correct, better safe than sorry. It’s rare that I make it through a string of edits without it needing some correction or clarification, even with the best reasoning models. If the edits themselves are sufficiently long, then the very act of making the edits is enough to stretch the context beyond its ability to remain coherent.

Steering helps, but sometimes you have to ABC - Always Be Compacting or just DeclareContextBankruptcy and start fresh.

For anything beyond the simplest task, you should be using FileSystemMemory to record the design and the reasoning and context behind it.

footnote: as always, this post is 100% human-written.

Unfortunately Claudio’s default macOS sounds are crap—the sound doesn’t match the severity level at all. I’ll post my sound scheme here at some point. ↩︎