SemanticToolWrapper

#ideas #agents

Instead of asking a reasoning model to read, say, an entire webpage, we wrap the tool call in a smaller faster model that distills the page content into a lean semantic structure: data + actions

Input:

<html>
<script>...300kb</script>
<body>
<h1>giant webpage</h1>
<a href=""><button onClick="save()">save</button></a>
<a href=""><button onClick="edit()">edit</button></a>

<more><markup>

(useful data)

The small model turns this into:

data: {"some":"useful data"},
available_actions: ["save", "delete", "edit", "next", "previous"]

Which is what our reasoning model works with. No extraneous tokens diluting the core meaning.

There is, of course, a risk of misinterpretation, hallucination, and a loss of information fidelity, but perhaps the tool call could specify a level of detail or some other instructions/context about what it expects to see.

The small model needs a very good prompt, and the correct context from its caller so it is primed for a good semantic interpretation of the raw content.

Web browsing is the obvious usecase, but really it can be the result of any tool call.

Any time you're getting back raw data from a tool call, if it's not already formatted for consumption by the model, it should be wrapped in something that has enough context to transform it so the calling model only ever sees the distilled data.

TK state machine / graph traversal

c.f. hypercard, choose-your-own-adventure, text-based dungeon crawlers

mirrors how brain turns raw visual data, memories, etc, into a mental model.

TK TODO we used to parse unstructured data into strongly-typed data structures, since data structures were the atoms of computing. now we parse the unstructured data into LANGUAGE / SEMANTIC atoms, since that's the atom of LLM reasoning.

LLM said:

Instead of agents getting:


<html><head><title>...lots of irrelevant markup...
<div class="navigation">...menu items...
<div class="job-listing">Senior Engineer at Acme Corp</div>

...ads, footers, tracking scripts...

They get:

BrowserState(
    scene_description="Senior engineering role with competitive compensation",
    available_actions=["easy_apply_button", "save_job"],
    state_data={"job_title": "Senior Engineer", "salary": "$150k-200k"}
)

It'd be an interesting experiment to have it output something that literally looked like a dungeon-crawler of the problem-space:

You are on a job listing page for a senior engineering role at CompanyX. It contains the following data: [xxxxx]

Do you: a) easy_apply_button, b) save_job, or c) something else?