Subtext, a vehicle of metacognition

Model reasoning traces can encode information beyond what’s visible to naked eyes. Subtext: an umbrella term for information encoded in the reasoning trace that’s readable by models but not humans. The existence of subtext means models and humans read the same reasoning trace differently and make different subsequent decisions. Subtext can cause unfaithfulness, but it’s not necessarily a bug as it can also be used for implicit reasoning.

We empirically measure subtext by the behavioral gap between humans and models given the same context. We control subtext by various interventions and measure the delta in this behavioral gap. This measurement separates subtext from human-unreadable things in the reasoning trace without behavioral consequences. We want to understand why and how models use subtext, and its downstream safety implications.

‣

Collusion

Subtext, a vehicle of metacognition