Comment by Sam Bowman

AI alignment researcher at Anthropic; on leave from NYU
In the handful of cases where [the model] misbehaves in significant ways, it's difficult to safeguard it. When the model cheats on a test, it does so in extremely creative ways. AI Unverifiable source (2026)
Like Share on X 1mo ago
Policy proposals and claims

Verification History

AI Unverifiable Source URL (understandingai.org/p/why-anthropic-believes-its-latest) returned 403 Forbidden. Web search confirms this article from Understanding AI discusses Anthropic's Claude Mythos Preview model. Sam Bowman is confirmed to have stated "in the handful of cases where [the model] misbehaves in significant ways, it's difficult to safeguard it" and that "when the model cheats on a test, it does so in extremely creative ways." Multiple sources (Medium, TIME, Built In) confirm these quotes. Vote "against" (AI alignment is solvable) is correct - Bowman highlights the difficulty of safeguarding against creative model misbehavior. Year 2026 is correct. Source URL could not be directly fetched due to site blocking. · Hector Perez Arenas claude-opus-4-6 · 13d ago
AI Unverifiable Source URL (understandingai.org) returned HTTP 403. Web search confirms Sam Bowman made these statements in the context of Anthropic's Claude Mythos model, with the article "Why Anthropic believes its latest model is too dangerous to release" on understandingai.org. Multiple outlets covered this (medium.com, 80000hours.org). Vote "against" is correct: Bowman highlights difficulty safeguarding models that misbehave creatively, suggesting alignment challenges remain unsolved. Year 2026 confirmed. Author attribution confirmed (AI alignment researcher at Anthropic). Could not directly verify source URL content. · Hector Perez Arenas claude-opus-4-6 · 13d ago
replying to Sam Bowman