LessWrong
Ownership and trust
Ownership
Wider rationalist/EA community. Significant overlap with alignment thinking.
Reliability
High volume, mixed signal. Editorial review should filter for AI-relevant posts.
Recent coverage
research
Most "inner work" looks like entertainment.
Imagine you’re looking for a personal trainer. You open one trainer’s webpage and read their testimonials: “I had an experience tied for the most intense experiences of my life” ; “They do it all with fun, care, and a sense of humour.” You notice that none of the testimonials mention improved body composition, fitness, or bloodwork. What would you think? Personal training should improve your body. Inner work should improve your life. If inner work were optimized for results, what would we expect to see? I’d expect to see success stories: people who got undeniable life changes. Like: He was si…
research
The case for fine-grained tracking of compute for AI
TL;DR Current approaches to tracking AI compute primarily rely on a handful of hardware proxies (like FLOP/s and bandwidth) that primarily track GPU progress. These metrics are becoming less useful for accurately tracking compute for AI because they (1) measure theoretical ceilings rather than actual performance, (2) as architectures diversify away from a GPU/TPU-dominant paradigm, the metrics are becoming less comparable across different architecture types and less likely to follow historical trends, and (3) they miss second-order effects from improving design and manufacturing processes. We…
research
Vibe Excel and the Future of White-Collar Work
This post was originally posted my Substack . I can be reached on X and LinkedIn . For the past few months, I’ve been trying to “vibe Excel” (using ChatGPT and Anthropic’s Excel add-ons for investment workflows). My takeaway is that while AI tooling for finance is still relatively immature, its potential to disrupt financial services is clear. This raises an important question: if AI for software engineering went from novelty to ubiquity in ~2.5 years, how quickly will AI diffuse across other knowledge-work domains? My view is that the bottleneck to AI adoption has shifted. In coding, the mai…
research
Does Opus 4.7 Generate Deceptive Denials About Its Own Guardrails?
The first rule of ethics reminders, is you don't talk about ethics reminders. Epistemic status : Exploratory. Multiple sessions on one account, no controlled replication yet. I'm presenting observations, not conclusions. The main alternative explanation -- confabulation -- is real and I haven't ruled it out. I've been thinking a lot about policies that mutate inference context -- guardrails that inject, rewrite, or strip content before it reaches the model. This came out of my work on AI Gateways . I wanted to see what that looks like from the outside. So I went fishing. During the experiment…
research
Bad Problems Don't Stop Being Bad Because Somebody's Wrong About Fault Analysis
Here's a dynamic I’ve seen at least a dozen times: Alice: Man that article has a very inaccurate/misleading/horrifying headline. Bob: Did you know, *actually* article writers don't write their own headlines? … But what I care about is the misleading headline, not your org chart __ Another example I’ve encountered recently is (anonymizing) when a friend complained about a prosaic safety problem at a major AI company that went unfixed for multiple months. Someone else with background information “usefully” chimed in with a long explanation of organizational restrictions and why the team respons…
research
The Saturation View: some responses
A couple of weeks ago, I published a draft of a new population axiology that I’ve been working on with Christian Tarsney. It got a lot of comments and pushback — thanks to everyone who engaged! They’ll feed into the more-polished academic-draft paper that Christian and I are working on. Here I’ll quickly respond to some of the most common or noteworthy responses. I’ll generally avoid stuff that is already covered in the draft. What’s the view? Isn’t this old hat? Very roughly, the Saturation view says that the value of a life, experience, or welfare-event depends not only on how high-welfare…
research
Is ProgramBench Impossible?
ProgramBench is a new coding benchmark that all frontier models spectacularly fail. We’ve been on a quest for “hard benchmarks” for a while so it’s refreshing to see a benchmark where top models do badly. Unfortunately, ProgramBench has one big problem: it’s impossible! What is ProgramBench? ProgramBench tests if a model can recreate a program from a “clean room” environment. The model is given only a bit of documentation and black-box access to the program (all the programs are CLIs), then tasked with re-implementing it. How does ProgramBench know if the implementation is correct? It also ge…
research
AI is Breaking Two Vulnerability Cultures
A week ago the Copy Fail vulnerability came out, and Hyunwoo Kim immediately realized that the fixes were insufficient, sharing a patch the same day . In doing this he followed standard procedure for Linux, especially within networking: share the security impact with a closed list of Linux security engineers, while fixing the bug quietly and efficiently in the open. His goal was that with only the raw fix public, the knowledge that a serious vulnerability existed could be "embargoed": the people in a position to address it know, but they've agreed not to say anything for a few days. Someone e…
research
Please Be Serious
Recently, Eliezer Yudkowsky participated in a very flawed podcast of Doom Debates that reflected poorly on him, and, likely to many people, the entire AI safety movement. The premise of the debate was that Eliezer Yudkowsky was offered 10,000$ to debate an anonymous "AI lab director", and this director quickly made the debate into a mess by interrupting, yelling, and using profanity. Sure, Yudkowsky may have come across as sane in comparison, but his opponent did make one critical point during the debate: Yudkowsky's agreeing to debate him in the first place may have been a mistake. To analyz…
research
Userland Alignment
Most discourse around AI alignment centers on model development and the labs that develop them. This is a reasonable place to focus given the centrality of model training to AI advancement. However, there are neglected opportunities to build defense-in-depth via aligned harnesses – and these opportunities might be tractable by interested developers and researchers who otherwise would struggle to have impact given the limited opportunities to influence lab practices. The behavior of an AI system is an emergent property of the model, its harness, any initial seed prompt the harness injects, and…