Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

LessWrong (Curated & Popular)

“OpenAI:” by Daniel Kokotajlo

2025-03-11

Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.tl;dr: 1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fu...

Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.

tl;dr:

1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:

The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.

That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to [...]

The original text contained 1 image which was described by AI.

---

First published:
March 11th, 2025

Source:
https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

View more

Comments (3)

More Episodes

You may also like

MPIR Old Time Radio

Ham Radio Crash Course Podcast

Conversations on the Creek

Elliot in the Morning

Podbean Amplified

All-In with Chamath, Jason, Sacks & Friedberg

The Ultimate Art Bell Podcast Feed

Lex Fridman Podcast

Darknet Diaries

Agatha Christie BBC Dramatisations

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com