“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit
LessWrong (Curated & Popular)

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

2024-12-21
I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame. The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values. What happened in this frame? The model was trained...
View more
Comments (3)

More Episodes

All Episodes>>

Get this podcast on your phone, Free

Create Your Podcast In Minutes

  • Full-featured podcast site
  • Unlimited storage and bandwidth
  • Comprehensive podcast stats
  • Distribute to Apple Podcasts, Spotify, and more
  • Make money with your podcast
Get Started
It is Free