“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks
LessWrong (Curated & Popular)

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

2025-07-03
Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.For context on our paper, the tweet thread is here and the paper is...
View more
Comments (3)

More Episodes

All Episodes>>

Get this podcast on your phone, Free

Create Your Podcast In Minutes

  • Full-featured podcast site
  • Unlimited storage and bandwidth
  • Comprehensive podcast stats
  • Distribute to Apple Podcasts, Spotify, and more
  • Make money with your podcast
Get Started
It is Free