Computer Vision - Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

2025-06-02

Hey PaperLedge listeners, Ernis here, ready to dive into some fascinating research! Today, we're talking about how smart our AI image recognition tools really are. You know, the ones that can tell the difference between your cat and a dog in a photo. Now, these systems are powered by what we call "large language models," or LLMs. Think of them as having a gigantic encyclopedia in their heads, letting them connect words and ideas. But, and this is a big but, a recent paper suggests that these LLMs might be...

Now, these systems are powered by what we call "large language models," or LLMs. Think of them as having a gigantic encyclopedia in their heads, letting them connect words and ideas. But, and this is a big but, a recent paper suggests that these LLMs might be missing a pretty crucial piece of the puzzle: hierarchical knowledge.

What's hierarchical knowledge? Imagine a family tree. You have broad categories at the top, like "Animals," and then branches leading down to more specific groups, like "Vertebrates," then even more specific ones like "Fish," and finally, individual types like "Anemone Fish." The paper argues that while these LLMs might be able to identify an "Anemone Fish" in a picture, they don't necessarily understand that it's also a fish, a vertebrate, and an animal. They lack the family tree understanding.

The researchers tested this by creating almost a million multiple-choice questions about images. These questions weren't just about identifying what was in the picture, but about understanding its place in these hierarchical categories. So, they might show a picture of an Anemone Fish and ask: "Is this a fish, a reptile, a bird, or a mammal?"

And guess what? The LLMs often struggled! It's like they're seeing the individual leaves on a tree, but not understanding the branches or the trunk that connect them all.

Now, here's where it gets really interesting. The researchers tried to fix this by "finetuning" a vision LLM – basically, giving it extra training using those million questions. And it helped, but not as much as they expected. The LLM itself (the language part) improved more than the vision-integrated LLM. This suggests that the LLM's lack of hierarchical knowledge is acting as a bottleneck, limiting how well the vision component can actually understand the images.

Think of it like trying to teach someone to bake a cake without them understanding basic cooking principles. They might be able to follow the recipe, but they won't truly understand why the ingredients work together or how to adjust the recipe if needed.

So, why does this matter?

For AI developers: It highlights a key limitation in current AI systems. We need to find ways to give LLMs a better understanding of hierarchical relationships in the visual world.
For anyone using AI-powered tools: It's a reminder that these systems aren't perfect. They might be able to identify things, but they don't always understand them in the same way we do.
For educators: It underscores the importance of teaching fundamental concepts and relationships, not just memorization of facts.

The researchers believe that until LLMs themselves have a solid grasp of these hierarchical taxonomies, we can't expect vision LLMs to fully understand visual concepts in a hierarchical way.

"We conjecture that one cannot make vision LLMs understand visual concepts fully hierarchical until LLMs possess corresponding taxonomy knowledge."

Here are some questions that popped into my head:

Could this lack of hierarchical understanding lead to biases in AI systems? For example, could it misclassify images from underrepresented groups because it doesn't understand the broader context?
What are some potential ways to improve LLMs' hierarchical knowledge? Could we train them on structured knowledge bases like ontologies or knowledge graphs?
If the language model is the bottleneck, should we focus more on improving the underlying language models before integrating them with vision systems? Or should we develop new architectures that better integrate language and vision from the start?

That's all for this episode. Let me know your thoughts on this fascinating research! Are you surprised that AI systems struggle with such fundamental concepts? Until next time, keep learning!

Credit to Paper authors: Yuwen Tan, Yuan Qing, Boqing Gong

Comments (3)