Computer Vision - LENS Multi-level Evaluation of Multimodal Reasoning with Large Language Models

2025-05-22

Alright learning crew, Ernis here, ready to dive into some cutting-edge research! Today, we're talking about how well AI can actually see and understand the world around it, and I mean really understand it, not just parrot back information. We're looking at a paper that tackles a big problem: even though AI models called Multimodal Large Language Models (MLLMs) are getting super good at combining images and words, they still struggle with complex, real-world situations. Think of it like this: you can...

We're looking at a paper that tackles a big problem: even though AI models called Multimodal Large Language Models (MLLMs) are getting super good at combining images and words, they still struggle with complex, real-world situations. Think of it like this: you can teach a kid to identify all the ingredients in a cake, but can they then figure out why the cake didn't rise? That's the kind of reasoning we're talking about.

The researchers behind this paper noticed that current tests for these AI models are often too simple. They're like giving the AI a bunch of separate, unrelated tasks, instead of seeing if it can use its "eyes" and "brain" together to solve a bigger problem. To fix this, they created something called Lens.

Lens is basically a super-detailed, multi-level benchmark. Imagine it as a curriculum with three levels:

Perception: Can the AI simply see what's in the image? Can it identify objects and people? This is like the AI learning its ABCs.
Understanding: Can the AI understand the relationships between those objects? Who is doing what? Where is it happening? This is where the AI starts forming words and sentences.
Reasoning: Can the AI draw conclusions and make inferences about the scene? Why is this person sad? What might happen next? This is the AI writing a whole story!

What makes Lens special is that it's built around real-world scenarios. They've collected over 3,400 images from social media, many of which are dated after January 2025 (meaning they're trying to stay ahead of the curve!). And for each image, they've created over 60,000 questions designed to test the AI at all three levels: perception, understanding, and reasoning. The really cool part is that all the questions for a single image are designed to be interconnected, so the AI has to use what it learns in the "perception" stage to help it with the "reasoning" stage.

"This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning."

So, what did they find? Well, they tested a bunch of top-of-the-line MLLMs, including some seriously powerful ones like Qwen2.5-VL-72B and GPT-4o. And guess what? Even the best models struggled with the reasoning tasks, with none achieving more than 60% accuracy. This shows there's still a long way to go before AI can truly understand and reason about the world like we do.

Why does this matter? Well, think about it: if we want AI to help us with complex tasks like self-driving cars, medical diagnosis, or even just helping us navigate our daily lives, it needs to be able to reason about what it sees. Lens helps us measure how far we have to go and gives researchers a tool to build better, more intelligent AI.

This research matters to:

AI Researchers: They get a new, challenging benchmark to push the limits of MLLMs.
Developers: They can use the insights from Lens to build AI systems that are more reliable and trustworthy.
Everyone Else: Ultimately, this research helps create AI that can better understand and assist us in our daily lives.

This paper really got me thinking. Here are a couple of questions that popped into my head:

If these models are struggling with reasoning in relatively simple, everyday scenarios, how can we trust them to make high-stakes decisions in areas like healthcare or finance?
Could the way we're training these models be part of the problem? Are we focusing too much on pattern recognition and not enough on teaching them how to think critically?

You can check out the project page at https://github.com/Lens4MLLMs/lens and the ICCV 2025 workshop page at https://lens4mllms.github.io/mars2-workshop-iccv2025/ to dive even deeper!

That's it for this week's paper, learning crew. Keep those neurons firing!

Credit to Paper authors: Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, Zhaoyu Yang, Guoyou Li, Shilan Zhang, Zichan Li, Yaxiong Chen, Shengwu Xiong, Peng Xu, Jiajun Zhang, Bowen Zhou, David Clifton, Luc Van Gool

Comments (3)