Alright learning crew, Ernis here, ready to dive into some cutting-edge research! Today, we're talking about how well AI can actually see and understand the world around it, and I mean really understand it, not just parrot back information.
We're looking at a paper that tackles a big problem: even though AI models called Multimodal Large Language Models (MLLMs) are getting super good at combining images and words, they still struggle with complex, real-world situations. Think of it like this: you can teach a kid to identify all the ingredients in a cake, but can they then figure out why the cake didn't rise? That's the kind of reasoning we're talking about.
The researchers behind this paper noticed that current tests for these AI models are often too simple. They're like giving the AI a bunch of separate, unrelated tasks, instead of seeing if it can use its "eyes" and "brain" together to solve a bigger problem. To fix this, they created something called Lens.
Lens is basically a super-detailed, multi-level benchmark. Imagine it as a curriculum with three levels:
What makes Lens special is that it's built around real-world scenarios. They've collected over 3,400 images from social media, many of which are dated after January 2025 (meaning they're trying to stay ahead of the curve!). And for each image, they've created over 60,000 questions designed to test the AI at all three levels: perception, understanding, and reasoning. The really cool part is that all the questions for a single image are designed to be interconnected, so the AI has to use what it learns in the "perception" stage to help it with the "reasoning" stage.
"This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning."So, what did they find? Well, they tested a bunch of top-of-the-line MLLMs, including some seriously powerful ones like Qwen2.5-VL-72B and GPT-4o. And guess what? Even the best models struggled with the reasoning tasks, with none achieving more than 60% accuracy. This shows there's still a long way to go before AI can truly understand and reason about the world like we do.
Why does this matter? Well, think about it: if we want AI to help us with complex tasks like self-driving cars, medical diagnosis, or even just helping us navigate our daily lives, it needs to be able to reason about what it sees. Lens helps us measure how far we have to go and gives researchers a tool to build better, more intelligent AI.
This research matters to:
This paper really got me thinking. Here are a couple of questions that popped into my head:
You can check out the project page at https://github.com/Lens4MLLMs/lens and the ICCV 2025 workshop page at https://lens4mllms.github.io/mars2-workshop-iccv2025/ to dive even deeper!
That's it for this week's paper, learning crew. Keep those neurons firing!