ELLIS Honours Student 2025

Part of ELLIS Honours Students for the Master of AI

Below is a summary of my experience as ELLIS Honours Student, you can also find more information here.

Figure 1: Cees Snoek & Matteo Nulli at the ELLIS Honors Presentations.

Title: Object-Guided Visual Tokens: Eliciting Compositional Reasoning in Multimodal Language Models

Links: 📄 Paper 📜 Full Thesis 📝 Blogpost 🧑‍💻 Code

Co-supervisors: Ivona Najdenkoska (ELLIS Postdoc-Amsterdam), Yuki M. Asano (ELLIS Member-Nuremberg), Marcel Worring (ELLIS Fellow-Amsterdam), Vladimir Orshoulevich (eBay Foundation Models Team)

Research Summary:

Standard Multimodal Large Language Models (MLLMs) employ contrastive pre-trained vision encoders whose performance, while undoubtedly good in a good range of tasks, falls short in Compositional Understanding and Reasoning on the visual input. This is mostly due their pre-training objective aimed at retrieval between similar image/captions rather than in-depth understanding of all components of an image. Moreover, while state-of-the-art image encoding methods yield strong performance, they inflate the number of visual input tokens by roughly two to three times, thereby significantly lengthening both training and inference times.

To alleviate these issues, we present OG-LLaVA (Object-Guided LLaVA), a novel multimodal architecture which, through a novel connector design (OG-Fusion), enhances the model’s ability to understand and reason about visual content without substantially increasing the number of tokens or unfreezing the Vision Encoder. A core element of OG-Fusion is the combination of CLIP output representations with segmentation masks. By leveraging the descriptive power of advanced segmentation models, OG-LLaVA attains superior performance at tasks which require a deeper understanding of object relationships and spatial arrangements and, more broadly, within the domains of compositional reasoning and visual grounding.

MSc Honours experience:

The MSc Honours Programme has been a real game‑changer for my research. While working on my thesis, I was able to dig deep into multimodal learning exploring the subtle challenges of compositional reasoning and visual understanding within vision‑language models. The best part was spending a month in Nuremberg: teaming up with Prof. Yuki Asano at the new Foundational AI Lab (University of Technology Nuremberg) let me sharpen my experiments, try out fresh ideas, and see how blue‑sky theory can quickly turn practical. This experience—made possible by the ELLIS HonoursProgramme—allowed me to learn directly from Europe’s leading AI researchers, while PhD Ivona Nadjenkoska, the VISLab team at the University of Amsterdam, and my supervisors at eBay ensured that each step stayed both rigorous and rewarding.