Object-Guided Visual Tokens: Eliciting Compositional Reasoning
in Multimodal Language Models

M. Nulli, I. Najdenkoska, M. M. Derakhshani, V. Orshulevich, Y. M. Asano
University of Amsterdam, eBay, University of Technology Nuremberg


Motivation

Figure 1: OG-LLaVA architecture with OG-Fusion internal proces

Most Multimodal Large Language Models (MLLMs) use contrastively pre-trained vision encoders.
They work well on many tasks, but often struggle when it comes to compositional understanding and reasoning about what’s actually in an image. That’s because these encoders are mainly trained for image–caption retrieval, not for truly breaking down and understanding all parts of a scene. Another issue is efficiency: state-of-the-art vision encoders generate 2–3x more visual tokens, which slows down both training and inference.

To tackle these problems, we introduce OG-LLaVA (Object-Guided LLaVA). With our new connector design, OG-Fusion, the model can reason about visual content more effectively—without adding lots of extra tokens or fine-tuning the vision encoder itself. At the core of OG-Fusion is a simple but powerful idea: combine CLIP representations with segmentation masks. This lets OG-LLaVA leverage the descriptive strength of segmentation models to better capture object relationships and spatial arrangements. The result? OG-LLaVA outperforms existing comparable models on tasks that demand deeper visual reasoning and grounding, all while staying efficient.

Figure 2: OG-LLaVA vs LLaVA-1.5 on Compositional Reasoning Benchmark ConMe.
Figure 3: OG-LLaVA vs LLaVA-1.5 on Vision Grounding benchmark MMVP.

Underlying Procedure

As illustrated in Figure 1, we extract visual features from the input image through a Vision Encoder. Concurrently, we pass the input image through OG-Fusion. Here we:

  1. Use a Segmentation model to retrieve the masks,
  2. Downsample the segmentations, and
  3. Apply these masks onto the visual features.
  4. Concatenated together and passed through a Multi-Layer Perceptron to produce Object-Guided Visual Tokens (OGVT).

The OGVT are then given as input to a Large Language Model together with Textual Tokens to produce an output.
The ❄️ (snowflake) and 🔥 (fire) symbols in Figure 1 represent modules whose parameters are kept frozen or turned on.
LoRA emphasizes that not all parameters of the LLM are unfrozen, only the LoRA layers.

Visualizations

The images we picked cover all kinds of tricky challenges—spotting tiny details, telling apart subtle colors, reading depth cues, recognizing materials, making sense of spatial layouts, and even detecting small objects. They’re designed to push visual–language reasoning to its limits. What’s key is that these examples are tested at inference time with no extra fine-tuning, so any boost (or drop) in performance comes purely from the Object-Guided priors built into OG-LLaVA.

In Figure 4,  5 and  6 we highlight a range of cases where OG-LLaVA consistently demonstrates sharper perception and more grounded reasoning, from subtle posture cues to tricky color judgments and material recognition.

Together, these examples underline how OG-LLaVA moves beyond surface-level cues. It pays attention to fine details, adapts across diverse tasks, and reasons about entire scenes in a way that more closely reflects human understanding.

Figure 4: OG-LLaVA vs LLaVA-1.5 on ConMe Replace-Relation examples.
Figure 5: OG-LLaVA vs LLaVA-1.5 on ConMe Replace-Object examples.
Figure 6: OG-LLaVA vs LLaVA-1.5 on ConMe Replace-Relation examples.

Results

Our results on compositional reasoning and vision-centric benchmarks Table 1, show that OG-LLaVA consistently outperforms its baselines, across both LLaVA-1.5 and Cambrian-1 training setups. The improvements are not marginal—they’re large and systematic.

  • Compositional understanding
    • ARO:
      • +21% on Coco-Order (38.2 → 82.6) and +16% on Flickr-Order (49.1 → 84.0).
      • Visual Genome Attribution on average +10% across backbones and on Visual Genome Relation +20% across training data and model sizes.
    • ConME: steady +2% gains, peaking at 65.2 in the 8B setting (+3.6 over the strongest baseline).
  • Vision-centric reasoning
    • MMVP: about +3 points on average (e.g. 32.0 → 37.0 in 8B, 61.6 → 66.0 with Cambrian-1 data).
    • CVBench: stable performance, with only ±1 point fluctuations.
Table 1: OG-LLaVA performance on Compositional Reasoning and Vision Centric tasks compared with LLaVA baselines.

In Figure 7, we compare OG-LLaVA-8B with SIT-8B, and LLaVA-1.5-8B under the same backbone. SIT-8B stands for Subobject-level Image Tokenization (SIT) a new study employing a comparable segmentation-infusion method. The results are clear: OG-LLaVA consistently outperforms SIT, with more than a 25% advantage on compositional reasoning and a 10% edge in visual grounding.

There’s also a key difference in usability. OG-LLaVA works flexibly both with and without segmentation masks at inference, while SIT requires pre-computed masks every time. This not only adds non-trivial overhead—since a separate segmentation model must run first—but also makes the system less adaptable. In practice, the reduced token count doesn’t outweigh the complexity introduced, whereas OG-LLaVA preserves efficiency without imposing such constraints.

Figure 7: OG-LLaVA vs Subobject Level Image Tokenization and LLaVA-1.5 on Compositional Reasoning and Vision Centric tasks.

Citation

If you use this work, please cite:

@misc{nulli2025ogllava,
  author  = {Nulli, M. and Najdenkoska, I., and Derakhshani, M. M., and Dorkenwald, M., Orshulevich, V., and Asano, Y. M.},
  title   = {Object-Guided Visual Tokens: Eliciting Compositional Reasoning in Multimodal Language Models},
  howpublished  = {https://matteonulli.github.io/blog/2025/ogllava/},
  year    = {2025},
  note = {Accessed: 2025-09-05}
}



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Model Compression for Machine Translation in Large Language Models
  • Confidently_Exiting/blogpost.md at main · joanvelja/Confidently_Exiting · GitHub