Demystifying Multimodal Learning: Impact of Visual Tokens on Inference Latency
Matteo Nulli, Marcin Mazur
Introduction
In our previous iterations of Demystifying Multimodal Learning, we have defined what a Visual Token (VT) is and established formulas to calculate # Visual Tokens ( \( V \)) for different architectures.
Before we look under the hood of production engines, we need to ask a fundamental question:
What is the true cost of V on inference latency, and why is it critical for Machine Learning Engineers scaling and deploying models?
The answer is simple: Scale is incredibly expensive. As highlighted in the recent Moondream blog, analyzing vision at scale quickly becomes the primary bottleneck for AI applications. When your system needs to process millions of images or sift through thousands of hours of video, compute resources are drained at an alarming rate.
While frontier AI labs are intensely focused on making VLMs faster (Marafioti, Andrés, et al., 2025, Steiner, Andreas, et al. 2024, Gemma-Team, 2025, Bai Shuai, et al., 2025) at the architectural level (see Figure 1), far fewer people are dissecting the downstream impact of these visual tokens during inference.
Architectural efficiency is a great starting point, but if your deployment strategy is bloated with unnecessary tokens, scaling remains a bottleneck. To serve multimodal models effectively, we need to make more conscious, data-driven decisions about how we handle these tokens in production. Crucially, managing this isn’t just about minimizing token counts—it’s also about how our serving infrastructure processes them.
This requires tightly managing and exploring three critical pillars of Inference, Latency, Context Windows, and VRAM KV Cache, all while leveraging Architectural Decouplings to maximize hardware efficiency.
Inference Latency
In production use-cases, inference cost is heavily tied to latency. Large companies enforce strict input limits to ensure predictable response times, but Visual Tokens completely disrupt this predictability.
When inference providers like vLLM (Kwon, Woosuk, et al., 2023) process a multimodal request, they undergo several distinct phases, each carrying its own ‘latency tax’:
Phase 0: Multi-Modal Processing Before the model even sees the image, engines apply Hugging Face Processors to combine the prompt text and multi-modal data. The text is tokenized, and the image sequences in the token IDs are updated with placeholder tokens (the number of placeholders equals the feature size outputted by the vision encoder).
-
The Bottleneck: Vision Processors can be notoriously slow, creating a CPU bottleneck before GPU inference even begins.
-
The Solution: To mitigate this, vLLM utilizes Processor Output Caching. When new data arrives, it checks the cache; missing items are processed in a single batch, cached, and then merged. This prevents redundant processing overhead for frequently seen or system-level images. More details here.
Phase 1: Vision Encoding The image is passed through the Vision Encoder (VE) to create the actual embeddings that will replace the placeholder tokens. As seen in Figure 2 below, encoding latency for small VLMs (0.5B-3B parameters) increases massively at high resolutions compared to the LLM prefill stage. \( ^{**} \)
Phase 2: LLM Prefill (Time-To-First-Token) The LLM processes all input tokens (Text + Visual) in parallel to compute the initial keys and values. A high VT count dramatically increases the Time-To-First-Token (TTFT), with prefill latency scaling quadratically with the sequence length:
\[\text{Latency}_{\text{prefill}} \propto (V + T_{\text{total}})^2\]where V stands for the numeber of Visual Tokens and \(T_{\text{total}}\) represents the text tokens.
Phase 3: LLM Decoding The model generates the output one token at a time. Each generated token must attend over the entire cached history of visual and text tokens. Therefore, per-token decode latency grows roughly linearly:
\[\text{Latency}_{\text{decoding}} \propto (V + T_{\text{total}})\]
\( ^{**} \) See Anasosalu et al., 2025 for a good overview of Fast-ViT architectures.
Decoupling the Vision Encoder and Prefix Caching
When optimizing multimodal inference, treating the model as a single monolithic block is highly inefficient. State-of-the-art serving engines like vLLM have introduced a critical architectural optimization: the strict separation of the Vision Encoder (VE) and the Large Language Model (LLM).
By extracting the vision embedding process into a dedicated method (such as embed_multimodal), engines can run the VE and LLM asynchronously. This decoupling ensures that heavy image embeddings are computed, queued, and ready exactly when the decoder is prepared to ingest them.
More importantly, this separation solves a major parallelism mismatch. LLMs are massive and often require Tensor Parallelism (TP) or Expert Parallelism to distribute their weights across multiple GPUs to fit into memory. Vision Encoders, however, are typically much smaller. Forcing a small VE to use TP introduces unnecessary cross-GPU communication overhead, which actually slows down the encoding phase. By decoupling the architectures, engineers can apply batch-level Data Parallelism to the multi-modal encoder—processing different images on different GPUs simultaneously—while reserving TP strictly for the heavy lifting of the LLM.
Multimodal Prefix Caching
Prefix caching is a vital optimization from both the user and infrastructure provider perspectives, as it dramatically reduces redundant compute for shared context (see this for more on prompt caching). For Vision Language Models, this translates to massive performance gains by mitigating the heavy processing tax of visual tokens Barrios, Wayner. et. al, 2026. In serving engines like vLLM, this caching is implemented seamlessly by matching images based on their unique image hash before the Vision Encoder step even begins. If a hash match is found, the system simply retrieves the cached representation, meaning the computationally expensive vision model pass is skipped entirely, resulting in immense latency savings, (read more here).
Context Window Budget
Every LLM/VLM operates within a fixed input capacity known as the context window. As agentic AI systems become more prevalent, a variety of techniques have emerged to make more efficient use of this limited space:
- excluding intermediate reasoning from conversation history
- automatically pruning tool-call traces e.g. with Dynamic Context Pruning
- enforcing ultra-compact communication styles e.g. with Caveman
Image tokens, however, are significantly more difficult to optimize. In practice, once images are introduced into the context, they tend to persist in full—unlike text, which can be summarized or selectively removed.
For a Strategy B model (like LLaVA-OneVision-7B (Li Bo, et al., 2024)) using a \( 3 \times 3 \) grid, a single image might consume \( \approx 2900 \) tokens. Given that the model has a context window of 32k, using 3 or 5 images can consume 30-45% of your entire input capacity. Even worse, if you are serving a model with a 4k pre-defined context limit, due to memory issues, a single image blocks 70% of the total input. These scenarios leaves little room for few-shot examples or long conversation history, potentially degrading the model’s ability to follow complex instructions.
For this reason, it’s valuable to give users explicit control over how many tokens they are willing to allocate to visual inputs. One example of this approach is variable vision token limits in the Gemma 4 (Farabet Clement, et al., 2026) model family, which allows dynamic trade-offs between image fidelity and token usage.
The Cascading Impact on VRAM
Perhaps the most critical “hidden” cost of Multimodal Learning is memory. When serving models, your maximum throughput is bound by how many requests can fit into GPU memory simultaneously (your Batch Size). This boundary is dictated heavily by the KV Cache.
The KV Cache stores the computed Key and Value vectors for all previous tokens in a sequence, preventing the model from recomputing them during the decoding phase. Unlike text tokens, which accumulate slowly as a user types or a model generates, Visual Tokens are dumped into the KV Cache all at once during the prefill phase.
- Higher VT Count \( \rightarrow \) Larger KV Cache footprint per request
- Larger KV Cache \( \rightarrow \) Fewer requests fit in VRAM
- Fewer Requests \( \rightarrow \) Smaller Batch Size
This creates a brutal cascading effect on your infrastructure costs. If a high-resolution grid strategy increases your visual tokens by 10x, you might be forced to reduce your batch size by roughly the same factor just to avoid Out-Of-Memory (OOM) errors. You are effectively multiplying your cost per inference, as you now need significantly more GPUs to handle the same amount of user traffic.
The token calculations from Demystifying Multimodal Learning: The Hidden Inefficiency in Vision Language Modelling are not just theoretical trivia—they are the direct levers that dictate your context limits, compute bottlenecks, and hardware budgets.
Conclusions & Key Takeaways
As we have explored, the number of Visual Tokens a model generates is far more than an architectural quirk—it is a defining metric for production viability. While it might be tempting to chase state-of-the-art benchmark scores by feeding massive, high-resolution token grids into a VLM, doing so blindly sacrifices Latency, Context Windows and VRAM.
We saw how pushing too many tokens starves your available context limit, creates massive latency bottlenecks during the prefill phase, and monopolizes the KV Cache, which ultimately cripples your maximum batch size. Even with highly optimized serving engines like vLLM employing processor caching, decoupled parallelization strategies, and multimodal prefix caching to bypass redundant vision encoding, there is no software magic that can completely erase the hardware tax of a bloated token count.
Answering our initial question, to build commercially viable, scalable multimodal applications, we must treat token efficiency as a primary objective for model selection and architectural design. While caching shared visual context provides a critical safety valve for latency, moving forward, the most successful multimodal systems won’t necessarily be the ones that process the most visual tokens, but the ones that compress visual reality into the fewest, smartest tokens possible.
Citation
If you use this work, please cite:
@misc{nulli2026impactvisualtokens,
title={Demystifying Multimodal Learning: Impact of Visual Tokens on Inference Latency},
author={Nulli, Matteo and Mazur, Marcin},
year={2026},
url={https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-impact-vt-laten},
howpublished={Available at \url{https://matteonulli.github.io/blog/2026/demystifying2/} and \url{https://huggingface.co/blog/MatteoNulli/de-mystifying-multimodal-learning-impact-vt-laten}},
note={Hugging Face Blog}
}
References
Enjoy Reading This Article?
Here are some more articles you might like to read next: