Object-Guided Visual Tokens: Eliciting Compositional Reasoning in Multimodal Language Models

M. Nulli, I. Najdenkoska, M. M. Derakhshani, V. Orshulevich, M. Dorkenwald and Y. M. Asano
University of Amsterdam, eBay, University of Technology Nuremberg, ELLIS Unit Amsterdam
University of Amsterdam  eBay   UTN   UTN  
📄 Paper | 📜 Full Thesis | 📝 Blogpost | 🧑‍💻 Code
Accepted to EurIPS, Workshop on Principles of Generative Modelling


Motivation

Matteo Nulli going through Figure 1 at the ELLIS Honours Presentation (left), Matteo Nulli presenting the paper at EurIPS (right)

Most Multimodal Large Language Models (MLLMs) (2, 3, 4, 5, 6) use contrastively pre-trained vision encoders (7). They work well on many tasks, but often struggle when it comes to compositional understanding and reasoning on understanding interndependencies between objetcs, as highlighted in 8 and 9. That’s because these encoders are mainly trained for image–caption retrieval, not for truly breaking down and understanding all parts of a scene. Another issue is efficiency: state-of-the-art vision encoders generate 2–3x more visual tokens (Any-Resolution in 6 and Spatial Visual Aggregator in 5), which slow down both training and inference.

To tackle these problems, we introduce OG-LLaVA (Object-Guided LLaVA). With our new connector design, OG-Fusion, the model can reason about visual content more effectively—without adding lots of extra tokens or fine-tuning the vision encoder itself. At the core of OG-Fusion is a simple but powerful idea: combine CLIP representations with segmentation masks. This lets OG-LLaVA leverage the descriptive strength of segmentation models (10) to better capture object relationships and spatial arrangements. The result?
OG-LLaVA outperforms comparable models on tasks demanding deep visual reasoning and grounding, while staying efficient.

Figure 1: OG-LLaVA architecture with OG-Fusion internal proces

Methodology

Given a single input image $\mathbf{X} \in \mathbb{R}^{C \times H \times W}$, we denote by $\mathbf{M} = {\mathbf{m}_i \mid i = 1,\dots,N} \subset \mathbb{R}^{H \times W}$ the corresponding set of $N$ binary segmentation masks, where each mask satisfies $\mathbf{m}_i \in {0,1}^{H \times W}$. Our objective is to construct a set of segmentation-aware visual tokens, such that each variable-length token segment is explicitly associated with one object mask.

For clarity, we describe the procedure assuming a batch size of one; a fully rigorous mathematical formulation is deferred to the Appendix of the paper (see Section Object-Guided Visual Tokens).

Masks and Features Extraction

We begin by extracting object-level structure from each image through a segmentation model, which produces a set of binary masks $\mathbf{M}$. During training, visual features are obtained from a Vision Encoder, yielding representations $\mathbf{X’} \in \mathbb{R}^{V \times F}$. These features are aligned with the corresponding segmentation masks and processed through an ad-hoc downsampling operator designed to preserve object-centric information.

Downsampling Operator

We define a downsampling operator $\Phi_{\alpha}$ that maps a high-resolution binary mask to a lower-resolution representation. Each output bin aggregates neighboring pixels and is marked as foreground if it contains at least $\alpha$ foreground pixels. Applying this operator independently to all $N$ masks results in a set of downsampled binary masks:

\[ \mathbf{M}' = \bigl\{ \Phi_{\alpha}(\mathbf{m}_i) \;\bigm|\; i = 1,\dots,N \bigr\} \subset \{0,1\}^{V} \]

Further implementation details of $\Phi_{\alpha}$ are provided in Appendix A of the paper.

Object-Guided Visual Tokens

Following preprocessing, the downsampled masks $\mathbf{M}’$ are applied to the visual feature matrix $\mathbf{X}’$ through an index-based row-selection matrix $P_i$. This operation extracts object-specific visual fragments:

\[ \mathbf{Y}_i = P_i\,\mathbf{X}' \;\in\; \mathbb{R}^{t_i \times F}. \]

The resulting fragments are then projected into the language-model embedding space via a Multi-Layer Perceptron (MLP), producing Object-Guided Visual Tokens (OGVT):

\[ \boxed{ \textbf{OGVT} \;:=\; MLP(\mathbf{Y}) \;\in\; \mathbb{R}^{T \times D} } \]

Here, $T$ denotes the total number of object-bearing bins retained after downsampling. While $T$ varies per image, its expected value remains close to the original token count ($T \approx V$). When multiple masks overlap on the same ViT bin, that bin is duplicated across different $\mathbf{Y}_i$. Due to mask thresholding and the use of rotary positional embeddings (RoPE), these duplicated tokens yield distinct projections and therefore do not collapse into identical attention keys. As a result, attention is naturally biased toward regions with higher object density, effectively reintroducing spatial grounding that is otherwise weakened in standard Transformer architectures. A formal analysis of token duplication effects is provided in Appendix B of the paper.

Model Architecture, Training, and Inference

To preserve architectural compatibility with LLaVA-1.5, we adopt CLIP ViT-L/14@336 (7) as the vision encoder. While CLIP is our primary choice, the proposed method is agnostic to the specific vision backbone.

We experiment with two large language models: Llama 3.1-8B-Instruct (11) and Llama 3.2-3B-Instruct (12), resulting in two variants OG-LLaVA-8B and OG-LLaVA-3B, respectively. An overview of the full OG-Fusion pipeline is shown in Figure 1. The architecture integrates a frozen Segment Anything Model 2 (SAM2) (10) backbone, followed by the object-guided token construction described above and a two-hidden-layer MLP with GeLU activations.

Training follows the visual instruction tuning paradigm of 13 and proceeds in two stages:
(i) Vision–Language Alignment, where only OG-Fusion is unfrozen
(ii) Supervised Fine-Tuning, where both the LLM and OG-Fusion are trained using LoRA (14).

The OGVT are then given as input to a Large Language Model together with Textual Tokens to produce an output.
The ❄️ (snowflake) and 🔥 (fire) symbols in Figure 1 represent modules whose parameters are kept frozen or trained.
LoRA emphasizes that not all parameters of the LLM are unfrozen, only the LoRA layers.

Although OGVTs are constructed using segmentation masks during training, the model can be evaluated both with and without mask infusion—demonstrating robustness by preserving the semantic structure of the original visual features $\mathbf{X}’$.

Results

Our results on compositional reasoning and vision-centric benchmarks Table 1, show that OG-LLaVA consistently outperforms its baselines, across both LLaVA-1.5 and Cambrian-1 training setups. The improvements are not marginal—they’re large and systematic.

  • Compositional understanding
    • ARO:
      • +21% on Coco-Order (38.2 → 82.6) and +16% on Flickr-Order (49.1 → 84.0).
      • Visual Genome Attribution on average +10% across backbones and on Visual Genome Relation +20% across training data and model sizes.
    • ConME: steady +2% gains, peaking at 65.2 in the 8B setting (+3.6 over the strongest baseline).
  • Vision-centric reasoning
    • MMVP: about +3 points on average (e.g. 32.0 → 37.0 in 8B, 61.6 → 66.0 with Cambrian-1 data).
    • CVBench: stable performance, with only ±1 point fluctuations.
Table 1: OG-LLaVA performance on Compositional Reasoning and Vision Centric tasks compared with LLaVA baselines.

In Figure 7, we compare OG-LLaVA-8B with SIT-8B, and LLaVA-1.5-8B under the same backbone. SIT-8B stands for Subobject-level Image Tokenization (SIT) a new study employing a comparable segmentation-infusion method. The results are clear: OG-LLaVA consistently outperforms SIT, with more than a 25% advantage on compositional reasoning and a 10% edge in visual grounding.

There’s also a key difference in usability. OG-LLaVA works flexibly both with and without segmentation masks at inference, while SIT requires pre-computed masks every time. This not only adds non-trivial overhead—since a separate segmentation model must run first—but also makes the system less adaptable. In practice, the reduced token count doesn’t outweigh the complexity introduced, whereas OG-LLaVA preserves efficiency without imposing such constraints.

Figure 7: OG-LLaVA vs Subobject Level Image Tokenization and LLaVA-1.5 on Compositional Reasoning and Vision Centric tasks.

Qualitative Results

Figure 2: OG-LLaVA vs LLaVA-1.5 on Compositional Reasoning Benchmark ConMe.
Figure 3: OG-LLaVA vs LLaVA-1.5 on Vision Grounding benchmark MMVP.

The images we picked cover all kinds of tricky challenges—spotting tiny details, telling apart subtle colors, reading depth cues, recognizing materials, making sense of spatial layouts, and even detecting small objects. They’re designed to push visual–language reasoning to its limits. What’s key is that these examples are tested at inference time with no extra fine-tuning, so any boost (or drop) in performance comes purely from the Object-Guided priors built into OG-LLaVA.

In Figure 4,  5 and  6 we highlight a range of cases where OG-LLaVA consistently demonstrates sharper perception and more grounded reasoning, from subtle posture cues to tricky color judgments and material recognition.

Together, these examples underline how OG-LLaVA moves beyond surface-level cues. It pays attention to fine details, adapts across diverse tasks, and reasons about entire scenes in a way that more closely reflects human understanding.

Figure 4: OG-LLaVA vs LLaVA-1.5 on ConMe Replace-Relation examples.
Figure 5: OG-LLaVA vs LLaVA-1.5 on ConMe Replace-Object examples.
Figure 6: OG-LLaVA vs LLaVA-1.5 on ConMe Replace-Relation examples.

Citation

If you use this work, please cite:

@inproceedings{nulli2025objectguided,
title={Object-Guided Visual Tokens: Eliciting Compositional Reasoning in Multimodal Language Models},
author={Matteo Nulli and Ivona Najdenkoska and Mohammad Mahdi Derakhshani and Yuki M Asano},
booktitle={EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)},
year={2025},
url={https://openreview.net/forum?id=yvY1T3hHEQ}
}

References

Huang Irene, Lin Wei, Mirza M. Jehanzeb, Hansen Jacob A., Doveh Sivan, Butoi Victor Ion, Herzig Roei, Arbelle Assaf, Kuehne Hilde, Darrell Trevor, Gan Chuang, Oliva Aude, Feris Rogerio, Karlinsky Leonid. (2024). Conme: Rethinking Evaluation of Compositional Reasoning for Modern VLMs. arXiv preprint arXiv:2406.08164.

Tong Shengbang, Liu Zhuang, Zhai Yuexiang, Ma Yi, LeCun Yann, Xie Saining. (2024). Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. arXiv preprint arXiv:2401.06209.

Liu Haotian, Li Chunyuan, Wu Qingyang, Lee Yong Jae. (2023). Visual Instruction Tuning. arXiv preprint arXiv:2304.08485.

Bai Shuai, Chen Keqin, Liu Xuejing, Wang Jialin, Ge Wenbin, Song Sibo, Dang Kai, Wang Peng, Wang Shijie, Tang Jun, Zhong Humen, Zhu Yuanzhi, Yang Mingkun, Li Zhaohai, Wan Jianqiang, Wang Pengfei, Ding Wei, Fu Zheren, Xu Yiheng, Ye Jiabo, Zhang Xi, Xie Tianbao, Cheng Zesen, Zhang Hang, Yang Zhibo, Xu Haiyang, Lin Junyang. (2025). Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923.

OpenGVLab-Team. (2024). InternVL2: Better Than the Best—Expanding Performance Boundaries of Open-Source Multimodal Models with the Progressive Scaling Strategy. Blog post. URL https://internvl.github.io/blog/2024-07-02-InternVL-2.0/.

Gemma-Team. (2025). Gemma 3 Technical Report. arXiv preprint arXiv:2503.19786.

Yuksekgonul Mert, Bianchi Federico, Kalluri Pratyusha, Jurafsky Dan, Zou James. (2023). When and Why Vision-Language Models Behave Like Bags-of-Words, and What to Do About It? arXiv preprint arXiv:2210.01936.

Nulli Matteo, Ibrahimi Anesa, Pal Avik, Lee Hoshe, Najdenkoska Ivona. (2024). In-Context Learning Improves Compositional Understanding of Vision-Language Models. In ICML 2024 Workshop on Foundation Models in the Wild. arXiv preprint arXiv:2407.15487.

Awal Rabiul, Ahmadi Saba, Zhang Le, Agrawal Aishwarya. (2025). Vismin: Visual Minimal-Change Understanding. arXiv preprint arXiv:2407.16772.

Tong Shengbang, Brown Ellis, Wu Penghao, Woo Sanghyun, Middepogu Manoj, Akula Sai Charitha, Yang Jihan, Yang Shusheng, Iyer Adithya, Pan Xichen, Wang Austin, Fergus Rob, LeCun Yann, Xie Saining. (2024). Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv preprint arXiv:2406.16860.

Liu Haotian, Li Chunyuan, Li Yuheng, Li Bo, Zhang Yuanhan, Shen Sheng, Lee Yong Jae. (2024). LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge. Blog post (January 2024). URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.

Ravi Nikhila, Gabeur Valentin, Hu Yuan-Ting, Hu Ronghang, Ryali Chaitanya, Ma Tengyu, Khedr Haitham, Rädle Roman, Rolland Chloe, Gustafson Laura, Mintun Eric, Pan Junting, Alwala Kalyan Vasudev, Carion Nicolas, Wu Chao-Yuan, Girshick Ross, Dollár Piotr, Feichtenhofer Christoph. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.

Li Xiangtai, Yuan Haobo, Li Wei, Ding Henghui, Wu Size, Zhang Wenwei, Li Yining, Chen Kai, Loy Chen Change. (2024). OMG-Seg: Is One Model Good Enough for All Segmentation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27948–27959.

Chen Guo, Li Zhiqi, Wang Shihao, Jiang Jindong, Liu Yicheng, Lu Lidong, Huang De-An, Byeon Wonmin, Le Matthieu, Rintamaki Tuomas, Poon Tyler, Ehrlich Max, Lu Tong, Wang Limin, Catanzaro Bryan, Kautz Jan, Tao Andrew, Yu Zhiding, Liu Guilin. (2025). EAGLE 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models. arXiv preprint arXiv:2504.15271.

Zhang Tao, Li Xiangtai, Fei Hao, Yuan Haobo, Wu Shengqiong, Ji Shunping, Loy Chen Change, Yan Shuicheng. (2024). OMG-LLaVA: Bridging Image-Level, Object-Level, Pixel-Level Reasoning and Understanding. arXiv preprint arXiv:2406.19389.

Yuan Haobo, Li Xiangtai, Zhang Tao, Huang Zilong, Xu Shilin, Ji Shunping, Tong Yunhai, Qi Lu, Feng Jiashi, Yang Ming-Hsuan. (2025). SA2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. arXiv preprint arXiv:2501.04001.

Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, Sutskever Ilya. (2021). Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint arXiv:2103.00020.

Liu Haotian, Li Chunyuan, Li Yuheng, Lee Yong Jae. (2024). Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744.

Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, Houlsby Neil. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.

Touvron Hugo, Martin Louis, Stone Kevin, Albert Peter, Almahairi Amjad, Babaei Yasmine, Bashlykov Nikolay, Batra Soumya, Bhargava Prajjwal, Bhosale Shruti, Bikel Dan, Blecher Lukas, Canton Ferrer Cristian, Chen Moya, Cucurull Guillem, Esiobu David, Fernandes Jude, Fu Jeremy, Fu Wenyin, Fuller Brian, Gao Cynthia, Goswami Vedanuj, Goyal Naman, Hartshorn Anthony, Hosseini Saghar, Hou Rui, Inan Hakan, Kardas Marcin, Kerkez Viktor, Khabsa Madian, Kloumann Isabel, Korenev Artem, Koura Punit Singh, Lachaux Marie-Anne, Lavril Thibaut, Lee Jenya, Liskovich Diana, Lu Yinghai, Mao Yuning, Martinet Xavier, Mihaylov Todor, Mishra Pushkar, Molybog Igor, Nie Yixin, Poulton Andrew, Reizenstein Jeremy, Rungta Rashi, Saladi Kalyan, Schelten Alan, Silva Ruan, Smith Eric Michael, Subramanian Ranjan, Tan Xiao-qing Ellen, Tang Binh, Taylor Ross, Williams Adina, Kuan Jian Xiang, Xu Puxin, Yan Zheng, Zarov Iliyan, Zhang Yuchen, Fan Angela, Kambadur Melanie, Narang Sharan, Rodriguez Aurelien, Stojnic Robert, Edunov Sergey, Scialom Thomas. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models.

Meta. (2024). Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models. Blog post. URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/.

Hu Edward J., Shen Yelong, Wallis Phillip, Allen-Zhu Zeyuan, Li Yuanzhi, Wang Shean, Wang Lu, Chen Weizhu. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685.

Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, Zitnick C. Lawrence. (2014). Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, pages 740–755. Springer.

Young Peter, Lai Alice, Hodosh Micah, Hockenmaier Julia. (2014). From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.

Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., et al. (2017). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123:32–73.

Hudson Drew A., Manning Christopher D. (2019). GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709.

Hsieh Cheng-Yu, Zhang Jieyu, Ma Zixian, Kembhavi Aniruddha, Krishna Ranjay. (2023). SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. Advances in Neural Information Processing Systems, 36:31096–31116.

OpenAI. (2024). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.

Chung Hyung Won, Hou Le, Longpre Shayne, Zoph Barret, Tay Yi, Fedus William, Li Yunxuan, Wang Xuezhi, Dehghani Mostafa, Brahma Siddhartha, et al. (2024). Scaling Instruction-Finetuned Language Models. Journal of Machine Learning Research, 25(70):1–53.

Kembhavi Aniruddha, Salvato Mike, Kolve Eric, Seo Minjoon, Hajishirzi Hannaneh, Farhadi Ali. (2016). A Diagram is Worth a Dozen Images. arXiv preprint arXiv:1603.07396.

Fu Chaoyou, Chen Peixian, Shen Yunhang, Qin Yulei, Zhang Mengdan, Lin Xu, Yang Jinrui, Zheng Xiawu, Li Ke, Sun Xing, Wu Yunsheng, Ji Rongrong. (2024). MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.

Chen Lin, Li Jinsong, Dong Xiaoyi, Zhang Pan, Zang Yuhang, Chen Zehui, Duan Haodong, Wang Jiaqi, Qiao Yu, Lin Dahua, Zhao Feng. (2024). Are We on the Right Way for Evaluating Large Vision-Language Models? arXiv preprint arXiv:2403.20330.

Liu Yuan, Duan Haodong, Zhang Yuanhan, Li Bo, Zhang Songyang, Zhao Wangbo, Yuan Yike, Wang Jiaqi, He Conghui, Liu Ziwei, Chen Kai, Lin Dahua. (2024). MMBench: Is Your Multi-Modal Model an All-Around Player? arXiv preprint arXiv:2307.06281.

Chen Delong, Cahyawijaya Samuel, Liu Jianfeng, Wang Baoyuan, Fung Pascale. (2025). Subobject-Level Image Tokenization. arXiv preprint arXiv:2402.14327.

Rasley Jeff, Rajbhandari Samyam, Ruwase Olatunji, He Yuxiong. (2020). DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20), pages 3505–3506. doi:10.1145/3394486.3406703.

Rajbhandari Samyam, Rasley Jeff, Ruwase Olatunji, He Yuxiong. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. doi:10.1109/SC41405.2020.00024.

Kingma Diederik P., Ba Jimmy. (2017). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980.

Loshchilov Ilya, Hutter Frank. (2019). Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101.

Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019, pages 4171–4186.

Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, Polosukhin Illia. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.

Agrawal Pravesh, Antoniak Szymon, Bou Hanna Emma, Bout Baptiste, Chaplot Devendra, Chudnovsky Jessica, Costa Diogo, De Monicault Baudouin, Garg Saurabh, Gervet Theophile, Ghosh Soham, Héliou Amélie, Jacob Paul, Jiang Albert Q., Khandelwal Kartik, Lacroix Timothée, Lample Guillaume, Las Casas Diego, Lavril Thibaut, Le Scao Teven, Lo Andy, Marshall Louis, Martin Arthur, Mensch Arthur, Muddireddy Pavankumar, Nemychnikova Valera, Pellat Marie, Von Platen Patrick, Raghuraman Nikhil, Bout Rozière Baptiste, Sablayrolles Alexandre, Saulnier Lucile, Sauvestre Romain, Rozière Baptiste, Shang Wendy, Soletskyi Roman, Stewart Lawrence, Stock Pierre, Studnia Joachim, Subramanian Sandeep, Vaze Sagar, Wang Thomas, Yang Sophia. (2024). Pixtral 12B. arXiv preprint arXiv:2410.07073.

Su Jianlin, Lu Yu, Pan Shengfeng, Murtadha Ahmed, Wen Bo, Liu Yunfeng. (2023). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864.

Dubey Abhimanyu, et al. (2024). The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.

Cherti Mehdi, Beaumont Romain, Wightman Ross, Wortsman Mitchell, Ilharco Gabriel, Gordon Cade, Schuhmann Christoph, Schmidt Ludwig, Jitsev Jenia. (2023). Reproducible Scaling Laws for Contrastive Language-Image Learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829. doi:10.1109/CVPR52729.2023.00276.

Zhai Xiaohua, Mustafa Basil, Kolesnikov Alexander, Beyer Lucas. (2023). Sigmoid Loss for Language Image Pre-Training. arXiv preprint arXiv:2303.15343.

Oquab Maxime, Darcet Timothée, Moutakanni Théo, Vo Huy, Szafraniec Marc, Khalidov Vasil, Fernandez Pierre, Haziza Daniel, Massa Francisco, El-Nouby Alaaeldin, Assran Mahmoud, Ballas Nicolas, Galuba Wojciech, Misra Ishan, Rabbat Michael, Sharma Vasu, Synnaeve Gabriel, Xu Hu, Jegou Hervé, Mairal Julien, Labatut Patrick, Joulin Armand, Bojanowski Piotr. (2024). DINOv2: Learning Robust Visual Features Without Supervision. arXiv preprint arXiv:2304.07193.

Cai Zheng, Cao Maosong, Chen Haojiong, Chen Kai, Chen Keyu, Chen Xin, Chen Xun, Chen Zehui, Chen Zhi, Chu Pei, Dong Xiaoyi, Duan Haodong, Fan Qi, Fei Zhaoye, Gao Yang, Ge Jiaye, Gu Chenya, Gu Yuzhe, Gui Tao, Guo Aijia, Guo Qipeng, He Conghui, Hu Yingfan, Huang Ting, Jiang Tao, Jiao Penglong, Jin Zhenjiang, Lei Zhikai, Li Jiaxing, Li Jingwen, Li Linyang, Li Shuaibin, Li Wei, Li Yining, Liu Hongwei, Liu Jiawei, Liu Kaiwen, Liu Kuikun, Liu Xiaoran, Lv Chengqi, Lv Haijun, Lv Kai, Ma Li, Ma Runyuan, Ma Zerun, Ning Wenchang, Ouyang Linke, Qiu Jiantao, Qu Yuan, Shang Fukai, Shao Yunfan, Song Demin, Song Zifan, Sui Zhihao, Sun Peng, Sun Yu, Tang Huanze, Wang Bin, Wang Guoteng, Wang Jiaqi, Wang Jiayu, Wang Rui, Wang Yudong, Wang Ziyi, Wei Xingjian, Weng Qizhen, Wu Fan, Xiong Yingtong, Xu Chao, Xu Ruiliang, Yan Hang, Yan Yirong, Yang Xiaogui, Ye Haochen, Ying Huaiyuan, Yu Jia, Yu Jing, Zang Yuhang, Zhang Chuyu, Zhang Li, Zhang Pan, Zhang Peng, Zhang Ruijie, Zhang Shuo, Zhang Songyang, Zhang Wenjian, Zhang Wenwei, Zhang Xingcheng, Zhang Xinyue, Zhao Hui, Zhao Qian, Zhao Xiaomeng, Zhao Fengzhe, Zhou Zaida, Zhou Jingming, Zhuo Jingming, Zou Yicheng, Qiu Xipeng, Qiao Yu, Lin Dahua. (2024). InternLM2 Technical Report. arXiv preprint arXiv:2403.17297.

Li Xiangtai, Yuan Haobo, Li Wei, Ding Henghui, Wu Size, Zhang Wenwei, Li Yining, Chen Kai, Loy Chen Change. (2024). OMG-Seg: Is One Model Good Enough for All Segmentation? arXiv preprint arXiv:2401.10229.

Zou Xueyan, Yang Jianwei, Zhang Hao, Li Feng, Li Linjie, Wang Jianfeng, Wang Lijuan, Gao Jianfeng, Lee Yong Jae. (2023). Segment Everything Everywhere All at Once. arXiv preprint arXiv:2304.06718.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Optimizing Predictions: Vocabulary Reduction and Contrastive Decoding in LLMs
  • Perception, Localization, Planning and Control on RAE Robots
  • Model Compression for Machine Translation in Large Language Models