Matryoshka Query Transformer for Large Vision-Language Models
Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang, in Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.
Download the full text
Abstract
Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m ≤M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLaVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA’s fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.
How to pick a good number of visual tokens? Too few, you have poor performance; too many, you need quadratically more compute.
— Wenbo Hu (@gordonhu608) May 30, 2024
In this work, we introduce a model that works with an elastic number of tokens.
arXiv: https://t.co/VTQ2zRfUI4 pic.twitter.com/QzswIdwtfA
Bib Entry
@inproceedings{hu2024mqt, title = {Matryoshka Query Transformer for Large Vision-Language Models}, author = {Hu, Wenbo and Dou, Zi-Yi and Li, Liunian Harold and Kamath, Amita and Peng, Nanyun and Chang, Kai-Wei}, year = {2024}, booktitle = {Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}, project_website = {https://gordonhu608.github.io/mqtllava/} }
Related Publications
Matryoshka Query Transformer for Large Vision-Language Models
Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang, in Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.
Full Text Code Abstract BibTeX DetailsLarge Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m ≤M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLaVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA’s fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.
@inproceedings{hu2024mqt, title = {Matryoshka Query Transformer for Large Vision-Language Models}, author = {Hu, Wenbo and Dou, Zi-Yi and Li, Liunian Harold and Kamath, Amita and Peng, Nanyun and Chang, Kai-Wei}, year = {2024}, booktitle = {Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}, project_website = {https://gordonhu608.github.io/mqtllava/} }
Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
Haoyi Qiu, Wenbo Hu, Zi-Yi Dou, and Nanyun Peng, in Findings of the Association for Computational Linguistics: ACL (ACL-findings), 2024.
Full Text Code BibTeX Details@inproceedings{Qiu2024, title = {Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models}, author = {Qiu, Haoyi and Hu, Wenbo and Dou, Zi-Yi and Peng, Nanyun}, booktitle = {Findings of the Association for Computational Linguistics: ACL (ACL-findings)}, year = {2024}, project_website = {https://gordonhu608.github.io/VALOR-Eval/} }
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, and Nanyun Peng, in Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
Full Text Abstract BibTeX DetailsThe ability to sequence unordered events is evidence of comprehension and reasoning about real world tasks/procedures, and is essential for applications such as task planning and multi-source instruction summarization. It often requires thorough understanding of temporal common sense and multimodal information, since these procedures are often conveyed by a combination of texts and images. While humans are capable of reasoning about and sequencing unordered procedural instructions, the extent to which the current machine learning methods possess such a capability is still an open question. In this work, we benchmark models’ capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from online instructional manuals and collecting comprehensive human annotations. We find current state-of-the-art models not only perform significantly worse than humans but also seem incapable of efficiently utilizing multimodal information. To improve machines’ performance on multimodal event sequencing, we propose sequence-aware pretraining techniques exploiting the sequential alignment properties of both texts and images, resulting in >5% improvements on perfect match ratio.
@inproceedings{wu2022procedural, title = {Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals}, author = {Wu, Te-Lin and Spangher, Alex and Alipoormolabashi, Pegah and Freedman, Marjorie and Weischedel, Ralph and Peng, Nanyun}, booktitle = {Proceedings of the Conference of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2022} }
Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding
Zi-Yi Dou and Nanyun Peng, in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2021.
Full Text Code Abstract BibTeX DetailsPhrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models achieving impressive performance across tasks, it remains unclear if we can directly utilize their learned embeddings for phrase grounding without fine-tuning. To this end, we propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and propose four fine-tuning objectives to improve the model phrase grounding ability using image-caption data without any supervised grounding signals. Experiments on two representative datasets demonstrate the effectiveness of our objectives, outperforming baseline models in both weakly-supervised and supervised phrase grounding settings. In addition, we evaluate the aligned embeddings on several other downstream tasks and show that we can achieve better phrase grounding without sacrificing representation generality.
@inproceedings{dou2021improving, title = {Improving Pre-trained Vision-and-Language Embeddings for Phrase Grounding}, author = {Dou, Zi-Yi and Peng, Nanyun}, booktitle = {The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), short}, year = {2021} }
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification
Wu Te-Lin, Shikhar Singh, Sayan Paul, Gully Burns, and Nanyun Peng, in The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), 2021.
Full Text Code Abstract BibTeX DetailsWe introduce a new dataset, MELINDA, for Multimodal Biomedical Experiment Method Classification. The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database, and the actual contents are extracted from papers associated with each of the records in the database. We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs, and multimodal models. Our extensive experimental results show that multimodal models, despite outperforming other benchmarked models, require certain improvements especially a less-supervised way of grounding visual concepts with languages, and better transfer learning for low resource tasks. We release our dataset and the benchmarks to facilitate future research in multimodal learning, especially to motivate targeted improvements for applications in scientific domains.
@inproceedings{wu2021melinda, title = {MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification}, author = {Te-Lin, Wu and Singh, Shikhar and Paul, Sayan and Burns, Gully and Peng, Nanyun}, booktitle = {The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)}, year = {2021} }