Member-only story
https://arxiv.org/pdf/2403.09611.pdf
In this study, the authors present their work on constructing performant Multimodal Large Language Models (MLLMs), focusing on the interplay between architectural choices and data selections for pre-training.
The core findings of this research highlight the importance of a meticulous mix of data types, including image-caption pairs, interleaved image-text, and text-only data, for achieving superior few-shot learning outcomes across multiple benchmarks.
They emphasize that the configuration of the image encoder, especially the image resolution and token count, significantly impacts model performance, whereas the design of the vision-language connector plays a lesser role.