The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages.
2023: Junnan Li, Dongxu Li, S. Savarese, Steven Hoi
Ranked #1 on Image Retrieval on COCO
https://arxiv.org/pdf/2301.12597v1.pdf
view more