Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen [52], we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input.
2021: Constantin Eichenberg, Sid Black, Samuel Weinbach, Letitia Parcalabescu, A. Frank
https://arxiv.org/pdf/2112.05253v1.pdf