In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting.
2022: Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, F. Khan
https://arxiv.org/pdf/2207.03482v1.pdf
view more