title | abstract | layout | series | publisher | issn | id | month | tex_title | firstpage | lastpage | page | order | cycles | bibtex_author | author | date | address | container-title | volume | genre | issued | extras | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation |
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models’ capability. |
inproceedings |
Proceedings of Machine Learning Research |
PMLR |
2640-3498 |
zhang23a |
0 |
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation |
127 |
133 |
127-133 |
127 |
false |
Zhang, Yuhui and McKinzie, Brandon and Gan, Zhe and Shankar, Vaishaal and Toshev, Alexander |
|
2023-04-24 |
Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops |
239 |
inproceedings |
|