You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! Thank you so much for the contribution of this repo.
I'm so interested in this work, and I'm suveying papers with key words like "captioning anything" or "instance level captioning" or "per pixel captioning". Would you like to recomand some related work to me?
The text was updated successfully, but these errors were encountered:
@LengZhuo0831 As far as I know, dense captioning is the most related topic, which generates captions at the region/object level. Scene graph generation is another way to describe the image at the instance level, which considers the instance as graph nodes and relationships as edges.
Here I list several early seminal works
image-based: DenseCap: Fully Convolutional Localization Networks for Dense Captioning
video-based: Dense-Captioning Events in Videos
3D data-based: Scan2cap: Context-aware dense captioning in rgb-d scans
Some recent works to combine LLMs and fine-grained visual experts for dense captioning generation:
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Hello! Thank you so much for the contribution of this repo.
I'm so interested in this work, and I'm suveying papers with key words like "captioning anything" or "instance level captioning" or "per pixel captioning". Would you like to recomand some related work to me?
The text was updated successfully, but these errors were encountered: