SherlookingArt

Project conducted during the King's College Prompting Hackathon

To what extent can LLMs be useful for multi-modal knowledge acquisition and inferencing?

Prior work on leveraging text-only LLMs for knowledge extraction and KG completion (overview here).
We would like to extend such approaches to multi-modal knowledge, including not only text and images, but also audio, video, haptics etc.
The goal would be to test the ability of multi-modal LLMs such as GPT-4 (as well as others) towards the construction and completion of a multi-modal KG in the context of the MuseIT project (https://www.muse-it.eu/).
Particularly interesting would be to explore the functionality of LLMs for multi-modal reasoning and inferencing.

Provide feedback