Red Hen Lab's Super Rapid Annotator Powered By Large Language Models
Welcome to the Super Rapid Annotator project! This tool is designed for video annotation by leveraging advanced multimodal vision and language models. 🚀
Annotating videos, especially identifying specific entities and their temporal relationships, is a complex and time-consuming task. Traditional methods lack efficiency and accuracy, particularly in handling multiple videos simultaneously. Our Super Rapid Annotator addresses these challenges by integrating state-of-the-art multimodal models with sophisticated spatial-temporal analysis, streamlining the annotation process.
We have updated the model to MiniCPM-V 2.6 which is a multimodal large language model (MLLM) designed for vision-language understanding. This model accepts images, videos, and text as inputs and generates high-quality text outputs. Since February 2024, five versions of the model have been released, focusing on strong performance and efficient deployment.
Model: MiniCPM-V 2.6 🤗 | Demo 🤖
Currently, the model and its dependencies are hosted on Gradio Hugging Face.
Gradio App: GSOC Super Rapid Annotator 🤗
First, clone the repository and navigate into the project directory:
git clone https://github.com/manishkumart/Super-Rapid-Annotator-Multimodal-Annotation-Tool.git
cd Super-Rapid-Annotator-Multimodal-Annotation-Tool/gradio
Create a virtual environment and install the required packages:
conda create -n env python=3.10 -y
conda activate env
pip install -r requirements.txt
If you watch to cache the model, you can download it by following the below from "Preliminary Setup" steps by changing the location and the model name.
The project has been structured as follows:
- src
utils.py
video_model.py
- data
videos
train.csv
multi_video_app.py
- Batch video processingapp.py
- Single video processingrequirements.txt
README.md
Initially, the approach was to combine a vision-language large language model (VLLM) to process videos and a smaller LLM like Phi to structure the outputs. This method worked well for processing a single video. However, when processing multiple videos in batches, having two LLMs in the pipeline introduced excessive context, making the LLM prone to hallucinations.
To address this, the constraints of using two LLMs were removed, focusing solely on using the VLLM for both video processing and output structuring. The key challenges faced during this refinement are outlined below.
-
Video Length and Frame Extraction: Each video was 4 seconds long. Processing 300 videos amounted to 1200 seconds (20 minutes) of video, with each video containing an average of 80 frames. To efficiently process this, we extracted 15-20 random frames from the beginning to the end of the video. These frames were stitched together into a single image, making it easier to understand the video's nuances.
2016-01-01_0000_US_KNBC_The_Ellen_DeGeneres_Show_91.07-95.45_today.mp4
The above video can be translated into 16 frames stitched into a grid format:
While this approach worked for some videos, it did not capture all the details. Since limiting the frames to just 16 didn't yield appropriate results, the focus shifted to processing the videos as they are. After each video, the GPU was freed up for batch processing, solving the batch processing issue. The next challenge was structuring the outputs.
Previously, we used a Pydantic class to process outputs with another LLM in the pipeline. With the removal of the second LLM, the functionality could still be retained, but research showed that using HTML tags was a more efficient approach. With minimal prompting, it became easier to pass content within HTML tags. For example:
Prompt: Provide the results in <annotation> tags, where 0 indicates False, 1 indicates True, and None indicates that no information is present. Follow the examples below:
<annotation>indoors: 0</annotation>
<annotation>standing: 1</annotation>
<annotation>hands.free: 0</annotation>
<annotation>screen.interaction_yes: 0</annotation>
Parsing the responses from these tags was simplified using the following Python function:
def parse_string(string, tags):
"""
Extracts the content between the specified HTML tags from the given string.
Args:
string (str): The input string to search for the tag content.
tags (list): A list of HTML tags to search for.
Returns:
dict: A dictionary with tags as keys and lists of content as values.
Example:
>>> parse_string("<code>Hello, World!</code><note>Important</note>", ["code", "note"])
{'code': ['Hello, World!'], 'note': ['Important']}
"""
results = {}
for tag in tags:
pattern = rf"<{tag}>(.*?)</{tag}>"
matches = re.findall(pattern, string, re.DOTALL)
results[tag] = matches if matches else None
return results
With this update, all the required fields were successfully extracted and processed into a dataframe.
Evaluating the accuracy of each annotation using the best-performing model revealed that LLMs are proficient at understanding body posture (standing/sitting) and location (indoors/outdoors) annotations. However, they often fail to validate the other two annotations, as depicted in the table below.
For improved multimodal modeling, you can tweak the model by attaching the Hugging Face repository name in the video_model.py
file and testing the annotations.
First, clone the repository and navigate into the project directory:
git clone https://github.com/manishkumart/Super-Rapid-Annotator-Multimodal-Annotation-Tool.git
cd Super-Rapid-Annotator-Multimodal-Annotation-Tool
Create a virtual environment and install the required packages:
conda create -n env python=3.10 -y
conda activate env
pip install -r requirements.txt
npm i cors-anywhere
Download the necessary models using the below command:
python models/download_models.py --m1 ./models/ChatUniVi --m2 ./models/Phi3
Head over to chat_uni.py
and update the model path at line 77
, and in struct_phi3.py
at line 90
.
You need three terminals for this. Run each of the following commands in three different terminals with the full path specified:
cd backend
-
Start the
chat_uni
server responsible for video annotation.uvicorn chat_uni:app --reload --port 8001
This server will run on port
8001
. If the port is busy, you can try another port, and then update the port in script.js under src. -
Start the
struct_phi3
server:uvicorn struct_phi3:app --reload --port 8002
This server will run on port
8002
. If the port is busy, you can try another port, and then update the port in script.js under src. -
Start the Node.js server:
node backend/server.js
This proxy server will run on port
8080
.
Open a new terminal and paste the below command
python frontend/serve_html.py
The frontend server can be accessed at http://localhost:5500.
-
Upload a Video
- Click on the "Upload Video" button to select and upload your video file.
-
Select Annotation Options
- Choose any combination of the available annotation options:
- Standing/Sitting
- Screen Interactions or not
- Hands free or not
- Indoor/Outdoor
- Choose any combination of the available annotation options:
-
Start the Annotation Process
- Click the "Start" button. This will display the selected options and the name of the uploaded video.
-
Annotate the Video
- Click the "Video Annotate" button. This will use the prompt and the uploaded video to generate annotations.
-
View the Prompt
- Click on the "Prompt" button to see the prompt used in the background based on the selected options.
-
Get the Output
- Click the "Output" button to receive the structured output of the annotations.
- An Experiment to Unlock Ollama’s Potential in Video Question Answering: Read here
- Vertical Scaling in Video Annotation with Large Language Models: A Journey with GSoC’24 @ Red Hen Labs: Read here
- My Journey with Red Hen Labs at GSoC ’24: Read here
- Why Google Summer Of Code?: Read here
- Automatic Video Annotation: Uses the best vision language models for rapid and accurate annotation.
- Multimodal Capabilities: Combines vision and language models to enhance understanding and entity detection.
- Concurrent Processing: Efficiently processes multiple videos at once.
- CSV Output: Annotations are compiled into a user-friendly CSV format.
At Red Hen Labs, through Google Summer of Code, I am contributing to vertical growth by developing an annotation product for the video space using large language models. This approach ensures that we build effective, domain-specific applications rather than generic models.
We cannot always use models out of the box; hence, we must structure them well to achieve the desired outputs. Following the recommendations from the mentors, my first step is to test the capabilities of Video Large Language Models by annotating the following four key entities among many others:
- Screen Interaction: Determine if the subject in the video is interacting with a screen in the background.
- Hands-Free: Check if the subject’s hands are free or if they are holding anything.
- Indoors: Identify whether the subject is indoors or outdoors.
- Standing: Observe if the subject is sitting or standing.
We are in an era where new open-source models emerge monthly, continuously improving. This progress necessitates focusing on developing great products around these models, which involve vertical scaling, such as fine-tuning models for specific domains. This approach not only optimizes the use of existing models but also accelerates the development of practical and effective solutions.
Here is a glimpse of the news dataset that we will be annotating, showcasing the real-world application of our annotation models.
All of the video frames we analyzed are sourced from news segments, each lasting approximately 4–5 seconds. To accurately capture the main key entities from these models, I have extensively experimented with prompt engineering, employing multiple variations and different models. The most effective prompt, yielding outstanding results, is provided below.
For each question, analyze the given video carefully and base your answers on the observations made.
- Examine the subject’s right and left hands in the video to check if they are holding anything like a microphone, book, paper (white color), object, or any electronic device, try segmentations and decide if the hands are free or not.
- Evaluate the subject’s body posture and movement within the video. Are they standing upright with both feet planted firmly on the ground? If so, they are standing. If they seem to be seated, they are seated.
- Assess the surroundings behind the subject in the video. Do they seem to interact with any visible screens, such as laptops, TVs, or digital billboards? If yes, then they are interacting with a screen. If not, they are not interacting with a screen.
- Consider the broader environmental context shown in the video’s background. Are there signs of an open-air space, like greenery, structures, or people passing by? If so, it’s an outdoor setting. If the setting looks confined with furniture, walls, or home decorations, it’s an indoor environment.
By taking these factors into account when watching the video, please answer the questions accurately.
-
Special thanks to Raúl Sánchez Sánchez for his continuous support and guidance throughout this project.
-
OpenBMB and team: OpenBMB
This project is licensed under the MIT License.
Contributions are welcome! Please feel free to submit a pull request.
For any questions, please reach out to me at LinkedIn