Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use the inference_demo.py #12

Open
aoji0606 opened this issue Nov 28, 2024 · 17 comments
Open

how to use the inference_demo.py #12

aoji0606 opened this issue Nov 28, 2024 · 17 comments

Comments

@aoji0606
Copy link

Thanks for your work

but, how to use the inference_demo.py ?

@XuGW-Kevin
Copy link
Collaborator

Thanks for your work

but, how to use the inference_demo.py ?

This demo can be used with VLMEvalKit. I'll upload a Gradio App demo later.

@ramkumarkoppu
Copy link

HF demo app is not available, I get this
image

@XuGW-Kevin
Copy link
Collaborator

HF demo app is not available, I get this image

It's very weird, because I just tried and the Gradio app works.
Can you run other Gradio apps well?

@ramkumarkoppu
Copy link

ramkumarkoppu commented Dec 2, 2024

this time HF demo available, but when I run the example image
reasoning

with the same prompt at https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot
Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6, I get different answer:

(Here begins the SUMMARY stage) To solve the problem, I will identify and count the objects in the image, excluding the tiny shiny balls and purple objects, and then determine how many objects remain. (Here ends the SUMMARY stage)
(Here begins the CAPTION stage) The image shows a collection of variously shaped objects on a flat surface. These include spheres, cubes, and cylinders in different colors, including blue, green, yellow, red, and purple. (Here ends the CAPTION stage)
(Here begins the REASONING stage) First, I will identify all the objects in the image: there are spheres, cubes, and cylinders. Next, I will exclude the tiny shiny balls, which are the spheres, and the purple objects. The purple objects are a cylinder and a sphere. After removing these, I will count the remaining objects. The remaining objects are a blue cylinder, a green sphere, a yellow cylinder, a red cube, and a yellow sphere. This totals to five objects. (Here ends the REASONING stage)
(Here begins the CONCLUSION stage) 5 (Here ends the CONCLUSION stage)

@XuGW-Kevin
Copy link
Collaborator

this time HF demo available, but when I run the example image reasoning

with the same prompt at https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6, I get different answer:

(Here begins the SUMMARY stage) To solve the problem, I will identify and count the objects in the image, excluding the tiny shiny balls and purple objects, and then determine how many objects remain. (Here ends the SUMMARY stage) (Here begins the CAPTION stage) The image shows a collection of variously shaped objects on a flat surface. These include spheres, cubes, and cylinders in different colors, including blue, green, yellow, red, and purple. (Here ends the CAPTION stage) (Here begins the REASONING stage) First, I will identify all the objects in the image: there are spheres, cubes, and cylinders. Next, I will exclude the tiny shiny balls, which are the spheres, and the purple objects. The purple objects are a cylinder and a sphere. After removing these, I will count the remaining objects. The remaining objects are a blue cylinder, a green sphere, a yellow cylinder, a red cube, and a yellow sphere. This totals to five objects. (Here ends the REASONING stage) (Here begins the CONCLUSION stage) 5 (Here ends the CONCLUSION stage)

Exactly. This is expected because the model in this Gradio App does not use inference time scaling.
The examples are generated using BS=2 with inference time scaling, which is 4~5 times slower than the current model.

@XuGW-Kevin
Copy link
Collaborator

But I'll definitely upload the model with inference time scaling later, hopefully in 3~4 days. Thanks for your interest!

@ramkumarkoppu
Copy link

I am surprised to see the wring answer 5 by the model. Will this model with inference time scaling available to download from the HF?

@XuGW-Kevin
Copy link
Collaborator

XuGW-Kevin commented Dec 2, 2024

I am surprised to see the wring answer 5 by the model. Will this model with inference time scaling available to download from the HF?

The model with inference time scaling is identical to the current model. Actually, if you set a non-zero temperature (temperature=0.6, top_p=0.9), you may observe that the model can answer the question correctly with constant possibility. Inference time scaling just improves the possibility.

You may also try the base Llama-3.2V model. That model can hardly generate a correct answer even with multiple tries.

@ramkumarkoppu
Copy link

I don't see the option to set temperature and top_p on HF demo to try

@XuGW-Kevin
Copy link
Collaborator

XuGW-Kevin commented Dec 2, 2024

I don't see the option to set temperature and top_p on HF demo to try

Hi, thank you for raising the issue! Now the Gradio App is set to {temperature=0.6, top_p=0.9}.

I've reviewed the results we tested earlier. The two demos in our paper are both picked from the MMStar benchmark, and we used VLMEvalKit to generate and test the results.

Today, I replicated the results without using inference-time scaling.
For the second question (science), the model has about a 30-40% chance of answering correctly.
For the first question, the model rarely gives the correct answer of 10-2=8 (I tried dozens of times before getting the correct reasoning once.). Instead, it often provides incorrect responses such as 8-2=6 or 10-3=7. However, it is worth mentioning that the model at least gets one part (either 10 or 2) correct in these cases.

I understand that the probability of correctly answering the two demos is not very high. This is mainly because I directly selected the most difficult questions from those that were correctly answered in a single generation by the model from the MMStar benchmark. As a result, the model won't get it right every time. However, it's worth mentioning that our model's performance on these two questions is as good as GPT-4o, and in fact, even GPT-4o can hardly solve these two questions.

If you attempt to compare the performance of Llama-3.2-11B-Vision-Instruct, you will find that it has almost no chance of providing the correct logic for these questions.
For the first question, I tested Llama-3.2-11B-Vision-Instruct on poe.com 10 times, and it never got any intermediate steps (10 or 2) correct.
For the second question, I tested Llama-3.2-11B-Vision-Instruct on poe.com and had to try 15 times before it answered correctly once.
This highlights that the progress made by LLaVA-CoT compared to Llama-3.2-11B-Vision-Instruct is still significant.

I apologize for any confusion this may have caused and hope this addresses your concerns. Feel free to reach out with any follow-up questions!

@ramkumarkoppu
Copy link

Tried again the same question with updated demo app, and still gives me wrong answer: C

@XuGW-Kevin
Copy link
Collaborator

Tried again the same question with updated demo app, and still gives me wrong answer: C

Could you try for a few more times? I think most of the time the model will give answer D because either 10-3 or 8-2 will lead to answer D.

@ramkumarkoppu
Copy link

this time it choose answer the objects number not choosing the multiple choices:
This totals to six objects. (Here ends the REASONING stage)

(Here begins the CONCLUSION stage) 6 (Here ends the CONCLUSION stage)

@XuGW-Kevin
Copy link
Collaborator

Yes, this is also possible and our base model will also have this issue.
I just found out that in MMStar benchmark every problem starts with Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end. Adding this prompt will alleviate this issue (but not eliminate it).

I'll provide you with some more examples generated by LLaVA-CoT:

Question 1:
Correct answers (8):
71e905effb01e5e71e259fbcb28816b1
63b29a94cf42d5e06f89e8c5324dea8c

Partly Correct Answers that can get 10:
7c68d5fe7a4804299f5441629d9f0f20
image

Partly Correct Answers that can get 2:
d179bae46996027be88c4d98c1368e9a

Question 2:
6aa557aadf17514812fd4ecbb3d7bd21
c3df6ad4dfb18e88f281c339fd997a8d

@XuGW-Kevin
Copy link
Collaborator

this time HF demo available, but when I run the example image reasoning

with the same prompt at https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6, I get different answer:

(Here begins the SUMMARY stage) To solve the problem, I will identify and count the objects in the image, excluding the tiny shiny balls and purple objects, and then determine how many objects remain. (Here ends the SUMMARY stage) (Here begins the CAPTION stage) The image shows a collection of variously shaped objects on a flat surface. These include spheres, cubes, and cylinders in different colors, including blue, green, yellow, red, and purple. (Here ends the CAPTION stage) (Here begins the REASONING stage) First, I will identify all the objects in the image: there are spheres, cubes, and cylinders. Next, I will exclude the tiny shiny balls, which are the spheres, and the purple objects. The purple objects are a cylinder and a sphere. After removing these, I will count the remaining objects. The remaining objects are a blue cylinder, a green sphere, a yellow cylinder, a red cube, and a yellow sphere. This totals to five objects. (Here ends the REASONING stage) (Here begins the CONCLUSION stage) 5 (Here ends the CONCLUSION stage)

I also find another model developed by the community (they used our dataset to train it):
https://huggingface.co/BarraHome/Mistroll-3.0-CoT-Llama-3.2-11B-Vision-Instruct

They also display a similar demo there:
image

@ramkumarkoppu
Copy link

Hi Guowei,

How to reproduce the original results like exact answer of B.8 ? any tuning parameters?

@XuGW-Kevin
Copy link
Collaborator

XuGW-Kevin commented Dec 6, 2024

Hi Guowei,

How to reproduce the original results like exact answer of B.8 ? any tuning parameters?

Hi Ramkumar,

It’s not possible to reproduce the exact original answers, as multiple candidate answers are generated randomly at different stages, making exact replication unfeasible. However, statistical results are fully reproducible. This specific example comes from CLEVR-MATH, a task in the MMStar benchmark, and you can reproduce the statistical results on MMStar using the VLMEvalKit.

MMStar includes many similar types of questions. You can explore the ones LLaVA-CoT successfully answers. Due to randomness, the specific questions LLaVA-CoT answers correctly may vary between runs, but the statistical results remain nearly identical.

You can find the guide for reproducing results without inference scaling at https://huggingface.co/Xkev/Llama-3.2V-11B-cot. For models with inference scaling, you only need to replace the original Llama-3.2V with the script we provide at https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/inference_demo/inference_demo.py.

Let me know if you have further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants