From 41adf356511b544fb8887e4ba9dba070908755fb Mon Sep 17 00:00:00 2001 From: Boyuan Zheng <58822425+boyuanzheng010@users.noreply.github.com> Date: Wed, 27 Dec 2023 13:03:40 -0500 Subject: [PATCH] Add files via upload --- index.html | 2237 ++++++++++++++++++++++++++-------------------------- 1 file changed, 1115 insertions(+), 1122 deletions(-) diff --git a/index.html b/index.html index 65314c3..9e7cae4 100644 --- a/index.html +++ b/index.html @@ -14,7 +14,7 @@ content="GPT-4V(ision) is a Generalist Web Agent, if Grounded"> - SeeAct: GPT-4V(ision) is a Generalist Web Agent, if Grounded + GPT-4V(ision) is a Generalist Web Agent, if Grounded @@ -117,14 +117,6 @@


- - - - - - - -
--> -
+

SEEACT is a generalist web agent based on GPT-4V. Specifically, given a web-based task (e.g., “Rent a truck with the lowest rate” in the car rental website), we examine two essential capabilities of GPT-4V as a generalist web agent: (i) Action Generation to produce an action description at each step (e.g., “Move the cursor over the ‘Find Your Truck’ button and perform a click”) towards completing the task, and (ii) Element Grounding to identify an HTML element (e.g., “[button] Find Your Truck”) at the current step on the webpage.

+

@@ -257,7 +250,7 @@

@@ -402,1129 +395,1129 @@

Visualization

- -
-
-

Experiment Results

-
-
-
-
+ + + + + + + + - -
-
-

Leaderboard

-
-
-

- We evaluate various models including LLMs and LMMs. - In each type, we consider both closed- and open-source models. - Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark. - For all models, we use the default prompt provided by each model for multi-choice or open QA, if available. - If models do not provide prompts for task types in MMMU, we conduct prompt engineering on the validation set and use the most effective prompt for the later zero-shot experiment. + + + + + + + + + + + + -

-
+ + - -
- + + + + - Open-Source - Proprietary -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ResetOverallArt & DesignBusinessScienceHealth & MedicineHuman. & Social Sci.Tech & Eng.
- - GPT-4V(ision) (Playground) - - 55.765.364.348.463.576.341.7
- - Qwen-VL-PLUS* - - 40.859.934.532.843.765.532.9
- - BLIP-2 FLAN-T5-XXL - - 34.049.228.627.333.751.530.4
- - InstructBLIP-T5-XXL - - 33.848.530.627.633.649.829.4
- - LLaVA-1.5-13B - - 33.649.828.225.934.954.728.3
- - Qwen-VL-7B - - 32.947.729.825.633.645.330.2
- - mPLUG-OWL2* - - 32.148.525.624.932.846.729.6
- - BLIP-2 FLAN-T5-XL - - 31.043.025.625.131.848.027.8
- - InstructBLIP-T5-XL - - 30.643.325.225.229.345.828.6
- - CogVLM - - 30.138.025.625.131.241.528.9
- - Otter - - 29.137.424.024.129.635.930.2
- - LLaMA-Adapter2-7B - - 27.735.225.425.630.029.125.7
- - MiniGPT4-Vicuna-13B - - 27.630.227.026.226.930.927.2
- - Fuyu-8B - - 27.429.927.025.627.032.526.4
- - Kosmos2 - - 26.628.823.726.627.226.326.8
- - OpenFlamingo2-9B - - 26.331.723.526.326.327.925.1
Frequent Choice25.826.728.424.024.425.226.5
Random Choice23.924.124.921.625.322.824.8
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - + + + + + + + + + + + -

Overall results of different models on the MMMU test set. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.

-
-
-
- -
-
-

Different Image Types

-
-

- We compare the performance of various models across top frequent image types. - Across all types, GPT-4V consistently outperforms the other models by a huge margin. - Open-source models demonstrate relatively strong performance in categories like Photos and Paintings, which are more frequently seen during training. - However, for less common image categories like Geometric shapes, Music sheets and Chemical structures, all models obtain very low scores (some are close to random guesses). - This indicates that the existing models are generalizing poorly towards these image types. -

-
-
- Fuyu-8B - Qwen-VL-7B - LLaVA-1.5-13B - InstructBLIP-T5-XXL - BLIP-2 FLAN-T5-XXL - GPT-4V -
-
-
- -
- -

Diagrams (3184)

-
- - -
- -

Tables (2267)

-
- -
- -

Plots and Charts (840)

-
- -
- -

Chemical Structures (573)

-
- -
- -

Photographs (770)

-
- -
- -

Paintings (453)

-
- -
- -

Geometric Shapes (336)

-
- -
- -

Sheet Music (335)

-
- -
- -

Medical Images (272)

-
- -
- -

Pathological Images (253)

-
- -
- -

Microscopic Images (226)

-
- -
- -

MRI, CT scans, and X-rays (198)

-
- -
- -

Sketches and Drafts (184)

-
- -
- -

Maps (170)

-
- -
- -

Technical Blueprints (162)

-
- -
- -

Trees and Graphs (146)

-
- -
- -

Mathematical Notations (133)

-
- -
- -

Comics and Cartoons (131)

-
- -
- -

Sculpture (117)

-
- -
- -

Portraits (91)

-
- -
- -

Screenshots (70)

-
- -
- -

Other(60)

-
- -
- -

Poster(57)

-
- -
- -

Icons and Symbols (42)

-
- -
- -

Historical Timelines (30)

-
- -
- -

3D Renderings (21)

-
- -
- -

DNA Sequences (20)

-
- -
- -

Landscapes (16)

-
- -
- -

Logos and Branding(14)

-
- -
- -

Advertisements (10)

-
-
-

Selected models' performance on 30 different image types. Note that a single image may have multiple image types.

-
+ + + + -
-
- -
-
-

Different Difficulty Levels

-
-

- we compares the performance of selected models across three difficulty levels. - GPT-4V demonstrates a significantly higher proficiency, with a success rate of 76.1%, compared to opensource models in the “Easy” category. - When it comes to the “Medium” category, while the gap narrows, GPT-4V still leads at 55.6%. - The further diminishing performance gap in the “Hard” category across models indicates that as the complexity of tasks increases, the advantage of more advanced models like GPT-4V almost disappears. - This might reflect a current limitation in handling expert-level challenging queries even for the most advanced models. -

-
-
- -

Result decomposition across question difficulty levels.

-
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + -
-
- -
-
-

Error Analysis

-
-

- We delve into the analysis of errors by GPT-4V, a pivotal aspect for understanding its operational capabilities and limitations. - This analysis serves not only to identify the model's current shortcomings but also to guide future enhancements in its design and training. - We meticulously examine 150 randomly sampled error instances from GPT-4V's predictions. - These instances are analyzed by expert annotators who identify the root causes of mispredictions based on their knowledge and the golden explanations if available. - The distribution of these errors is illustrated in Figure, and a selection of 100 notable cases, along with detailed analyses, is included in the Appendix. -

-
-
- error distribution -

Error distribution over 150 annotated GPT-4V errors.

-
- -
-
+ + - + + + + + + + + + + + + + + + + + -
-
-

Error Examples

- -
-
- + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + -
-
-

Correct Examples

- -
-
- + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
@@ -1534,10 +1527,10 @@

Correct Examples

BibTeX


-      @article{yue2023mmmu,
-        title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
-        author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen},
-        journal={arXiv preprint arXiv:2311.16502},
+      @article{zheng2023seeact,
+        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
+        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
+        journal={arXiv preprint arXiv:xxxx.xxxx},
         year={2023},
       }