You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
excerpt: 'My experience building an AI based browsing agent. I go through the various insights I came across.'
date: '2024-03-24T00:29:32.431Z'
For a hackathon at work, I finally decided to build something with AI agents. For the last hackathon, I had started to experiment with embeddings in the previous Hasan where I had built a simple wiki search tool. But this time I really wanted to start experimenting with LLM agents in a practical application.
A team member had done a presentation at work describing the recent rapid advancements in AI image reasoning. The talk was very interesting and I found myself going through the various sources and articles he had linked. I forked this repo and started to experiment.
Basics
The basic structure for the existing project was as follows:
sequenceDiagram
User->>+Orchestrator:Sends objective
Orchestrator->>+Browser Layer:Initializes Browser
loop Loops until objective is achieved
Browser Layer->>-Orchestrator:Sends browser screenshot
Orchestrator->>+AI Layer:Sends Objective with screenshot
AI Layer->>+Vision Layer:Builds prompt
Vision Layer->>+OpenAI:Sends prompt and screenshot
OpenAI->>-Vision Layer:Returns action to be performed
Vision Layer->>-Orchestrator:Returns action to be performed
Orchestrator->>-Browser Layer:Execute action
end
Loading
Essentially, the Vision layer would send a screenshot to the AI agent and ask it to return the instructions for the next action. Actions included things like clicking, typing or navigating. Alternatively, if the AI determined that the objective was complete, it could return a signal that would terminate the program. The browser layer was a wrapper on top of Playwright which spawns a new browser and can be controlled by the program.
Vimium chrome extension
The forked project made use of Vimium Chrome Extension as a clever way to highlight interactable elements on the screen. The prompt asks the agent to return the id of the elements that it should use in the next action. This keeps the agent focused on what actions it can actually perform.
Ofcourse, this is not fool-proof. Websites often do weird things with their sites where certain divs that are actually interactable don't have accessible roles or are interactable through javascript. This reduces the effectiveness of the chrome extension in those situations. More on that later.
I also modified the extension myself for my use case so it would see some more types of elements and would not rely on a keyboard shortcut to trigger hint-mode. My fork is here if you're interested.
Observations and Ideas of improvement
The basic operation of this project was surprisingly simple. The AI model was surprisingly good at making simple decisions given.an objective and a screenshot of a webpage. However, for many different reasons, it struggled, and it struggled a lot. Below, I've documented a number of these instances and some strategies I used to get past them (or not).
Sticking to a single app
I decided to stick with a single app for my experiments. This allowed me to have better control over the environment and not have too much variability. Its true that our tool should be able to handle unpredictability with ease, but I wanted it to graduate to those use cases eventually. Ideally, I should have built my own "realistic" website which would let me have even more control over the experiments, but I decided to skip this step for now and go with something that was already built.
The website I chose was Todoist. The main reasons were:
Generally when you're experimenting with any new tech, after a basic Hello World app, you try to build a simple to do app.
It was a realistic use case which could inspire automation. Users could ask the agenet to go out and add a todo item, or retrieve them.
Its interface was clean and accessible. Many blogs and wordpress sites are often littered with ads and popups (Which can introduce more unpredictability). The interface for Todoist is simple and without much noise.
Some sample objectives I played around with were:
Add a task called Do Laundry to my task list
Retrieve all my tasks from the task list
An example of what a typical session looked like. The program takes a screenshot and sends it to the agent which decides on the next action until the objective is complete.
Unending sessions
Initially, the prompt I had forked looked something along the lines of this:
You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block.
Notice the "When the page seems satisfactory, return done as a key with no value". The problem with keeping the prompt fairly open and the completion condition be along the lines of "when you think the objective is complete" is that sometimes it's not clear to the model that the objective was successfully completed. In my experiments, sometimes the model would continue to try to add tasks even though it was successfully able to do it. It wasn't clear what the meaning of "satisfactory" really is.
To get past this, I introduced a completion_condition parameter (alongside the objective parameter). This seemed to work fairly well. With an explicit completion condition, the model seemed to have a clear understanding of what "done" meant and the un-ending session problem seemed to have been solved.
Authentication
Ah authentication. I decided to simply punt on this problem. Yes authentication will be something that needs to be solved, but it's boring (but important), so I decided to tackle it later. Atleast until I was more confident in the core tech.
For this reason, I decided to sign in to the chromium browser and use a persistent browser context as opposed to a non-persistent one.
Desktop vs Mobile
In the same presentation mentioned above, one of the slides referenced this paper titled "Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception". In it, it showed really great success rate for mobile surfaces.
Here, CR refers to completion rate, which is really phenomenal. The idea is that the lack of thumb real estate forces websites to be thoughtful about what's on the screen. Additionally, this tweet from Greg Kamradt further enforces the notion that the performance of the model degrades the more text that is present in the image.
To mimic its success, I decided to reduce the size of the browser such that the app would go into mobile-responsive view. The results were rather surprising.
For my use case (add a task to the task list), the model often got confused:
Desktop
Mobile
The desktop version has explicit text in buttons that helps the model understand what the button does
The mobile version relies on icons more. As humans, we seem to have learned what these buttons mean (send, accept, confirm), but the model was not too sure.
The agent would often get confused on what button to press once it had filled out the task name. It would often click the cancel button and then keep trying since the complettion_condition wasn't met.
Introducing hints
I considered accessibility of the website in situations like this. Readers who suffer from blindness can still often use these websites even though they don't know exactly what icon or color the buttons have. They do this through HTML roles and aria-attributes. I decided to use the same principal when building the prompt for the agent. In addition to the screenshot, I would also give it some helpful hints on what elements were present on the screen. I could have sent it the dom nodes themselves, but to save on tokens and reduce noise, I wrote a custom function to give it some useful hints for all interactable components. Here is a sample of what that looks like:
This dramatically improved its accuracy. The model knows exactly what the buttons may mean depending on its various attributes even though it's not visually defined.
Unpredictability of websites
Websites are unpredictable. They are built for humans in mind and often, they are built for being more visually appealing vs accessible. Because of this, developers often override best practices like rebuilding UI controls (I saw a select box being rebuilt with just divs).
All this basically results in websites not being friendly for automation. The browser agent I built was using playwright which is a testing tool. Playwright encourages developers to build correctly or it won't be friendly when testing. However, when the tables are flipped and it's used for controlling itself, sometimes its APIs are too cumbersome for complex unfriendly websites to be built generically.
Non determinism of the model
The AI agents are also not deterministic. You might get the same, expected, result 10 times in a row, but the 11th time it fails. Here is a super interesting article that talks about how non-determinism in GPT4 can be caused by Sparse MoE:
Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens often compete against each other for available spots in expert buffers. As a consequence, the model is no longer deterministic at the sequence-level, but only at the batch-level, as some input sequences may affect the final prediction for other inputs
This was super unexpected to me. This means that depending on your luck and the traffic that particular GPU is experiencing, you may or may not get the result you expect. The linked article is super interesting and sheds light on a lot of non-determinism issues I had while building this. Naturally, the more investments in this technology, this is surely to get better, but for now it cannot be highly reliable.
Conclusion
These are my latest discoveries as I build this thing and experiment more and more with this technology. I'll probably do another post with more details in the future.
For anyone interested in checking out the code, checkout the repo here. I'd love to get more thoughts on what works for people and what doesn't and if there are any ideas on improving this tech.
The text was updated successfully, but these errors were encountered:
excerpt: 'My experience building an AI based browsing agent. I go through the various insights I came across.'
date: '2024-03-24T00:29:32.431Z'
For a hackathon at work, I finally decided to build something with AI agents. For the last hackathon, I had started to experiment with embeddings in the previous Hasan where I had built a simple wiki search tool. But this time I really wanted to start experimenting with LLM agents in a practical application.
A team member had done a presentation at work describing the recent rapid advancements in AI image reasoning. The talk was very interesting and I found myself going through the various sources and articles he had linked. I forked this repo and started to experiment.
Basics
The basic structure for the existing project was as follows:
Essentially, the Vision layer would send a screenshot to the AI agent and ask it to return the instructions for the next action. Actions included things like clicking, typing or navigating. Alternatively, if the AI determined that the objective was complete, it could return a signal that would terminate the program. The browser layer was a wrapper on top of Playwright which spawns a new browser and can be controlled by the program.
Vimium chrome extension
The forked project made use of Vimium Chrome Extension as a clever way to highlight interactable elements on the screen. The prompt asks the agent to return the id of the elements that it should use in the next action. This keeps the agent focused on what actions it can actually perform.
Ofcourse, this is not fool-proof. Websites often do weird things with their sites where certain divs that are actually interactable don't have accessible roles or are interactable through javascript. This reduces the effectiveness of the chrome extension in those situations. More on that later.
I also modified the extension myself for my use case so it would see some more types of elements and would not rely on a keyboard shortcut to trigger hint-mode. My fork is here if you're interested.
Observations and Ideas of improvement
The basic operation of this project was surprisingly simple. The AI model was surprisingly good at making simple decisions given.an objective and a screenshot of a webpage. However, for many different reasons, it struggled, and it struggled a lot. Below, I've documented a number of these instances and some strategies I used to get past them (or not).
Sticking to a single app
I decided to stick with a single app for my experiments. This allowed me to have better control over the environment and not have too much variability. Its true that our tool should be able to handle unpredictability with ease, but I wanted it to graduate to those use cases eventually. Ideally, I should have built my own "realistic" website which would let me have even more control over the experiments, but I decided to skip this step for now and go with something that was already built.
The website I chose was Todoist. The main reasons were:
Some sample objectives I played around with were:
An example of what a typical session looked like. The program takes a screenshot and sends it to the agent which decides on the next action until the objective is complete.
Unending sessions
Initially, the prompt I had forked looked something along the lines of this:
Notice the "When the page seems satisfactory, return done as a key with no value". The problem with keeping the prompt fairly open and the completion condition be along the lines of "when you think the objective is complete" is that sometimes it's not clear to the model that the objective was successfully completed. In my experiments, sometimes the model would continue to try to add tasks even though it was successfully able to do it. It wasn't clear what the meaning of "satisfactory" really is.
To get past this, I introduced a
completion_condition
parameter (alongside theobjective
parameter). This seemed to work fairly well. With an explicit completion condition, the model seemed to have a clear understanding of what "done" meant and the un-ending session problem seemed to have been solved.Authentication
Ah authentication. I decided to simply punt on this problem. Yes authentication will be something that needs to be solved, but it's boring (but important), so I decided to tackle it later. Atleast until I was more confident in the core tech.
For this reason, I decided to sign in to the chromium browser and use a persistent browser context as opposed to a non-persistent one.
Desktop vs Mobile
In the same presentation mentioned above, one of the slides referenced this paper titled "Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception". In it, it showed really great success rate for mobile surfaces.
Here, CR refers to completion rate, which is really phenomenal. The idea is that the lack of thumb real estate forces websites to be thoughtful about what's on the screen. Additionally, this tweet from Greg Kamradt further enforces the notion that the performance of the model degrades the more text that is present in the image.
To mimic its success, I decided to reduce the size of the browser such that the app would go into mobile-responsive view. The results were rather surprising.
For my use case (add a task to the task list), the model often got confused:
The agent would often get confused on what button to press once it had filled out the task name. It would often click the cancel button and then keep trying since the
complettion_condition
wasn't met.Introducing
hints
I considered accessibility of the website in situations like this. Readers who suffer from blindness can still often use these websites even though they don't know exactly what icon or color the buttons have. They do this through HTML roles and aria-attributes. I decided to use the same principal when building the prompt for the agent. In addition to the screenshot, I would also give it some helpful hints on what elements were present on the screen. I could have sent it the dom nodes themselves, but to save on tokens and reduce noise, I wrote a custom function to give it some useful hints for all interactable components. Here is a sample of what that looks like:
This dramatically improved its accuracy. The model knows exactly what the buttons may mean depending on its various attributes even though it's not visually defined.
Unpredictability of websites
Websites are unpredictable. They are built for humans in mind and often, they are built for being more visually appealing vs accessible. Because of this, developers often override best practices like rebuilding UI controls (I saw a select box being rebuilt with just divs).
All this basically results in websites not being friendly for automation. The browser agent I built was using playwright which is a testing tool. Playwright encourages developers to build correctly or it won't be friendly when testing. However, when the tables are flipped and it's used for controlling itself, sometimes its APIs are too cumbersome for complex unfriendly websites to be built generically.
Non determinism of the model
The AI agents are also not deterministic. You might get the same, expected, result 10 times in a row, but the 11th time it fails. Here is a super interesting article that talks about how non-determinism in GPT4 can be caused by Sparse MoE:
This was super unexpected to me. This means that depending on your luck and the traffic that particular GPU is experiencing, you may or may not get the result you expect. The linked article is super interesting and sheds light on a lot of non-determinism issues I had while building this. Naturally, the more investments in this technology, this is surely to get better, but for now it cannot be highly reliable.
Conclusion
These are my latest discoveries as I build this thing and experiment more and more with this technology. I'll probably do another post with more details in the future.
For anyone interested in checking out the code, checkout the repo here. I'd love to get more thoughts on what works for people and what doesn't and if there are any ideas on improving this tech.
The text was updated successfully, but these errors were encountered: