Proposal: AI Assistant API #541

polywock · 2024-02-09T07:13:40Z

This new powerful API will be used by AI assistant extensions. Anything the user can do, the assistant should be able to do. This will be a very permissive API, but the benefits will outweigh the risks.

Virtual eyes and ears, the assistant can see/hear your browser windows. This API will be similar to the DesktopCapture / TabCapture API, but less restrictive.
Virtual keyboard, mouse, and touch, the assistant can interact with any part of the browser window. This API might look something like pyautogui. For example, cursor.moveTo(x, y)

By virtual keyboard, mouse, I don't mean controlling the user's mouse/keyboard, the assistant should have their own virtual cursor and keyboard, and be able to collaborate with the user in real time (like Google Docs)

Later features (not necessary for initial API).

The ability to interact with background tabs. This should also be paired with a log system.
The ability to install multiple AI assistants.

You can imagine the accessibility and productivity benefits such an API could bring.

The text was updated successfully, but these errors were encountered:

bershanskiy · 2024-02-10T10:26:44Z

What you are describing is already possible using the current extension system, since extensions can run arbitrary code on pages via content scripts, can inspect DOM, can synthesize events or trigger arbitrary functions, and even capture video feed of pages to perform computer vision analysis on videos.

If you have a specific request about a particular use case not served by the current APIs, please share it.

polywock · 2024-02-10T11:23:46Z

What you are describing is already possible using the current extension system,

Some of them, but not in the way that's conductive to an AI assistant.

can inspect DOM, ..... or trigger arbitrary functions,

The point is not to parse DOM, but to see and be able to act like a human would.

can synthesize events

Most websites ignore synthesize events by checking isTrusted. In addition, synthesized events isn't equivalent to a virtual cursor, touch, and keyboard. Anything a person can do, an AI should be able to do through this proposed API.

Theoretically, you should be able to play 2-player games with the AI assistant.

even capture video feed of pages to perform computer vision analysis on videos.

I haven't used the DesktopCapture API much, but the TabCapture API is super restricted which requires invoking the extension for every tab. If DesktopCapture is similar, it's not conductive towards a seamless AI assistant experience.

The AI should be able to navigate between tabs, and pretty much have the flexibility to do anything the user can.

bershanskiy · 2024-02-10T12:26:18Z

What you are describing is already possible using the current extension system,

Some of them, but not in the way that's conductive to an AI assistant.

Could you please be more specific? What exact issue are you facing?

can inspect DOM, ..... or trigger arbitrary functions,

The point is not to parse DOM, but to see and be able to act like a human would.

The "act like a human would" part is very abstract and sounds like you are proposing some kind of AGI. In practice, most AI is much simpler and actually has very well defined input and output. DOM tree is a great input for almost any AI.

can synthesize events

Most websites ignore synthesize events by checking isTrusted. In addition, synthesized events isn't equivalent to a virtual cursor, touch, and keyboard. Anything a person can do, an AI should be able to do through this proposed API.

Theoretically, you should be able to play 2-player games with the AI assistant.

Extensions can trivially programmatically forge the isTrusted attribute on the event. In fact, by design, it is rather difficult for a page to actually reliably distinguish programmatic events from the truly user-initiated ones. Some sites are actually foolish enough to try to distinguish the two and end up with lots of false-positives (resulting in not-accessible websites).

If you show me a game, I can trivially construct an API/adapter for AI to use.

even capture video feed of pages to perform computer vision analysis on videos.

I haven't used the DesktopCapture API much, but the TabCapture API is super restricted which requires invoking the extension for every tab. If DesktopCapture is similar, it's not conductive towards a seamless AI assistant experience.

The AI should be able to navigate between tabs, and pretty much have the flexibility to do anything the user can.

Extensions already can do this. If you have a specific question, I might be able to answer it.

polywock · 2024-02-10T13:42:38Z

Yes, technically you might be able to code around this, and code around that. But, I'm proposing a flexible API where you wouldn't need to. The AI will figure it out, the browsers just need to provide the tools.

bershanskiy · 2024-02-10T16:40:35Z

I'm proposing a flexible API where you wouldn't need to.

Could you be more specific about the API you are proposing?

polywock · 2024-02-10T16:52:00Z

A non-restrictive DesktopCapture / TabCapture (no invoke per tab rule)
For DesktopCapture / TabCapture, more granular control regarding quality, fps, etc. Eg. it should be fairly straightforward to send a 10fps stream to a server without doing video processing yourself.
A way to have a virtual cursor, be able to move it, click any where on browser, and to touch anywhere. This AI controlled cursor will be visible to the user alongside the user's own cursor.
A way to create a virtual keyboard, press any key combinations.

patrickkettner · 2024-02-10T17:19:55Z

@polywock all of your suggestions are neat! None of them are actual proposals though. They are rough use cases for APIs that don’t exist, and no one has researched or developed. They are all built on a vague concept of “ai”, when none of the vendors in this group ship a browser with any sort of AI apis today. Your suggestions, while they are neat ideas, are missing the entire fundamental building block of how they would even work. What AI? How are the models provided?

To illustrate my point, I would find it cool if there was a simple API I could use to interact with my home automation. But that isn’t a proposal. A proposal would be something like “create browser.mqtt to expose simplified Iot device pairing”, in which I would go over the suggested structure of the API, the interfaces, methods, and events.

bershanskiy · 2024-02-10T17:35:10Z

when none of the vendors in this group ship a browser with any sort of AI apis today.

Well, there is Web Neural Network API which has decent polyfils and is fairly close to shipping... but it's a bit off-topic. it is a generic web API to execute (and train) ML models on device via platform-specific native ML APIs. Should be really cool once it actually ships.

The larger point that "None of them are actual proposals" is spot-on.

polywock · 2024-02-10T18:41:33Z

@patrickkettner

None of them are actual proposals though.

I'm confused about this response. A proposal is a suggestion, what makes the first one not a proposal?

A non-restrictive DesktopCapture / TabCapture (no invoke per tab rule). What more can I say about this?

What AI? How are the models provided?

The more powerful AI assistants or proprietary ones will capture the browser's screen and send that over to a server using WebRTC. Sending back mouse, keyboard instructions the same way.

Some weaker models like 7B LLMs will be shipped to the client and run locally.

patrickkettner · 2024-02-10T19:05:13Z

A proposal is a suggestion, what makes the first one not a suggestions?

No, not in standards parlance. A proposal would look like this - https://github.com/w3c/webextensions/blob/main/proposals/secure-storage.md

…

On Sat, Feb 10, 2024 at 1:41 PM polywock ***@***.***> wrote: @patrickkettner <https://github.com/patrickkettner> None of them are actual proposals though. I'm confused about this response. A proposal is a suggestion, what makes the first one not a suggestions? 1. A non-restrictive DesktopCapture / TabCapture (no invoke per tab rule). What more can I say about this? What AI? How are the models provided? The more powerful AI assistants or proprietary ones will capture the browser's screen and send that over to a server using WebRTC. Sending back mouse, keyboard instructions the same way. Some weaker models like 7B LLMs will be shipped to the client and run locally. — Reply to this email directly, view it on GitHub <#541 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADRUBWVWKO3BCUZGYRJ5RTYS65OTAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TENJQGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

MasterKia · 2024-02-10T19:06:19Z

Extensions can trivially programmatically forge the isTrusted attribute on the event.

@bershanskiy This is not true, could you show some examples?

patrickkettner · 2024-02-10T19:11:08Z

Input events sent via the debugger API are trusted

…

On Sat, Feb 10, 2024 at 2:06 PM MasterKia ***@***.***> wrote: Extensions can trivially programmatically forge the isTrusted attribute on the event. @bershanskiy <https://github.com/bershanskiy> This is no true, could you show some examples? — Reply to this email directly, view it on GitHub <#541 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADRUBRY6UUM22QK4LN3TDDYS7ALPAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TONZUG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

bershanskiy · 2024-02-10T19:44:13Z

Extensions can trivially programmatically forge the isTrusted attribute on the event.

@bershanskiy This is no true, could you show some examples?

There are a few ways:

As stated above, debugger API creates trusted events by design. As a mitigation for security/privacy concerns, it does require a debugging session (debugging sidebar or tab) and produces an inconvenient warning at the top of the page.
Without any extra warnings besides host permissions, extension can run a script on document_start (before any other script is run) and shim every relevant call to .addEventListener to actually construct a synthetic object which will be just similar enough to a real trusted Event that it will pass a specific set of checks
Content script can transplant event listeners around and chain them in weird ways to trigger multiple event listeners by the same real trusted Event.

If you plan to publish extensions utilizing these techniques I would recommend being upfront about why you need to use them since these are odd things to do in a normal extension.

polywock · 2024-02-13T01:58:46Z

@bershanskiy Although nice to know, they're hardly valid substitutes for the API I'm proposing. I would also disagree that they're trivial.

To expand a bit more on my proposal. It doesn't need to offer complex Keyboard/Mouse capabilities like Selenium's Action API. More basic actions should be viable. These should get us 95% of the way there.

Cursor API

browser.aiAssistant.createCursor() 
Cursor.moveBy(x, y)
Cursor.moveTo(x, y) // relative to browser window viewport 
Cursor.press(...buttons) 
Cursor.release(...buttons) 
Cursor.scroll(by)
Cursor.getState() // current coordinates, which buttons are pressed down, etc.

Keyboard API

browser.aiAssistant.createKeyboard() 
Keyboard.press(...keys)
Keyboard.release(...keys)
Keyboard.getState() // which keys are pressed down

Vision and Audio API: Pretty much just TabCapture/DesktopCapture API without restrictions, and more granular options.

polywock · 2024-02-13T02:05:44Z

@bershanskiy

Content script can transplant event listeners around and chain them in weird ways to trigger multiple event listeners by the same real trusted Event.

I tested this out, and it doesn't seem to be the case. If you dispatch a trusted event manually, it's trusted status resets back to false.

dotproto · 2024-03-18T21:58:28Z

Closing this as the request isn't specific enough for us to meaningfully discuss it.

github-actions bot added the needs-triage label Feb 9, 2024

xeenon added proposal Proposal for a change or new feature and removed needs-triage labels Feb 29, 2024

dotproto closed this as completed Mar 18, 2024

Rob--W mentioned this issue Apr 18, 2024

Publish minutes of 2024-03-18 - 2024-03-20 meetings in San Diego #599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: AI Assistant API #541

Proposal: AI Assistant API #541

polywock commented Feb 9, 2024 •

edited

Loading

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 •

edited

Loading

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 •

edited

Loading

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 •

edited

Loading

patrickkettner commented Feb 10, 2024

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 •

edited

Loading

patrickkettner commented Feb 10, 2024 via email

MasterKia commented Feb 10, 2024 •

edited

Loading

patrickkettner commented Feb 10, 2024 via email

bershanskiy commented Feb 10, 2024

polywock commented Feb 13, 2024 •

edited

Loading

polywock commented Feb 13, 2024 •

edited

Loading

dotproto commented Mar 18, 2024

Proposal: AI Assistant API #541

Proposal: AI Assistant API #541

Comments

polywock commented Feb 9, 2024 • edited Loading

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 • edited Loading

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 • edited Loading

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 • edited Loading

patrickkettner commented Feb 10, 2024

bershanskiy commented Feb 10, 2024

polywock commented Feb 10, 2024 • edited Loading

patrickkettner commented Feb 10, 2024 via email

MasterKia commented Feb 10, 2024 • edited Loading

patrickkettner commented Feb 10, 2024 via email

bershanskiy commented Feb 10, 2024

polywock commented Feb 13, 2024 • edited Loading

polywock commented Feb 13, 2024 • edited Loading

dotproto commented Mar 18, 2024

polywock commented Feb 9, 2024 •

edited

Loading

polywock commented Feb 10, 2024 •

edited

Loading

polywock commented Feb 10, 2024 •

edited

Loading

polywock commented Feb 10, 2024 •

edited

Loading

polywock commented Feb 10, 2024 •

edited

Loading

MasterKia commented Feb 10, 2024 •

edited

Loading

polywock commented Feb 13, 2024 •

edited

Loading

polywock commented Feb 13, 2024 •

edited

Loading