-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: AI Assistant API #541
Comments
What you are describing is already possible using the current extension system, since extensions can run arbitrary code on pages via content scripts, can inspect DOM, can synthesize events or trigger arbitrary functions, and even capture video feed of pages to perform computer vision analysis on videos. If you have a specific request about a particular use case not served by the current APIs, please share it. |
Some of them, but not in the way that's conductive to an AI assistant.
The point is not to parse DOM, but to see and be able to act like a human would.
Most websites ignore synthesize events by checking isTrusted. In addition, synthesized events isn't equivalent to a virtual cursor, touch, and keyboard. Anything a person can do, an AI should be able to do through this proposed API. Theoretically, you should be able to play 2-player games with the AI assistant.
I haven't used the DesktopCapture API much, but the TabCapture API is super restricted which requires invoking the extension for every tab. If DesktopCapture is similar, it's not conductive towards a seamless AI assistant experience. The AI should be able to navigate between tabs, and pretty much have the flexibility to do anything the user can. |
Could you please be more specific? What exact issue are you facing?
The "act like a human would" part is very abstract and sounds like you are proposing some kind of AGI. In practice, most AI is much simpler and actually has very well defined input and output. DOM tree is a great input for almost any AI.
Extensions can trivially programmatically forge the If you show me a game, I can trivially construct an API/adapter for AI to use.
Extensions already can do this. If you have a specific question, I might be able to answer it. |
Yes, technically you might be able to code around this, and code around that. But, I'm proposing a flexible API where you wouldn't need to. The AI will figure it out, the browsers just need to provide the tools. |
Could you be more specific about the API you are proposing? |
|
@polywock all of your suggestions are neat! None of them are actual proposals though. They are rough use cases for APIs that don’t exist, and no one has researched or developed. They are all built on a vague concept of “ai”, when none of the vendors in this group ship a browser with any sort of AI apis today. Your suggestions, while they are neat ideas, are missing the entire fundamental building block of how they would even work. What AI? How are the models provided? To illustrate my point, I would find it cool if there was a simple API I could use to interact with my home automation. But that isn’t a proposal. A proposal would be something like “create browser.mqtt to expose simplified Iot device pairing”, in which I would go over the suggested structure of the API, the interfaces, methods, and events. |
Well, there is Web Neural Network API which has decent polyfils and is fairly close to shipping... but it's a bit off-topic. it is a generic web API to execute (and train) ML models on device via platform-specific native ML APIs. Should be really cool once it actually ships. The larger point that "None of them are actual proposals" is spot-on. |
I'm confused about this response. A proposal is a suggestion, what makes the first one not a proposal?
The more powerful AI assistants or proprietary ones will capture the browser's screen and send that over to a server using WebRTC. Sending back mouse, keyboard instructions the same way. Some weaker models like 7B LLMs will be shipped to the client and run locally. |
A proposal is a suggestion, what makes the first one not a suggestions?
No, not in standards parlance. A proposal would look like this -
https://github.com/w3c/webextensions/blob/main/proposals/secure-storage.md
…On Sat, Feb 10, 2024 at 1:41 PM polywock ***@***.***> wrote:
@patrickkettner <https://github.com/patrickkettner>
None of them are actual proposals though.
I'm confused about this response. A proposal is a suggestion, what makes
the first one not a suggestions?
1. A non-restrictive DesktopCapture / TabCapture (no invoke per tab
rule). What more can I say about this?
What AI? How are the models provided?
The more powerful AI assistants or proprietary ones will capture the
browser's screen and send that over to a server using WebRTC. Sending back
mouse, keyboard instructions the same way.
Some weaker models like 7B LLMs will be shipped to the client and run
locally.
—
Reply to this email directly, view it on GitHub
<#541 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADRUBWVWKO3BCUZGYRJ5RTYS65OTAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TENJQGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@bershanskiy This is not true, could you show some examples? |
Input events sent via the debugger API are trusted
…On Sat, Feb 10, 2024 at 2:06 PM MasterKia ***@***.***> wrote:
Extensions can trivially programmatically forge the isTrusted attribute on
the event.
@bershanskiy <https://github.com/bershanskiy> This is no true, could you
show some examples?
—
Reply to this email directly, view it on GitHub
<#541 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADRUBRY6UUM22QK4LN3TDDYS7ALPAVCNFSM6AAAAABDA74P2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZXGA4TONZUG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There are a few ways:
If you plan to publish extensions utilizing these techniques I would recommend being upfront about why you need to use them since these are odd things to do in a normal extension. |
@bershanskiy Although nice to know, they're hardly valid substitutes for the API I'm proposing. I would also disagree that they're trivial. To expand a bit more on my proposal. It doesn't need to offer complex Keyboard/Mouse capabilities like Selenium's Action API. More basic actions should be viable. These should get us 95% of the way there. Cursor API
Keyboard API
Vision and Audio API: Pretty much just TabCapture/DesktopCapture API without restrictions, and more granular options. |
I tested this out, and it doesn't seem to be the case. If you dispatch a trusted event manually, it's trusted status resets back to false. |
Closing this as the request isn't specific enough for us to meaningfully discuss it. |
This new powerful API will be used by AI assistant extensions. Anything the user can do, the assistant should be able to do. This will be a very permissive API, but the benefits will outweigh the risks.
pyautogui
. For example,cursor.moveTo(x, y)
By virtual keyboard, mouse, I don't mean controlling the user's mouse/keyboard, the assistant should have their own virtual cursor and keyboard, and be able to collaborate with the user in real time (like Google Docs)
Later features (not necessary for initial API).
You can imagine the accessibility and productivity benefits such an API could bring.
The text was updated successfully, but these errors were encountered: