doc: update the instructions to configure the model service (#274)

--------- Co-authored-by: zhouxiao.shaw <[email protected]>
web-infra-dev · Jan 15, 2025 · afd0934 · afd0934
1 parent beb74f1
commit afd0934
Show file tree

Hide file tree

Showing 10 changed files with 189 additions and 114 deletions.
diff --git a/apps/site/docs/en/automate-with-scripts-in-yaml.mdx b/apps/site/docs/en/automate-with-scripts-in-yaml.mdx
@@ -145,8 +145,8 @@ target:
   # string, the path to save the aiQuery result, optional
   output: <path-to-output-file>
 
-  # string, the bridge mode to use, optional, default is 'currentTab', can be 'newTabWithUrl' or 'currentTab'
-  bridgeMode: <mode>
+  # string, the bridge mode to use, optional, default is false, can be 'newTabWithUrl' or 'currentTab'. More details see the following section
+  bridgeMode: false | 'newTabWithUrl' | 'currentTab'
 ```
 
 The `tasks` part is an array indicates the tasks to do. Remember to write a `-` before each item which means an array item.

diff --git a/apps/site/docs/en/index.mdx b/apps/site/docs/en/index.mdx
@@ -1,11 +1,5 @@
 # Midscene.js - Joyful Automation by AI
 
-UI automation testing is often difficult to maintain, which often involves a maze of *#ids*, *data-test* attributes, and *.selectors*. When it comes to refactoring, it can be a nightmare, although this is precisely the situation where UI automation should be useful.
-
-Introducing Midscene.js, an innovative SDK designed to bring joy back to automation scripts by simplifying the commands.
-
-Midscene.js leverages a multimodal Large Language Model (LLM) to intuitively “understand” your user interface and carry out the necessary actions. You can simply describe the interaction steps or expected data formats, and the AI will handle the execution for you.
-
 <div style={{"width": "100%", "display": "flex", justifyContent: "center"}}>
   <iframe
     style={{"maxWidth": "100%", "width": "800px", "height": "450px"}}
@@ -19,15 +13,25 @@ Midscene.js leverages a multimodal Large Language Model (LLM) to intuitively “
 
 ## Interact, query and assert by natural language
 
-There are three main capabilities: action (`.ai`, `.aiAction`), query (`.aiQuery`), assert(`.aiAssert`).
+There are three main capabilities: **action**, **query**, **assert**.
 
-* Use `.ai` to execute a series of actions by describing the steps
-* Use `.aiQuery` to extract customized data from the UI. Just describe the JSON format you want, and AI will give the answer based on its "understand" of the page
-* Use `.aiAssert` to perform assertions on the page.
+* Use **action (`.ai`, `.aiAction`)** to execute a series of actions by describing the steps
+* Use **query (`.aiQuery`)** to extract customized data from the UI. Describe the JSON format you want, and AI will give the answer based on its "understand" of the page
+* Use **assert (`.aiAssert`)** to perform assertions on the page.
 
 All these methods accept natural language prompt as param. Obviously, the cost of script maintenance will be greatly decreased.
 
-For example 
+## Start with Chrome extension
+
+To quickly experience the main features of Midscene, you can use the Midscene Chrome extension. It allows you to use Midscene on any webpage without writing any code.
+
+Click [here](https://chromewebstore.google.com/detail/midscene/gbldofcpkknbggpkmbdaefngejllnief) to install Midscene extension from Chrome Web Store.
+
+For instructions, please refer to [Quick Experience](./quick-experience).
+
+## Multiple ways to integrate
+
+Maintaining automation scripts by Midscene could be a brand new experience. For example, to search for headphones on a website, you can do this:
 
 ```typescript
 // 👀 type keywords, perform a search
@@ -44,11 +48,7 @@ console.log("headphones in stock", items);
 await aiAssert("There is a category filter on the left");
 ```
 
-## Multiple ways to integrate
-
-To start experiencing the core feature of Midscene, we recommend you use [the Chrome Extension](./quick-experience). You can call Action / Query / Assert by natural language on any webpage, without needing to set up a code project.
-
-Also, there are several ways to integrate Midscene into your code project:
+There are several ways to integrate Midscene into your code project:
 
 * [Automate with Scripts in YAML](./automate-with-scripts-in-yaml), use this if you prefer to write YAML file instead of code
 * [Bridge Mode by Chrome Extension](./bridge-mode-by-chrome-extension), use this to control the desktop Chrome by scripts
@@ -57,20 +57,23 @@ Also, there are several ways to integrate Midscene into your code project:
 
 ## Visualized report
 
-Midscene will provide a visual report after each run. With this report, you can review the animated replay and view the details of each step in the process. What's more, there is a playground in the report file for you to adjust your prompt without re-running all your scripts.
+Midscene provides a visual report after each run. With this report, you can review the animated replay and view the details of each step in the process. What's more, there is a playground in the report file for you to adjust your prompt without re-running all your scripts.
 
 <p align="center">
   <img src="/report.gif" alt="visualized report" />
 </p>
 
-## Just you and model provider, no third-party services
+## Customize model
 
-⁠Midscene.js is an open-source project (GitHub: [Midscene](https://github.com/web-infra-dev/midscene/)) under the MIT license. You can run it in your own environment. All data gathered from pages will be sent directly to OpenAI or the custom model provider according to your configuration. Therefore, only you and the model provider will have access to the data. No third-party platform will access the data.
+Currently, the model we are using by default is the OpenAI GPT-4o model, while you can [customize it to a different multimodal model](./model-provider) if needed.
 
-## Customize Model
+## Just you and model provider, no third-party services
 
-Currently, the model we are using by default is the OpenAI GPT-4o model, while you can [customize it to a different multimodal model](./model-provider) if needed.
+⁠Midscene.js is an open-source project (GitHub: [Midscene](https://github.com/web-infra-dev/midscene/)) under the MIT license. You can run it in your own environment. All data gathered from pages will be sent directly to OpenAI or the custom model provider according to your configuration. Therefore, no third-party platform will access the data.
 
-## Start with Chrome Extension
+## Follow us
 
-To quickly experience the main features of Midscene, you can use the [Chrome Extension](./quick-experience). It allows you to use Midscene on any webpage without writing any code.
+* [GitHub - give us a star](https://github.com/web-infra-dev/midscene)
+* [Twitter](https://x.com/midscene_ai)
+* [Discord](https://discord.gg/AFHJBdnn)
+* [Lark](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=291q2b25-e913-411a-8c51-191e59aab14d)
diff --git a/apps/site/docs/en/model-provider.md b/apps/site/docs/en/model-provider.md
@@ -1,60 +1,60 @@
 # Customize Model and Provider
 
-Midscene uses the OpenAI SDK to call AI services. You can customize the configuration using environment variables. All the configs can also be used in the [Chrome Extension](./quick-experience).
+Midscene uses the OpenAI SDK to call AI services. Using this SDK limits the input and output format of AI services, but it doesn't mean you can only use OpenAI's models. You can use any model service that supports the same interface (most platforms or tools support this).
 
-These are the main configs, in which `OPENAI_API_KEY` is required.
+In this article, we will show you how to config AI service provider and how to choose a different model.
 
-Required:
+## Configs
 
-```bash
-# replace by your own
-export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
-```
+These are the most common configs, in which `OPENAI_API_KEY` is required.
 
-Optional configs:
+| Name | Description |
+|------|-------------|
+| `OPENAI_API_KEY` | Required. Your OpenAI API key (e.g. "sk-abcdefghijklmnopqrstuvwxyz") |
+| `OPENAI_BASE_URL` | Optional. Custom endpoint URL for API endpoint. Often used to switch to a provider other than OpenAI. |
+| `MIDSCENE_MODEL_NAME` | Optional. Specify a different model name (default is gpt-4o). Often used to switch to a different model. |
 
-```bash
-# if you want to use a customized endpoint
-export OPENAI_BASE_URL="https://..."
+Some advanced configs are also supported. Usually you don't need to use them.
 
-# if you want to use Azure OpenAI Service. See more details in the next section.
-export OPENAI_USE_AZURE="true"
+| Name | Description |
+|------|-------------|
+| `OPENAI_USE_AZURE` | Optional. Set to "true" to use Azure OpenAI Service. See more details in the following section. |
+| `MIDSCENE_OPENAI_INIT_CONFIG_JSON` | Optional. Custom JSON config for OpenAI SDK initialization |
+| `MIDSCENE_OPENAI_SOCKS_PROXY` | Optional. Proxy configuration (e.g. "socks5://127.0.0.1:1080") |
+| `OPENAI_MAX_TOKENS` | Optional. Maximum tokens for model response |
 
-# if you want to specify a model name other than gpt-4o
-export MIDSCENE_MODEL_NAME='qwen-vl-max-latest';
+## Two ways to config environment variables
 
-# if you want to pass customized JSON data to the `init` process of OpenAI SDK
-export MIDSCENE_OPENAI_INIT_CONFIG_JSON='{"baseURL":"....","defaultHeaders":{"key": "value"}}'
+Pick one of the following ways to config environment variables.
 
-# if you want to use proxy. Midscene uses `socks-proxy-agent` under the hood.
-export MIDSCENE_OPENAI_SOCKS_PROXY="socks5://127.0.0.1:1080"
+### 1. Set environment variables in your system
 
-# if you want to specify the max tokens for the model
-export OPENAI_MAX_TOKENS=2048
+```bash
+# replace by your own
+export OPENAI_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
 ```
 
-## Using Azure OpenAI Service
+### 2. Set environment variables using dotenv
+
+This is what we used in our [demo project](https://github.com/web-infra-dev/midscene-example).
 
-Use ADT token provider
+[Dotenv](https://www.npmjs.com/package/dotenv) is a zero-dependency module that loads environment variables from a `.env` file into `process.env`.
 
 ```bash
-# this is always true when using Azure OpenAI Service
-export MIDSCENE_USE_AZURE_OPENAI=1
+# install dotenv
+npm install dotenv --save
+```
 
-export MIDSCENE_AZURE_OPENAI_SCOPE="https://cognitiveservices.azure.com/.default"
-export AZURE_OPENAI_ENDPOINT="..."
-export AZURE_OPENAI_API_VERSION="2024-05-01-preview"
-export AZURE_OPENAI_DEPLOYMENT="gpt-4o"
+Create a `.env` file in your project root directory, and add the following content. There is no need to add `export` before each line.
+
+```
+OPENAI_API_KEY=sk-abcdefghijklmnopqrstuvwxyz
 ```
 
-Or use keyless authentication
+Import the dotenv module in your script. It will automatically read the environment variables from the `.env` file.
 
-```bash
-export MIDSCENE_USE_AZURE_OPENAI=1
-export AZURE_OPENAI_ENDPOINT="..."
-export AZURE_OPENAI_KEY="..."
-export AZURE_OPENAI_API_VERSION="2024-05-01-preview"
-export AZURE_OPENAI_DEPLOYMENT="gpt-4o"
+```typescript
+import 'dotenv/config';
 ```
 
 ## Choose a model other than `gpt-4o`
@@ -80,6 +80,32 @@ export ANTHROPIC_API_KEY="....."
 export MIDSCENE_MODEL_NAME="claude-3-opus-20240229"
 ```
 
+## Using Azure OpenAI Service
+
+There are some extra configs when using Azure OpenAI Service.
+
+### Use ADT token provider
+
+```bash
+# this is always true when using Azure OpenAI Service
+export MIDSCENE_USE_AZURE_OPENAI=1
+
+export MIDSCENE_AZURE_OPENAI_SCOPE="https://cognitiveservices.azure.com/.default"
+export AZURE_OPENAI_ENDPOINT="..."
+export AZURE_OPENAI_API_VERSION="2024-05-01-preview"
+export AZURE_OPENAI_DEPLOYMENT="gpt-4o"
+```
+
+### Use keyless authentication
+
+```bash
+export MIDSCENE_USE_AZURE_OPENAI=1
+export AZURE_OPENAI_ENDPOINT="..."
+export AZURE_OPENAI_KEY="..."
+export AZURE_OPENAI_API_VERSION="2024-05-01-preview"
+export AZURE_OPENAI_DEPLOYMENT="gpt-4o"
+```
+
 ## Example: Using `gemini-1.5-pro` from Google
 
 Configure the environment variables:
@@ -104,6 +130,8 @@ export MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
 
 Create a inference point first: https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint
 
+In the inference point interface, find an ID like `ep-202...` as the model name.
+
 Configure the environment variables:
 
 ```bash

diff --git a/apps/site/docs/en/quick-experience.mdx b/apps/site/docs/en/quick-experience.mdx
@@ -14,7 +14,7 @@ Prepare an OpenAI API key, we will use it soon.
 
 Install Midscene extension from chrome web store: [Midscene](https://chromewebstore.google.com/detail/midscene/gbldofcpkknbggpkmbdaefngejllnief)
 
-Start the extension (may be folded by default), setup the config by pasting the config in the K=V format:
+Start the extension (may be folded by Chrome extension icon), setup the config by pasting the config in the K=V format:
 
 ```shell
 OPENAI_API_KEY="sk-replace-by-your-own"
@@ -35,6 +35,7 @@ Enjoy !
 After experiencing, you may want to write some code to integrate Midscene. There are multiple ways to do that. Please refer to the documents below:
 
 * [Automate with Scripts in YAML](./automate-with-scripts-in-yaml)
+* [Bridge Mode by Chrome Extension](./bridge-mode-by-chrome-extension)
 * [Integrate with Puppeteer](./integrate-with-puppeteer)
 * [Integrate with Playwright](./integrate-with-playwright)
 

diff --git a/apps/site/docs/zh/automate-with-scripts-in-yaml.mdx b/apps/site/docs/zh/automate-with-scripts-in-yaml.mdx
@@ -144,6 +144,9 @@ target:
 
   # 输出 aiQuery 结果的 JSON 文件路径，可选
   output: <path-to-output-file>
+
+  # 桥接模式，可选，默认 false，可以为 'newTabWithUrl' 或 'currentTab'。更多详情请参阅后文
+  bridgeMode: false | 'newTabWithUrl' | 'currentTab'
 ```
 
 `tasks` 部分是一个数组，定义了脚本执行的步骤。记得在每个步骤前添加 `-` 符号。

diff --git a/apps/site/docs/zh/index.mdx b/apps/site/docs/zh/index.mdx
@@ -1,26 +1,29 @@
-# Midscene.js - AI 加持，带来愉悦的 UI 自动化体验
+# Midscene.js - AI 驱动，带来愉悦的 UI 自动化体验
 
-传统 UI 自动化太难维护了。自动化脚本里往往到处都是选择器，比如 `#ids`、`data-test`、`.selectors`。在需要重构的时候，这可能会让人感到非常头疼，尽管在这种情况下，UI 自动化应该能够发挥作用。
+<video src="/introduction/Midscene.mp4" controls/>
 
-我们在这里推出 Midscene.js，助你重拾编码的乐趣。
+## 通过 AI 执行交互、提取数据和断言
 
-Midscene.js 采用了多模态大语言模型（LLM），能够直观地“理解”你的用户界面并执行必要的操作。你只需描述交互步骤或期望的数据格式，AI 就能为你完成任务。
+Midscene 提供了三种关键方法：交互（`.ai`, `.aiAction`, 提取 (`.aiQuery`), 断言 (`.aiAssert`)。
 
-<video src="/introduction/Midscene.mp4" controls/>
+* 交互 - 用 `.ai` 方法描述步骤并执行交互
+* 提取 - 用 `.aiQuery` 从 UI 中“理解”并提取数据，返回值是 JSON 格式，你可以尽情描述想要的数据结构
+* 断言 - 用 `.aiAssert` 来执行断言
 
-## 通过 AI 执行交互、提取数据和断言
+## 从 Chrome 插件开始快速体验
 
-一共有三种关键方法：交互（`.ai`, `.aiAction`）, 提取 (`.aiQuery`), 断言 (`.aiAssert`)。
+通过使用 Midscene.js Chrome 插件，你可以快速在任意网页上体验 Midscene 的主要功能，而无需编写任何代码。
 
-* 用 `.ai`方法描述步骤并执行交互
-* 用 `.aiQuery` 从 UI 中“理解”并提取数据，返回值是 JSON 格式，你可以尽情描述想要的数据结构
-* 用 `.aiAssert` 来执行断言
+点击 [这里](https://chromewebstore.google.com/detail/midscene/gbldofcpkknbggpkmbdaefngejllnief) 从 Chrome Web Store 安装 Midscene 插件。
 
-举例：
+请参照文档 [通过 Chrome 插件快速体验](./quick-experience) 进行安装和配置。
+
+## 多种代码集成形式
+
+维护 Midscene 自动化脚本是一种全新的编码体验。例如，在网页上搜索耳机，你可以这样做：
 
 ```typescript
 // 👀 输入关键字，执行搜索
-// 尽管这是一个英文页面，你也可以用中文指令控制它
 await ai('在搜索框输入 "Headphones" ，敲回车');
 
 // 👀 找到列表里耳机相关的信息
@@ -31,13 +34,9 @@ const items = await aiQuery(
 console.log("headphones in stock", items);
 ```
 
-## 多种集成形式
-
-如果你想要测试 Midscene 的核心能力，我们推荐从 [浏览器插件](./quick-experience) 开始快速体验。插件里可以用自然语言与任意网页联动，调用交互、提取、断言三种接口，无需搭建代码项目。
+有多种形式可以将 Midscene 集成到代码项目中：
 
-此外，还有几种形式将 Midscene 集成到代码：
-
-* [使用 YAML 格式的自动化脚本](./automate-with-scripts-in-yaml)，如果你更喜欢写 YAML 文件而不是代码
+* [使用 YAML 格式的自动化脚本](./automate-with-scripts-in-yaml)，如果你更喜欢写 YAML 文件而不是 Javascript 代码
 * [使用 Chrome 插件的桥接模式](./bridge-mode-by-chrome-extension)，用它来通过脚本控制桌面 Chrome
 * [集成到 Puppeteer](./integrate-with-puppeteer)
 * [集成到 Playwright](./integrate-with-playwright)
@@ -52,14 +51,18 @@ console.log("headphones in stock", items);
   <img src="/report.gif" alt="visualized report" />
 </p>
 
-## 直连模型端，无需三方服务
-
-Midscene.js 是一个采用 MIT 许可证的开源项目 (GitHub: [Midscene](https://github.com/web-infra-dev/midscene/)) 。项目代码运行在用户的自有环境中，所有从页面收集的数据会依照用户的配置，直接传送到 OpenAI 或指定的自定义模型。因此，数据仅用户和指定的模型服务商可访问，任何第三方平台均无法获取这些数据。
 
 ## 自定义模型
 
 目前我们默认选择的是 OpenAI GPT-4o 作为模型，你也可以[自定义为其他多模态模型](./model-provider)。
 
-## 从 Chrome插件开始快速体验
+## 直连模型端，无需三方服务
+
+Midscene.js 是一个采用 MIT 许可证的开源项目 (GitHub: [Midscene](https://github.com/web-infra-dev/midscene/)) 。项目代码运行在用户的自有环境中，所有从页面收集的数据会依照用户的配置，直接传送到 OpenAI 或指定的自定义模型。因此，任何第三方平台均无法获取这些数据。
+
+## 关注我们
 
-通过使用 Midscene.js Chrome 插件，你可以快速在任意网页上体验 Midscene 的主要功能，而无需编写任何代码。请参照文档 [通过 Chrome 插件快速体验](./quick-experience) 进行安装和配置。
+* [GitHub - 请给我们点个 star](https://github.com/web-infra-dev/midscene)
+* [Twitter](https://x.com/midscene_ai)
+* [Discord](https://discord.gg/AFHJBdnn)
+* [飞书交流群](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=291q2b25-e913-411a-8c51-191e59aab14d)