Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create speech-recognition-context.md #140

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
index.html
.DS_Store
.idea/
168 changes: 168 additions & 0 deletions explainers/speech-recognition-context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Explainer: Speech Recognition Context for the Web Speech API


## Introduction


The Web Speech API provides speech recognition with generalized language models, but currently lacks explicit mechanisms to support contextual biasing. This means it is difficult to prioritize certain words or phrases, or adapt to user-specific or domain-specific vocabularies.


Therefore we introduce a new feature to add **recognition context** to be part of the Web Speech API, along with an update method to keep updating it during an active speech recognition session.


## Why Add Speech Recognition Context?


### 1. **Improve Accuracy**
Recognition context allows the speech recognition models to recognize words and phrases that are relevant to a specific domain, user, or situation, even if those terms are rare or absent from the general training data. This can significantly decrease the number of errors in transcriptions, especially in scenarios where ambiguity exists.


### 2. **Enhanced Relevance**
By incorporating contextual information, the speech recognition models can produce transcriptions that are more meaningful and better aligned with the user's expectations of the output. This leads to improved understanding of the spoken content and more reliable execution of voice commands.


### 3. **Increased Efficiency**
By focusing on relevant vocabulary, the speech recognition models can streamline the recognition process, leading to faster and more efficient transcriptions.


## New Interfaces


### 1. **SpeechRecognitionPhrase**
This interface holds fundamental information about a biasing phrase, such as a text string for phrase content and a boost float indicating how likely the phrase will appear. The phrase string should be non-empty and a valid boost should be inside the range of [0.0, 10.0] with a default value of 1.0. A boost of 0.0 means the phrase is not boosted at all, and a higher boost means the phrase is more likely to appear.


#### Definition
```javascript
[Exposed=Window]
interface SpeechRecognitionPhrase {
constructor(DOMString phrase, optional float boost = 1.0);
readonly attribute DOMString phrase;
readonly attribute float boost;
};
```


### 2. **SpeechRecognitionPhraseList**
This interface holds a list of `SpeechRecognitionPhrase` and supports adding more `SpeechRecognitionPhrase` to the list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just sequence<SpeechRecognitionPhrase> ?

Copy link
Author

@yrw-google yrw-google Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with using sequence a bit. I think it is feasible to put sequence<SpeechRecognitionPhrase> inside SpeechRecognitionContext and get rid of SpeechRecognitionPhraseList, but that means we need to move other methods like the definitions of length (which is somehow required by blink/v8 if it detects an array) and addItem from SpeechRecognitionPhraseList into SpeechRecognitionContext too. In that case if we add more types of data to SpeechRecognitionContext in the future, it might become confusing, and we won't be able to support another array inside SpeechRecognitionContext because the definition of length needs to duplicate.

From the IDL examples I can find, seems like it is a common practice to create a new ObjectList interface to support a new Object interface (e.g. DataTransferItemList), and sequence is used more often in a dictionary. I'm not sure if we should make SpeechRecognitionPhrase and SpeechRecognitionContext become dictionary instead, so that their relationship will be as simple as SpeechRecognitionContext contains a sequence of SpeechRecognitionPhrase. We want to perform data validation on each SpeechRecognitionPhrase so using dictionary may oversimplify things? Let me know what you think!



#### Definition
```javascript
[Exposed=Window]
interface SpeechRecognitionPhraseList {
constructor();
readonly attribute unsigned long length;
SpeechRecognitionPhrase item(unsigned long index);
undefined addItem(SpeechRecognitionPhrase item);
};
```


### 3. **SpeechRecognitionContext**
This interface holds a `SpeechRecognitionPhraseList` attribute providing contextual information to the speech recognition models. It can hold more types of contextual data if needed in the future.


#### Definition
```javascript
[Exposed=Window]
interface SpeechRecognitionContext {
constructor(SpeechRecognitionPhraseList phrases);
readonly attribute SpeechRecognitionPhraseList phrases;
};
```


## New Attribute


### 1. `context` attribute in the `SpeechRecognition` interface
The `context` attribute of the type `SpeechRecognitionContext` in the `SpeechRecognition` interface provides initial contextual information to start the speech recognition session with.


#### Example Usage
```javascript
var list = new SpeechRecognitionPhraseList();
list.addItem(new SpeechRecognitionPhrase("text", 1.0));
var context = new SpeechRecognitionContext(list);

const recognition = new SpeechRecognition();
recognition.context = context;
recognition.start();
```


## New Method


### 1. `void updateContext(SpeechRecognitionContext context)`
This method in the `SpeechRecognition` interface updates the speech recognition context after the speech recognition session has started. If the session has not started yet, user should update the `context` attribute instead of using this method.


#### Example Usage
```javascript
const recognition = new SpeechRecognition();
recognition.start();

var list = new SpeechRecognitionPhraseList();
list.addItem(new SpeechRecognitionPhrase("updated text", 2.0));
var context = new SpeechRecognitionContext(list);
recognition.updateContext(context);
```


## New Error Code


### 1. `context-not-supported` in the `SpeechRecognitionErrorCode` enum
This error code is returned when the speech recognition models do not support biasing, but speech recognition context is set during SpeechRecognition initialization, or the update context method is called. For example, Chrome requires on-device speech recognition to be used in order to support recognition context.


#### Example Scenario 1
```javascript
const recognition = new SpeechRecognition();
recognition.mode = "cloud-only";

var list = new SpeechRecognitionPhraseList();
list.addItem(new SpeechRecognitionPhrase("text", 1.0));
var context = new SpeechRecognitionContext(list);
recognition.context = context;

// If the speech recognition model in the cloud does not support biasing,
// an error event will occur when the recognition starts.
recognition.onerror = function(event) {
if (event.error == "context-not-supported") {
console.log("Speech recognition context is not supported: ", event);
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this example leaving the actual use of the context that would lead to an error? I'm not sure if I follow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote and added more sample codes to hopefully make it clear. Please take a look again and let me know if additional explanation is needed!


recognition.start();
```


#### Example Scenario 2
```javascript
const recognition = new SpeechRecognition();
recognition.mode = "cloud-only";
recognition.start();

var list = new SpeechRecognitionPhraseList();
list.addItem(new SpeechRecognitionPhrase("text", 1.0));
var context = new SpeechRecognitionContext(list);

// If the speech recognition model in the cloud does not support biasing,
// an error event will occur when calling updateContext().
recognition.onerror = function(event) {
if (event.error == "context-not-supported") {
console.log("Speech recognition context is not supported: ", event);
}
};

recognition.updateContext(context);
```

## Conclusion


In conclusion, adding SpeechRecognitionContext to the Web Speech API represents a critical step towards supporting contextual biasing for speech recognition. This enables the developers to improve accuracy and relevance for domain-specific and personalized speech recognition, and allows for dynamic adaptation during active speech recognition sessions.