-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create speech-recognition-context.md #140
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,3 @@ | ||
index.html | ||
.DS_Store | ||
.idea/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
# Explainer: Speech Recognition Context for the Web Speech API | ||
|
||
|
||
## Introduction | ||
|
||
|
||
The Web Speech API provides speech recognition with generalized language models, but currently lacks explicit mechanisms to support contextual biasing. This means it is difficult to prioritize certain words or phrases, or adapt to user-specific or domain-specific vocabularies. | ||
|
||
|
||
Therefore we introduce a new feature to add **recognition context** to be part of the Web Speech API, along with an update method to keep updating it during an active speech recognition session. | ||
|
||
|
||
## Why Add Speech Recognition Context? | ||
|
||
|
||
### 1. **Improve Accuracy** | ||
Recognition context allows the speech recognition models to recognize words and phrases that are relevant to a specific domain, user, or situation, even if those terms are rare or absent from the general training data. This can significantly decrease the number of errors in transcriptions, especially in scenarios where ambiguity exists. | ||
|
||
|
||
### 2. **Enhanced Relevance** | ||
By incorporating contextual information, the speech recognition models can produce transcriptions that are more meaningful and better aligned with the user's expectations of the output. This leads to improved understanding of the spoken content and more reliable execution of voice commands. | ||
|
||
|
||
### 3. **Increased Efficiency** | ||
By focusing on relevant vocabulary, the speech recognition models can streamline the recognition process, leading to faster and more efficient transcriptions. | ||
|
||
|
||
## New Interfaces | ||
|
||
|
||
### 1. **SpeechRecognitionPhrase** | ||
This interface holds fundamental information about a biasing phrase, such as a text string for phrase content and a boost float indicating how likely the phrase will appear. The phrase string should be non-empty and a valid boost should be inside the range of [0.0, 10.0] with a default value of 1.0. A boost of 0.0 means the phrase is not boosted at all, and a higher boost means the phrase is more likely to appear. | ||
|
||
|
||
#### Definition | ||
```javascript | ||
[Exposed=Window] | ||
interface SpeechRecognitionPhrase { | ||
constructor(DOMString phrase, optional float boost = 1.0); | ||
readonly attribute DOMString phrase; | ||
readonly attribute float boost; | ||
}; | ||
``` | ||
|
||
|
||
### 2. **SpeechRecognitionPhraseList** | ||
This interface holds a list of `SpeechRecognitionPhrase` and supports adding more `SpeechRecognitionPhrase` to the list. | ||
|
||
|
||
#### Definition | ||
```javascript | ||
[Exposed=Window] | ||
interface SpeechRecognitionPhraseList { | ||
constructor(); | ||
readonly attribute unsigned long length; | ||
SpeechRecognitionPhrase item(unsigned long index); | ||
undefined addItem(SpeechRecognitionPhrase item); | ||
}; | ||
``` | ||
|
||
|
||
### 3. **SpeechRecognitionContext** | ||
This interface holds a `SpeechRecognitionPhraseList` attribute providing contextual information to the speech recognition models. It can hold more types of contextual data if needed in the future. | ||
|
||
|
||
#### Definition | ||
```javascript | ||
[Exposed=Window] | ||
interface SpeechRecognitionContext { | ||
constructor(SpeechRecognitionPhraseList phrases); | ||
readonly attribute SpeechRecognitionPhraseList phrases; | ||
}; | ||
``` | ||
|
||
|
||
## New Attribute | ||
|
||
|
||
### 1. `context` attribute in the `SpeechRecognition` interface | ||
The `context` attribute of the type `SpeechRecognitionContext` in the `SpeechRecognition` interface provides initial contextual information to start the speech recognition session with. | ||
|
||
|
||
#### Example Usage | ||
```javascript | ||
var list = new SpeechRecognitionPhraseList(); | ||
list.addItem(new SpeechRecognitionPhrase("text", 1.0)); | ||
var context = new SpeechRecognitionContext(list); | ||
|
||
const recognition = new SpeechRecognition(); | ||
recognition.context = context; | ||
recognition.start(); | ||
``` | ||
|
||
|
||
## New Method | ||
|
||
|
||
### 1. `void updateContext(SpeechRecognitionContext context)` | ||
This method in the `SpeechRecognition` interface updates the speech recognition context after the speech recognition session has started. If the session has not started yet, user should update the `context` attribute instead of using this method. | ||
|
||
|
||
#### Example Usage | ||
```javascript | ||
const recognition = new SpeechRecognition(); | ||
recognition.start(); | ||
|
||
var list = new SpeechRecognitionPhraseList(); | ||
list.addItem(new SpeechRecognitionPhrase("updated text", 2.0)); | ||
var context = new SpeechRecognitionContext(list); | ||
recognition.updateContext(context); | ||
``` | ||
|
||
|
||
## New Error Code | ||
|
||
|
||
### 1. `context-not-supported` in the `SpeechRecognitionErrorCode` enum | ||
This error code is returned when the speech recognition models do not support biasing, but speech recognition context is set during SpeechRecognition initialization, or the update context method is called. For example, Chrome requires on-device speech recognition to be used in order to support recognition context. | ||
|
||
|
||
#### Example Scenario 1 | ||
```javascript | ||
const recognition = new SpeechRecognition(); | ||
recognition.mode = "cloud-only"; | ||
|
||
var list = new SpeechRecognitionPhraseList(); | ||
list.addItem(new SpeechRecognitionPhrase("text", 1.0)); | ||
var context = new SpeechRecognitionContext(list); | ||
recognition.context = context; | ||
|
||
// If the speech recognition model in the cloud does not support biasing, | ||
// an error event will occur when the recognition starts. | ||
recognition.onerror = function(event) { | ||
if (event.error == "context-not-supported") { | ||
console.log("Speech recognition context is not supported: ", event); | ||
} | ||
}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this example leaving the actual use of the context that would lead to an error? I'm not sure if I follow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I rewrote and added more sample codes to hopefully make it clear. Please take a look again and let me know if additional explanation is needed! |
||
|
||
recognition.start(); | ||
``` | ||
|
||
|
||
#### Example Scenario 2 | ||
```javascript | ||
const recognition = new SpeechRecognition(); | ||
recognition.mode = "cloud-only"; | ||
recognition.start(); | ||
|
||
var list = new SpeechRecognitionPhraseList(); | ||
list.addItem(new SpeechRecognitionPhrase("text", 1.0)); | ||
var context = new SpeechRecognitionContext(list); | ||
|
||
// If the speech recognition model in the cloud does not support biasing, | ||
// an error event will occur when calling updateContext(). | ||
recognition.onerror = function(event) { | ||
if (event.error == "context-not-supported") { | ||
console.log("Speech recognition context is not supported: ", event); | ||
} | ||
}; | ||
|
||
recognition.updateContext(context); | ||
``` | ||
|
||
## Conclusion | ||
|
||
|
||
In conclusion, adding SpeechRecognitionContext to the Web Speech API represents a critical step towards supporting contextual biasing for speech recognition. This enables the developers to improve accuracy and relevance for domain-specific and personalized speech recognition, and allows for dynamic adaptation during active speech recognition sessions. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just
sequence<SpeechRecognitionPhrase>
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with using sequence a bit. I think it is feasible to put
sequence<SpeechRecognitionPhrase>
insideSpeechRecognitionContext
and get rid ofSpeechRecognitionPhraseList
, but that means we need to move other methods like the definitions oflength
(which is somehow required by blink/v8 if it detects an array) andaddItem
fromSpeechRecognitionPhraseList
intoSpeechRecognitionContext
too. In that case if we add more types of data toSpeechRecognitionContext
in the future, it might become confusing, and we won't be able to support another array insideSpeechRecognitionContext
because the definition oflength
needs to duplicate.From the IDL examples I can find, seems like it is a common practice to create a new
ObjectList
interface to support a newObject
interface (e.g.DataTransferItemList
), and sequence is used more often in a dictionary. I'm not sure if we should makeSpeechRecognitionPhrase
andSpeechRecognitionContext
become dictionary instead, so that their relationship will be as simple asSpeechRecognitionContext
contains a sequence ofSpeechRecognitionPhrase
. We want to perform data validation on eachSpeechRecognitionPhrase
so using dictionary may oversimplify things? Let me know what you think!