This document explains the technical details behind Chrome accessibility code by starting at a high level and progressively adding more levels of detail.
See Part 1 first.
[TOC]
There are different ways that a web browser can be multi-process, so it's important to first discuss the model Chromium uses.
In Chromium, there's a single browser process. That process is the "main" process that's launched by the user. It owns all of the windows and UI elements, and it handles nearly all of the interaction with operating system APIs. Then there are multiple render processes that handle running each individual web page. Render processes are sandboxed - this means that they're basically forbidden from directly talking to the operating system.
Each renderer handles one web page. For now let's forget about iframes, we'll deal with that added complexity down below. A renderer handles the entire lifecycle of a page - managing the HTTP connection, parsing the HTML, resolving CSS styles, executing the JavaScript, and figuring out what to draw to the screen.
Because the renderer is in a sandboxed process, it doesn't directly get user input like key presses or mouse clicks, and it doesn't directly draw to the screen. These things are all handled by communicating with the browser process. The browser process owns the window; when there's a user input event like a mouse click or key press, it forwards that event to the appropriate renderer. The renderer figures out how to draw the webpage, but it doesn't draw directly to the screen - because it's sandboxed - it either sends pixels (in software rendering mode) or sends the drawing commands to Chromium's separate gpu process.
A simplified diagram of Chromium's multi-process architecture is shown here:
In most system diagrams we're only going to show a single web page, because supporting multiple windows and tabs doesn't complicate the accessibility architecture much. But, it's helpful to understand how the system looks when the user has multiple windows and tabs open. Notably:
- The browser process owns all of the windows and the tabs within them.
- Each browser tab maintains a connection to a web page running in a render process.
- There can be multiple render processes, but there's no correspondence between browser windows, tabs, and render processes. The different web page renderers might all be in one process, they might all be in different processes, or they might be split across processes in any arbitrary way, as shown in this diagram:
You can read more about Chromium's multi-process architecture here - note that this document is old, so it's more Windows-centric and some of the details are out of date, but the basic design is still quite similar to today:
https://www.chromium.org/developers/design-documents/multi-process-architecture
The majority of the code inside each web renderer is implemented in a module called Blink (the Blink Rendering Engine). Historically, when Chromium was first released, this module was WebKit, but it was forked and renamed Blink in 2013. As described in How Blink Works, Blink implements everything that renders content inside a browser tab:
- Implement the specs of the web platform (e.g., HTML standard), including DOM, CSS and Web IDL
- Embed V8 and run JavaScript
- Request resources from the underlying network stack
- Build DOM trees
- Calculate style and layout
- Embed Chrome Compositor and draw graphics
There are a few small layers in Chromium's render processes outside of Blink, containing:
- Handling the multi-process communication
- The renderer side of Chrome browser features (like spell check or autofill) that aren't a core part of the web platform
There are other ways that a multi-process web browser could be implemented. Another possible approach is that each tab could be in its own process, but each tab still communicates directly with the operating system. In this model, the operating system can send input events directly to the active tab, and the tab can paint its own contents directly.
Both Apple Safari and Microsoft Edge Legacy (i.e. Edge before Chromium) use variations of this multi-process model. They get some of these multi-process advantages, shared with Chromium:
- Stability: a stuck tab won't hang the whole browser, a crashed tab won't crash the whole browser
- Performance: a slow tab won't prevent other tabs from being responsive
- Isolation: a compromised tab won't have access to user data from other tabs
Chromium chose its multi-process model with sandboxed render processes because it provides much stronger protection against exploits:
- Security: a compromised sandboxed renderer has no access to the operating system, so it can't compromise the user's system
Unfortunately, Chromium's architecture makes accessibility more complex. Accessibility APIs are operating system APIs. In a non-sandboxed multi-process browser like in the diagram above, each tab can directly handle its own accessibility. In Chromium, accessibility APIs need to be handled by the browser process, even though most of the information about the web page lives in one of the sandboxed render processes.
Under Chrome's architecture, operating system accessibility APIs can only talk to the browser process. In fact, from the point of view of assistive technology, they don't even know about other processes. All of the windows are owned by the main browser process, so all of the accessibility APIs get called in that process.
Let's consider the following scenario: assistive technology has found a node in the accessibility tree corresponding to a checkbox, and wants to query it to find out its current state (enabled, checked, focused, etc.).
When we were first building accessibility support in Chromium, one approach we tried was for the browser process to have a lightweight tree of proxy objects, each one corresponding to a node in the accessibility tree in the render process. Upon receiving a call to getState, the browser process would make a blocking IPC to the render process, get the state of that particular node, then reply to the accessibility API call with that result.
We discovered two problems with this approach.
First, making blocking IPCs from the browser to renderer was highly discouraged. It introduced "jankiness" into the browser and introduced the possibility of deadlock. Unfortunately nearly all accessibility APIs are synchronous method calls, so there's no easy way around this.
Second, we discovered that some assistive technology was calling many thousands of accessibility APIs in a row when loading a page. For example, both JAWS and NVDA scanned the entire web page from top to bottom on first load in order to build their virtual buffers. This proxy model was slowing things down dramatically. Even though most calls only took a millisecond to return, when accessing thousands of nodes sequentially that resulted in even medium-sized web pages taking 10 seconds or more to load.
In addition, while most calls would be fast, some blocking calls could take much longer because they'd need to block until not only the render process's main thread was free, but also until the document was in a clean layout state (more on layout below under Blink). And if a renderer was hung (long-running JavaScript or an endless loop), the blocking call might never return or might be forced to time out.
Instead of the proxying approach, Chromium caches the full accessibility tree for every web page in the browser process. When accessibility API calls come in from the operating system or assistive technology, they're handled immediately out of the cache, never blocking on a render process. Separately, renderers send atomic updates to the browser process to keep the accessibility tree up-to-date.
Here's a diagram of this approach:
One advantage to this approach is that handling operating system accessibility API calls is quite fast. In fact, this design leads to even faster performance than a traditional single-process browser, where many API calls would be handled by querying the DOM tree or Layout tree for details.
This approach is also completely free from blocking IPCs or deadlocks.
There are some drawbacks to this approach, though:
Memory usage is higher. The cache necessarily duplicates information that was already stored elsewhere, so this is unavoidable. We mitigate this in part by ensuring that the data structure we use to store each accessibility node is sparse and compact.
Second, the accessibility tree can't be computed lazily. Whenever a web page changes, updates have to be pushed to the browser process cache right away, so that the cache is up-to-date as soon as possible. Now, when a large and complex page loads and this page is immediately consumed by assistive technology, then this approach is no worse - we're just shifting the burden from providing the accessibility tree on-demand to precomputing it, but essentially doing the same work. However, when assistive technology is not actually consuming the changes, this approach can be inefficient. More work is needed to mitigate this performance issue.
One question that comes up is: isn't it a problem that the browser cache is potentially "behind", showing a snapshot of the web page as it existed a moment ago? This is true, but in practice it ends up being insignificant. What's most important is that the cache always represents a complete and consistent snapshot of the accessibility tree. Also, note that the visual representation you see of a web page is also delayed slightly from the "source of truth" in the DOM. A typical graphics frame is calculated every ~17 ms (assuming a display refresh rate of 60 fps), and there's some additional latency from when a graphics frame is computed and when it's actually shown on the screen - so in a sense whenever a web page makes a change to JavaScript, what you see on the screen is usually 10 - 20 ms behind that. If you've ever clicked the mouse at the exact instant the web page scrolled out from under you and you clicked on the wrong thing, you've observed this phenomenon.
Chromium's caching approach to multi-process accessibility has led to several advantages or insights that were not immediately apparent in the initial design:
- The cache can be anywhere, it doesn't have to be in the browser process. On Chrome OS we put the accessibility cache in the process running the assistive technology. On Windows we have explored the idea of making a separate accessibility process for assistive technology to talk to, or possibly pushing the cache to the assistive technology's process.
- It's all data. By design the accessibility cache is just data, it's a serialization of what the accessibility tree looks like at any point in time. This view ends up having a lot of nice advantages, described more below.
One way to think about different ways to architect an accessibility system is to explore what triggers data to move in the system.
Most operating system accessibility APIs are based around a "pull" model. When assistive technology requests information from the accessibility tree, it calls a method on the app to requests it. In a single-process browser or in the proxy approach described above, the underlying data behind that node is pulled from the source accessibility tree, which pulls from the underlying data model (the DOM and layout trees).
In contrast, Chrome's "push" model tracks changes that happen to the accessibility tree and then pushes changes from the web directly to the cached accessibility tree cache in the browser process. This incurs from upfront cost when a page changes, but makes access to accessibility APIs very fast.
Most accessibility APIs are based around a functional API, where you override methods in order to answer whatever queries assistive technology has about any particular accessibility object.
This approach is consistent with many common design principles, such as DRY (don't repeat yourself) - that there should be a single source of truth for any piece of information (like whether a control is visible or not), rather than two copies of that information (which could get out of sync). (This is harder to achieve in a multi-process app, though.)
However, the functional approach has its downsides.
To query the current state of an object, you end up calling basically all of its methods. Even in a single process browser where there's no IPC overhead, every single API might go through several layers of indirection in order to be satisfied.
For example, suppose the operating system calls the isEnabled() method on a checkbox. The implementation might make a series of method calls that end up querying the DOM or the Layout tree, before returning the result. Subsequent calls to isVisible(), isFocused(), and isChecked() might go through the same series of calls, unless the app specifically cached them temporarily.
In comparison, if you ask that same checkbox to just fill in a simple data structure with its accessibility state, it might be able to quickly compute isEnabled, isVisible, isFocused, and isChecked all at the same time, with no additional layers of indirection needed.
Another advantage of this approach is that you can take advantage of default values using a sparse data structure - think of a node in the accessibility cache as a key/value store like a hash map. If the default value of isChecked is false, then for any accessible object that isn't currently checked you don't have to do anything. Only if it is checked do you add a "checked" attribute to the map.
Another advantage of the accessibility tree being data is that you can save the complete state of the accessibility tree - either making a copy in memory, or dumping a human-readable text version to a log file. We use this concept extensively in accessibility browser tests.
One powerful consequence of this approach is that an accessibility tree doesn't need a backing web page in order to function. It's possible to save an accessibility tree and a series of atomic mutations, and then "replay" them later and get identical results, without any backing web page. Chromium currently has some experimental support for recording changes to a web page in the chrome://accessibility page, and we also take advantage of this snapshotting in order to implement support for the Android "freeze-dried tabs" feature where a frozen snapshot of the page is displayed (with accessibility support) while the real page is being fetched from a slow connection.
In this section we'll cover the data structures used in order to implement the accessibility cache. For the moment, we'll ignore the render processes and how this data is generated, in order to focus on what data is received and how it's used.
The key data structures used throughout accessibility code are found in the ui/accessibility directory.
The underlying data from one node is stored in AXNodeData. A simplified version of this struct is:
struct AXNodeData {
int id;
Role role;
vector<pair<StringAttribute, string>> string_attributes;
vector<pair<IntAttribute, int>> int_attributes;
vector<pair<FloatAttribute, float>> float_attributes;
...
vector<int> child_ids;
AXRelativeBounds relative_bounds;
};
Every node has an ID, but IDs are only required to be unique within the same web frame. The IDs are used to express the tree structure - each node has a vector of its child IDs.
Every node has a role, since that's a fundamental concept in accessibility and every node needs one. Every node also has a bounding box (we'll go into why it's a relative bounding box later). Nearly all of the other attributes are stored as sparse vectors of (attribute type, attribute value) pairs. There are currently over 100 different attributes that Chromium can associate with a single accessible node, but most nodes only have 5 - 10 of them set. Anything unset is treated as having the default value.
An AXTreeUpdate is a pure-data snapshot of an accessibility tree or an update to an existing tree. A slightly simplified version of that struct is:
struct AXTreeUpdate {
AXTreeData tree_data;
int root_id;
vector<AXNodeData> nodes;
};
AXTreeData just has a few attributes that apply to the entire tree rather than just one particular node.
A valid AXTreeUpdate corresponding to a complete accessibility tree just has to contain every node exactly once with no duplicates or redundant nodes. In other words, root_id must contain the ID of one node; that node must have the IDs of its children, and every node must either be the root or the child of some other node. The nodes can come in any order.
A valid AXTreeUpdate can also represent the changes needed to change an existing accessibility tree. Any node that's unchanged can be completely left out; only nodes that change need to be included. To insert a node, just add its AXNodeData and be sure to update the child_ids in its parent. To remove a node, remove it from its parent's child_ids.
An AXTreeUpdate is stateful - if you're using it to update an existing tree then the code that generated the update needs to understand the state that the tree was in previously.
There are two classes that represent a live node and a live tree, here are the important details:
class AXNode {
public:
...
const AXNodeData& data();
AXNode* GetParent() const;
vector<AXNode*>& GetAllChildren() const;
...
};
class AXTree {
public:
AXTree();
explicit AXTree(const AXTreeUpdate& initial_state);
bool Unserialize(const AXTreeUpdate& update);
AXNode* root() const;
AXNode* GetFromId(int id) const;
...
};
You can construct an AXTree from an AXTreeUpdate directly, or create an empty tree first. Then call Unserialize to take an AXTreeUpdate and apply those changes to the current tree. It will return false if the AXTreeUpdate is malformed or can't be applied to the current tree.
Once you have an AXTree, you can get its root node, or look up nodes by ID. Each node is just a thin wrapper around its underlying AXNodeData plus some convenience functions to walk the tree.
AXTree is essentially the underlying data model used to implement accessibility in the browser process. As described in Part 1, there's a platform-specific accessibility object for each node in the tree. When a platform accessibility API is called on one of those nodes, it uses the corresponding AXNode's AXNodeData to get the details of the node to satisfy that query.
Nearly all accessibility APIs can be satisfied directly from AXNodeData: for example a node's name, role, state, value, bounding box, allowed actions, and relationships with other nodes. There are a few exceptions that will be covered in Part 3.
BrowserAccessibilityManager is a layer on top of AXTree. It hooks up the cross-platform accessibility data structures to the platform-specific tree of native accessibility objects. That will be discussed in more detail in Part 3.
The other half of the puzzle here is what happens on the renderer side. Given a web page that's constantly changing, how do we serialize a representation of the accessibility tree and send small, atomic updates to the browser process to keep the cache in sync?
See above for a reminder of Blink and how it fits into the render process. Render accessibility consists of two main pieces:
- A tree of lightweight accessibility nodes inside Blink that represent the current state of the accessibility tree
- Code outside Blink that keeps track of nodes that need to be updated, and periodically serializes updates to send to the browser process.
Inside Blink, we use the following classes.
AXObject is the base class representing one node in the accessibility tree. Each AXObject is a wrapper, it either wraps a DOM Node (a blink::Node) or a Blink layout object (a blink::LayoutObject). The AXObject contains very little state; it caches a few attributes that are expensive to recompute but otherwise doesn't store its full serialization.
AXObjectCache is the class that represents all of the AXObjects for one web page. It's owned by the Document class, only if accessibility is enabled.
AXObjects are built lazily, on-demand. When an AXObject is added or deleted, its parent marks its list of child objects as dirty so that the next time it's queried it knows to compute them.
The vast majority of the web-specific accessibility logic is in AXObject and AXObjectCache. Besides just support for ARIA attributes and getting accessibility information from DOM elements, here you'll find all of the code that interprets the ACCNAME spec to compute the accessible name and description for any HTML element, by checking aria-labelledby, aria-label, and other relevant attributes, for example.
In addition, the code in AXObject and AXObjectCache builds the structure of the accessibility tree, especially in cases where it differs from the DOM tree, such as CSS generated content, aria-hidden, or aria-owns.
When the accessibility tree changes, AXObjectCache sends an event notification to the render accessibility code outside of Blink indicating that a node has changed and needs to be re-serialized.
Blink does not currently have listener interfaces for all of the changes that accessibility cares about. Rather, Blink code is generally specifically instrumented for accessibility. So when a new DOM node is inserted, or when the value of a form control changes, you'll see explicit code in Blink to update the AXObjectCache.
In earlier versions, Blink accessibility code used to post a lot of specific event notifications indicating exactly what changed, for example: NameChanged, ValueChanged, StateChanged, ChildrenChanged, etc. - but over time we've moved away from that model. Now we usually just mark a node as dirty - or the event notification is still there for historical reasons but it's never consumed and only marking the node is dirty ends up mattering.
The reason for this was because we started adding code to automatically generate events from tree mutations. This helped avoid entire classes of problems like duplicate events.
So in a nutshell, Blink is just responsible for notifying which nodes have changed. It's more efficient to just re-serialize nodes that probably changed (occasionally doing a bit of extra work) than it is to write very careful logic to only update nodes that actually changed.
The main render process accessibility code outside of Blink is in RenderAccessibilityImpl. That code maintains the connection with the browser accessibility code and handles serializing updates to the accessibility tree.
Updates to the accessibility tree are always batched.
One important reason is because the accessibility tree can only be serialized when the document's lifecycle state is clean. In a nutshell, every time a change happens to a web page that could conceivably affect how it appears on-screen, the document is dirty until Blink has had a chance to do CSS style resolution and layout. Accessibility code will fail assertions if you try to query certain properties when the document is in a dirty state, because it could lead to inconsistent results or even crashes.
So, accessibility changes are always queued up and then sent periodically only after first ensuring that layout is complete.
When it is time to send accessibility updates, we make use of an abstraction called AXTreeSerializer. AXTreeSerializer is a class that knows how to walk a tree of nodes and generate valid AXTreeUpdates that incrementally update a remote AXTree.
AXTreeSerializer is designed so that it doesn't know anything about Blink, and it doesn't interpret any accessibility logic, it just knows how to work with the AXNodeData and AXTreeUpdate data structures. In fact, we're using AXTreeSerializer for other accessibility trees in Chromium outside of Blink, and we have extensive unit tests for AXTreeSerializer that serialize from one AXTree into another AXTree to test the logic in isolation.
AXTreeSerializer is stateful; it keeps track of what nodes have been sent to its counterpart. When walking the tree, if it encounters a node that it hasn't serialized before, it automatically serializes it. If it encounters a node that was previously serialized and wasn't marked as dirty, it automatically skips it.
AXTreeSerializer uses an interface called AXTreeSource to enable it to walk any tree-like object without being tightly coupled to Blink or any other specific tree. We use an implementation BlinkAXTreeSource that maps all of the tree-walking and serialization calls into calls to Blink's AXObject class.
This figure shows the overall system diagram covered so far:
In the next section we'll explore some of the details that were glossed over, including:
- Abstracting platform-specific APIs
- Relative coordinates
- Text bounding boxes
- Hit testing
- Views and other non-web custom-drawn UI