Skip to content

Conversation

@aheejin
Copy link
Member

@aheejin aheejin commented Sep 15, 2025

This adds a script, tools/empath-split.py, which is a wrapper for Binaryen's wasm-split. wasm-split has --multi-split mode, which takes a manifest file that lists the name of functions per module. (Example:
https://github.com/WebAssembly/binaryen/blob/main/test/lit/wasm-split/multi-split.wast.manifest)

But listing all functions belonging to each module is a tedious process. empath-split takes a wasm file and a text file that has a list of paths, which can be either directories or functions, and using the source map information, generates a manifest file, and runs wasm-split.

This adds a small drive-by fix for emsymbolizer. Currently when it takes a address 0, it returns the location info associated with offsets[-1], which is the largest offset. This fixes it, and adds an optional lower_bound argument to find_offset so that when we want to get a source info entry, we don't go below the current function start offset.

This adds a script, `tools/empath-split.py`, which is a wrapper for
Binaryen's `wasm-split`. `wasm-split` has `--multi-split` mode, which
takes a manifest file that lists the name of functions per module.
(Example:
 https://github.com/WebAssembly/binaryen/blob/main/test/lit/wasm-split/multi-split.wast.manifest)

But listing all functions belonging to each module is a tedious process.
`empath-split` takes a wasm file and a text file that has a list of
paths, which can be either directories or functions, and using the
source map information, generates a manifest file, and runs
`wasm-split`.

This makes a small drive-by fix for `emsymbolizer`. Currently when it
takes a 0 address, it returns the location info associated with
offsets[-1], which is the largest offset. This fixes it, and adds an
optional `lower_bound` argument to `find_offset` so that when we want to
get a source info entry, we don't go below the current function start
offset.
@aheejin aheejin requested review from dschuff and tlively September 15, 2025 17:08
return None
# If lower bound is given, return the offset only if the offset is equal to
# or greather than the lower bound
if lower_bound:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's only one caller of this (and of lookup) and we don't anticipate any different use cases, maybe we should just simplify this by requiring lower_bound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place is here:

return sm.lookup(address)

What do we give for lower_bound? It doesn't have the current function offset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, ok.
Maybe for the original "just symbolize a random address" emsymbolizer use case we can eventually do better than we are now (e.g. give some kind of warning if we end up finding a location that corresponds to a different function from the given address, because odds are good it's not what the user actually wanted). But that doesn't have to be for this PR.

assert module.read_string() == 'sourceMappingURL'
# TODO: support stripping/replacing a prefix from the URL
URL = module.read_string()
URL = module.get_sourceMappingURL()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add to this PR if things are working for you, but last time I tried to actually use emsymbolizer, I had to add something like

if not os.path.isfile(URL):
      URL = os.path.join(os.path.dirname(module.filename), URL)

probably because I was using relative paths everywhere.

Copy link
Member Author

@aheejin aheejin Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't change anything for emsymbolizer (I just moved sourceMappingURL-getting code from emsymbolizer.py to webassembly.py) and there was no os.path.join(os.path.dirname, ...) in emsymbolizer.py. Where am I supposed to add it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I had to add it locally (I added it right here in emsymbolizer because that's where the code was until now). Again, this was just an FYI. Maybe I'll just try to reproduce the behavior and add a proper test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, emsymbolizer has worked for me with no change so far.. Yeah please let me know if you find the condition in which it becomes a problem.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 15, 2025

Are we sure empath-split is the best name for this tool? Are we free to change the name after this test lands?

@aheejin
Copy link
Member Author

aheejin commented Sep 15, 2025

Are we sure empath-split is the best name for this tool? Are we free to change the name after this test lands?

I'm all for different suggestions. What do you prefer? I started with path-split, and then noticed all scripts that are meant to be used by outside users had the prefix em, so empath-split, but not that I particularly like the name.

And yeah, I think we can change the tool name even after landing because this is currently an experimental tool so that a few partners can try out and I don't intend to broadcast it to all users just yet.

@aheejin aheejin merged commit d801296 into emscripten-core:main Sep 22, 2025
32 checks passed
@aheejin aheejin deleted the path_split branch September 22, 2025 22:17
@dschuff
Copy link
Member

dschuff commented Oct 16, 2025

@aheejin it just occurred to me that this functionality (in whatever form it ultimately gets integrated into emcc) should probably allow having multiple path specifications per module, rather than just one module per file or directory. I'd say let's figure out the JS glue and the integration into emcc first though.

@aheejin
Copy link
Member Author

aheejin commented Oct 17, 2025

Done in #25577.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants