Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infra: perf flame graph #5995

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

infra: perf flame graph #5995

wants to merge 14 commits into from

Conversation

catenacyber
Copy link
Contributor

cc @jonathanmetzman @inferno-chromium

This allows to have perf flame graph along with coverage reports for C/C++/Rust

Tested with

python infra/helper.py build_fuzzers --sanitizer coverage suricata
python3 infra/helper.py build_image --no-pull base-runner
python3 infra/helper.py coverage --fuzz-target fuzz_sigpcap_aware --corpus-dir /path/to/corpus-libFuzzer-suricata_fuzz_sigpcap_aware-latest/ suricata

after having downloaded the corpus directory

Result is available here https://catenacyber.fr/perf-fuzz_sigpcap_aware.svg

In this case, the flame graph allows for instance to see that we spend 17% of the time in JsonAnomalyLogger
Even if this function may have bugs, it is clearly way too much time that is not spent on something else...

@catenacyber
Copy link
Contributor Author

This works locally as I have uname -r : 4.15.0-142-generic but fails CI with uname -r : 5.8.0-1036-azure
Any idea on how to solve this ?

@catenacyber
Copy link
Contributor Author

Interesting : the fuzz target spends quite a bit of time resetting/clearing hash tables... which is not what they are optimized for and is needed by the fuzz target to be stateless...

This was referenced Jul 7, 2021
Copy link
Contributor

@jonathanmetzman jonathanmetzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks!

cd linux-stable/tools/perf/
apt-get install -y flex bison make
# clang finds errors such as tautological-bitwise-compare
CC=gcc DESTDIR=/usr/ make install
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please uninstall flex and bison (there's a bit of an issue with my assumption that we can just uninstall these packages, what if they are installed by a previous step because it is needed in a later step, they will no longer be available).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete the source checkout and the non-installed build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done...
I am not sure about flex and bison, I guess a later step should reinstall them if needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't already installed so they should be safe to remove here.

Copy link
Contributor

@jonathanmetzman jonathanmetzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have something different in mind for profiling than just profiling on the corpus (which I'm not sure illustrates very useful info).
Like using this for timeouts on ClusterFuzz for example

@@ -82,7 +82,7 @@ function run_fuzz_target {
local args="-merge=1 -timeout=100 -close_fd_mask=3 $corpus_dummy $corpus_real"

export LLVM_PROFILE_FILE=$profraw_file
timeout $TIMEOUT $OUT/$target $args &> $LOGS_DIR/$target.log
timeout $TIMEOUT perf record -g -o $DUMPS_DIR/$target.perf $OUT/$target $args &> $LOGS_DIR/$target.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should be doing this in the coverage script. It seems different than coverage IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would surely be useful for timeouts :-)

I also think profiling on the corpus still shows relevant information, as it shows profiling across all kinds of inputs...

I did it in the coverage script, as it was easy : get both reports in one shot...

Let me know what you prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiling on the corpus seems like it will produce a bit of a skewed result because the corpus is a subset of the inputs that the target runs on during fuzzing. Let's say there is an edge that is hit very frequently during fuzzing and takes a really long time, but only one file in the corpus covers it. The profiling won't show that fuzzing is spending a ton of time on this edge.
@oliverchang What do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason this is problematic: In this run the target is instrumented with Source-based coverage instrumentation and dumping results. So not only will profiling include time wasted on that stuff, it will not be able to accurately show the amount of time spent on sancov instrumentation which can also be useful to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, so what is the best way to run it ?
We can run it for some definite time with classic ASAN instrumentation
But then, should it be a new subcommand to infra/helper.py ? Or an option to the run_fuzzer subcommand ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How useful is sancov info? It seems like pretty niche information in most cases?

That said, it does seem like something that could be built more into ClusterFuzz instead, as we may want this data for timeouts and there's the potential corpus vs actual fuzzing skew issue. Not requiring a specialised build also make it easier.

I think this is fine to include with coverage as is for now though, so we can evaluate this before devoting more time to properly integrate this into ClusterFuzz. How long do these runs typically take?

@@ -82,7 +82,7 @@ function run_fuzz_target {
local args="-merge=1 -timeout=100 -close_fd_mask=3 $corpus_dummy $corpus_real"

export LLVM_PROFILE_FILE=$profraw_file
timeout $TIMEOUT $OUT/$target $args &> $LOGS_DIR/$target.log
timeout $TIMEOUT perf record -g -o $DUMPS_DIR/$target.perf $OUT/$target $args &> $LOGS_DIR/$target.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason this is problematic: In this run the target is instrumented with Source-based coverage instrumentation and dumping results. So not only will profiling include time wasted on that stuff, it will not be able to accurately show the amount of time spent on sancov instrumentation which can also be useful to know.

cd linux-stable/tools/perf/
apt-get install -y flex bison make
# clang finds errors such as tautological-bitwise-compare
CC=gcc DESTDIR=/usr/ make install
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't already installed so they should be safe to remove here.

@@ -105,6 +105,11 @@ RUN wget https://repo1.maven.org/maven2/org/jacoco/org.jacoco.cli/0.8.7/org.jaco
echo "37df187b76888101ecd745282e9cd1ad4ea508d6 /opt/jacoco-agent.jar" | shasum --check && \
echo "c1814e7bba5fd8786224b09b43c84fd6156db690 /opt/jacoco-cli.jar" | shasum --check

ENV PERF_DIR=/root/perf
RUN git clone --depth 1 https://github.com/brendangregg/FlameGraph $PERF_DIR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pin this by checking out a git hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Why do we need this ?

Copy link
Collaborator

@oliverchang oliverchang Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to pin dependencies like these going forward for more reproducibility and to avoid potential supply chain compromises.

@@ -82,7 +82,7 @@ function run_fuzz_target {
local args="-merge=1 -timeout=100 -close_fd_mask=3 $corpus_dummy $corpus_real"

export LLVM_PROFILE_FILE=$profraw_file
timeout $TIMEOUT $OUT/$target $args &> $LOGS_DIR/$target.log
timeout $TIMEOUT perf record -g -o $DUMPS_DIR/$target.perf $OUT/$target $args &> $LOGS_DIR/$target.log
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How useful is sancov info? It seems like pretty niche information in most cases?

That said, it does seem like something that could be built more into ClusterFuzz instead, as we may want this data for timeouts and there's the potential corpus vs actual fuzzing skew issue. Not requiring a specialised build also make it easier.

I think this is fine to include with coverage as is for now though, so we can evaluate this before devoting more time to properly integrate this into ClusterFuzz. How long do these runs typically take?

@catenacyber
Copy link
Contributor Author

How useful is sancov info? It seems like pretty niche information in most cases?

In my main test case, suricata fuzz_sigpcap_aware, __sanitizer_cov_trace_cmp8 is pretty significant in resetting hash tables between each input, like 3% of the total runtime

How long do these runs typically take?

In my main test case, suricata fuzz_sigpcap_aware, with a corpus over 10k elements, it takes like 10 seconds

@jonathanmetzman
Copy link
Contributor

Interesting : the fuzz target spends quite a bit of time resetting/clearing hash tables... which is not what they are optimized for and is needed by the fuzz target to be stateless...

Which hash tables are you talking about?

@jonathanmetzman
Copy link
Contributor

This works locally as I have uname -r : 4.15.0-142-generic but fails CI with uname -r : 5.8.0-1036-azure
Any idea on how to solve this ?

Not sure to be honest. And I'm wondering if this is going to be a problem on google cloud (i.e. production) since I would bet the kernels there are also non-standard. Googling "install perf on azure" doesn't come up with much that is helpful: https://www.google.com/search?q=install+perf+on+azure+linux+linux-tools+-perfinsights

@catenacyber
Copy link
Contributor Author

Which hash tables are you talking about?

The ones in the software being fuzzed (suricata)

And I'm wondering if this is going to be a problem on google cloud (i.e. production) since I would bet the kernels there are also non-standard.

At least, I solved the CI problem with install_perf.sh and its multiple alternatives to install perf

@jonathanmetzman
Copy link
Contributor

Which hash tables are you talking about?

The ones in the software being fuzzed (suricata)

You mean the ones reset by __sanitizer_cov_trace_cmp8 right?

@catenacyber
Copy link
Contributor Author

The ones in the software being fuzzed (suricata)
You mean the ones reset by __sanitizer_cov_trace_cmp8 right?

I do not think so, I mean the hash tables from util-hashlist.c in Suricata
So, that is only relevant for Suricata's project, it is not generic.
The generic point is that profiling helps to find points where we can improve the buzzer's speed.

@jonathanmetzman
Copy link
Contributor

The ones in the software being fuzzed (suricata)
You mean the ones reset by __sanitizer_cov_trace_cmp8 right?

I do not think so, I mean the hash tables from util-hashlist.c in Suricata
So, that is only relevant for Suricata's project, it is not generic.
The generic point is that profiling helps to find points where we can improve the buzzer's speed.

Ah got it.

@catenacyber
Copy link
Contributor Author

Other UIs are available to get flame graphs cf https://www.markhansen.co.nz/profiler-uis/

@DavidKorczynski
Copy link
Collaborator

Other UIs are available to get flame graphs cf https://www.markhansen.co.nz/profiler-uis/

FYI a few months ago we used Prodfiler to debug some of the Envoy fuzzers (check page 10 in the report https://github.com/envoyproxy/envoy/blob/main/docs/security/audit_fuzzer_adalogics_2021.pdf) and the flamegraphs were very useful for debugging performance issues in the fuzzers.

@catenacyber
Copy link
Contributor Author

Thanks @DavidKorczynski for this interesting read I had missed.

Does this mean I should not | grep LLVMFuzzerTestOneInput ? for getting the perf report also outside the fuzz target ?

Side note for me later : We can try catenacyber@0ec259f to see if we have a difference in the perf report

@catenacyber
Copy link
Contributor Author

Friendly ping on this
I rebased it and made some needed changes to get this still working.

@jonathanmetzman would you want to use perf for other use cases ?

@DavidKorczynski
Copy link
Collaborator

I think it would be really nice to get this added and add the flame graphs to Fuzz Introspector

@catenacyber
Copy link
Contributor Author

I think it would be really nice to get this added and add the flame graphs to Fuzz Introspector

Thanks @DavidKorczynski. I still want to get this somehow ;-)

What would it take to get it into Fuzz Introspector ?
Does Fuzz Introspector support rust ? (perf works for it)

@DavidKorczynski
Copy link
Collaborator

I think the easiest from Fuzz Introspector's perspective would be to accept the .svg file and perhaps a textual representation of what the flamegraph represents, rather than having Fuzz Introspector run the flame graph generation itself

@catenacyber
Copy link
Contributor Author

@jonathanmetzman how would you like to proceed on this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants