[C++20][Modules] The search complexity for header file info is too expensive with a lot of modules. #140860

mpark · 2025-05-21T07:16:17Z

Currently, for each header file #include or import, there is a look-up of this header file on each of the loaded PCMs. This results in a search complexity of O(N*M) where N is the # of includes, and M is the # of loaded PCMs. For large numbers, e.g. N = 10,000 and M = 1,000, this becomes very costly.

The synthetic test case I put together takes 14s user time, compared to 0.06s without modules.
In a real world scenario, the difference for us was 18s with modules and 3s without modules.

This problem was encountered previously inside ASTWriter (see #87848). It was solved there by only considering the local information rather than making the query out to ASTReader. I'm not sure that such a solution can be used in general. (CC @jansvoboda11)

Test Case

I wrote a bash script to generate the test case. It produces 1,000 empty header units, and 10,000 empty header files. The main file imports the 1,000 header units, then #includes 10,000 header files. We can measure the preprocessing time on this file. On my machine, this takes 14s, compared to 0.06s without modules.

The script for the synthetic repro:

#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-1000}
NUM_HEADERS=${NUM_HEADERS:-10000}

mkdir build && cd build

for i in $(seq 1 $NUM_MODULES); do > m${i}.h; done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 -fmodule-header m{}.h"

for i in $(seq 1 $NUM_HEADERS); do > h${i}.h; done

> a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" >> a.cpp; done
for i in $(seq 1 $NUM_HEADERS); do echo "#include \"h${i}.h\"" >> a.cpp; done
echo "int main() {}" >> a.cpp

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
    $(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; done)

Note: I am aware that -fmodule-file=<pcm> eagerly loads modules and that it's not encouraged. In reality we actually use the -fmodule-file=<name>=<pcm> form with header units.

Code Path

We more or less start here:

  // Do a standard file entry lookup.
  OptionalFileEntryRef FE = HeaderInfo.LookupFile(...);

and proceed through the following:

It->LookupFile(...);: It is an iterator over the search directories.
HS.findUsableModuleForHeader(...);
suggestModule(...);
HS.findModuleForHeader(...);

Here we encounter this:

ModuleMap::KnownHeader
HeaderSearch::findModuleForHeader(FileEntryRef File, bool AllowTextual,
                                  bool AllowExcluded) const {
  if (ExternalSource) {
    // Make sure the external source has handled header info about this file,
    // which includes whether the file is part of a module.
    (void)getExistingFileInfo(File);
  }
  return ModMap.findModuleForHeader(File, AllowTextual, AllowExcluded);
}

where there is an invocation to (void)getExistingFileInfo(File);, which calls ExternalSource->GetHeaderFileInfo(FE);.

We finally arrive at:

HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
  HeaderFileInfoVisitor Visitor(FE);
  ModuleMgr.visit(Visitor);
  if (std::optional<HeaderFileInfo> HFI = Visitor.getHeaderFileInfo())
      return *HFI;

  return HeaderFileInfo();
}

where ModuleMgr.visit(Visitor); will visit every currently loaded PCM.

Every PCM file contains a HEADER_SEARCH_TABLE record, which is an on-disk hash table. For header units, it has at least one entry in that table which is the header file itself. For named modules, we shouldn't need this, but it's still present today (see #78495). This means that every header unit has a HEADER_SEARCH_TABLE with at least one entry in it.

The text was updated successfully, but these errors were encountered:

llvmbot · 2025-05-21T07:21:44Z

@llvm/issue-subscribers-clang-modules

Author: Michael Park (mpark)

Currently, for each header file `#include` or `import`, there is a look-up of this header file on each of the loaded PCMs. This results in a search complexity of `O(N*M)` where `N` is the # of includes, and `M` is the # of loaded PCMs. For large numbers, e.g. `N` = 10,000 and `M` = 1,000, this becomes very costly.

This problem was encountered previously inside ASTWriter (see #87848). It was solved there by only considering the local information rather than making the query out to ASTReader. I'm not sure that such a solution can be used in general. (CC @jansvoboda11)

Test Case

I wrote a bash script to generate the test case. It produces 1,000 empty header units, and 10,000 empty header files. The main file imports the 1,000 header units, then #includes 10,000 header files. We can measure the preprocessing time on this file. On my machine, it takes about 14s user time, compared to 0.06s without modules. In a real world scenario, the difference for us was 18s with modules and 3s without modules.

The script for the synthetic repro:

#/usr/bin/env bash

CLANG=${CLANG:-clang++}
NUM_MODULES=${NUM_MODULES:-1000}
NUM_HEADERS=${NUM_HEADERS:-10000}

mkdir build &amp;&amp; cd build

for i in $(seq 1 $NUM_MODULES); do &gt; m${i}.h; done
seq 1 $NUM_MODULES | xargs -I {} -P $(nproc) bash -c "$CLANG -std=c++20 -fmodule-header m{}.h"

for i in $(seq 1 $NUM_HEADERS); do &gt; h${i}.h; done

&gt; a.cpp
for i in $(seq 1 $NUM_MODULES); do echo "import \"m${i}.h\";" &gt;&gt; a.cpp; done
for i in $(seq 1 $NUM_HEADERS); do echo "#include \"h${i}.h\"" &gt;&gt; a.cpp; done
echo "int main() {}" &gt;&gt; a.cpp

time $CLANG -std=c++20 -Wno-experimental-header-units -E a.cpp -o /dev/null \
    $(for i in $(seq 1 $NUM_MODULES); do echo -n "-fmodule-file=m${i}.pcm "; done)

> Note: I am aware that -fmodule-file=<pcm> eagerly loads modules and that it's not encouraged. In reality we actually use the -fmodule-file=<name>=<pcm> form with header units.

Code Path

We more or less start here:

  // Do a standard file entry lookup.
  OptionalFileEntryRef FE = HeaderInfo.LookupFile(...);

and proceed through the following:

It->LookupFile(...);: It is an iterator over the search directories.
HS.findUsableModuleForHeader(...);
suggestModule(...);
HS.findModuleForHeader(...);

Here we encounter this:

ModuleMap::KnownHeader
HeaderSearch::findModuleForHeader(FileEntryRef File, bool AllowTextual,
                                  bool AllowExcluded) const {
  if (ExternalSource) {
    // Make sure the external source has handled header info about this file,
    // which includes whether the file is part of a module.
    (void)getExistingFileInfo(File);
  }
  return ModMap.findModuleForHeader(File, AllowTextual, AllowExcluded);
}

where there is an invocation to (void)getExistingFileInfo(File);, which calls ExternalSource->GetHeaderFileInfo(FE);.

We finally arrive at:

HeaderFileInfo ASTReader::GetHeaderFileInfo(FileEntryRef FE) {
  HeaderFileInfoVisitor Visitor(FE);
  ModuleMgr.visit(Visitor);
  if (std::optional&lt;HeaderFileInfo&gt; HFI = Visitor.getHeaderFileInfo())
      return *HFI;

  return HeaderFileInfo();
}

where ModuleMgr.visit(Visitor); will visit every currently loaded PCM.

Every PCM file contains a HEADER_SEARCH_TABLE record, which is an on-disk hash table. For header units, it has at least one entry in that table which is the header file itself. For named modules, we shouldn't need this, but it's still present today (see #78495). This means that every header unit has a HEADER_SEARCH_TABLE with at least one entry in it.

ChuanqiXu9 · 2025-05-21T07:29:21Z

BTW, for this,

For named modules, we shouldn't need this, but it's still present today (see #78495).

I remember I tried to do some of such optimizations in the downstream. But I didn't find any significant impacts, so I stopped. It is welcomed to optimize this if any one want to give it a try.

Bigcheese · 2025-05-21T21:02:40Z

Every PCM file contains a HEADER_SEARCH_TABLE record, which is an on-disk hash table. For header units, it has at least one entry in that table which is the header file itself. For named modules, we shouldn't need this, but it's still present today (see #78495).

I'm pretty sure we do still need it. I would expect:

//--- A.h
#pragma once
#define FOO

//--- B.h
#include "A.h"

//--- C.cpp
import "B.h";
#include "A.h" // this #include should get skipped.

where A.h is textual and B.h is a header unit.

The actual issue with #78495 is that Clang does not do visibility tracking of HeaderFileInfo which causes other problems with submodules too. I think the right fix for this part of it is to fix this part of Clang. Named modules should always block HeaderFileInfo propagation because they don't export macros, but header units should always propagate it. Basically they should follow the exact same rules as the other preprocessor state does.

Of course this doesn't fix the performance issue. There I have a preference for making chained hash tables work.

mpark · 2025-05-21T21:13:04Z

There I have a preference for making chained hash tables work.

This means using MultiOnDiskHashTable, is that right? I did try this (mentioned in #140867) and I'm willing to make that work, but it'll be more involved. Specifically because HeaderFileInfoTrait isn't readily usable with it, and because I think we want to store FileEntryRef instead of the "internal key". I think all of this isn't too hard, like, MultiOnDiskHashTable could use Info::InMemoryKey as the MergedTable's key, and the traits can decide which key they want to be used as the in-memory key or something. If that's the direction we want to head, I'm happy to update #140867 with that implementation. The current approach I think is no-go anyway based on @jansvoboda11's report.

llvmbot added the new issue label May 21, 2025

frederick-vs-ja added clang:modules C++20 modules and Clang Header Modules slow-compile and removed new issue labels May 21, 2025

mpark mentioned this issue May 21, 2025

[WIP][C++20][Modules] Lazily, but fully load 'HeaderFileInfo' table into memory. #140867

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++20][Modules] The search complexity for header file info is too expensive with a lot of modules. #140860

[C++20][Modules] The search complexity for header file info is too expensive with a lot of modules. #140860

mpark commented May 21, 2025 •

edited

Loading

llvmbot commented May 21, 2025

Test Case

Code Path

Uh oh!

ChuanqiXu9 commented May 21, 2025

Uh oh!

Bigcheese commented May 21, 2025

Uh oh!

mpark commented May 21, 2025

Uh oh!

[C++20][Modules] The search complexity for header file info is too expensive with a lot of modules. #140860

[C++20][Modules] The search complexity for header file info is too expensive with a lot of modules. #140860

Comments

mpark commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Case

Code Path

llvmbot commented May 21, 2025

Test Case

Code Path

Uh oh!

ChuanqiXu9 commented May 21, 2025

Uh oh!

Bigcheese commented May 21, 2025

Uh oh!

mpark commented May 21, 2025

Uh oh!

mpark commented May 21, 2025 •

edited

Loading