Exploring large codebase (unity build) compilation time



Intro

Previously I wrote a post about compiling inkscape’s codebase using “unity build” methods where all files that need to be compiled are joined during preprosessing phase and then compiled as single compilation unit. Compiling all that code in one go took from 3 to 6 minutes depending on optimization parameters. I got some free time and want to explore where time could be spent.

In the previous post there were several compilation units (instead of just one). This was done becase there was main compilation unit related to inkscape and others were external libraries. Since I don’t assume external libraries will be changing during development I decided to give them their own space. In this post I will be looking only into main compilation unit because it is unit that takes most of the time. All other comilation units in total take roughly 10 seconds while main one takes several minutes so it logical to spend time looking into the most “problematic” place.

Currently time distribution for compilation (on my current computer) looks like this.

Copy data files: .185105678 seconds
Compile lib2geom.so: 17.437606971 seconds
Compile liblivarot_LIB.a: 6.259619460 seconds
Compile libutil_LIB.a: 7.790196950 seconds
Compile libuemf_LIB.a: 1.071249511 seconds
Compile libcroco_LIB.a: .678505152 seconds
Compile libdepixelize_LIB.a: 2.989161327 seconds
Compile libavoid_LIB.a: 10.573620461 seconds
Compile libcola_LIB.a: 1.921713723 seconds
Compile libvpsc_LIB.a: .964423906 seconds
Compile libautotrace_LIB.a: .294723941 seconds
Compile libgc_LIB.a: .537479040 seconds
Compile inkscape binary: 385.648089621 seconds

We will be looking only at inkscape final compilation unit.

First of all let’s look at raw source code numbers. I ran “cloc *” in a project root directory and got these numbers:

------------------------------------------------------------------------
Language                        code     blank        comment     files 
------------------------------------------------------------------------
PO File                      2557321   1042936        1523404        88 
SVG                           471582      3022           1822      3606 
C++                           464864     85253          91898      1407 
C/C++ Header                   96830     27853          54205      1308 
Python                         61830     13612          16414       532 
C                              52514      9493          17897        80 
make                           36992     13415          10271        99 
Text                           29221      8452              0       238 
CMake                          24270      3626           3778       774 
Gencat NLS                     21614       286              0        64 
XML                            16051      1174            993       246 
CSS                            14890      2846           1502        46 
Glade                          13834        24             62        41 
XML (Qt/GTK)                   11630         0              4        34 
Cython                          4729      1726           1532        15 
JavaScript                      3855       745            797         7 
XSLT                            3540       351            638        11 
reStructuredText                2070       977            696        24 
YAML                            2067       128             76        10 
Bourne Shell                    1453       324            336        39 
Markdown                        1365       498              8        35 
CSV                             1116         0              0         3 
TypeScript                       692         0              0       346 
HTML                             658        45             18         6 
Perl                             618       160            162         4 
TeX                              586       194              9         2 
m4                               380        68             62         2 
Visual Studio Solution           289         3              3         3 
MSBuild script                   288         6             32         2 
Bourne Again Shell               221        64             48         3 
diff                             209        26            121         2 
TOML                              83         7              7         3 
DOS Batch                         72        18              2         6 
INI                               66        16              0         3 
Expect                            56        18              0         1 
Tcl/Tk                            52        22             32         1 
WiX source                        34         8              0         1 
Windows Module Definition         29         1              5         2 
Windows Resource File             29         1              0         1 
JSON                              20         0              0         1 
XAML                               2         0              0         1 
------------------------------------------------------------------------
SUM:                         3898022   1217398        1726834      9097 

So we have 464K lines of C++ code, 96K line headers and 52K likes of C code. These are totals and includes code for external libraries. I will try to remove extra lines and get actual lines of code for the main compilation unit.

Aside: Why do they have 36K lines of makefiles and 24k lines of CMake files?

For unity build I created single all.cpp file that included all needed files into it. Instead of compiling it we can use “-E” flag to just preprocess it and actually combine all included files into one. For this I used this command:

c++ -E $CFLAGS_MAIN \
    $SRC/all.cpp \
    -o all_preprocessed.cc \
    $LDFLAGS 

This produces one 37MB file. Looking at raw file gives us 1_031_397 lines of “code”. Since it is preprocessed result it contains a bunch preproccer lines indicating files and location. Removing all those lines (starting with “# “) reduces size to 976K lines. Running “cloc all_preprocessed.cc” gives us 695K lines of code.

If we assume that we hvae 695K lines of code compiled in 385 then we get 1800 lines/second compilation speed. This is roughly similar to what we got in the last article.

Syscalls

Hoping to get some low hanging fruits I tried to look into syscalls. For this I used the same command with some extra params for strace.

strace -T -t -f -o strace_out.txt \
    c++ $CFLAGS_MAIN \
        $SRC/all.cpp \
        -o inkscape \
        $LDFLAGS 

strace start

First of all we are interested to look into all “subprocesses” and used “-f” for that. This will show starting process with all process that it will start (if any). -T and -t are used to show real time and individual timing per syscall. Then dump everything to a file for later processing.

As a result this creates a file with about 850K syscall spread over 375 seconds. Looking over those lines does not give any additional insight as there are no apparent outliers. I wrote a small program to extract and explore some basic stats from this file. First of all “c++” executable start another 4 processes during execution. Here are process ids with how many syscalls they called.

'3192020': 227,
'3192021': 330440,
'3192780': 70127,
'3192791': 218,
'3192792': 447009

Next is to look into syscall themselves. Here is the list:

command         |   amount | total time
----------------------------------------------------
execve          |       13 |   0.000799
brk             |     9569 |   0.047949
access          |       62 |   0.000310
openat          |   155016 |   0.488216
fstat           |    22577 |   0.056053
mmap            |    13981 |   0.192727
close           |     8337 |   0.021549
read            |   203240 |   0.604515
pread64         |       10 |   0.000037
arch_prctl      |        5 |   0.000021
set_tid_address |        5 |   0.000018
set_robust_list |        5 |   0.000016
rseq            |        5 |   0.000018
mprotect        |       33 |   0.000234
prlimit64       |       14 |   0.000066
getrandom       |        8 |   0.000034
munmap          |      368 |   0.011382
futex           |        5 |   0.000020
ioctl           |       26 |   0.000103
rt_sigaction    |       36 |   0.000131
newfstatat      |      349 |   0.001744
readlink        |    97439 |   0.281958
getcwd          |      103 |   0.000269
rt_sigprocmask  |       24 |   0.000088
clone3          |        4 |   0.000729
sysinfo         |        3 |   0.000019
faccessat2      |     5671 |   0.017385
madvise         |       36 |   0.000478
mremap          |        9 |   0.000310
write           |    93497 |   0.614183
wait4           |        5 | 416.315802
exit_group      |        5 |   0.000000
getrusage       |       82 |   0.000246
fcntl           |      498 |   0.001223
lseek           |   236963 |   0.526468
dup             |       11 |   0.000025
getdents64      |        4 |   0.000020
umask           |        2 |   0.000005
chmod           |        1 |   0.000006
unlink          |        5 |   0.010854

Looking at this table we clearly see that there are no obvious offenders. Even though we do a lot of syscalls they don’t take a lot of time. From the table the biggest time user looks like wait4 syscall. But it just waits for another process.

Some notable syscalls:

readlink(…): There were 97439 calls of readline. 86596 of this were EINVAL and rest is ENOENT. So all calls to readlink resulted in error.

read(…): There 203240 calls to read. If we sum up all read amounts we get 1GB (1019936061 bytes) read in total. Largest read was 3.8MB (3865472 bytes). Actually there were 10 reads more than 1 MB and the rest were smaller reads. The big 3MB read was a /tmp/ccU1fgMR.o file.

write: There were 93497 write calls. In total it wrote 231 MB (231457895 bytes). With largest write being 9.5 MB (9_506_816 bytes). This write was for ‘inkscape’ binary itself.

execve: it shows 13 calls but in reality it was 5 calls. Othere were ENOENT. So we can see that we started with c++ and it the called other 4.

1) /usr/bin/c++
2) /usr/lib/gcc/x86_64-pc-linux-gnu/15.2.1/cc1plus
3) /usr/bin/as
4) /usr/lib/gcc/x86_64-pc-linux-gnu/15.2.1/collect2
5) /usr/bin/ld

openat: The program tried opening 155016 files of those 146635 were ENOENT and only 8381 were successfull. From these opened files there were a lot of duplicates. Removing dublicates leaves us with 5967 uniqe file paths. I removed not related files and that left us with these:

 3423 .h
 1275 .hpp
  875 .cpp
  195 .so
   32 .tcc
   22 .ipp
   13 .a
    8 .conf
    ...

I tried cating togetter all h, hpp, cpp, tcc, ipp files and resulting file was 54MB with about 1.5 million raw lines of code. This is significantly larger that the file I calculated using “gcc -E”. But in this case I decided not to spend more time on this for now.

We can see that there were quite a few header files. These are probably boost and other headers from standart library.

unlink: Since in the end it removes just 4 files I assume it created only 5 temporary files during process of compilation. The files were: /tmp/ccJ8g4Os.cdtor.c, /tmp/ccNQCMUo.cdtor.o, /tmp/cc7K6Det.res, /tmp/ccU1fgMR.o, /tmp/cckTy2lE.s.

wait4: it is syscall to wait for pid and in here it uses most of the time. In reality it does not take time. It just starts waiting and counts times from start to the time when when child returns. Here are timing for this run:

3192020 <... wait4 resumed>, ...           <313.367923>
3192020 <... wait4 resumed>, ... = 3192021 <82.373998>
3192020 <... wait4 resumed>, ... = 3192780 <11.422819>
3192791 <... wait4 resumed>, ... = 3192792 <4.574100>
3192020 <... wait4 resumed>, ... = 3192791 <4.576962>

Clearly the biggest time block is first wait4. Looking into process start list it it is cc1plus. So the most of the time is spent actually reading files processing them and creating assembly code from it.

strace openat first
child

We can see that it start with “all.cpp” catch all file. After that it opens a temporary file where it will be writing create output. It is in this process that it will open 7941 files. Here we see that it start with “all.cpp” and ends with “inkscape-main.cpp” file.

This whole process of opening all this almost 8k files and reading them took slightly over 4 seconds.

Next part is about 2 minutes and filled with 14K syscalls of brk and mmap. It looks like it read all needed files and then started processing. During the processing it allocated a lot of small blocks of memory. At the start of processing of this block it used 3-9 microseconds per syscall. At the end of this block the same syscall time grow to 60-90 microseconds. Since we it is memory calls I can assume we might be stressing memory to find blocks of available memory.

Next block of actions is writing. It is 36k syscalls to write. Each write is 4096 or one page (buffering?) and it is all writing and assembly output (135 MB). This whole block of writing take 4 and a half minutes. Writing 135 MB should not take 4+ minutes (which makes it barely 0.5MB/s). I have a fast SSD. So this means that most of the time is spent either converting AST to assembly or even more compilation in between. After writing it closes assembly file and closed process.

So we can say that roughly speaking this process of compilation took 4 seconds to load all files (with probably minor processing), about 2 minutes of aggresive small memory allocations (probably parsing and compiling into AST) and then 4 minutes converting AST to assembly and writing it to disk.

Template processing

At the beginning when we created a single preporcessed file I counted all template definitions: “template<”. It was 14588 different definitions. My next assumption was too look at code size exploding due to templates. But how do we look at it? Preprocessor does not expand tempates. To check it I had to switch from GCC to clang and use cppinsights.

cppinsights is a plugin to clang that can convert ast back to code and print it out. Unfortunately my experiments with direct “insights” binary didn’t work out as it was crashing clang. I decided to try do it directly though clang command using these options:

clang++ -Xclang -ast-print -fsyntax-only \
    $CFLAGS_MAIN \
    $SRC/all.cpp \
    $LDFLAGS \
    > out.cc

So it is pretty much the same compilatio but now we use clang and instead of creating binary we dump out code. The resulting out.cc is 700 MB file of raw C++ code (Remember that our initial assumption was about 35-40 MB). So it is 20x increase in raw byte size. Counting lines using “cloc out.cc” tells us that this file is 5 million lines of code (5037380 to be precise). At least 7x increase in size.

So previous 2 minute processing for file was probably expanding templates and size code before starting compilation. And if our calculation was correct then the speed is not 1800 lines per second but full 12000 lines per second (5mil / 386 seconds).

Conclusion

This is rought draft of (back of the envelope) exploring what was happening during unity build of inkscape code and why it (might have) took this much time to compile. The file expands from 30-40 MB source code to 700+ MB source code. And source lines expand from 700k to 5m.

All this process of expanding and the compiling this massive amoutn of code takes some time.

This is all guesstimates and proper analysis would take actual clang/gcc code and profile using those. This is rough exploration and maybe will be expanded to more pricise methods.