We're updating the issue view to help you get more done. 

ARM PAuth patch slows down Linux boot 2x from 2 minutes to 4 minutes

Description

Edit: https://gem5-review.googlesource.com/c/public/gem5/+/25964 made if fall to 2 minutes 40 seconds.

I noticed the boot was slower than normal, and so I bisected it down to:
3cea7d9ce49bda49c50e756339ff1287fd55df77 https://gem5-review.googlesource.com/c/public/gem5/+/25024

Linux image at Linux upstream 5.4.3:

The feature seems to have been added a while back in v4.21: https://www.phoronix.com/scan.php?page=news_item&px=ARM64-Pointer-Authentication and CONFIG_ARM64_PTR_AUTH=y is set in that kernel image.

Full gem5 cli:

I tried to disable the PAuth feature from the software point of view with:

but it did not change the execution time. Same if I revert the feature register by reverting the change only to src/arch/arm/ArmISA.py.

gprof2dot graph before the commit:

at the commit with --param:

Minimized kernel boot

To make the optimization test loop shorter, here is a hacked vmlinux identical to the previous one, only it adds an m5 exit much earlier into the boot:

That image exits just after the “ftrace: allocating 22067 entries in 87 pages” printk message, this point was found by trial and error, it would have been more intelligent to add an m5 dumpstats to printk now that I think about it.

It takes 10s before 3cea7d9ce49bda49c50e756339ff1287fd55df77 and 20s after:

When using this minimized kernel, it is also worth measuring the gem5 startup time with a -m 5000 to ensure it is not taking up too much of the time, the result it: 1.2s.

The gprof proportions of the full boot and short one appear similar, so maybe it is representative. From the below graphs we see for example that in both ArmISA::TLB::translateAtomic accounts for about 39% of execution time. Obtained on Lenovo ThinkPad P51 on gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55:

readMiscReg

This section was tested on Lenovo ThinkPad P51 on gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55.

From the profiling, one big culprit could be readMiscReg, which you would expect to be a fast call, but is not.

Edit: if I edit both readMiscRegNoEffect and readMiscReg to just return 0 and limit simulation time to the same as the kernel with -m, it did not speed up the simulation at all… it is possible that this is because the code goes on a weird path and is not representative anymore, but I don’t think it is likely. Haven’t investigated it further though to confirm. If that is the case, maybe readMiscRec is not so huge after all.

The large majority of calls comes from:

on call stack:

The other important callers are:

and:

Then:

shows that the calls being made are:

with very few calls after that which suggests that caching the top three might give big gains.

To find out more precisely which registers are being read I hack:

and then process the log with:

and the top calls are:

Some ideas:

SCR is about secure stuff, so we likely don’t care about it without EL3.

id_aa64mmfr1_el1 is a feature register fixed at startup and could be easily cached.

By adding both --debug-flags MiscRegs,ExecAll,FmtFlags we also see how many calls are done per typical instruction:

Environment

None

Attachments

Assignee

Giacomo Travaglini

Reporter

Ciro Santilli

Priority

Medium

Affects versions

None

Fix versions

None

Epic Link

None

Components

Labels

None
Configure