RISCV Compilation failing with "relocation truncated to fit: R_X86_64_32 against `.debug_loc'" when LTO disabled.
Description
Environment
Activity
Jason Lowe-Power June 11, 2021 at 4:21 PM
No, it’s not obvious with objdump! This is the really weird thing… Objdump/nm/size doesn’t even see the whole size of the file! It seems to be mostly in the debug symbols, but I can’t really tell.
Size says the total sum of all symbols in the file is like 400K, but du shows that it’s 20 MB… Something is definitely going wrong!
I have two hypotheses, but I don’t know how to dig much further:
There’s some header files with code, and that code is being duplicated many times. I don’t think it’s pybind because that header isn’t included in some of the large files (as far as I can tell)
We’re using recursive and other crazy template magic. This could be causing great confusion in the compiler.
At a high level, it concerns me that there is so much code generation that is difficult to understand. Trying to trace back these kinds of bugs in incredibly difficult and time consuming. I’m starting to think that the benefits we get from these complex code generation implementations (both templates and python generation) is not worth the maintenance headaches.
Andreas Sandberg June 11, 2021 at 3:43 PM
That is truly weird. This is what it looks like on my system:
$ find build/ARM -name \*.do -exec du -h \{\} \; | sort -h
...
236K build/ARM/params/BIPRP.do
236K build/ARM/params/BloomFilterBlock.do
236K build/ARM/params/BloomFilterBulk.do
...
3.7M build/ARM/enums/TerminalDump.do
3.7M build/ARM/enums/ThreadPolicy.do
3.7M build/ARM/enums/TimingExprOp.do
3.7M build/ARM/enums/VecRegRenameMode.do
...
4.9M build/ARM/python/_m5/param_BaseCPU.do
4.9M build/ARM/python/pybind11/stats.do
5.3M build/ARM/arch/arm/generated/generic_cpu_exec_4.do
5.5M build/ARM/arch/arm/generated/generic_cpu_exec_5.do
5.9M build/ARM/arch/arm/generated/generic_cpu_exec_3.do
8.2M build/ARM/python/pybind11/core.do
11M build/ARM/arch/arm/generated/generic_cpu_exec_6.do
13M build/ARM/arch/arm/generated/generic_cpu_exec_1.do
13M build/ARM/arch/arm/linux/se_workload.do
32M build/ARM/arch/arm/generated/inst-constrs-3.do
<EOF>
The fact that fs9p.o
is a whopping 71MiB on your machine is really strange. There is nothing special about that file other than the fact that it includes 3 param headers. There are other files including multiple param headers, for example thermal_model.cc
that includes 4 of them, and they don’t show up on your list, so it must be something else. Does objdump tell you anything useful about the files? For example which sections are bloated.
Jason Lowe-Power June 11, 2021 at 3:23 PM
Last comment before I start to try to implement your idea… After building with the docker command in the other comment, if you run du build -type f -exec du -h {} ; | sort -h
you’ll find over 300 files that are > 20MB. I can’t find any pattern to these files. I’ll paste the biggest ones below.
26M /fasthome/jlp/gem5-build/default/RISCV/python/_m5/param_Sinic.o
26M /fasthome/jlp/gem5-build/default/RISCV/python/_m5/param_TAGEBase.o
26M /fasthome/jlp/gem5-build/default/RISCV/python/_m5/param_TraceCPU.o
27M /fasthome/jlp/gem5-build/default/RISCV/cpu/minor/execute.o
27M /fasthome/jlp/gem5-build/default/RISCV/dev/net/ns_gige.o
27M /fasthome/jlp/gem5-build/default/RISCV/python/_m5/param_BaseCPU.o
27M /fasthome/jlp/gem5-build/default/RISCV/python/_m5/param_DerivO3CPU.o
27M /fasthome/jlp/gem5-build/default/RISCV/python/_m5/param_Process.o
27M /fasthome/jlp/gem5-build/default/RISCV/python/_m5/param_System.o
28M /fasthome/jlp/gem5-build/default/RISCV/cpu/o3/cpu.o
29M /fasthome/jlp/gem5-build/default/RISCV/dev/net/i8254xGBe.o
29M /fasthome/jlp/gem5-build/default/RISCV/mem/cache/base.o
30M /fasthome/jlp/gem5-build/default/RISCV/arch/riscv/generated/generic_cpu_exec.o
39M /fasthome/jlp/gem5-build/default/RISCV/arch/riscv/linux/se_workload.o
71M /fasthome/jlp/gem5-build/default/RISCV/dev/virtio/fs9p.o
Jason Lowe-Power June 11, 2021 at 3:16 PMEdited
You can try using docker to see the same output that I’m seeing.
This is the largest file that I’m seeing. It’s over 70MB!
docker run -u $UID:$GID --volume $(pwd):$(pwd) -w $(pwd) --rm gcr.io/gem5-test/ubuntu-20.04_all-dependencies scons build/RISCV/dev/virtio/fs9p.o
Note, that I’m running using the uncommitted minor release staging branch (https://gem5-review.googlesource.com/c/public/gem5/+/45829/3)
git fetch https://gem5.googlesource.com/public/gem5 refs/changes/29/45829/3 && git checkout -b change-45829 FETCH_HEAD
I tried passing -M to gcc to see the headers that are included. I passed this through grep -v /usr
to filter out the system headers (I didn’t do -MM just in case that filtered out the -I headers…). This is what I got. Note that pybind does not appear??
docker run -u $UID:$GID --volume $(pwd):$(pwd) -v /fasthome/jlp/gem5-build/default:$(pwd)/build -w $(pwd) --rm gcr.io/gem5-test/ubuntu-20.04_all-dependencies g++ -M -std=c++14 -gz -pipe -fno-strict-aliasing -Wall -Wundef -Wextra -Wno-sign-compare -Wno-unused-parameter -Werror -Wno-error=deprecated-declarations -Wno-error=deprecated -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -pthread -DPROTOBUF_INLINE_NOT_IN_HEADERS=0 -g -O3 -DNUMBER_BITS_PER_SET=64 -DPROTOCOL_MI_example -DTRACING_ON=1 -Ibuild/drampower/src -Ibuild/libfdt -Ibuild/libelf -Ibuild/softfloat -Ibuild/iostream3 -Ibuild/nomali/include -Ibuild/fputils/include -Iext/pybind11/include -Iinclude -Iext -I/usr/include/python3.8 -I/usr/include/hdf5/serial -Iext/googletest/googletest/include -Iext/googletest/googlemock/include -Ibuild/RISCV -Ibuild/RISCV/systemc/ext build/RISCV/dev/virtio/fs9p.cc | grep -v /usr
build/RISCV/base/cprintf.hh build/RISCV/base/cprintf_formats.hh \
build/RISCV/sim/serialize_handlers.hh build/RISCV/base/str.hh \
build/RISCV/base/bitunion.hh build/RISCV/base/bitfield.hh \
build/RISCV/mem/port_proxy.hh build/RISCV/mem/port.hh \
build/RISCV/base/addr_range.hh build/RISCV/mem/packet.hh \
build/RISCV/base/flags.hh build/RISCV/base/printable.hh \
build/RISCV/mem/htm.hh build/RISCV/mem/request.hh \
build/RISCV/base/amo.hh build/RISCV/cpu/inst_seq.hh \
build/RISCV/sim/byteswap.hh build/RISCV/enums/ByteOrder.hh \
build/RISCV/mem/backdoor.hh build/RISCV/base/callback.hh \
build/RISCV/mem/protocol/functional.hh \
build/RISCV/mem/protocol/timing.hh build/RISCV/sim/port.hh \
build/RISCV/sim/sim_object.hh build/RISCV/base/stats/group.hh \
build/RISCV/base/stats/units.hh build/RISCV/params/SimObject.hh \
build/RISCV/base/uncontended_mutex.hh \
build/RISCV/debug/VIO9P.hh build/RISCV/debug/VIO9PData.hh \
build/RISCV/params/VirtIO9PBase.hh \
build/RISCV/params/VirtIODeviceBase.hh build/RISCV/params/System.hh \
build/RISCV/enums/MemoryMode.hh build/RISCV/params/AbstractMemory.hh \
build/RISCV/params/ClockedObject.hh build/RISCV/params/ClockDomain.hh \
build/RISCV/params/PowerModel.hh build/RISCV/base/temperature.hh \
build/RISCV/params/PowerModelState.hh build/RISCV/enums/PMType.hh \
build/RISCV/params/SubSystem.hh build/RISCV/params/ThermalDomain.hh \
build/RISCV/params/PowerState.hh build/RISCV/enums/PwrState.hh \
build/RISCV/params/RedirectPath.hh build/RISCV/params/ThermalModel.hh \
build/RISCV/params/Workload.hh build/RISCV/params/VirtIO9PDiod.hh \
build/RISCV/params/VirtIO9PProxy.hh build/RISCV/params/VirtIO9PSocket.hh \
build/RISCV/sim/system.hh build/RISCV/arch/isa_traits.hh \
build/RISCV/arch/riscv/isa_traits.hh \
build/RISCV/base/loader/memory_image.hh \
build/RISCV/base/loader/image_file_data.hh \
build/RISCV/base/loader/symtab.hh build/RISCV/base/statistics.hh \
build/RISCV/base/intmath.hh build/RISCV/base/stats/info.hh \
build/RISCV/base/stats/types.hh build/RISCV/base/stats/output.hh \
build/RISCV/base/stats/storage.hh build/RISCV/config/the_isa.hh \
build/RISCV/cpu/pc_event.hh build/RISCV/mem/mem_requestor.hh \
build/RISCV/mem/physical.hh build/RISCV/base/addr_range_map.hh \
build/RISCV/cpu/thread_context.hh build/RISCV/arch/generic/htm.hh \
build/RISCV/arch/generic/isa.hh build/RISCV/arch/registers.hh \
build/RISCV/arch/riscv/registers.hh build/softfloat/softfloat.h \
build/softfloat/softfloat_types.h build/softfloat/specialize.h \
build/softfloat/primitiveTypes.h build/softfloat/platform.h \
build/softfloat/softfloat.h build/RISCV/arch/generic/types.hh \
build/RISCV/base/trace.hh build/RISCV/base/match.hh \
build/RISCV/arch/generic/vec_pred_reg.hh \
build/RISCV/arch/generic/vec_reg.hh build/RISCV/arch/types.hh \
build/RISCV/arch/riscv/types.hh build/RISCV/cpu/reg_class.hh \
build/RISCV/sim/redirect_path.hh build/RISCV/sim/se_signal.hh \
build/RISCV/sim/workload.hh build/RISCV/base/loader/object_file.hh \
build/RISCV/base/loader/image_file.hh build/RISCV/sim/stats.hh
Andreas Sandberg June 11, 2021 at 2:59 PM
This is a really weird issue by the way. I’m really not seeing the same issues when I build locally. My local builds strongly suggest that PyBind is one of the biggest culprits on my machine. It smells a lot like a compiler bug that is triggered by something unusual in the gem5 codebase.
At present, the RISCV ISA will fail to compile in certain environments when compiling without LTO. This has not been observed for other ISA targets. This bug can be re-created on the stable branch (v20.1.0) with:
docker run -u $UID:$GID --volume $(pwd):$(pwd) -w $(pwd) --rm gcr.io/gem5-test/ubuntu-20.04_all-dependencies scons build/RISCV/gem5.opt -j6 --no-lto
The compiler fails with the following error at link time:
build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x144808): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x14480c): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x144839): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x14483d): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x1448e3): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x1448e7): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x1448f5): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x1448f9): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x14491d): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x144921): relocation truncated to fit: R_X86_64_32 against `.debug_loc' build/RISCV/python/_m5/param_RiscvInterrupts.o:(.debug_info+0x14492a): additional relocation overflows omitted from the output collect2: error: ld returned 1 exit status scons: *** [build/RISCV/gem5.opt] Error 1
Info on this error can be found here:
https://www.ibm.com/support/pages/intel-compiler-error-relocation-truncated-fit-rx8664pc32
https://stackoverflow.com/questions/10486116/what-does-this-gcc-error-relocation-truncated-to-fit-mean
https://www.technovelty.org/c/relocation-truncated-to-fit-wtf.html
This is essentially a problem due to GCC limiting relative addressing to 32-bit. We believe debug symbols are "too far" away in the binary to address with 32-bit relative addressing.
The go-to solution for this problem is to pass
-mmodel="medium"
or
-mmodel="large"
though this has been found not to work in our case.
This bug only occurs when all dependencies are present on the system, which makes some sense as linking more will increase the address spaces needed.
This bug does not occur on the develop branch. All compile/linking flags are the same. Doing a git bisect we found the following patch fixed the issue on develop: https://gem5-review.googlesource.com/c/public/gem5/+/41736/. It is not currently known why this patch fixes things but may just be that this patch "accidently" keeps all objects compatible with GCC's 32 bit relative addressing restrictions.
This is currently blocking our minor release branch as disabling LTO by default is part of this: https://gem5-review.googlesource.com/q/branch:minor-release-staging-v21-0-1
We are currently looking into fixed this by re-ordering the linking of files, though this is not a long-term solution.