Elastic Trace replay extremely inaccurate


I ran elastic trace driven simulations expecting inaccuracy to be under 5% relative to a full O3 simulation, according to the elastic trace paper. What I got instead was a difference in ticks/sim seconds between 100 and 150%. I tried many different approaches but I will give the steps to the one that needs the least explantation.

1)Booted a system with fs.py:
build/ARM/gem5.opt --outdir=m5out/clean_boot ./configs/example/fs.py --checkpoint-dir=m5out/clean_boot --cpu-type=AtomicSimpleCPU --disk-image=/home/sherif/thesis/m5_binaries/disks/linaro-minimal-aarch64.img --disk-image=/home/sherif/thesis/m5_binaries/disks/parsec3.img --kernel=/home/sherif/thesis/m5_binaries/binaries/vmlinux.arm64 --caches

2)Create checkpoint with m5 checkpoint and exit

3) Resume with O3CPU:
build/ARM/gem5.opt --outdir=m5out/recording ./configs/example/fs.py --checkpoint-dir=m5out/clean_boot -r 1 --cpu-type=DerivO3CPU --disk-image=/home/sherif/thesis/m5_binaries/disks/linaro-minimal-aarch64.img --disk-image=/home/sherif/thesis/m5_binaries/disks/parsec3.img --kernel=/home/sherif/thesis/m5_binaries/binaries/vmlinux.arm64 --caches

4)create a checkpoint for the trace recording:
-m5 checkpoint; m5 resetstats; ./program; m5 exit (exit simulation as soon as checkpoint is created)
-move new checkpoint to m5out/recording
-in the m5.cpt file in the checkpoint, replace all instances of 'switch_cpus' with 'cpu' , since otherwise checkpoints made after restoring from a checkpoint cause a serialization error (that's another bug, though I think someone already reported it. It happens with fs.py but not with starter_fs.py)

5) Record a trace:
build/ARM/gem5.opt --outdir=m5out/recording ./configs/example/fs.py --checkpoint-dir=m5out/recording -r 1 --cpu-type=DerivO3CPU --disk-image=/home/sherif/thesis/m5_binaries/disks/linaro-minimal-aarch64.img --disk-image=/home/sherif/thesis/m5_binaries/disks/parsec3.img --kernel=/home/sherif/thesis/m5_binaries/binaries/vmlinux.arm64 --caches --elastic-trace-en --data-trace-file=deptrace.proto.gz --inst-trace-file=fetchtrace.proto.gz --mem-type=SimpleMemory

6)Run a full O3 simulation for reference:
build/ARM/gem5.opt --outdir=m5out/full_o3 ./configs/example/fs.py --checkpoint-dir=m5out/recording -r 1 --cpu-type=DerivO3CPU --disk-image=/home/sherif/thesis/m5_binaries/disks/linaro-minimal-aarch64.img --disk-image=/home/sherif/thesis/m5_binaries/disks/parsec3.img --kernel=/home/sherif/thesis/m5_binaries/binaries/vmlinux.arm64 --caches

7)Run a replay of the trace:
build/ARM/gem5.opt --outdir=m5out/replay ./configs/example/etrace_replay.py --cpu-type=TraceCPU --caches --data-trace-file=m5out/recording/system.switch_cpus.traceListener.deptrace.proto.gz --inst-trace-file=m5out/recording/system.switch_cpus.traceListener.fetchtrace.proto.gz --mem-type=SimpleMemory --mem-size=3GB

Note: I've tried different programs/benchmarks for the trace recording, from just running ls or echo to the stream benchmark. They all cause an assertion error for invalid Context ID. This is an issue in of itself that makes elastic traces unusable out of the box. In an attempt to circumvent this I tried setting all context IDs to 0 or forcing the LL/SC flag to 0 in trace_cpu.cc TraceCPU::ElasticDataGen::executeMemReq()


//remove LLSC commands;
//req->setFlags(node_ptr->flags & (~0x00200000));


I can attach simulation stats if needed, but given that these stats are dependent on may factors, I don't feel it is necessary. The one thing I consistently see is a relative difference in sim_seconds of over 100%, a relative difference in simulated cpu cycles that is almost identica to that, and sometimes the number of Dcache accesses in the replay being much larger than the one in the recording and reference, even though the accuracy is terrible whether this happens or not.


Ubuntu 20.0 running in WSL2 on Windows 10 64-bit




Sherif AbdelFadil