The exact shape of C++ code emitted by CXXRTL has a critical effect
on performance, both compile-time and runtime. CXXRTL's performance
greatly improved when it started localizing and inlining wires, not
only because this assists the optimizer and register allocator, but
also because inlining code into edge-triggered regions cuts the time
spent in eval() by at least a factor of two.
However, the logic of netlist layout has always been ad-hoc, fragile,
and very hard to understand and modify. After commit ece25a45, which
introduced outlining, the same logic started being applied to two
distinct netlists at once instead of one, which barely worked.
This commit does four major changes:
* There is now a single unambiguous source of truth (per subgraph)
for the layout of any emitted wire.
* Netlist layout is now done entirely during analysis using well
known graph algorithms; no graph operations happen when emitting.
* Netlist layout now happens completely separately for eval() and
debug_eval() subgraphs.
* Unreachable (within subgraph scope) netlist nodes are now neither
emitted nor considered for wire inlining decisions.
The netlist layout code should also now closely match the described
semantics.
As a part of this large cleanup, it includes many miscellaneous
improvements:
* The "bare minimum" debug level introduced in commit dd6a761d was
split into two levels; -g1 now emits debug information *only* for
inputs and state wires, and -g2 now emits debug information for
all public members. The old behavior matches -g2. This is done
to avoid bloat on low optimization levels.
* Debug aliases and inlined connections are now handled separately,
and complex RHS never interferes with inlined connections.
* Aliases to outlined wires now carry a pointer to the outline.
* Cell sync outputs can now be emitted in debug_eval().
* Black box debug information now includes comb/sync driver flags.
* The comment emitted for inlined cells is now accurate.
* Debug information statistics now has less noise.
* Netlist layout code is now much better documented.
Due to more precise inlining decisions, unmodified (i.e. with no
Yosys script being used) netlists now have much more logic inlined
into edge-triggered regions. On Minerva SoC SRAM, this improves
runtime by 20-25% across compilers and optimization levels.
Due to more precise reachability analysis, much less C++ code is now
emitted, especially at the maximum debug level. On Minerva SoC SRAM,
this improves clang compile time by 30-50% depending on options.
gcc is not affected.
On Minerva SoC SRAM compiled with clang-11, this change cuts commit
time in half (!) and overall time by 20%. When compiled with gcc-10,
there is no difference.
Implementing outlining has greatly increased the amount of debug
information in a typical build, and consequently exposed performance
issues in C++ compilers, which are similar for both GCC and Clang;
the compile time of Minerva SoC SRAM increased almost twofold.
Although one would expect the slowdown to be caused by the increased
use of templates in `debug_eval()`, it is actually almost entirely
attributable to optimizations and codegen for `debug_items()`.
Fortunately, it is neither possible nor desirable to optimize
`debug_items()`: in most cases it is called exactly once, and its
body is a linear sequence of calls with unique arguments.
This commit turns off optimizations for `debug_items()` on GCC and
Clang, improving -Os compile time of Minerva SoC SRAM by ~40% (!)
Before this commit, if a sequence of wires assigned in a chain would
terminate on a cell, none of the wires would get marked as aliases,
and typically all of the public wires would get outlined. The reason
for this behavior is that alias analysis predates outlining and in
fact runs before it.
After this commit, alias analysis runs after outlining and considers
outlined wires valid aliasees. More importantly, if the chained wires
contain any valid aliasees, then all of the wires are aliased to
the one that is topologically deepest.
Aliased wires incur virtually no overhead for the VCD writer, unlike
outlined wires that would otherwise take their place. On Minerva SoC
SRAM, size of the full VCD dump is reduced by ~65%, and throughput
is increased by ~55%.
Aggressive wire localization and inlining is necessary for CXXRTL to
achieve high performance. However, that comes with a cost: reduced
debug information coverage. Previously, as a workaround, the `-Og`
option could have been used to guarantee complete coverage, at a cost
of a significant performance penalty.
This commit introduces debug information outlining. The main eval()
function is compiled with the user-specified optimization settings.
In tandem, an auxiliary debug_eval() function, compiled from the same
netlist, can be used to reconstruct the values of localized/inlined
signals on demand. To the extent that it is possible, debug_eval()
reuses the results of computations performed by eval(), only filling
in the missing values.
Benchmarking a representative design (Minerva SoC SRAM) shows that:
* Switching from `-O4`/`-Og` to `-O6` reduces runtime by ~40%.
* Switching from `-g1` to `-g2`, both used with `-O6`, increases
compile time by ~25%.
* Although `-g2` increases the resident size of generated modules,
this has no effect on runtime.
Because the impact of `-g2` is minimal and the benefits of having
unconditional 100% debug information coverage (and the performance
improvement as well) are major, this commit removes `-Og` and changes
the defaults to `-O6 -g2`.
We'll have our cake and eat it too!
"Elision" in this context is an unusual and not very descriptive term
whereas "inlining" is common and straightforward. Also, introducing
"inlining" makes it easier to introduce its dual under the obvious
name "outlining".
Before this commit, a cell's input was always assigned like:
p_cell.p_input = (value...);
If `p_input` is buffered (e.g. if the design is built at -O0), this
is not correct. (In practice, this breaks clocking.) Unfortunately,
the incorrect design was compiled without diagnostics because wire<>
was move-assignable and also implicitly constructible from value<>.
After this commit, cell inputs are no longer incorrectly assumed to
always be unbuffered, and wires are not assignable from values.
RTL contract violations and C++ contract violations are different:
the former depend on the netlist and will never violate memory safety
whereas the latter may. When loading a CXXRTL simulation into another
process, RTL contract violations should generally not crash it, while
C++ contract violations should.
Although it is always possible to destroy and recreate the design to
simulate a power-on reset, this has two drawbacks:
* Black boxes are also destroyed and recreated, which causes them
to reacquire their resources, which might be costly and/or erase
important state.
* Pointers into the design are invalidated and have to be acquired
again, which is costly and might be very inconvenient if they are
captured elsewhere (especially through the C API).
In most cases, a CXXRTL simulation would use a top module, either
because this module serves as an entry point to the CXXRTL C API,
or because the outputs of a top module are unbuffered, improving
performance. Taking this into account, the CXXRTL backend now runs
`hierarchy -auto-top` if there is no top module. For the few cases
where this behavior is unwanted, it now accepts a `-nohierarchy`
option.
Fixes#2373.
This can be useful to determine whether the wire should be a part of
a design checkpoint, whether it can be used to override design state,
and whether driving it may cause a conflict.
Before this commit, the meaning of "sync def" included some flip-flop
cells but not others. There was no actual reason for this; it was
just poorly defined.
After this commit, a "sync def" means that a wire holds design state
because it is connected directly to a flip-flop output, and may never
be unbuffered. This is not affected by presence of async inputs.
This can be useful to distinguish e.g. a combinatorially driven wire
with type `CXXRTL_VALUE` from a module input with the same type, as
well as general introspection.
Without unbuffering output wires of, at least, toplevel modules, it
is not possible to have most designs that rely on IO via toplevel
ports (as opposed to using exclusively blackboxes) converge within
one delta cycle. That seriously impairs the performance of CXXRTL.
This commit avoids unbuffering outputs of all modules solely so that
in future, CXXRTL could gain fully separate compilation, and not for
any present technical reason.
With this change, it is easier to see which signals carry state (only
wire<>s appear as `reg` in VCD files) and to construct a minimal
checkpoint (CXXRTL_WIRE debug items represent the canonical smallest
set of state required to fully reconstruct the simulation).
Although logically two separate steps, these were treated as one for
historic reasons. Splitting the two makes it possible to have designs
that are only 2× slower than fastest possible (and are without extra
delta cycles) that allow probing all public wires.
Historically, elision was implemented before localization, so levels
with elision are lower than corresponding levels with localization.
This is unfortunate for two reasons:
1. Elision is a logical subset of localization, since it equals to
not giving a name to a temporary.
2. "Localize" currently actually means "unbuffer and localize",
and it would be useful to split those steps (at least for
public wires) for improved design visibility.
Although these options can be thought of as optimizations, they are
essentially orthogonal to the core of -O, which is managing signal
buffering and scope. Going from -O4 to -O2 means going from limited
to complete design visibility, yet in both cases proc and flatten
are desirable.