This can result in massive reduction in runtime, up to 50% depending
on workload. Currently people are using `-mllvm -inline-threshold=`
as a workaround (with clang++), but this solution is more portable.
Without unbuffering output wires of, at least, toplevel modules, it
is not possible to have most designs that rely on IO via toplevel
ports (as opposed to using exclusively blackboxes) converge within
one delta cycle. That seriously impairs the performance of CXXRTL.
This commit avoids unbuffering outputs of all modules solely so that
in future, CXXRTL could gain fully separate compilation, and not for
any present technical reason.
With this change, it is easier to see which signals carry state (only
wire<>s appear as `reg` in VCD files) and to construct a minimal
checkpoint (CXXRTL_WIRE debug items represent the canonical smallest
set of state required to fully reconstruct the simulation).
Although logically two separate steps, these were treated as one for
historic reasons. Splitting the two makes it possible to have designs
that are only 2× slower than fastest possible (and are without extra
delta cycles) that allow probing all public wires.
Historically, elision was implemented before localization, so levels
with elision are lower than corresponding levels with localization.
This is unfortunate for two reasons:
1. Elision is a logical subset of localization, since it equals to
not giving a name to a temporary.
2. "Localize" currently actually means "unbuffer and localize",
and it would be useful to split those steps (at least for
public wires) for improved design visibility.
Although these options can be thought of as optimizations, they are
essentially orthogonal to the core of -O, which is managing signal
buffering and scope. Going from -O4 to -O2 means going from limited
to complete design visibility, yet in both cases proc and flatten
are desirable.
Before this commit, Verilog expressions like `x && 1` would result in
references to `logic_and_us` in generated CXXRTL code, which would
not compile. After this commit, since cells like that actually behave
the same regardless of signedness attributes, the signedness is
ignored, which also reduces the template instantiation pressure.
This commit changes the VCD writer such that for all signals that
have `debug_item.type == VALUE && debug_item.next == nullptr`, it
would only sample the value once.
Commit f2d7a187 added more debug information by including constant
wires, and decreased the performance of VCD writer proportionally
because the constant wires were still repeatedly sampled; this commit
eliminates the performance hit.
Constant wires can represent a significant chunk of the design in
generic designs or after optimization. Emitting them in VCD files
significantly improves usability because gtkwave removes all traces
that are not present in the VCD file after reload, and iterative
development suffers if switching a varying signal to a constant
disrupts the workflow.
This commit changes the VCD writer such that for all signals that
share `debug_item.curr`, it would only emit a single VCD identifier,
and sample the value once.
Commit 9b39c6f7 added redundancy to debug information by including
alias wires, and increased the size of VCD files proportionally; this
commit eliminates the redundancy from VCD files so that their size
is the same as before.
Alias wires can represent a significant chunk of the design in highly
hierarchical designs; in Minerva SRAM, there are 273 member wires and
527 alias wires. Showing them in every hierarchy level significantly
improves usability.
Compared to the C++ API, the C API currently has two limitations:
1. Memories cannot be updated in a race-free way.
2. Black boxes cannot be implemented in C.
Debug information describes values, wires, and memories with a simple
C-compatible layout. It can be emitted on demand into a map, which
has no runtime cost when it is unused, and allows late bound designs.
The `hdlname` attribute is used as the lookup key such that original
names, as emitted by the frontend, can be used for debugging and
introspection.
This isn't actually necessary anymore after scheduling was improved,
and `clean -purge` disrupts the mapping between wires in the input
RTLIL netlist and the output CXXRTL code.
log_assert(false) never returns and thus can't fall through, but gcc
doesn't seem to think that far. Making it the last case avoids the
problem entirely.
Strategically inserting the pending memory write in memory::update to keep the
queue sorted allows us to skip the queue sort in memory::commit.
The Minerva SRAM SoC runs ~7% faster as a result.
This is quite possibly the worst way to implement this, but it does
work for a subset of well-behaved designs, and can be used to measure
how much performance is lost simulating the inactive edge of a clock.
It should be replaced with a clock tree analyzer generating safe
code once it is clear how should such a thing look like.
If the annotations are not used, this commit does not alter semantics
at all, other than removing elision of outputs of black box cells.
(Elision of such outputs is expected to be too rare to have any
noticeable benefit, and the implementation was somewhat of a hack.)
The (* cxxrtl.comb *) annotation alters the semantics of the output
of the black box it is applied to such that, if the black box
converges immediately, no additional delta cycle is necessary to
propagate the computed combinatorial value upwards in hierarchy.
The (* cxxrtl.sync *) annotation alters the semantics of the output
of the black box it is applied to such as to remove any uses of
the black box by the wires connected to this output, and break false
feedback arcs arising from conservative modeling of dependencies of
the black box.
Although currently these attributes are only recognized on black
boxes, if separate compilation is added in the future, it could also
emit and consume them.
The attribute for this is called (* cxxrtl.edge *), and there is
a planned attribute (* cxxrtl.sync *) that would cause blackbox
cell outputs to be added to sync defs rather than comb defs.
Rename the edge detector related stuff to avoid confusion.
If it is statically known that eval() will converge in one delta
cycle (that is, the second commit() will always return `false`)
because the design contains no feedback or buffered wires, then
there is no need to run the second delta cycle at all.
After this commit, the case where eval() always converges immediately
is detected and the second delta cycle is omitted. As a result,
Minerva SRAM SoC runs ~25% faster.