Repository to build firmware for a demonstrator of the Correlator Layer 1 in the endcap region (1.5 < |η| < 2.5 for the moment, 2.5 < |η| < 3.0 to be added later)
This uses the git submodule feature to get the dependencies, e.g. the PF & regionizer code.
- you should provide a
buildToolSetup.sh
to setup the environment for Vivado - a project can be created with
./setupProject.sh project_name
(see below for the list of projects) - then, one can compile the firmware with
./buildFirmware.sh project_name
If you have problems with setting up ipbb with python 2.7, do the following once:
rm -r ipbb-master/venv/ipbb
- log a lxplus7 node, with a clean environment
source ipbb-master/env.sh
after that, the environment should work also on other machines.
For all designs, PF runs with 9 phi regions with 0.25 rad overlap
This is the simplest full regionizer algorithm
- all inputs are at TM6, and are already in the 64 bit format (but muons are in local coordinates)
- the tracker has 9 sectors, with 2 "fibers"/sector giving 1 track/clock, with sector-local η,φ coordinates
- HGCal has 3 sectors with 4 fibers/sector giving 1 track/clock, with sector-local η,φ coordinates
- the muon system sends muons globally with 2 muons / clock, in global coordinates
- the regionizer waits for 54 clocks to read all inputs, then outputs the sorted list of the best 30 tracks, 20 calo and muons for each region, sending out all objects of the region in parallel and keeping them stable for 6 clocks before moving on with the next region.
Resource usage (emp framework, payload only):
Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|
52209(4.42%) | 51912(4.39%) | 0(0.00%) | 297(0.05%) | 111220(4.70%) | 132(6.11%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
This differs from regionizer_mux
in only one simple aspect: for each region, instead of outputing the sorted list of 30/20/4 tracks/calo/muons and keeping them constant for 6 clocks, it will stream the objects in those 6 clocks.
So, the output is 5 tracks, 4 calo, 1 muon per clock cycle.
- At the 1st clock cycle of a region it will output: tracks 0, 6, 12, 18, 24; calo 0, 6, 12, 18; muon 0
- At the 2nd clock cycle it will output: tracks 1, 7, 13, 19, 25; calo 1, 7, 13, 19; muon 1
- At the 3rd clock cycle it will output: tracks 2, 8, 14, 20, 26; calo 2, 8, 14 plus one null calo; muon 2
- At the 6th and last clock cycle it will output: tracks 5, 11, 17, 23, 29; calo 5, 11, 17 plus one null; a null muon
Resource usage (emp framework, payload only):
Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|
53822(4.55%) | 53525(4.53%) | 0(0.00%) | 297(0.05%) | 106097(4.49%) | 132(6.11%) | 0(0.00%) | 0(0.00%) | 0(0.00%) |
These are used to test the routing and timing for a single HLS IP core
This setup runs the PF block at full 360 MHz.
- the IP core for PF can be build with
make_hls_cores.sh pfHGCal_2p2ns_ii6
(which runsl1pf_hls/run_hls_pfalgo2hgc_2p2ns_II6.tcl
)
Resource usage (emp framework)
Instance | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|---|
top | 89199(7.54%) | 81938(6.93%) | 1754(0.30%) | 5507(0.93%) | 113139(4.78%) | 96(4.44%) | 896(20.74%) | 0(0.00%) | 440(6.43%) |
payload | 39331(3.33%) | 34268(2.90%) | 0(0.00%) | 5063(0.86%) | 60076(2.54%) | 0(0.00%) | 0(0.00%) | 0(0.00%) | 440(6.43%) |
TODO:
- the design has routing & timing issues, even when running with reduced inputs (20 tracks, 12 calo) and disabling the reset signal (which is a major offender in the timing)
- the design is importing the IP core VHDL files directly instead of importing the core itself
This setup runs the PF block at 240 MHz by using dual-clock BRAM FIFOs to bridge the 360 to 240 MHz clock domains
- the IP core for PF can be build with
make_hls_cores.sh pfHGCal_240MHz_ii4
(which runsl1pf_hls/run_hls_pfalgo2hgc_3ns_II4.tcl
) - the design has routing & timing challenges, but it does succeeed when tying the reset signal to zero.
- IP core latency: 41 clock
Resource usage (emp framework)
Instance | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|---|
top | 136370(11.53%) | 123866(10.48%) | 1754(0.30%) | 10750(1.82%) | 165943(7.02%) | 124(5.74%) | 896(20.74%) | 0(0.00%) | 750(10.96%) |
payload | 86489(7.32%) | 76183(6.44%) | 0(0.00%) | 10306(1.74%) | 112880(4.77%) | 28(1.30%) | 0(0.00%) | 0(0.00%) | 750(10.96%) |
TODO:
- the design is importing the IP core VHDL files directly instead of importing the core itself
This setup runs the mux regionizer + the PF and Puppi at 360 MHz with II=6 (same clock as the regionizer)
- the EMP input pattern files can be generated with
l1pf_hls/multififo_regionizer/run_hls_csim_pf_puppi.tcl
- the IP core for PF can be build with
make_hls_cores.sh pfHGCal_360MHz_ii6
andmake_hls_cores.sh puppiHGCal_360MHz_ii6
, which runl1pf_hls/run_hls_pfalgo2hgc_360MHz_II6.tcl
andl1pf_hls/puppi/run_hls_linpuppi_hgcal_360MHz_II6.tcl
A vhdl testbench simulation in vivado can be run with test/run_vhdltb.sh
run with mux-pf-puppi
as argument.
- The first PF & Puppi outputs arrive at frames 111 and 169 in the testbench output, compared to 54 in the reference from HLS (HLS has an ideal 54 clock cycle latency for the regionizer, to stream in the inputs, and zero latency for PF & Puppi)
- For a reduced set of inputs (20 tracks, 12 calo) the frames become 106 and 153 for PF and puppi
TODO:
- Implementation in the EMP framework has routing and timing problems
- The design is somewhat wasteful in terms of resources for delaying the tracks & PV for puppi: it's using one BRAM36 for each track while in principle one could just use NTRACKS / II BRAMs, and uses a full BRAM36 for the PV Z where a BRAM18 would have been sufficient
- The VHDL testbench uses the VHDL output files from the IP core synthesis directly instead of importing the IP core
This setup runs the streaming regionizer at 360 MHz, transfers the data to the 240 MHz clock domain and runs the PF and Puppi with II=4, and then crosses the data back
- the EMP input pattern files can be generated with
l1pf_hls/multififo_regionizer/run_hls_csim_pf_puppi.tcl
- the IP core for PF can be build with
make_hls_cores.sh pfHGCal_240MHz_ii4
andmake_hls_cores.sh puppiHGCal_240MHz_ii4
, which runl1pf_hls/run_hls_pfalgo2hgc_240MHz_II4.tcl
andl1pf_hls/puppi/run_hls_linpuppi_hgcal_240MHz_II4.tcl
- clock domain crossings are done with dual-clock BRAM36 used in native FIFO mode.
- the reading of the output in the 360 MHz domain is synchronized to a delayed version of the start of writing inputs from the 360 MHz domain, so that the latency in the 360 MHz domain is fixed irrespectively of the phase between the two clocks and the time it takes to make the two clock domain crossings. The price for this is that the design is conservative on the latency, potentially waits a bit more before starting to read the outputs.
A vhdl testbench simulation in vivado can be run with test/run_vhdltb.sh
run with stream-cdc-pf-puppi
as argument.
- The first PF & Puppi outputs arrive at frames 185 and 233 in the testbench output, compared to 54 in the reference from HLS (HLS has an ideal 54 clock cycle latency for the regionizer, to stream in the inputs, and zero latency for PF & Puppi)
Resource usage from emp framework (might be slightly out of date):
Instance | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|---|
top | 252272(21.34%) | 238213(20.15%) | 1754(0.30%) | 12305(2.08%) | 404120(17.09%) | 300(13.89%) | 906(20.97%) | 0(0.00%) | 1245(18.20%) |
payload | 202393(17.12%) | 190532(16.12%) | 0(0.00%) | 11861(2.00%) | 351057(14.85%) | 204(9.44%) | 10(0.23%) | 0(0.00%) | 1245(18.20%) |
TODO:
- Implementation in the EMP framework does not meet timing, but we know that e.g. the input link assignment to quads is not good (this is fixed in the later designs).
- The design is somewhat wasteful in terms of resources for delaying objects
- The parallel FIFOs importing data in the 240 MHz are assumed to be all in phase, and so the 240 MHz processing starts as soon as one FIFO becomes readable. It may be safer to wait until all FIFOs are non-empty before staring to read.
- There's a lot of code duplication for the various CDCs and serial to parallel transitions (this is somewhat reduced in the later designs)
- The VHDL testbench uses the VHDL output files from the IP core synthesis directly instead of importing the IP core.
tdemux_regionizer_cdc_pf_puppi
: time demultiplexer + trivial decoder + regionizer_stream
+ PF@240 + Puppi@240
This setup starts from regionizer_stream_cdc_pf_puppi
and makes the inputs more realistic:
- tracks come at TMUX 18 with 1 fiber per sector per time slice, and 3 input times slices (T0, T0+6, T0+12). Track objects are 96 bit long.
- hgcal comes at TMUX 18 with 4 fiber per sector per time slice, and 3 input times slices (T0, T0+6, T0+12). Calo objects are 128 bit long, and arrive at 16 G so with 1 null word after 2 valid ones.
- muons come at TMUX 18 with 1 fiber per time slice and 3 input time slices (T0, T0+6, T0+12). Muons are 128 bit long. In all cases the "longer" objects are made just by padding our existing 64 bit objecs with zeros so the decoding to 64 bit is trivial, but the decoding is anyway performed with an HLS IP core so a more complex version of it should be straightfoward to use.
Compared to regionizer_stream_cdc_pf_puppi
:
- the EMP input pattern files can be generated with
l1pf_hls/multififo_regionizer/run_hls_csim_pf_puppi_tm18.tcl
- this needs additional IP cores that can be build with
make_hls_cores.sh tdemux
andmake_hls_cores.sh unpackers
.
A vhdl testbench simulation in vivado can be run with test/run_vhdltb.sh
run with tdemux-stream-cdc-pf-puppi
as argument.
- For the full set of inputs (30 tracks, 20 calo), the first PF & Puppi outputs arrive at frames 297 and 345 in the testbench output, compared to 54 in the reference from HLS (HLS has an ideal 54 clock cycle latency for the regionizer, to stream in the inputs, and zero latency for PF & Puppi). This estimate of the latency is conservative, as the code in the 360 MHz domain waits more before starting to read from the FIFOs.
Resource usage from emp framework (might be slightly out of date):
Instance | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|---|
top | 264567(22.38%) | 250508(21.19%) | 1754(0.30%) | 12305(2.08%) | 435270(18.41%) | 366(16.94%) | 906(20.97%) | 0(0.00%) | 1245(18.20%) |
payload | 214672(18.16%) | 202811(17.15%) | 0(0.00%) | 11861(2.00%) | 382207(16.16%) | 270(12.50%) | 10(0.23%) | 0(0.00%) | 1245(18.20%) |
TODO:
- Implementation in the EMP framework does not meet timing
tdemux_regionizer_cdc_pf_puppi_stream
: time demultiplexer + trivial decoder + regionizer_stream
+ PF@240 + Puppi stream@240
The change wrt tdemux_regionizer_cdc_pf_puppi
is that it uses a streaming implementation of Puppi, with 3 components:
- a chs component that takes a single PF charged candidate and the PV, computes the compatibility and returns in output the PF candidate charged or zero
- a prepare component that takes as input a track and the PV, and saves an object with pt, eta and a weight equal to pt^2 if the track is PV-compatible and zero otherwise
- a neutral component that takes as input a single PF neutral candidate and a list of objects from the prepare above, and outputs the puppi candidate (or a null candidate) all the components are pipelined at II=1.
The whole Puppi logic is implemented with the above 3:
- PF Charged particles serialized into NTrack/II streams and are then processed by a set of parallell instances of the "chs" component before being inputed in the CDC logic
- A copy of the input tracks is made before the serial-to-parallel conversion, and they delayed and then processed by the prepare component, and then converted from serial to parallel
- PF neutral particles are serialized, and then processed together with the prepared objects before going into the CDC logic
The implementation requires to compile the new cores for the streaming puppi with make_hls_cores.sh puppiHGCal_240MHz_stream
.
The full set of cores for this design is pfHGCal_240MHz_ii4 puppiHGCal_240MHz_stream tdemux unpackers
The input pattern files are produced with l1pf_hls/multififo_regionizer/run_hls_csim_pf_puppi_tm18.tcl
, as before.
A VHDL testbench implementation can be run with run_vhdltb.sh tdemux-stream2-cdc-pf-puppi
:
- For the full set of inputs (30 tracks, 20 calo), the first PF & Puppi outputs arrive at frames 297 and 345 in the testbench output, compared to 54 in the reference from HLS (HLS has an ideal 54 clock cycle latency for the regionizer, to stream in the inputs, and zero latency for PF & Puppi). This estimate of the latency is largely conservative, more than the previous design, as it's using the latency of the traditional Puppi algorithm which is longer.
Resource usage from emp framework (might be slightly out of date):
Instance | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|---|
top | 216875(18.34%) | 202242(17.11%) | 1754(0.30%) | 12879(2.18%) | 378775(16.02%) | 344(15.93%) | 981(22.71%) | 0(0.00%) | 1223(17.88%) |
payload | 166943(14.12%) | 154508(13.07%) | 0(0.00%) | 12435(2.10%) | 325713(13.78%) | 248(11.48%) | 85(1.97%) | 0(0.00%) | 1223(17.88%) |
TODO:
- In order to complete the implementation successfully, one has to modify emp-fwk/boards/vcu118/firmware/ucf/xdma_constraints.tcl changing
LOC PCIE40E4_X1Y0
toLOC PCIE40E4_X1Y2
and commenting out the threeset_property USER_CLOCK_ROOT X5Y0
. This is being followed up with EMP framework authors - possibly recover some clock cycles
tdemux_regionizer_cdc_pf_puppi_stream_sort
: time demultiplexer + trivial decoder + regionizer_stream
+ PF@240 + Puppi stream@240 + final sort
The change wrt tdemux_regionizer_cdc_pf_puppi_stream
is to add a final sorting of the Puppi candidates to select the best 18 ones. The sorting uses BitonicSort from the Rufl library.
The full set of cores for this design is the same as before, pfHGCal_240MHz_ii4 puppiHGCal_240MHz_stream tdemux unpackers
.
The input pattern files are produced with l1pf_hls/multififo_regionizer/run_hls_csim_pf_puppi_tm18.tcl
, as before.
A VHDL testbench implementation can be run using Modelsim with run_vhdltb.sh -vsim tdemux-stream2-cdc-pf-puppi-sort
, :
- For the full set of inputs (30 tracks, 20 calo), the first PF & Puppi outputs arrive at frames 297 and 366 in the testbench output, compared to 54 in the reference from HLS (HLS has an ideal 54 clock cycle latency for the regionizer, to stream in the inputs, and zero latency for PF & Puppi).
Resource usage from emp framework:
Instance | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | URAM | DSP48 Blocks |
---|---|---|---|---|---|---|---|---|---|
top | 242042(20.47%) | 227361(19.23%) | 1754(0.30%) | 12927(2.18%) | 419909(17.76%) | 344(15.93%) | 981(22.71%) | 0(0.00%) | 1223(17.88%) |
payload | 187162(15.83%) | 174679(14.78%) | 0(0.00%) | 12483(2.11%) | 363038(15.35%) | 248(11.48%) | 85(1.97%) | 0(0.00%) | 1223(17.88%) |
TODO:
- As for the previous design, in order to complete the implementation successfully, one has to modify emp-fwk/boards/vcu118/firmware/ucf/xdma_constraints.tcl changing
LOC PCIE40E4_X1Y0
toLOC PCIE40E4_X1Y2
and commenting out the threeset_property USER_CLOCK_ROOT X5Y0
. This is being followed up with EMP framework authors on p2-xware gitlab.