-
Notifications
You must be signed in to change notification settings - Fork 37
Error happens when two threads co-execute together #47
Comments
I'm not sure I follow your question sorry. SMT threads have independent GPRs so they won't affect each other. |
Different SMT threads use the same write port of the GPR. Normal ALU instructions from different threads do not have contention since only one instruction can be issued per cycle and all WB happens after EX6. However, reading slowspr returns data later which causes contention when mfspr and ALU instr. want to use the write port at the same time. |
Sorry I'm still a bit lost. Are you concerned about a performance or functional issue? Certainly SMT threads can affect each others performance. They should not affect each other functionally. |
Are you saying T0 mfspr issues before T1 issues ALU op, but the T0/T1 GPR writes are concurrent because of extra cycles in the mfspr pipe? I think that would either be dealt with at issue time or more likely as a resource collision T1 flush when it is detected. |
Why is SMT required? Isn't a single thread susceptible to the same condition? |
I didn't see the case in Table D-5, but D.4.6 (p.843) appears to cover it as a stall:
|
Hi mikey and openpowerwtf, Thanks for the reply! As mentioned in D.4.6 (p.843), slow SPR only blocks instructions following it within the same thread. That's why the single thread does not have the problem. However, other threads can still issue ALU instructions which causes contention between different threads. |
Where does it say that? I assumed 'any' did not imply 'any on this thread'. There are clearly interthread resource collisions for other ops like mul/div and they are described in Appendix D also. |
Hi openpowerwtf, I verified this through simulation. In fact, A2 uses "hole" in the issue unit for the div instruction to avoid the interthread resource collisions. Note A2 also tries to use "hole" here for slowspr read. Please see 14.5.130 XUCR0 - Execution Unit Configuration Register 0, 43:47 SSDLY. Unfortunatelly, using default 7 cyclces does not aovid the collision. My question is where does 7 cycles come from? I guess you guys change the pipeline a bit when release the A2I code so 7 cycles do not work anymore. |
No @Grubby-CPU , I did not change any functional source code except to get it to compile with Vivado (not intentionally 😨 ). But it IS possible there were configuration bits changed by firmware/boot code/etc. in real systems. Usually the default is 'good', but not guaranteed (late bug discovery, different core/system modes, etc.). It is interesting there are config bits for this exact case. The RO/config ring note means the setting could only be changed 'externally' - not by processor code directly. In xuq_dec_sspr.vhdl, it looks like each thread does the slowspr check and then it becomes a global
|
Hi Openpowerwtf,
|
Good info! So I get from your analysis:
The solution(?): I don't know why the stall and hole are implemented slightly differently, but would guess it's possible the single-thread case was designed in from the start and the multihread case 'evolved' during sim verification. Also, there also could have been timing issues. And there are no performance implications for a normal stream. |
Cool!
|
Following Q1, I guess this is because A2 is an in-order core, when reading slowspr does not return data, the laters instructions are stalled to guarantee the in-order feature? |
The 'slow spr' distinction is because reading those regs isn't important for performance, and the regs are likely placed throughout the core since they are often specific to certain macros/functions. Adding latency simplifies the wiring/timing requirements to funnel the data to the GPR; but this now adds some extra control complexity to handle the special case. The stall is to prevent the hardware resource (GPR write port) collision, like you are seeing. You could handle some cases by finishing out-of-order and adding control complexity. But this case could also happen in out-of-order pipes. If ops have different latencies to a shared resource like the writeback bus, they have to be scheduled/stalled to avoid collisions. |
Thanks for the detailed explanation of slow spr! For single-thread reading Slow SPR collision, now I think it can also be removed by using 'hole' made by slowspr_need_hole signal. In this case, we do not necessarily need the original barrier logic, i.e., D.4.6 (p.843) , which blocks the following instructions within the thread. Is it ok to do this? Any comments:) ?? |
Hi Guys,
I have two threads co-execute together. One thread named T0 executes mfspr to read some values from slowspr to GPR and the other thread named T1 executes normal ALU instructions. I found this can cause errors since slowspr can return value later meanwhile the datapath of mfspr and ALU to the GPR is the same. In other words, T0 and T1 can write GPR at the same time with the same data path and control path. Any comments about this?
Many thanks
The text was updated successfully, but these errors were encountered: