lecture14.tex

\chapter{RISC-V 5-Stage Pipeline/Hazards}

\section{RISC-V Pipeline}
Pipelining doesn't help with the \emph{latency} of a single task (time to travel from source to destination), but rather improves the \emph{throughput} of the entire workload (number of packets processed within a period of time). All pipeline stages have the same duration.
\begin{enumerate}
    \item Instruction Fetch: Send address to the instruction memory (IMEM) and read IMEM at that address.
    \item Register Decode: Generate control signals from the instruction bits, generate the immediate, and read registers from the RegFile.
    \item Execute: Perform ALU operations, and do branch comparison.
    \item Memory: Read from or write to the data memory (DMEM).
    \item Writeback: Write back either PC + 4, the result of the ALU operation, or data from memory to the RegFile.
\end{enumerate}

\section{Hazards}
\subsection{Structural}
\begin{description}
	\item[Problem:] Two or more instructions in the pipeline compete for access to a single physical resource.
	\item[Solution 1:] Instructions take turns to use resource, some instructions have to stall.
	\item[Solution 2:] Add more hardware to machine (e.g. support simultaneous memory accesses).
\end{description}

\subsection{Data}
\begin{itemize}
\item Register:
\begin{description}
	\item[Problem:] What happens if at one pipeline stage a register is being written and read in a single clock cycle by different instructions?
	\item[Solution:] You can do a \emph{write} in half a clock cycle and \emph{read} in the next half, so there is no danger after all.
\end{description}

\item ALU Result:
\begin{description}
	\item[Problem:] Instruction depends on result from previous instruction.
	\item[Solution 1] Stall by filling the instruction pipeline with NOPs until the first instruction's ALU output is ready for the second instruction.
	\begin{itemize}
		\item Compiler that is aware of the pipeline structure can arrange code in order to avoid hazards and stalls.
	\end{itemize}
	\item[Solution 2:] Forwarding- Grab operand from pipeline stage rather than register file.
	An ALU output is routed straight back into the ALU, with no need for register write.
	\begin{enumerate}
		\item Detect need for forwarding: Compare \texttt{inst1.rd} and \texttt{inst2.rs1}; if equal, we will do forwarding.
		\item ALU output register and register write data are muxed back into ALU input.
	\end{enumerate}
\end{description}

\item Load:
\begin{description}
	\item[Problem:] We do not have the value of the load in the ALU phase, i.e. we cannot forward backwards in time.
	\item[Solution:] Stall one instruction before you can forward.
		The slot after a load is called a \emph{load delay slot}. If that instruction uses the result of the load, then the hardware will stall for one cycle.
		\begin{itemize}
		    \item Put unrelated instruction into load delay slot $\implies$ no performance loss.
		\end{itemize}
\end{description}
\end{itemize}

\subsection{Control}
\begin{description}
	\item[Problem:] Instruction 1 has a branch. The next two instructions are executed regardless of branch outcome.
	\item[Solution:] If the branch is not taken, then instructions fetched sequentially after branch are correct. If the branch/jump is taken, then we need to flush (kill) incorrect instructions from pipeline by converting to NOPs.
\end{description}

\section{Superscalar Processors}
How to increase single processor core performance?
\begin{enumerate}
    \item Clock rate: Limited by technology and power dissipation
    \item Pipeling: "Overlap" instruction execution
    \item Multiple issue superscalar: Multiple pipelines; out-of-order execution (reorder instructions dynamically in hardware)
\end{enumerate}