Author | Editor | Title |
---|---|---|
Jason Lowe-Power |
Maryam Babaie, Mahyar Samani, Kaustav Goswami |
ECS 201A Assignment 3 |
Due on 02/17 1:59 pm (PST): See Submission for details
- Introduction
- Workload
- Experimental setup
- Analysis and simulation
- Submission
- Grading
- Academic misconduct reminder
- Hints
In this assignment, you'll be investigating the performance impacts of different out-of-order core designs on a set of RISC-V benchmarks. The goals of this assignment are:
- Show how applications have different behaviors as the microarchitecture changes.
- Give you experience investigating the bottleneck in a particular architecture.
- Improve your understanding of out-of-order processor architecture.
In this assignment, you are going to use four workloads for the evaluation of the performance of your systems. Read more about each workload and how to use those workloads for your experiments below.
You are going to use the same matrix multiplication program from [assignment 1]({{'modules/gem5/assignment1' | relative_url}}) as the first workload.
Please note that matrix size is hardcoded for this assignment.
You do not have to pass matrix size as an input argument to the __init__
function for your workload.
Below you can find the C++ implementation of the matrix multiplication workload.
#include <iostream>
#include "matrix.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
void multiply(double* A, double* B, double* C, int size)
{
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
for (int j = 0; j < size; j++) {
C[i * size + j] += A[i * size + k] * B[k * size + j];
}
}
}
}
int main()
{
std::cout << "Beginning matrix multiply ..." << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
multiply(A, B, C, SIZE);
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished matrix multiply." << std::endl;
return 0;
}
Take a look at workloads/matmul_workload.py
to learn more about instantiating a matrix multiplication workload.
BFS is an algorithm for traversing a graph in breadth order.
The algorithm traverses vertices in a graph in an order corresponding to each vertex's relative depth to a specific vertex.
I.e. the algorithm starts traversing vertices in the graph starting from a specific vertex called root
.
During the traversal, the algorithm makes sure to visit vertices that are only 1 edge/hop
away from root
before it visits those vertices that are 2 edges/hops or more
away.
The BFS algorithm is one of the building blocks of graph analytics workloads.
Graph analytics workloads have common use cases such as modeling relationships and processes in physical, biological, social, and information systems.
Beamer et al. discuss the BFS algorithm and how it could be parallelized in their paper.
Below you can find a basic C++ implementation of the algorithm discussed in the paper.
#include <iostream>
#include <vector>
#include "graph.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
int main()
{
std::vector<int> frontier;
std::vector<int> next;
frontier.clear();
next.clear();
frontier.push_back(0);
std::cout << "Beginning BFS ..." << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
while (!frontier.empty()) {
for (auto vertex: frontier) {
int start = columns[vertex];
int end = columns[vertex + 1];
for (int i = start; i < end; i++){
int neighbor = edges[i];
if (visited[neighbor] == 0) {
visited[neighbor] = 1;
next.push_back(neighbor);
}
}
}
frontier = next;
next.clear();
}
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished BFS." << std::endl;
return 0;
}
Take a look at workloads/bfs_workload.py
to learn more about instantiating a BFS workload.
Bubble sort is the starter algorithm for sorting arrays. This program sorts an array by replacing each element with the smallest element to its right (if sorting ascendingly). Below you can find the C++ implementation for bubble sort.
#include <iostream>
#include "array.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
int main()
{
std::cout << "Beginning bubble sort ... " << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
for (int i = 0; i < ARRAY_SIZE - 1; i++) {
for (int j = i + 1; j < ARRAY_SIZE; j++) {
if (data[i] > data[j]) {
int temp = data[i];
data[i] = data[j];
data[j] = temp;
}
}
}
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished bubble sort." << std::endl;
return 0;
}
The N-Queens problem is a classic computer science problem. It involves placing N chess queens on an N×N chessboard so that no two queens threaten each other. This means that no two queens can share the same row, column, or diagonal. The most common method to solve the N-Queens problem is to use backtracking. Below you can find the C++ implementation for the same:
#include <bits/stdc++.h>
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
void printSolution(int **board, int N) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++)
printf(" %d ", board[i][j]);
printf("\n");
}
}
bool isSafe(int **board, int row, int col, int N) {
for (int i = 0; i < col; i++)
if (board[row][i])
return false;
for (int i=row, j=col; i>=0 && j>=0; i--, j--)
if (board[i][j])
return false;
for (int i=row, j=col; j>=0 && i<N; i++, j--)
if (board[i][j])
return false;
return true;
}
bool solveNQUtil(int **board, int col, int N) {
if (col >= N)
return true;
for (int i = 0; i < N; i++) {
if (isSafe(board, i, col, N)) {
board[i][col] = 1;
if (solveNQUtil(board, col + 1, N))
return true;
board[i][col] = 0;
}
}
return false;
}
bool solveNQ(int **board, int N) {
if (solveNQUtil(board, 0, N) == false) {
printf("Solution does not exist");
return false;
}
printSolution(board, N);
return true;
}
int main(int argc, char *argv[]) {
// the user needs to input the size of the chessboard
if (argc == 2) {
int size = atoi(argv[1]);
// initialize the board
int **board = new int*[size];
for (int i = 0 ; i < size; i++)
board[i] = new int[size];
for (int i = 0 ; i < size ; i++) {
for (int j = 0 ; j < size ; j++) {
board[i][j] = 0;
}
}
#ifdef GEM5
m5_work_begin(0,0);
#endif
solveNQ(board, size);
#ifdef GEM5
m5_work_end(0,0);
#endif
}
else {
printf("N-Queens program. Usage \n $ ./queens <chess-board-size>\n");
}
return 0;
}
In this assignment, you are asked to design your own out of order processor
models for your experiments.
In regards to the rest of the components in your system:
- You will be using
HW3RISCVBoard
as your mainboard
for your computer system. You can find the model for the board incomponents/boards.py
. - You will be using
HW3MESICache
as your cache hierarchy for your computer system. You can find its model incomponents/cache_hierarchies.py
. - You will be using
HW3DDR4
as yourmemory
in your computer system. You can find its model incomponents/memories.py
. - You will be using
2 GHz
as you clock frequencyclk_freq
in your system.
For your processor, you are going to use HW3O3CPU
to model a high-performance and an efficient processor core.
HW3O3CPU
is based on O3CPU
which is an internal model of gem5.
Read up on the O3CPU
in gem5's documentation.
Below, you can find details about HWO3CPU
and its parameters.
For the purposes of this assignment, you need to configure a processor with the same width in all stages.
In the constructor of HW3O3CPU
this attribute is named as width
.
This is the number of entries in the reorder buffer (ROB).
This will constrain the number of instructions which can be "in flight" in the processor.
In the constructor of HW3O3CPU
this attribute is named as rob_size
.
This is the number of registers in the physical register file. A processor renames architecture registers to physical registers to resolve false dependences. It also tracks true dependences (read-after-write) in the register file. To learn more about register renaming, read up on Tomasulo's algorithm.
HW3O3CPU
has two physical register files.
One register file for integer registers and one for floating point registers.
In the constructor of HW3O3CPU
, num_int_regs
refers to the number of integer physical registers and num_fp_regs
refers to the number of floating point physical registers.
NOTE: Both register files must be larger than the 32 entries or gem5 will hang. This is becuase the number of physical registers must be bigger than or equal to the number of logical registers. RISC-V ISA defines 32 logical registers.
In this assignment, you are required to design your own high-performance and efficient cores.
You need to add two core designs to components/processors.py
.
In componets/processors.py
, create a model based on HW3O3CPU
and name it HW3BigCore
.
This core will be your high-performance core for the assignment.
Create another model in components/processors.py
and name it HW3LittleCore
.
This core will be your efficient core for the assignment.
You can use information available on the internet on the different core microarchitectures to configure your HW3BigCore
and HW3LittleCore
.
As a starting point, take a look at WikiChip and AnandTech.
Your instructor has loosely modeled their HW3BigCore
on Intel Sunny Cove and their HW3LittleCore
on Intel Gracemont.
NOTE: You are not required to match the specifications of any core.
Use information online as a guideline for your design.
You might find it impossible to match the specifications found online.
E.g. the base model HW3O3CPU
assumes the same width for all the stages of the pipeline which is not usually the case with modern processor designs.
Tips and To dos: When designing your cores and caches, I recommend taking note of the following:
- When designing your cores, I strongly recommend not beefing up your cores, especially
HW3LittleCore
. Remember that in computer design, there are almost always diminishing returns. A beefyHW3LittleCore
will result in aHW3BigCore
that is not much more performant thanHW3LittleCore
. I recommend dividing the values you see on specifications by in half forHW3LittleCore
. - Setting the
width
below 4 would result some unstability. However, you might want to setwidth
to 3 or less forHW3LittleCore
. I recommend starting with very little numbers for the rest of the parameters, especiallyrob_size
, and gradually increasing them to get around this issue. - Make sure you register files have more than 32 entries each.
Before running any simulations answer the following questions in your report.
1- What will be the average speed up of HW3BigCore
over HW3LittleCore
? Can you predict an upper bound using your pipeline parameters?
Hint: You can predict and upper bound for the speed up using Amdahl's law with optimistic values.
2- Do you think all the workloads will experience the same speed up between HW3BigCore
and HW3LittleCore
?
Now that you have completed the design process of HW3BigCore
and HW3LittleCore
, let's compare their performances using our four workloads as benchmarks.
Simulate each workload with each core.
For each workload compare the performance of HW3BigCore
with the performance of HW3LittleCore
.
Answer the following questions in your report. Use relevant and correct reasoning in your answer. Use simulation data with proper simulation data to strengthen your reasoning.
- What is the speed up of
HW3BigCore
overHW3LittleCore
for each workload? - What is the average improvement in IPC of
HW3BigCore
compared toHW3LittleCore
over all the workloads? CAUTION: Make sure to use the correct mean to report average IPC improvement. - Some workloads show more speedup that others. Which workloads show high speedup, which show low speedup? Look at the benchmark code (both the
.c
and.s
files may be useful) and speculate the algorithm characteristics which influence the IPC difference betweenHW3BigCore
andHW3LittleCore
. What characteristics do applications have that lead to low performance improvement and what characteristics lead to high performance improvement? - Which workload has the highest IPC for
HW3BigCore
? What is unique about this workload?
Hints: Take a look at the assembly code of the ROI for inspiration.
In this step, you are tasked with finding a middle ground between HW3BigCore
and HW3LittleCore
.
This core needs to perform as closely as possible to HW3BigCore
while using as little resources as HW3LittleCore
.
We will refer to this core as HW3MediumCore
.
To pick our sweet spot for the design of HW3MediumCore
, we need to develop a methodology.
First, we need to define a cost function for increasing hardware resources.
We will use area as the cost of making a hardware.
Not all parameters of the pipeline have the same effect on the area.
E.g. Increasing the width of the pipeline has a quadratic effect on the area of the hardware while increasing register file entries has a linear effect.
Morever the cost of increasing two of these resources at the same time should be bigger than the sum of increasing each resource.
We will use an equation like below to score the area of a pipeline using the 4 parameters of width
, rob_size
, num_int_regs
, and num_fp_regs
.
You can also get the area score for a pipeline design by calling the method get_area_score
on the processor.
Now that we have our cost function, let's devise a method for measuring our gains. In you report answer the following question.
- If you were to use the speed up under only one workload from the four workloads you used before, which workload would you choose? Why?
Now that we have devised functions to measure costs and gains, configure 4 middle ground designs for the pipeline. Not all of these designs will have the "best" cost-gain tradeoff. In your report include the following figure.
- Create a pareto frontier plot with cost on the y-axis and performance on the x-axis. This will be a scatter plot with 6 points: the two "big" and "LITTLE" cores as well as your 4 middle ground designs. Then, "connet the dots" on the "best" designs.
Answer this question in your report.
- Assume you are an engineer working to design this middle ground core. Given this early analysis, which designs, if any, would you recommend your team to pursue developing? Explain why. (Note: you may want to annotate the above plot.)
Based on your experiments and insights, can you answer the following qeustion?
- Many phone chips (e.g. Arm Cortex-A series processors) and Intel's Alder Lake chips employ architectures that contain both big cores and little cores in a single system. The operating system (or Intel's "thread directory") can choose which core to use to run each thread. Assume that there is not context switching overhead, do you think that this system will be more or less efficient than have an equal number of medium cores? You may want to read the paper Amdahl's Law in the Multicore Era by Hill and Marty for inspiration. Note that you do not need to run any further experiments to answer this question. Back up your answer with logic.
As mentioned before, you are allowed to submit your assignments in pairs and in PDF format. You should submit your report on gradescope. In your report answer the questions presented in Analysis and simulation, Analysis and simulation: Step I, Analysis and simulation: Step II, and, Analysis and simulation: Step III. Use clear reasoning and visualization to drive your conclusions. Submit all your code through your assignment repository. Please make sure to include code/scripts for the following.
Instruction.md
: should include instruction on how to run your simulations.- Automation: code/scripts to run your simulations.
- Configuration: python file configuring the systems you need to simulate.
You should add your final core designs to
components/processors.py
. There should be six core definitions incomponents/processor.py
. They should include one design forHW3BigCore
, one design forHW3LittleCore
, and four designs forHW3MediumCore
. You can add numbers to designs forHW3MediumCore
to distinguish their design. E.g.HW3MediumCore0
,HW3MediumCore1
,HW3MediumCore2
, andHW3MediumCore3
.
Like your submission, your grade is split into two parts.
-
Reproducibility Package (50 points):50
- Instruction and automation to run simulations for different section and dump statistics (20 points)
- Instructions (5 points)
- Automation (5 points)
- Configuration scripts and correct simulation setup (40 points): 10 points for configuration script(s) used to run your simulations and 5 points for implementing each of the 6 processor models as described in Analysis and simulation: Step I, and Analysis and simulation: Step II.
- Instruction and automation to run simulations for different section and dump statistics (20 points)
-
Report (50 points): 5 points for each question presented in Analysis and simualtion, Analysis and simulation: Step I, Analysis and simulation: Step II, and, Analysis and simulation: Step III.
You are required to work on this assignment in teams. You are only allowed to share you scripts and code with your teammate(s). You may discuss high level concepts with others in the class but all the work must be completed by your team and your team only.
Remember, DO NOT POST YOUR CODE PUBLICLY ON GITHUB! Any code found on GitHub that is not the base template you are given will be reported to SJA. If you want to sidestep this problem entirely, don’t create a public fork and instead create a private repository to store your work.
- Start early and ask questions on Piazza and in discussion.
- If you need help, come to office hours for the TA, or post your questions on Piazza.