15618 Multi-Core Cache Simulator

Links

See here for the proposal.
See here for the milestone report.
See here for the final report.
This is the link to the project website.
This is the link to the project repository.

Summary

We are going to implement a trace-driven multicore cache simulator supporting both snooping and directory based cache coherence protocols. We further want to perform workload analysis for program with different access patterns, locality, sharing, and the effect of different interconnect topologies on cache performance.

Background

We studied multiple cache coherence protocols during the lectures such as MSI, MESI, and MOESI. We also studied a couple of different implementation styles namely snooping-based and directory-based. We are curious about the practical implications and their effect on the performance of a multi-core cache system.

We're excited about studying the effect of different sizes of a cache line and replacement policies, on the performance of the multi-core cache system.

State diagram for MESI:

Design of snooping-based cache coherence:

Design of directory-based cache coherence:

The challenges

Correctly reflecting what we learnt about cache coherence protocols from lecture in the actual implementation requires firm understanding of the protocol implications.
Understanding the core APIs of (SST)[https://github.com/sstsimulator] with limited documentation and active community
Recoding traces from the program execution on multi-core machine and feeding those traces to our implementation is something none of us have experience before.
Devising appropriate test plans, programs, and workloads in order to stress the simulator and extract valuable insights is also a challenge.
Measuring performance of directory-based protocols depends highly on accurate modeling of inter-connect and arbitration.

Resources

We'll start implementing the cache system from scratch building upon the core APIs exposed by (SST)[https://github.com/sstsimulator].
We plan to do development on local machines, and then gather traces, run tests and benchmarks on PSC machines. Another reason of using the PSC machine is because we want study the effect of number of cores on the scalability of different cache coherence implementations.
We'll also use Intel Pin to record memory access traces on PSC machines.

Goals and deliverables

PLAN TO ACHIEVE

We plan to achieve a fully-functional multi-core cache coherency simulator capable of being configured with

the number of cores
cache block size
cache replacement policy
coherency protocol
- MESI
implementation style
- snoop-based

The advantage of building our own cache coherence simulator on top of the core APIs exposed by SST is that -

Gain hands-on experience in using and extending an industrial toolkit
Simplify and abstract the behaviour and workloads we want to study in a controlled environment
Enable a trace based analysis in SST in addition to artificial address generators available in SST

HOPE TO ACHIEVE

If time permits, we also aim to implement a directory based coherency protocol on top of the SST APIs. This will allow us to perform scalability studies and analysis of directory based vs snooping based protocols. We aim to analyse and present some concrete data on what kinds of programs, access patterns and sharing benefit and scale from directory based protocols.

ANALYSIS

We aim to analyze and gather a concrete understanding of programs with different memory access patterns, sharing and locality with concrete numbers and statistics in order to answer the following questions -

Performance and traffic generated by different lock implementations (Test and Set, Test and Test and Set)
Effect of artifactual communication on performance
Directory based vs Snooping based scalability (125% Goal)

DEMO

We aim to have an interactive demo showing the capabilities of our simulator. We plan to present insights such as reporting the following statistics -

Different types of cache statistics -

Miss rate
Number of invalidations due to coherecncy protocol

Bus Traffic Classficiation -

Memory traffic: Request served directly from memory
Coherency traffic: Request served from one of the caches due to sharing

Latency of different cache events

Read
Write

Effect of cache block size on performance producing a plot of miss rate vs cache block size
Study the effect of different types of memory access patterns and sharing (such as Ocean simulation, stencil etc) to gather insights from the simulator
Scalability of cache coherency implementation styles (125%)

snoop-based
directory based

PLATFORM CHOICE

We will be doing most of our development and execution locally or on the GHC machines. However for doing the scalabiltiy study of directory vs snoop-based coherency implementation we will be using PSC machine to gather traces on larger number of cores.
We will be develop our cache component on top of SST architecture. And the multi-core communication will be supported by built-in openmpi apis of SST.
We will use C++ as our developing language.

SCHEDULE

Week Number	Checkpoint
1	Study SST API and start build a cache component
2	Complete implementation of cache
3	Gather traces using PIN tool
4	Perform analysis and gather data using simulator
5	Work on report and extending simulator to directory based protocol

Assumptions

As we began the development process, we made the following assumptions for our multi-core cache coherence simulator:

At any time, each processor has only one outstanding request
The bus only support atomic transaction
We don't support read and write to actual data. We only concern the addresses issued by each processor

As part of this project we plan to remove assumption 1 by enhancing the cache implementation to incorporate non-blocking semantics.

Updated Schedule for Project Milestone

We've been working diligently to keep to the schedule. So far we've completed the following portions:

Study SST Core API

We went through the SST Core documentation explaining the basic primitives components and APIs
We also checked out the tutorials online and the simple examples given in the SST-Elements repository
The next step was building the SST-Core and SST-Elements repository locally and executing a few examples to get hands-on and understand the process of building our own components on top of SST-Core.

Completed development for simulated CPU load store generator
Completed building a cache component

We have already completed the basic development process of a multi-core cache.
The cache has three ports
- One to receive requests and transmit responses back to the processor
- Second to submit requests to the bus arbitrator to access the bus
- Third to transmit the actual request on the bus and receive the response
We have a working implementation with broadcast based MSI cache coherency protocol

Tested the complete implementation on a trace computing the sum of an array in parallel

Broadcast based interconnect
Arbiter with round-robin and FIFO policies
Multi-core cache with MSI coherence protocol

Understand how to gather multi-threaded memory traces using PIN tool

Below is the updated schedule for the coming 2 weeks in half-week granularity (between milestone and project deadline).

Week Number	Checkpoint	Assignee	Status
0.5	Complete implementation of cache component	Tanay	Done
1.0	Complete implementation of bus and arbitrator	Xuan	Done
1.5	Enhance implementation of cache component for MESI protocol and additional statistics	Tanay	Done
1.75	Enhance implementation to incorporate non-blocking cache semantics and additional statistics	Both	Done
2.0	Devise characteristic multi-threaded programs to stress test simulator and study workload patterns	Both	Done
2.0	Generate characteristic cache traces using Pintool	Both	Done
2.25	Perform analysis and gather data using our simulator	Both	Done
2.5	Complete the extended implementation of directory component	Xuan	cancelled
2.75	Incoporate changes to cache and bus for directory based protocol	Tanay	cancelled
3.0	Work on report and poster session prep	Both	Done

Updated Goals and Deliverables

We have been sticking farily well to the planned schedule and targeted development goals. We already have a working multi-core cache coherency simulator with the following features:

Variable cache block size
Variable total cache size
Configurable associativity
Configurable replacement policy

Round robin
LRU
MRU

Configurable cache coherence protocol

MSI
MESI ( in progress )

Configurable arbitration policy:

Rouund robin
FIFO

In the upcoming weeks we plan to enhance the implementation to inocporate non-blocking cache semnantics after discussing it and taking feedback from Professor Skarlatos. Post that we would carry out performance studies using our simulator to gather insight into the behaviour of shared memory parallel programs with different communication patterns and plan to reproduce what we learned in class regarding artifactual communication with actual data and statistics collected using our simulator (XTSim).

Since our poster session is on December 9th, leaving us with less than 10 days, we are skeptical of achieving our 125% goal but will try our best to keep in line with original GOALs and deliverables.

We aim to present our deliverables in terms of graphs

Miss rate vs Programs with different access patterns
Number of invalidations vs Programs with different access patterns
Miss rate vs cache block size
Miss rate vs total cache size
Coherence Traffic vs Programs with different access patterns
Memory Traffic vs Programs with different access patterns

Oustanding Concerns

We primarily have the following tasks remaining

Enhance cache implementation to incorporate non-blocking semantics
Generate test plan for carrying out performance study using our simulator (XTSim)

We are majorly concerned with the remaining time we have since we're doing an early poster session. We are confident of achieving the 100% goal but are not completely sure of completing the 125% goal of enhancing the implementation to incorporate directory based coherency protocol.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md
final.pdf		final.pdf
milestone.pdf		milestone.pdf
proposal.pdf		proposal.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

15618 Multi-Core Cache Simulator

Links

Summary

Background

The challenges

Resources

Goals and deliverables

PLAN TO ACHIEVE

HOPE TO ACHIEVE

ANALYSIS

DEMO

PLATFORM CHOICE

SCHEDULE

Assumptions

Updated Schedule for Project Milestone

Updated Goals and Deliverables

Oustanding Concerns

About

Releases

Packages

Contributors 2

callMeBigBen/15618-CacheSim-Page

Folders and files

Latest commit

History

Repository files navigation

15618 Multi-Core Cache Simulator

Links

Summary

Background

The challenges

Resources

Goals and deliverables

PLAN TO ACHIEVE

HOPE TO ACHIEVE

ANALYSIS

DEMO

PLATFORM CHOICE

SCHEDULE

Assumptions

Updated Schedule for Project Milestone

Updated Goals and Deliverables

Oustanding Concerns

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages