Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wave Dec 2024 Release #250

Open
5 of 31 tasks
harsh-nod opened this issue Nov 4, 2024 · 0 comments
Open
5 of 31 tasks

Wave Dec 2024 Release #250

harsh-nod opened this issue Nov 4, 2024 · 0 comments

Comments

@harsh-nod
Copy link
Contributor

harsh-nod commented Nov 4, 2024

DefaultThis issue lists all feature requests and improvements slated for the Nov 2024 Tkw release.

Flash Attention Performance is highest priority

  • FP8 Functionality & FP16 Performance Improvement
  • IGEMM optimizations & kernel check in to iree-kernel-benchmark
  • Broadcasting dynamic offset for paged attention
  • Gather optimization
  • Kernel Caching
  • Hoist Q
  • Packed Shuffles
  • Schedule Discovery
  • Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
  • Scheduling
  • Packed Shuffles
  • Implement FP8 Attention Kernel
  • Scaling of Q has to happen after Q @ K
  • Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
  • Causal mask (addition of triangular matrix 0s and -infinity)
  • Dynamic dimensions for sequence length
  • Paged Attention using vector.gathers
  • Extend Attention (split-k vs warp reduction)
  • Prefill Attention
  • Decode Attention (M = 1, with dynamic)
  • Update Paper
  • Unaligned shapes for GEMMs
  • Debugger support (add breakpoints and inspect stack on GPU)
  • Profiling support
  • Ensure that mappings modify the index sequence
  • IGEMM Performance Results
  • GEMM Non-temporal loads
  • GEMM + SiLU fusion kernel
  • MoE Kernel
  • Buffer loads to load K directly to shared memory
  • Buffer loads for masking
  • Understand scheduling + multi-buffering in tensile to be able to implement it in wave

================================================

Week 1 (Nov 8th)

  • Scheduling

Week 2(Nov 15)
Ivan

  • Adding support for using tensors from the kernel in mapping for reads and writes
    Harsh
  • Create a FA page table dataset for Ivan to test his PR on
  • Create a harness for SGLANG grok / llama where we can test baseline perf and add our kernels and see perf (with Sai)
  • Write a decode attention kernel
  • Unaligned sequence length & Unaligned head dim
    Stan
  • Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
  • Scheduling meeting with Giuseppe to show kernel and help him iterate
  • 15th meeting with quantization team showing the FP8 kernel
    =========================================================================================
    Unassigned
  • Getting kernels with hipblaslt where we can turn knobs and relate knobs to output kernels
  • Packed Shuffles
  • Dynamic & aligned attention fp16 (M & K2 not specified)

Week 3(Nov 22)

  • Identifying which knobs represent multi-buffering and investigating strategy for multi-buffering
  • Assembly generation
  • Which knobs map to which instructions
  • Harsh: Differences between tensile and tensile lite?
  • Both: How does PGR2 fit into the big picture?
  • Wave performance numbers
  • Documentation on scheduling in tensile
    Ivan
  • PR for dynamic symbol in read/write
  • Add conv kernel to iree-kernel-benchmark
  • Contiguous IGEMM loads PR
    Stan
  • Land F8 Attention
  • F8 FA Performance
  • Detecting of rocm device instead of manually specifying gfx942
  • Support for scalar constants
  • Kernel caching
  • Packed Shuffle
  • I32 MFMA intrinsics
    Harsh
  • Land dynamic GEMM PR
  • Dynamic attention PR
  • Flash Decoding but without PagedAttention (highest priority)
  • Hoist Q
  • Backward Attention

Week 4(Nov 29)

  • Performance evaluation across 5 sizes with the best tile sizes and other parameters enabled
  • Insights on whether we need llvm inline assembly and if so, how we could integrate with the rest of wave
  • Start drafting implementation strategy for mimicking multi buffering
    Ivan
  • Performance optimizations of IGEMM
    Stan
  • Ramp on scheduling and importance of set_prio
@harsh-nod harsh-nod changed the title Dec 2024 Release Wave Dec 2024 Release Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant