You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DefaultThis issue lists all feature requests and improvements slated for the Nov 2024 Tkw release.
Flash Attention Performance is highest priority
FP8 Functionality & FP16 Performance Improvement
IGEMM optimizations & kernel check in to iree-kernel-benchmark
Broadcasting dynamic offset for paged attention
Gather optimization
Kernel Caching
Hoist Q
Packed Shuffles
Schedule Discovery
Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
Scheduling
Packed Shuffles
Implement FP8 Attention Kernel
Scaling of Q has to happen after Q @ K
Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
Causal mask (addition of triangular matrix 0s and -infinity)
Dynamic dimensions for sequence length
Paged Attention using vector.gathers
Extend Attention (split-k vs warp reduction)
Prefill Attention
Decode Attention (M = 1, with dynamic)
Update Paper
Unaligned shapes for GEMMs
Debugger support (add breakpoints and inspect stack on GPU)
Profiling support
Ensure that mappings modify the index sequence
IGEMM Performance Results
GEMM Non-temporal loads
GEMM + SiLU fusion kernel
MoE Kernel
Buffer loads to load K directly to shared memory
Buffer loads for masking
Understand scheduling + multi-buffering in tensile to be able to implement it in wave
================================================
Week 1 (Nov 8th)
Scheduling
Week 2(Nov 15)
Ivan
Adding support for using tensors from the kernel in mapping for reads and writes
Harsh
Create a FA page table dataset for Ivan to test his PR on
Create a harness for SGLANG grok / llama where we can test baseline perf and add our kernels and see perf (with Sai)
Write a decode attention kernel
Unaligned sequence length & Unaligned head dim
Stan
Adjusting k-width to maximize reads from shared memory and align layouts between 2 matmuls
Scheduling meeting with Giuseppe to show kernel and help him iterate
15th meeting with quantization team showing the FP8 kernel
=========================================================================================
Unassigned
Getting kernels with hipblaslt where we can turn knobs and relate knobs to output kernels
Packed Shuffles
Dynamic & aligned attention fp16 (M & K2 not specified)
Week 3(Nov 22)
Identifying which knobs represent multi-buffering and investigating strategy for multi-buffering
Assembly generation
Which knobs map to which instructions
Harsh: Differences between tensile and tensile lite?
Both: How does PGR2 fit into the big picture?
Wave performance numbers
Documentation on scheduling in tensile
Ivan
PR for dynamic symbol in read/write
Add conv kernel to iree-kernel-benchmark
Contiguous IGEMM loads PR
Stan
Land F8 Attention
F8 FA Performance
Detecting of rocm device instead of manually specifying gfx942
Support for scalar constants
Kernel caching
Packed Shuffle
I32 MFMA intrinsics
Harsh
Land dynamic GEMM PR
Dynamic attention PR
Flash Decoding but without PagedAttention (highest priority)
Hoist Q
Backward Attention
Week 4(Nov 29)
Performance evaluation across 5 sizes with the best tile sizes and other parameters enabled
Insights on whether we need llvm inline assembly and if so, how we could integrate with the rest of wave
Start drafting implementation strategy for mimicking multi buffering
Ivan
Performance optimizations of IGEMM
Stan
Ramp on scheduling and importance of set_prio
The text was updated successfully, but these errors were encountered:
DefaultThis issue lists all feature requests and improvements slated for the Nov 2024 Tkw release.
Flash Attention Performance is highest priority
================================================
Week 1 (Nov 8th)
Week 2(Nov 15)
Ivan
Harsh
Stan
=========================================================================================
Unassigned
Week 3(Nov 22)
Ivan
Stan
Harsh
Week 4(Nov 29)
Ivan
Stan
The text was updated successfully, but these errors were encountered: