My implementation of the Gemma LLM. Training data. a) All English Wikipedia pages(6.5 million). b) ~2 billion tokens. Key insights from this implementation. a)RMS Normalization b)ROPE Embedding c)MultiQueryAttention d)GeGLU Activations e)Pre-Norm Transformers Training detail. a) 2 Million parameters b) Contextual length of 64 tokens.