General guidelines for scaling the Namenode in HDFS, lessons learnt from the field.
This repo is a collateral for my presentation at ApacheCon 2021 titled "Scaling the Namenode - Lessons Learnt"
The slide deck from the talk is available here.
Commonly reported problems
- Performance
- RPC Processing Time
- GC pauses
- Read/Write performance
- Too long to start NN
- Stability
- Frequent Failover
- Frequent Crash
Various causes
- Small files
- Sub optimal heap settings
- Missing RPC improvements
- Bad Applications / Mistuned Components
- Degraded AD
- Too frequent/delayed checkpointing
- Heavy Services co-located / Disk throughput
- Too much logging
- Degraded JN / communication between NN/JN/ZK