With the rapid increase in data generation, there is a growing need for sophisticated systems to classify and label large datasets for downstream tasks. Deep learning has seen significant success in recent years, particularly in computer vision, where Convolutional Neural Networks (CNNs) have consistently demonstrated their effectiveness. While transformer-based models with attention mechanisms are emerging as powerful tools, this project focuses on CNN-based methods for document classification.
This project focuses on document classification, a subset of image classification. Unlike object classification tasks, such as those seen in the Fashion-MNIST dataset, document classification requires understanding both:
- 📐 Spatial Structure of the document
- 📝 Contextual (Textual) Information within the document
- OCR-Based Solutions: Extract text from the document to leverage its contextual information.
- CNN-Based Solutions: Classify documents based on spatial layout alone.
⚡ While ensemble approaches that combine both methods exist, this project focuses on the faster and more efficient CNN-based classification using spatial information only. This approach is motivated by the fact that humans can accurately classify documents based on spatial layout alone, without needing to read the text. Additionally, CNNs are faster and avoid the limitations of OCR, such as unreliable text extraction.
The models are trained and evaluated on the RVL-CDIP dataset, a well-known and extensive dataset for document classification tasks.
The objective of this project is to:
- 🛠️ Develop and evaluate CNN architectures for document classification using spatial features.
- 📊 Analyze the performance of each model to determine the most effective approach.
Feel free to explore the code, data, and results within this repository!