Skip to content

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites

Notifications You must be signed in to change notification settings

good-repos/Awesome-Transformer-Attention

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Ultimate-Awesome-Transformer-Attention Awesome

This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites.
This list is maintained by Min-Hung Chen. (Actively keep updating)

If you find some ignored papers, feel free to create pull requests, open issues, or email me.
Contributions in any form to make this list more comprehensive are welcome.

If you find this repository useful, please consider citing and ★STARing this list.
Feel free to share this list with others!

[Update: September, 2022] Added the Transformer tutorial slides made by Lucas Beyer!
[Update: July, 2022] Added all the related papers from ICML 2022!
[Update: June, 2022] Added all the related papers from CVPR 2022!


Overview


Survey

  • "A Survey on Visual Transformer", TPAMI, 2022 (Huawei). [Paper]
  • "A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
  • "Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (The University of Sydney). [Paper]
  • "Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (Microsoft). [Paper]
  • "Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (Illinois Institute of Technology, Chicago). [Paper]
  • "Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (Charles Sturt University, Australia). [Paper]
  • "VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (CAS). [Paper]
  • "Transformers in Remote Sensing: A Survey", arXiv, 2022 (MBZUAI). [Paper][Github]
  • "Medical image analysis based on transformer: A Review", arXiv, 2022 (NUS, Singapore). [Paper]
  • "3D Vision with Transformers: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
  • "Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (NYCU). [Paper]
  • "Transformers in Medical Imaging: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
  • "Multimodal Learning with Transformers: A Survey", arXiv, 2022 (Oxford). [Paper]
  • "Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (CAS). [Paper]
  • "Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (University of Waterloo). [Paper]
  • "A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (INESC TEC and University of Porto, Portugal). [Paper]
  • "Efficient Transformers: A Survey", arXiv, 2022 (Google). [Paper]
  • "Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (Tsinghua). [Paper]
  • "Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (National University of Sciences and Technology (NUST), Pakistan). [Paper]
  • "Video Transformers: A Survey", arXiv, 2022 (Universitat de Barcelona, Spain). [Paper]
  • "Transformers in Medical Image Analysis: A Review", arXiv, 2022 (Nanjing University). [Paper]
  • "Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (?). [Paper]
  • "Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (Xidian University). [Paper]
  • "Image Captioning In the Transformer Age", arXiv, 2022 (Alibaba). [Paper][GitHub]
  • "Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (Fayoum University, Egypt). [Paper]
  • "Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (MBZUAI). [Paper]
  • "Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (Renmin University of China). [Paper]
  • "A Survey of Transformers", arXiv, 2021 (Fudan). [Paper]
  • "A Survey of Visual Transformers", arXiv, 2021 (CAS). [Paper]
  • "Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (University of Kashmir, India). [Paper]

[Back to Overview]

Image Classification / Backbone

Replace Conv w/ Attention

Pure Attention

Conv-stem + Attention

  • GSA: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (Google). [Paper][PyTorch (lucidrains)]
  • HaloNet: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (Google). [Paper][PyTorch (lucidrains)]
  • CoTNet: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (JD). [Paper][PyTorch]
  • HAT-Net: "Vision Transformers with Hierarchical Attention", arXiv, 2022 (ETHZ). [Paper][PyTorch (in construction)]

Conv + Attention

[Back to Overview]

Vision Transformer

General Vision Transformer

  • ViT: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR, 2021 (Google). [Paper][Tensorflow][PyTorch (lucidrains)]
  • Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind). [Paper][PyTorch (lucidrains)]
  • PiT: "Rethinking Spatial Dimensions of Vision Transformers", ICCV, 2021 (NAVER). [Paper][PyTorch]
  • VT: "Visual Transformers: Where Do Transformers Really Belong in Vision Models?", ICCV, 2021 (Facebook). [Paper][PyTorch (tahmid0007)]
  • PVT: "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
  • iRPE: "Rethinking and Improving Relative Position Encoding for Vision Transformer", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • CaiT: "Going deeper with Image Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • Swin-Transformer: "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", ICCV, 2021 (Microsoft). [Paper][PyTorch][PyTorch (berniwal)]
  • T2T-ViT: "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", ICCV, 2021 (Yitu). [Paper][PyTorch]
  • FFNBN: "Leveraging Batch Normalization for Vision Transformers", ICCVW, 2021 (Microsoft). [Paper]
  • DPT: "DPT: Deformable Patch-based Transformer for Visual Recognition", ACMMM, 2021 (CAS). [Paper][PyTorch]
  • Focal: "Focal Attention for Long-Range Interactions in Vision Transformers", NeurIPS, 2021 (Microsoft). [Paper][PyTorch]
  • XCiT: "XCiT: Cross-Covariance Image Transformers", NeurIPS, 2021 (Facebook). [Paper]
  • Twins: "Twins: Revisiting Spatial Attention Design in Vision Transformers", NeurIPS, 2021 (Meituan). [Paper][PyTorch)]
  • ARM: "Blending Anti-Aliasing into Vision Transformer", NeurIPS, 2021 (Amazon). [Paper][GitHub (in construction)]
  • DVT: "Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch]
  • Aug-S: "Augmented Shortcuts for Vision Transformers", NeurIPS, 2021 (Huawei). [Paper]
  • TNT: "Transformer in Transformer", NeurIPS, 2021 (Huawei). [Paper][PyTorch][PyTorch (lucidrains)]
  • ViTAE: "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", NeurIPS, 2021 (The University of Sydney). [Paper][PyTorch]
  • DeepViT: "DeepViT: Towards Deeper Vision Transformer", arXiv, 2021 (NUS + ByteDance). [Paper][Code]
  • So-ViT: "So-ViT: Mind Visual Tokens for Vision Transformer", arXiv, 2021 (Dalian University of Technology). [Paper][PyTorch]
  • LV-ViT: "All Tokens Matter: Token Labeling for Training Better Vision Transformers", NeurIPS, 2021 (ByteDance). [Paper][PyTorch]
  • NesT: "Aggregating Nested Transformers", arXiv, 2021 (Google). [Paper][Tensorflow]
  • KVT: "KVT: k-NN Attention for Boosting Vision Transformers", arXiv, 2021 (Alibaba). [Paper]
  • Refined-ViT: "Refiner: Refining Self-attention for Vision Transformers", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
  • Shuffle-Transformer: "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer", arXiv, 2021 (Tencent). [Paper]
  • CAT: "CAT: Cross Attention in Vision Transformer", arXiv, 2021 (KuaiShou). [Paper][PyTorch]
  • V-MoE: "Scaling Vision with Sparse Mixture of Experts", arXiv, 2021 (Google). [Paper]
  • P2T: "P2T: Pyramid Pooling Transformer for Scene Understanding", arXiv, 2021 (Nankai University). [Paper]
  • PvTv2: "PVTv2: Improved Baselines with Pyramid Vision Transformer", arXiv, 2021 (Nanjing University). [Paper][PyTorch]
  • LG-Transformer: "Local-to-Global Self-Attention in Vision Transformers", arXiv, 2021 (IIAI, UAE). [Paper]
  • ViP: "Visual Parser: Representing Part-whole Hierarchies with Transformers", arXiv, 2021 (Oxford). [Paper]
  • Scaled-ReLU: "Scaled ReLU Matters for Training Vision Transformers", AAAI, 2022 (Alibaba). [Paper]
  • LIT: "Less is More: Pay Less Attention in Vision Transformers", AAAI, 2022 (Monash University). [Paper][PyTorch]
  • DTN: "Dynamic Token Normalization Improves Vision Transformer", ICLR, 2022 (Tencent). [Paper][PyTorch (in construction)]
  • RegionViT: "RegionViT: Regional-to-Local Attention for Vision Transformers", ICLR, 2022 (MIT-IBM Watson). [Paper][PyTorch]
  • CrossFormer: "CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention", ICLR, 2022 (Zhejiang University). [Paper][PyTorch]
  • ?: "Scaling the Depth of Vision Transformers via the Fourier Domain Analysis", ICLR, 2022 (UT Austin). [Paper]
  • ViT-G: "Scaling Vision Transformers", CVPR, 2022 (Google). [Paper]
  • CSWin: "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • MPViT: "MPViT: Multi-Path Vision Transformer for Dense Prediction", CVPR, 2022 (KAIST). [Paper][PyTorch]
  • Diverse-ViT: "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy", CVPR, 2022 (UT Austin). [Paper][PyTorch]
  • DW-ViT: "Beyond Fixation: Dynamic Window Visual Transformer", CVPR, 2022 (Dark Matter AI, China). [Paper][PyTorch (in construction)]
  • MixFormer: "MixFormer: Mixing Features across Windows and Dimensions", CVPR, 2022 (Baidu). [Paper][Paddle]
  • DAT: "Vision Transformer with Deformable Attention", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
  • Swin-Transformer-V2: "Swin Transformer V2: Scaling Up Capacity and Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • MSG-Transformer: "MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens", CVPR, 2022 (Huazhong University of Science & Technology). [Paper][PyTorch]
  • NomMer: "NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • Shunted: "Shunted Self-Attention via Multi-Scale Token Aggregation", CVPR, 2022 (NUS). [Paper][PyTorch]
  • PyramidTNT: "PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture", CVPRW, 2022 (Huawei). [Paper][PyTorch]
  • X-ViT: "X-ViT: High Performance Linear Vision Transformer without Softmax", CVPRW, 2022 (Kakao). [Paper]
  • ReMixer: "ReMixer: Object-aware Mixing Layer for Vision Transformers", CVPRW, 2022 (KAIST). [Paper][PyTorch]
  • UN: "Unified Normalization for Accelerating and Stabilizing Transformers", ACMMM, 2022 (Hikvision). [Paper][Code (in construction)]
  • Wave-ViT: "Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning", ECCV, 2022 (JD). [Paper][PyTorch]
  • DaViT: "DaViT: Dual Attention Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • ScalableViT: "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer", ECCV, 2022 (ByteDance). [Paper]
  • MaxViT: "MaxViT: Multi-Axis Vision Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
  • VSA: "VSA: Learning Varied-Size Window Attention in Vision Transformers", ECCV, 2022 (The University of Sydney). [Paper][PyTorch]
  • ?: "Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning", NeurIPS (Microsoft). [Paper]
  • BViT: "BViT: Broad Attention based Vision Transformer", arXiv, 2022 (CAS). [Paper]
  • O-ViT: "O-ViT: Orthogonal Vision Transformer", arXiv, 2022 (East China Normal University). [Paper]
  • MOA-Transformer: "Aggregating Global Features into Local Vision Transformer", arXiv, 2022 (University of Kansas). [Paper][PyTorch]
  • BOAT: "BOAT: Bilateral Local Attention Vision Transformer", arXiv, 2022 (Baidu + HKU). [Paper]
  • ViTAEv2: "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond", arXiv, 2022 (The University of Sydney). [Paper]
  • VAN: "Visual Attention Network", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
  • HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
  • PatchMerger: "Learning to Merge Tokens in Vision Transformers", arXiv, 2022 (Google). [Paper]
  • DGT: "Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention", arXiv, 2022 (Baidu). [Paper]
  • NAT: "Neighborhood Attention Transformer", arXiv, 2022 (Oregon). [Paper][PyTorch]
  • ASF-former: "Adaptive Split-Fusion Transformer", arXiv, 2022 (Fudan). [Paper][PyTorch (in construction)]
  • LITv2: "Fast Vision Transformers with HiLo Attention", arXiv, 2022 (Monash University). [Paper][Code (in construction)]
  • PerViT: "Peripheral Vision Transformer", arXiv, 2022 (POSTECH). [Paper]
  • SP-ViT: "SP-ViT: Learning 2D Spatial Priors for Vision Transformers", arXiv, 2022 (Alibaba). [Paper]
  • EATFormer: "EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm", arXiv, 2022 (Zhejiang University). [Paper]
  • GC-ViT: "Global Context Vision Transformers", arXiv, 2022 (NVIDIA). [Paper][PyTorch]
  • LinGlo: "Rethinking Query-Key Pairwise Interactions in Vision Transformers", arXiv, 2022 (TCL Research Wuhan). [Paper]
  • Dual-ViT: "Dual Vision Transformer", arXiv, 2022 (JD). [Paper][PyTorch]
  • MMA: "Multi-manifold Attention for Vision Transformers", arXiv, 2022 (Centre for Research and Technology Hellas, Greece). [Paper]
  • MAFormer: "MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition", arXiv, 2022 (Baidu). [Paper]
  • AEWin: "Axially Expanded Windows for Local-Global Interaction in Vision Transformers", arXiv, 2022 (Southwest Jiaotong University). [Paper]
  • MAGNETO: "Foundation Transformers", arXiv, 2022 (Microsoft). [Paper]

Efficient Vision Transformer

  • DeiT: "Training data-efficient image transformers & distillation through attention", ICML, 2021 (Facebook). [Paper][PyTorch]
  • ConViT: "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases", ICML, 2021 (Facebook). [Paper][Code]
  • ?: "Improving the Efficiency of Transformers for Resource-Constrained Devices", DSD, 2021 (NavInfo Europe, Netherlands). [Paper]
  • PS-ViT: "Vision Transformer with Progressive Sampling", ICCV, 2021 (CPII). [Paper]
  • HVT: "Scalable Visual Transformers with Hierarchical Pooling", ICCV, 2021 (Monash University). [Paper][PyTorch]
  • CrossViT: "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification", ICCV, 2021 (MIT-IBM). [Paper][PyTorch]
  • ViL: "Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • Visformer: "Visformer: The Vision-friendly Transformer", ICCV, 2021 (Beihang University). [Paper][PyTorch]
  • MultiExitViT: "Multi-Exit Vision Transformer for Dynamic Inference", BMVC, 2021 (Aarhus University, Denmark). [Paper][Tensorflow]
  • SViTE: "Chasing Sparsity in Vision Transformers: An End-to-End Exploration", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
  • DGE: "Dynamic Grained Encoder for Vision Transformers", NeurIPS, 2021 (Megvii). [Paper][PyTorch]
  • GG-Transformer: "Glance-and-Gaze Vision Transformer", NeurIPS, 2021 (JHU). [Paper][Code (in construction)]
  • DynamicViT: "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
  • ResT: "ResT: An Efficient Transformer for Visual Recognition", NeurIPS, 2021 (Nanjing University). [Paper][PyTorch]
  • Adder-Transformer: "Adder Attention for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
  • SOFT: "SOFT: Softmax-free Transformer with Linear Complexity", NeurIPS, 2021 (Fudan). [Paper][PyTorch][Website]
  • IA-RED2: "IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers", NeurIPS, 2021 (MIT-IBM). [Paper][Website]
  • LocalViT: "LocalViT: Bringing Locality to Vision Transformers", arXiv, 2021 (ETHZ). [Paper][PyTorch]
  • CCT: "Escaping the Big Data Paradigm with Compact Transformers", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
  • DiversePatch: "Vision Transformers with Patch Diversification", arXiv, 2021 (UT Austin + Facebook). [Paper][PyTorch]
  • SL-ViT: "Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead", arXiv, 2021 (Aarhus University). [Paper]
  • ?: "Multi-Exit Vision Transformer for Dynamic Inference", arXiv, 2021 (Aarhus University, Denmark). [Paper]
  • DeiT-Manifold: "Efficient Vision Transformers via Fine-Grained Manifold Distillation", arXiv, 2021 (Huawei). [Paper]
  • ViX: "Vision Xformers: Efficient Attention for Image Classification", arXiv, 2021 (Indian Institute of Technology Bombay). [Paper]
  • Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
  • WideNet: "Go Wider Instead of Deeper", arXiv, 2021 (NUS). [Paper]
  • Armour: "Armour: Generalizable Compact Self-Attention for Vision Transformers", arXiv, 2021 (Arm). [Paper]
  • IPE: "Exploring and Improving Mobile Level Vision Transformers", arXiv, 2021 (CUHK). [Paper]
  • DS-Net++: "DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers", arXiv, 2021 (Monash University). [Paper][PyTorch]
  • UFO-ViT: "UFO-ViT: High Performance Linear Vision Transformer without Softmax", arXiv, 2021 (Kakao). [Paper]
  • Token-Pooling: "Token Pooling in Visual Transformers", arXiv, 2021 (Apple). [Paper]
  • Evo-ViT: "Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer", AAAI, 2022 (Tencent). [Paper][PyTorch]
  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
  • ShiftViT: "When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism", AAAI, 2022 (Microsoft). [Paper][PyTorch]
  • EViT: "Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations", ICLR, 2022 (Tencent). [Paper][PyTorch]
  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
  • Anti-Oversmoothing: "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][Jax]
  • LVT: "Lite Vision Transformer with Enhanced Self-Attention", CVPR, 2022 (Adobe). [Paper][PyTorch]
  • A-ViT: "A-ViT: Adaptive Tokens for Efficient Vision Transformer", CVPR, 2022 (NVIDIA). [Paper][Website]
  • PS-ViT: "Patch Slimming for Efficient Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Rev-MViT: "Reversible Vision Transformers", CVPR, 2022 (Meta). [Paper][PyTorch]
  • AdaViT: "AdaViT: Adaptive Vision Transformers for Efficient Image Recognition", CVPR, 2022 (Fudan). [Paper]
  • DQS: "Dynamic Query Selection for Fast Visual Perceiver", CVPRW, 2022 (Sorbonne Universite', France). [Paper]
  • ATS: "Adaptive Token Sampling For Efficient Vision Transformers", ECCV, 2022 (Microsoft). [Paper][Website]
  • EdgeViT: "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", ECCV, 2022 (Samsung). [Paper][PyTorch]
  • SReT: "Sliced Recursive Transformer", ECCV, 2022 (CMU + MBZUAI). [Paper][PyTorch]
  • SiT: "Self-slimmed Vision Transformer", ECCV, 2022 (SenseTime). [Paper][PyTorch]
  • TerViT: "TerViT: An Efficient Ternary Vision Transformer", arXiv, 2022 (Beihang University). [Paper]
  • MT-ViT: "Multi-Tailed Vision Transformer for Efficient Inference", arXiv, 2022 (Wuhan University). [Paper]
  • ViT-P: "ViT-P: Rethinking Data-efficient Vision Transformers from Locality", arXiv, 2022 (Chongqing University of Technology). [Paper]
  • CF-ViT: "Coarse-to-Fine Vision Transformer", arXiv, 2022 (Xiamen University + Tencent). [Paper][PyTorch]
  • EIT: "EIT: Efficiently Lead Inductive Biases to ViT", arXiv, 2022 (Academy of Military Sciences, China). [Paper]
  • SepViT: "SepViT: Separable Vision Transformer", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • ResT-V2: "ResT V2: Simpler, Faster and Stronger", arXiv, 2022 (Nanjing University). [Paper][PyTorch]
  • TRT-ViT: "TRT-ViT: TensorRT-oriented Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • SuperViT: "Super Vision Transformer", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
  • EfficientViT: "EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition", arXiv, 2022 (MIT). [Paper]
  • EfficientFormer: "EfficientFormer: Vision Transformers at MobileNet Speed", arXiv, 2022 (Snap). [Paper][Code (in construction)]
  • Tutel: "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • SimA: "SimA: Simple Softmax-free Attention for Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][PyTorch]
  • EdgeNeXt: "EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
  • VVT: "Vicinity Vision Transformer", arXiv, 2022 (Australian National University). [Paper][Code (in construction)]
  • SOFT: "Softmax-free Linear Transformers", arXiv, 2022 (Fudan). [Paper][PyTorch]
  • MaiT: "MaiT: Leverage Attention Masks for More Efficient Image Transformers", arXiv, 2022 (Samsung). [Paper]
  • LightViT: "LightViT: Towards Light-Weight Convolution-Free Vision Transformers", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
  • Next-ViT: "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios", arXiv, 2022 (ByteDance). [Paper]
  • XFormer: "Lightweight Vision Transformer with Cross Feature Attention", arXiv, 2022 (Samsung). [Paper]
  • PatchDropout: "PatchDropout: Economizing Vision Transformers Using Patch Dropout", arXiv, 2022 (KTH, Sweden). [Paper]
  • ClusTR: "ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers", arXiv, 2022 (The University of Adelaide, Australia). [Paper]
  • DiNAT: "Dilated Neighborhood Attention Transformer", arXiv, 2022 (University of Oregon). [Paper][PyTorch]
  • MobileViTv3: "MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features", arXiv, 2022 (Micron). [Paper][PyTorch]

Conv + Transformer

  • LeViT: "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • CeiT: "Incorporating Convolution Designs into Visual Transformers", ICCV, 2021 (SenseTime). [Paper][PyTorch (rishikksh20)]
  • Conformer: "Conformer: Local Features Coupling Global Representations for Visual Recognition", ICCV, 2021 (CAS). [Paper][PyTorch]
  • CoaT: "Co-Scale Conv-Attentional Image Transformers", ICCV, 2021 (UCSD). [Paper][PyTorch]
  • CvT: "CvT: Introducing Convolutions to Vision Transformers", ICCV, 2021 (Microsoft). [Paper][Code]
  • ViTc: "Early Convolutions Help Transformers See Better", NeurIPS, 2021 (Facebook). [Paper]
  • ConTNet: "ConTNet: Why not use convolution and transformer at the same time?", arXiv, 2021 (ByteDance). [Paper][PyTorch]
  • SPACH: "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", arXiv, 2021 (Microsoft). [Paper]
  • MobileViT: "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer", ICLR, 2022 (Apple). [Paper][PyTorch]
  • CMT: "CMT: Convolutional Neural Networks Meet Vision Transformers", CVPR, 2022 (Huawei). [Paper]
  • Mobile-Former: "Mobile-Former: Bridging MobileNet and Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch (in construction)]
  • TinyViT: "TinyViT: Fast Pretraining Distillation for Small Vision Transformers", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • CETNet: "Convolutional Embedding Makes Hierarchical Vision Transformer Stronger", ECCV, 2022 (OPPO). [Paper]
  • ParC-Net: "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer", ECCV, 2022 (Intellifusion, China). [Paper][PyTorch]
  • ?: "How to Train Vision Transformer on Small-scale Datasets?", BMVC, 2022 (MBZUAI). [Paper][PyTorch]
  • DHVT: "Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets", NeurIPS, 2022 (USTC). [Paper][Code (in construction)]
  • CXV: "Convolutional Xformers for Vision", arXiv, 2022 (IIT Bombay). [Paper][PyTorch]
  • ConvMixer: "Patches Are All You Need?", arXiv, 2022 (CMU). [Paper][PyTorch]
  • MobileViTv2: "Separable Self-attention for Mobile Vision Transformers", arXiv, 2022 (Apple). [Paper][PyTorch]
  • UniFormer: "UniFormer: Unifying Convolution and Self-attention for Visual Recognition", arXiv, 2022 (SenseTime). [Paper][PyTorch]
  • EdgeFormer: "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers", arXiv, 2022 (?). [Paper]
  • iFormer: "Inception Transformer", arXiv, 2022 (Sea AI Lab). [Paper][PyTorch]
  • MoCoViT: "MoCoViT: Mobile Convolutional Vision Transformer", arXiv, 2022 (ByteDance). [Paper]
  • DynamicViT: "Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • ConvFormer: "ConvFormer: Closing the Gap Between CNN and Vision Transformers", arXiv, 2022 (National University of Defense Technology, China). [Paper]
  • Fast-ParC: "Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs", arXiv, 2022 (Intellifusion, China). [Paper]

Training + Transformer

  • iGPT: "Generative Pretraining From Pixels", ICML, 2020 (OpenAI). [Paper][Tensorflow]
  • MoCo-V3: "An Empirical Study of Training Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper]
  • DINO: "Emerging Properties in Self-Supervised Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
  • drloc: "Efficient Training of Visual Transformers with Small Datasets", NeurIPS, 2021 (University of Trento). [Paper][PyTorch]
  • CARE: "Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning", NeurIPS, 2021 (Tencent). [Paper][PyTorch]
  • MST: "MST: Masked Self-Supervised Transformer for Visual Representation", NeurIPS, 2021 (SenseTime). [Paper]
  • SiT: "SiT: Self-supervised Vision Transformer", arXiv, 2021 (University of Surrey). [Paper][PyTorch]
  • MoBY: "Self-Supervised Learning with Swin Transformers", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • ?: "Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block", arXiv, 2021 (Pune Institute of Computer Technology, India). [Paper]
  • Annotations-1.3B: "Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations", WACV, 2022 (Pinterest). [Paper]
  • BEiT: "BEiT: BERT Pre-Training of Image Transformers", ICLR, 2022 (Microsoft). [Paper][PyTorch]
  • EsViT: "Efficient Self-supervised Vision Transformers for Representation Learning", ICLR, 2022 (Microsoft). [Paper]
  • iBOT: "Image BERT Pre-training with Online Tokenizer", ICLR, 2022 (ByteDance). [Paper][PyTorch]
  • MaskFeat: "Masked Feature Prediction for Self-Supervised Visual Pre-Training", CVPR, 2022 (Facebook). [Paper]
  • AutoProg: "Automated Progressive Learning for Efficient Training of Vision Transformers", CVPR, 2022 (Monash University, Australia). [Paper][Code (in construction)]
  • MAE: "Masked Autoencoders Are Scalable Vision Learners", CVPR, 2022 (Facebook). [Paper][PyTorch][PyTorch (pengzhiliang)]
  • SimMIM: "SimMIM: A Simple Framework for Masked Image Modeling", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • SelfPatch: "Patch-Level Representation Learning for Self-Supervised Vision Transformers", CVPR, 2022 (KAIST). [Paper][PyTorch]
  • Bootstrapping-ViTs: "Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
  • TransMix: "TransMix: Attend to Mix for Vision Transformers", CVPR, 2022 (JHU). [Paper][PyTorch]
  • PatchRot: "PatchRot: A Self-Supervised Technique for Training Vision Transformers", CVPRW, 2022 (Arizona State). [Paper]
  • SplitMask: "Are Large-scale Datasets Necessary for Self-Supervised Pre-training?", CVPRW, 2022 (Meta). [Paper]
  • MC-SSL: "MC-SSL: Towards Multi-Concept Self-Supervised Learning", CVPRW, 2022 (University of Surrey, UK). [Paper]
  • RelViT: "Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer", CVPRW, 2022 (University of Padova, Italy). [Paper]
  • data2vec: "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language", ICML, 2022 (Meta). [Paper][PyTorch]
  • SSTA: "Self-supervised Models are Good Teaching Assistants for Vision Transformers", ICML, 2022 (Tencent). [Paper][Code (in construction)]
  • MP3: "Position Prediction as an Effective Pretraining Strategy", ICML, 2022 (Apple). [Paper]
  • CutMixSL: "Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning", IJCAI, 2022 (Yonsei University, Korea). [Paper]
  • BootMAE: "Bootstrapped Masked Autoencoders for Vision BERT Pretraining", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • TokenMix: "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers", ECCV, 2022 (CUHK). [Paper][PyTorch]
  • ?: "Locality Guidance for Improving Vision Transformers on Tiny Datasets", ECCV, 2022 (Peking University). [Paper][PyTorch]
  • HAT: "Improving Vision Transformers by Revisiting High-frequency Components", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
  • IDMM: "Training Vision Transformers with Only 2040 Images", ECCV, 2022 (Nanjing University). [Paper]
  • AttMask: "What to Hide from Your Students: Attention-Guided Masked Image Modeling", ECCV, 2022 (National Technical University of Athens). [Paper][PyTorch]
  • TokenMixup: "TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers", NeurIPS, 2022 (Korea University). [Paper][Code (in construction)]
  • ?: "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers", Transactions on Machine Learning Research (TMLR), 2022 (Google). [Paper][Tensorflow][PyTorch (rwightman)]
  • PeCo: "PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
  • RePre: "RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
  • Beyond-Masking: "Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers", arXiv, 2022 (CAS). [Paper][Code (in construction)]
  • Kronecker-Adaptation: "Parameter-efficient Fine-tuning for Vision Transformers", arXiv, 2022 (Microsoft). [Paper]
  • DILEMMA: "DILEMMA: Self-Supervised Shape and Texture Learning with Transformers", arXiv, 2022 (University of Bern, Switzerland). [Paper]
  • DeiT-III: "DeiT III: Revenge of the ViT", arXiv, 2022 (Meta). [Paper]
  • ?: "Better plain ViT baselines for ImageNet-1k", arXiv, 2022 (Google). [Paper][Tensorflow]
  • ConvMAE: "ConvMAE: Masked Convolution Meets Masked Autoencoders", arXiv, 2022 (Shanghai AI Laboratory). [Paper][PyTorch (in construction)]
  • ViT-Adapter: "Vision Transformer Adapter for Dense Predictions", arXiv, 2022 (Shanghai AI Lab). [Paper][Code (in construction)]
  • UM-MAE: "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • MixMIM: "MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning", arXiv, 2022 (SenseTime). [Paper][Code (in construction)]
  • A2MIM: "Architecture-Agnostic Masked Image Modeling - From ViT back to CNN", arXiv, 2022 (Westlake University, China). [Paper][PyTorch]
  • GMML: "GMML is All you Need", arXiv, 2022 (University of Surrey, UK). [Paper][PyTorch]
  • HiViT: "HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling", arXiv, 2022 (CAS). [Paper]
  • ?: "A Closer Look at Self-supervised Lightweight Vision Transformers", arXiv, 2022 (Megvii). [Paper]
  • SIM: "Siamese Image Modeling for Self-Supervised Vision Representation Learning", arXiv, 2022 (SenseTime). [Paper]
  • SupMAE: "SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners", arXiv, 2022 (UT Austin). [Paper][PyTorch]
  • LoMaR: "Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction", arXiv, 2022 (KAUST). [Paper]
  • SAR: "Spatial Entropy Regularization for Vision Transformers", arXiv, 2022 (University of Trento, Italy). [Paper]
  • ExtreMA: "Extreme Masking for Learning Instance and Distributed Visual Representations", arXiv, 2022 (Microsoft). [Paper]
  • ?: "Exploring Feature Self-relation for Self-supervised Transformer", arXiv, 2022 (Nankai University). [Paper]
  • ?: "Position Labels for Self-Supervised Vision Transformer", arXiv, 2022 (Southwest Jiaotong University). [Paper]
  • Jigsaw-ViT: "Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer", arXiv, 2022 (KU Leuven, Belgium). [Paper][PyTorch][Website]
  • DropKey: "DropKey", arXiv, 2022 (Meitu). [Paper]
  • BEiT-v2: "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MILAN: "MILAN: Masked Image Pretraining on Language Assisted Representation", arXiv, 2022 (Princeton). [Paper][PyTorch (in construction)]
  • PSS: "Accelerating Vision Transformer Training via a Patch Sampling Schedule", arXiv, 2022 (Franklin and Marshall College, Pennsylvania). [Paper][PyTorch]
  • MaskCLIP: "MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining", arXiv, 2022 (Microsoft). [Paper]
  • DMAE: "Masked Autoencoders Enable Efficient Knowledge Distillers", arXiv, 2022 (JHU + UC Santa Cruz). [Paper][Code (in construction)]
  • dBOT: "Exploring Target Representations for Masked Autoencoders", arXiv, 2022 (ByteDance). [Paper]
  • PatchErasing: "Effective Vision Transformer Training: A Data-Centric Perspective", arXiv, 2022 (Alibaba). [Paper]
  • Self-Distillation: "Self-Distillation for Further Pre-training of Transformers", arXiv, 2022 (KAIST). [Paper]
  • TL-Align: "Token-Label Alignment for Vision Transformers", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • AutoView: "Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers", arXiv, 2022 (Sun Yat-sen University). [Paper][Code (in construction)]

Robustness + Transformer

  • ViT-Robustness: "Understanding Robustness of Transformers for Image Classification", ICCV, 2021 (Google). [Paper]
  • SAGA: "On the Robustness of Vision Transformers to Adversarial Examples", ICCV, 2021 (University of Connecticut). [Paper]
  • ?: "Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs", BMVC, 2021 (KAIST). [Paper][PyTorch]
  • ViTs-vs-CNNs: "Are Transformers More Robust Than CNNs?", NeurIPS, 2021 (JHU + UC Santa Cruz). [Paper][PyTorch]
  • T-CNN: "Transformed CNNs: recasting pre-trained convolutional layers with self-attention", arXiv, 2021 (Facebook). [Paper]
  • Transformer-Attack: "On the Adversarial Robustness of Visual Transformers", arXiv, 2021 (Xi'an Jiaotong). [Paper]
  • ?: "Reveal of Vision Transformers Robustness against Adversarial Attacks", arXiv, 2021 (University of Rennes). [Paper]
  • ?: "On Improving Adversarial Transferability of Vision Transformers", arXiv, 2021 (ANU). [Paper][PyTorch]
  • ?: "Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers", arXiv, 2021 (University of Pittsburgh). [Paper]
  • Token-Attack: "Adversarial Token Attacks on Vision Transformers", arXiv, 2021 (New York University). [Paper]
  • ?: "Discrete Representations Strengthen Vision Transformer Robustness", arXiv, 2021 (Google). [Paper]
  • ?: "Vision Transformers are Robust Learners", AAAI, 2022 (PyImageSearch + IBM). [Paper][Tensorflow]
  • PNA: "Towards Transferable Adversarial Attacks on Vision Transformers", AAAI, 2022 (Fudan + Maryland). [Paper][PyTorch]
  • MIA-Former: "MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation", AAAI, 2022 (Rice University). [Paper]
  • Patch-Fool: "Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?", ICLR, 2022 (Rice University). [Paper][PyTorch]
  • Generalization-Enhanced-ViT: "Delving Deep into the Generalization of Vision Transformers under Distribution Shifts", CVPR, 2022 (Beihang University + NTU, Singapore). [Paper]
  • ECViT: "Towards Practical Certifiable Patch Defense with Vision Transformer", CVPR, 2022 (Tencent).[Paper]
  • Attention-Fool: "Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness", CVPR, 2022 (Bosch). [Paper]
  • Memory-Token: "Fine-tuning Image Transformers using Learnable Memory", CVPR, 2022 (Google). [Paper]
  • APRIL: "APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers", CVPR, 2022 (CAS). [Paper]
  • Smooth-ViT: "Certified Patch Robustness via Smoothed Vision Transformers", CVPR, 2022 (MIT). [Paper][PyTorch]
  • RVT: "Towards Robust Vision Transformer", CVPR, 2022 (Alibaba). [Paper][PyTorch]
  • Pyramid: "Pyramid Adversarial Training Improves ViT Performance", CVPR, 2022 (Google). [Paper]
  • VARS: "Visual Attention Emerges from Recurrent Sparse Reconstruction", ICML, 2022 (Berkeley + Microsoft). [Paper][PyTorch]
  • FAN: "Understanding The Robustness in Vision Transformers", ICML, 2022 (NVIDIA). [Paper][PyTorch]
  • CFA: "Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment", IJCAI, 2022 (The University of Tokyo). [Paper][PyTorch]
  • ?: "Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem", ECML-PKDD, 2022 (University of Exeter, UK). [Paper][PyTorch]
  • ?: "An Impartial Take to the CNN vs Transformer Robustness Contest", ECCV, 2022 (Oxford). [Paper]
  • AGAT: "Towards Efficient Adversarial Training on Vision Transformers", ECCV, 2022 (Zhejiang University). [Paper]
  • ?: "Are Vision Transformers Robust to Patch Perturbations?", ECCV, 2022 (TUM). [Paper]
  • ViP: "ViP: Unified Certified Detection and Recovery for Patch Attack with Vision Transformers", ECCV, 2022 (UC Santa Cruz). [Paper][PyTorch]
  • ?: "When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture", NeurIPS, 2022 (Peking University). [Paper][Code (in construction)]
  • ?: "Are Vision Transformers Robust to Spurious Correlations?", arXiv, 2022 (UW-Madison). [Paper]
  • MA: "Boosting Adversarial Transferability of MLP-Mixer", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • ?: "Deeper Insights into ViTs Robustness towards Common Corruptions", arXiv, 2022 (Fudan + Microsoft). [Paper]
  • ?: "Privacy-Preserving Image Classification Using Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
  • RobustViT: "Optimizing Relevance Maps of Vision Transformers Improves Robustness", arXiv, 2022 (Tel-Aviv). [Paper][PyTorch]
  • FedWAvg: "Federated Adversarial Training with Transformers", arXiv, 2022 (Institute of Electronics and Digital Technologies (IETR), France). [Paper]
  • RobustCNN: "Can CNNs Be More Robust Than Transformers?", arXiv, 2022 (UC Santa Cruz + JHU). [Paper][PyTorch]
  • Backdoor-Transformer: "Backdoor Attacks on Vision Transformers", arXiv, 2022 (Maryland + UC Davis). [Paper][Code (in construction)]
  • ?: "Defending Backdoor Attacks on Vision Transformer via Patch Processing", arXiv, 2022 (Baidu). [Paper]
  • ?: "Image and Model Transformation with Secret Key for Vision Transformer", arXiv, 2022 (Tokyo Metropolitan University). [Paper]
  • ?: "Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks", arXiv, 2022 (Yonsei University). [Paper]
  • CLIPping Privacy: "CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models", arXiv, 2022 (TUM). [Paper]
  • ?: "A Light Recipe to Train Robust Vision Transformers", arXiv, 2022 (EPFL). [Paper]
  • ?: "Attacking Compressed Vision Transformers", arXiv, 2022 (NYU). [Paper]
  • C-AVP: "Visual Prompting for Adversarial Robustness", arXiv, 2022 (Michigan State). [Paper]
  • ?: "Curved Representation Space of Vision Transformers", arXiv, 2022 (Yonsei University). [Paper]
  • RKDE: "Robustify Transformers with Robust Kernel Density Estimation", arXiv, 2022 (UT Austin). [Paper]
  • MRAP: "Pretrained Transformers Do not Always Improve Robustness", arXiv, 2022 (Arizona State University). [Paper]

Model Compression + Transformer

  • ViT-quant: "Post-Training Quantization for Vision Transformer", NeurIPS, 2021 (Huawei). [Paper]
  • VTP: "Visual Transformer Pruning", arXiv, 2021 (Huawei). [Paper]
  • NViT: "NViT: Vision Transformer Compression and Parameter Redistribution", arXiv, 2021 (NVIDIA). [Paper]
  • MD-ViT: "Multi-Dimensional Model Compression of Vision Transformer", arXiv, 2021 (Princeton). [Paper]
  • FQ-ViT: "FQ-ViT: Fully Quantized Vision Transformer without Retraining", arXiv, 2021 (Megvii). [Paper][PyTorch]
  • UVC: "Unified Visual Transformer Compression", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • MiniViT: "MiniViT: Compressing Vision Transformers with Weight Multiplexing", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • Auto-ViT-Acc: "Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization", International Conference on Field Programmable Logic and Applications (FPL), 2022 (Northeastern University). [Paper]
  • SPViT: "SPViT: Enabling Faster Vision Transformers via Soft Token Pruning", ECCV, 2022 (Northeastern University). [Paper][PyTorch]
  • PSAQ-ViT: "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022 (CAS). [Paper][PyTorch]
  • PTQ4ViT: "PTQ4ViT: Post-Training Quantization Framework for Vision Transformers", ECCV, 2022 (Peking University). [Paper]
  • EAPruning: "EAPruning: Evolutionary Pruning for Vision Transformers and CNNs", BMVC, 2022 (Meituan). [Paper]
  • Q-ViT: "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022 (Beihang University). [Paper][PyTorch]
  • Q-ViT: "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022 (Megvii). [Paper]
  • VAQF: "VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer", arXiv, 2022 (Northeastern University). [Paper]
  • VTP: "Vision Transformer Compression with Structured Pruning and Low Rank Approximation", arXiv, 2022 (UCLA). [Paper]
  • SiDT: "Searching Intrinsic Dimensions of Vision Transformers", arXiv, 2022 (UC Irvine). [Paper]
  • I-ViT: "I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference", arXiv, 2022 (CAS). [Paper]
  • PSAQ-ViT-V2: "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022 (CAS). [Paper][PyTorch]
  • AS: "Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention", arXiv, 2022 (Baidu). [Paper]
  • SaiT: "SaiT: Sparse Vision Transformers through Adaptive Token Pruning", arXiv, 2022 (Samsung). [Paper]
  • oViT: "oViT: An Accurate Second-Order Pruning Framework for Vision Transformers", arXiv, 2022 (IST Austria). [Paper]

[Back to Overview]

Attention-Free

MLP-Series

  • RepMLP: "RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition", arXiv, 2021 (Megvii). [Paper][PyTorch]
  • EAMLP: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (Tsinghua University). [Paper]
  • Forward-Only: "Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet", arXiv, 2021 (Oxford). [Paper][PyTorch]
  • ResMLP: "ResMLP: Feedforward networks for image classification with data-efficient training", arXiv, 2021 (Facebook). [Paper]
  • ?: "Can Attention Enable MLPs To Catch Up With CNNs?", arXiv, 2021 (Tsinghua). [Paper]
  • ViP: "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition", arXiv, 2021 (NUS, Singapore). [Paper][PyTorch]
  • CCS: "Rethinking Token-Mixing MLP for MLP-based Vision Backbone", arXiv, 2021 (Baidu). [Paper]
  • S2-MLPv2: "S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision", arXiv, 2021 (Baidu). [Paper]
  • RaftMLP: "RaftMLP: Do MLP-based Models Dream of Winning Over Computer Vision?", arXiv, 2021 (Rikkyo University, Japan). [Paper][PyTorch]
  • Hire-MLP: "Hire-MLP: Vision MLP via Hierarchical Rearrangement", arXiv, 2021 (Huawei). [Paper]
  • Sparse-MLP: "Sparse-MLP: A Fully-MLP Architecture with Conditional Computation", arXiv, 2021 (NUS). [Paper]
  • ConvMLP: "ConvMLP: Hierarchical Convolutional MLPs for Vision", arXiv, 2021 (University of Oregon). [Paper][PyTorch]
  • sMLP: "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?", arXiv, 2021 (Microsoft). [Paper]
  • MLP-Mixer: "MLP-Mixer: An all-MLP Architecture for Vision", NeurIPS, 2021 (Google). [Paper][Tensorflow][PyTorch-1 (lucidrains)][PyTorch-2 (rishikksh20)]
  • gMLP: "Pay Attention to MLPs", NeurIPS, 2021 (Google). [Paper][PyTorch (antonyvigouret)]
  • S2-MLP: "S2-MLP: Spatial-Shift MLP Architecture for Vision", WACV, 2022 (Baidu). [Paper]
  • CycleMLP: "CycleMLP: A MLP-like Architecture for Dense Prediction", ICLR, 2022 (HKU). [Paper][PyTorch]
  • AS-MLP: "AS-MLP: An Axial Shifted MLP Architecture for Vision", ICLR, 2022 (ShanghaiTech University). [Paper][PyTorch]
  • Wave-MLP: "An Image Patch is a Wave: Quantum Inspired Vision MLP", CVPR, 2022 (Huawei). [Paper][PyTorch]
  • DynaMixer: "DynaMixer: A Vision MLP Architecture with Dynamic Mixing", ICML, 2022 (Tencent). [Paper][PyTorch]
  • STD: "Spatial-Channel Token Distillation for Vision MLPs", ICML, 2022 (Huawei). [Paper]
  • MS-MLP: "Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs", arXiv, 2022 (Microsoft). [Paper]
  • ActiveMLP: "ActiveMLP: An MLP-like Architecture with Active Token Mixer", arXiv, 2022 (Microsoft). [Paper]
  • MDMLP: "MDMLP: Image Classification from Scratch on Small Datasets with MLP", arXiv, 2022 (Jiangsu University). [Paper][PyTorch]
  • PosMLP: "Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP", arXiv, 2022 (University of Science and Technology of China). [Paper][PyTorch]
  • SplitMixer: "SplitMixer: Fat Trimmed From MLP-like Models", arXiv, 2022 (Quintic AI, California). [Paper][PyTorch]
  • gSwin: "gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window", arXiv, 2022 (PKSHATechnology, Japan). [Paper]
  • ?: "Analysis of Quantization on MLP-based Vision Models", arXiv, 2022 (Berkeley). [Paper]

Other Attention-Free

  • PoolFormer: "MetaFormer is Actually What You Need for Vision", CVPR, 2022 (Sea AI Lab). [Paper][PyTorch]
  • FocalNet: "Focal Modulation Networks", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • Sequencer: "Sequencer: Deep LSTM for Image Classification", arXiv, 2022 (Rikkyo University, Japan). [Paper]

[Back to Overview]

Analysis for Transformer

  • Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL). [Paper][PyTorch][Website]
  • Transformer-Explainability: "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021 (Tel Aviv). [Paper][PyTorch]
  • ?: "Are Convolutional Neural Networks or Transformers more like human vision?", CogSci, 2021 (Princeton). [Paper]
  • ?: "ConvNets vs. Transformers: Whose Visual Representations are More Transferable?", ICCVW, 2021 (HKU). [Paper]
  • ?: "Do Vision Transformers See Like Convolutional Neural Networks?", NeurIPS, 2021 (Google). [Paper]
  • ?: "Intriguing Properties of Vision Transformers", NeurIPS, 2021 (MBZUAI). [Paper][PyTorch]
  • FoveaTer: "FoveaTer: Foveated Transformer for Image Classification", arXiv, 2021 (UCSB). [Paper]
  • ?: "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight", arXiv, 2021 (Microsoft). [Paper]
  • ?: "Revisiting the Calibration of Modern Neural Networks", arXiv, 2021 (Google). [Paper]
  • ?: "What Makes for Hierarchical Vision Transformer?", arXiv, 2021 (Horizon Robotic). [Paper]
  • ?: "Visualizing Paired Image Similarity in Transformer Networks", WACV, 2022 (Temple University). [Paper][PyTorch]
  • FDSL: "Can Vision Transformers Learn without Natural Images?", AAAI, 2022 (AIST). [Paper][PyTorch][Website]
  • AlterNet: "How Do Vision Transformers Work?", ICLR, 2022 (Yonsei University). [Paper][PyTorch]
  • ?: "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations", ICLR, 2022 (Google). [Paper][Tensorflow]
  • ?: "On the Connection between Local Attention and Dynamic Depth-wise Convolution", ICLR, 2022 (Microsoft). [Paper]
  • ?: "Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers", ICML, 2022 (Stanford). [Paper]
  • ?: "Three things everyone should know about Vision Transformers", ECCV, 2022 (Meta). [Paper]
  • ?: "Vision Transformers learn patch association", NeurIPS, 2022 (Princeton). [Paper]
  • AWD-ViT: "Visualizing and Understanding Patch Interactions in Vision Transformer", arXiv, 2022 (JD). [Paper]
  • ?: "CNNs and Transformers Perceive Hybrid Images Similar to Humans", arXiv, 2022 (Quintic AI, CA). [Paper][Code]
  • MJP: "Breaking the Chain of Gradient Leakage in Vision Transformers", arXiv, 2022 (Tencent). [Paper]
  • ViT-Shapley: "Learning to Estimate Shapley Values with Vision Transformers", arXiv, 2022 (UW). [Paper][PyTorch]
  • ?: "A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • ?: "How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification", arXiv, 2022 (University of Groningen, The Netherlands). [Paper]
  • ?: "Transformer Vs. MLP-Mixer Exponential Expressive Gap For NLP Problems", arXiv, 2022 (Technion Israel Institute Of Technology). [Paper]
  • ProtoPFormer: "ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
  • ICLIP: "Exploring Visual Interpretability for Contrastive Language-Image Pre-training", arXiv, 2022 (HKUST). [Paper][Code (in construction)]
  • ?: "Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers", arXiv, 2022 (Google). [Paper]
  • ?: "Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?", arXiv, 2022 (Monash University). [Paper][PyTorch]

[Back to Overview]

Detection

Object Detection

  • CNN-based backbone:
    • DETR: "End-to-End Object Detection with Transformers", ECCV, 2020 (Facebook). [Paper][PyTorch]
    • Deformable DETR: "Deformable DETR: Deformable Transformers for End-to-End Object Detection", ICLR, 2021 (SenseTime). [Paper][PyTorch]
    • UP-DETR: "UP-DETR: Unsupervised Pre-training for Object Detection with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch]
    • SMCA: "Fast Convergence of DETR with Spatially Modulated Co-Attention", ICCV, 2021 (CUHK). [Paper][PyTorch]
    • Conditional-DETR: "Conditional DETR for Fast Training Convergence", ICCV, 2021 (Microsoft). [Paper]
    • PnP-DETR: "PnP-DETR: Towards Efficient Visual Analysis with Transformers", ICCV, 2021 (Yitu). [Paper][Code (in construction)]
    • TSP: "Rethinking Transformer-based Set Prediction for Object Detection", ICCV, 2021 (CMU). [Paper]
    • Dynamic-DETR: "Dynamic DETR: End-to-End Object Detection With Dynamic Attention", ICCV, 2021 (Microsoft). [Paper]
    • ViT-YOLO: "ViT-YOLO:Transformer-Based YOLO for Object Detection", ICCVW, 2021 (Xidian University). [Paper]
    • ACT: "End-to-End Object Detection with Adaptive Clustering Transformer", BMVC, 2021 (Peking + CUHK). [Paper][PyTorch]
    • DIL-ViT: "Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers", BMVC, 2021 (Monash University Malaysia). [Paper]
    • Efficient-DETR: "Efficient DETR: Improving End-to-End Object Detector with Dense Prior", arXiv, 2021 (Megvii). [Paper]
    • CA-FPN: "Content-Augmented Feature Pyramid Network with Light Linear Transformers", arXiv, 2021 (CAS). [Paper]
    • DETReg: "DETReg: Unsupervised Pretraining with Region Priors for Object Detection", arXiv, 2021 (Tel-Aviv + Berkeley). [Paper][Website]
    • GQPos: "Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads", arXiv, 2021 (Megvii). [Paper]
    • Anchor-DETR: "Anchor DETR: Query Design for Transformer-Based Detector", AAAI, 2022 (Megvii). [Paper][PyTorch]
    • Sparse-DETR: "Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity", ICLR, 2022 (Kakao). [Paper][PyTorch]
    • DAB-DETR: "DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR", ICLR, 2022 (IDEA, China). [Paper][PyTorch]
    • DN-DETR: "DN-DETR: Accelerate DETR Training by Introducing Query DeNoising", CVPR, 2022 (International Digital Economy Academy (IDEA), China). [Paper][PyTorch]
    • SAM-DETR: "Accelerating DETR Convergence via Semantic-Aligned Matching", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
    • AdaMixer: "AdaMixer: A Fast-Converging Query-Based Object Detector", CVPR, 2022 (Nanjing University). [Paper][Code (in construction)]
    • DESTR: "DESTR: Object Detection With Split Transformer", CVPR, 2022 (Oregon State). [Paper]
    • REGO: "Recurrent Glimpse-based Decoder for Detection with Transformer", CVPR, 2022 (The University of Sydney). [Paper][PyTorch]
    • ?: "Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer", CVPR, 2022 (Ant Group). [Paper]
    • DE-DETR: "Towards Data-Efficient Detection Transformers", ECCV, 2022 (JD). [Paper][PyTorch]
    • DFFT: "Efficient Decoder-free Object Detection with Transformers", ECCV, 2022 (Tencent). [Paper]
    • KA: "Knowledge Amalgamation for Object Detection with Transformers", arXiv, 2022 (Zhejiang University). [Paper]
    • MIMDet: "Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection", arXiv, 2022 (Tencent). [Paper][PyTorch]
    • imTED: "Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection", arXiv, 2022 (CAS). [Paper]
    • AO2-DETR: "AO2-DETR: Arbitrary-Oriented Object Detection Transformer", arXiv, 2022 (Peking University). [Paper]
    • MaskDINO: "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation", arXiv, 2022 (IDEA, China). [Paper][Code (in construction)]
    • TCC: "Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection", arXiv, 2022 (The University of Sydney). [Paper]
    • Conditional-DETR-V2: "Conditional DETR V2: Efficient Detection Transformer with Box Queries", arXiv, 2022 (Peking University). [Paper]
    • Group-DETR: "Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment", arXiv, 2022 (Baidu). [Paper]
    • H-DETR: "DETRs with Hybrid Matching", arXiv, 2022 (Microsoft). [Paper]
    • SAM-DETR++: "Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
    • IMFA: "Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
    • ComplETR: "ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers", arXiv, 2022 (Amazon). [Paper]
    • Obj2Seq: "Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks", arXiv, 2022 (CAS). [Paper][PyTorch]
  • Transformer-based backbone:
    • ViT-FRCNN: "Toward Transformer-Based Object Detection", arXiv, 2020 (Pinterest). [Paper]
    • WB-DETR: "WB-DETR: Transformer-Based Detector Without Backbone", ICCV, 2021 (CAS). [Paper]
    • YOLOS: "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection", NeurIPS, 2021 (Horizon Robotics). [Paper][PyTorch]
    • ?: "Benchmarking Detection Transfer Learning with Vision Transformers", arXiv, 2021 (Facebook). [Paper]
    • ViDT: "ViDT: An Efficient and Effective Fully Transformer-based Object Detector", ICLR, 2022 (NAVER). [Paper][PyTorch]
    • FP-DETR: "FP-DETR: Detection Transformer Advanced by Fully Pre-training", ICLR, 2022 (USTC). [Paper]
    • DETR++: "DETR++: Taming Your Multi-Scale Detection Transformer", CVPRW, 2022 (Google). [Paper]
    • ViTDet: "Exploring Plain Vision Transformer Backbones for Object Detection", ECCV, 2022 (Meta). [Paper]
    • UViT: "A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation", ECCV, 2022 (Google). [Paper]
    • D2ETR: "D2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention", arXiv, 2022 (Alibaba). [Paper][PyTorch]
    • DINO: "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection", arXiv, 2022 (IDEA, China). [Paper][Code (in construction)]

[Back to Overview]

3D Object Detection

  • AST-GRU: "LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention", CVPR, 2020 (Baidu). [Paper][Code (in construction)]
  • Pointformer: "3D Object Detection with Pointformer", arXiv, 2020 (Tsinghua). [Paper]
  • CT3D: "Improving 3D Object Detection with Channel-wise Transformer", ICCV, 2021 (Alibaba). [Paper][Code (in construction)]
  • Group-Free-3D: "Group-Free 3D Object Detection via Transformers", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • VoTr: "Voxel Transformer for 3D Object Detection", ICCV, 2021 (CUHK + NUS). [Paper]
  • 3DETR: "An End-to-End Transformer Model for 3D Object Detection", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
  • DETR3D: "DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries", CoRL, 2021 (MIT). [Paper]
  • M3DETR: "M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers", WACV, 2022 (University of Maryland). [Paper][PyTorch]
  • SST: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (CAS). [Paper][PyTorch]
  • MonoDTR: "MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer", CVPR, 2022 (NTU). [Paper][Code (in construction)]
  • VoxSeT: "Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds", CVPR, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • TransFusion: "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers", CVPR, 2022 (HKUST). [Paper][PyTorch]
  • CAT-Det: "CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection", CVPR, 2022 (Beihang University). [Paper]
  • TokenFusion: "Multimodal Token Fusion for Vision Transformers", CVPR, 2022 (Tsinghua). [Paper]
  • SST: "Embracing Single Stride 3D Object Detector with Sparse Transformer", CVPR, 2022 (CAS). [Paper][PyTorch]
  • LIFT: "LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection", CVPR, 2022 (Shanghai Jiao Tong University). [Paper]
  • BoxeR: "BoxeR: Box-Attention for 2D and 3D Transformers", CVPR, 2022 (University of Amsterdam). [Paper][PyTorch]
  • BrT: "Bridged Transformer for Vision and Point Cloud 3D Object Detection", CVPR, 2022 (Tsinghua). [Paper]
  • VISTA: "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
  • STRL: "Towards Self-Supervised Pre-Training of 3DETR for Label-Efficient 3D Object Detection", CVPRW, 2022 (Bosch). [Paper]
  • MTrans: "Multimodal Transformer for Automatic 3D Annotation and Object Detection", ECCV, 2022 (HKU). [Paper][PyTorch]
  • CenterFormer: "CenterFormer: Center-based Transformer for 3D Object Detection", ECCV, 2022 (TuSimple). [Paper][Code (in construction)]
  • BUTD-DETR: "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds", ECCV, 2022 (CMU). [Paper][PyTorch][Website]
  • SpatialDETR: "SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention", ECCV, 2022 (Mercedes-Benz). [Paper][PyTorch]
  • CramNet: "CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection", ECCV, 2022 (Waymo). [Paper]
  • PETR: "PETR: Position Embedding Transformation for Multi-View 3D Object Detection", arXiv, 2022 (Megvii). [Paper]
  • MonoDETR: "MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection", arXiv, 2022 (Shanghai AI Laboratory). [Paper][Code (in construction)]
  • Graph-DETR3D: "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", arXiv, 2022 (University of Science and Technology of China). [Paper]
  • UVTR: "Unifying Voxel-based Representation with Transformer for 3D Object Detection", arXiv, 2022 (CUHK). [Paper][PyTorch]
  • PETRv2: "PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images", arXiv, 2022 (Megvii). [Paper]
  • PolarFormer: "PolarFormer: Multi-camera 3D Object Detection with Polar Transformer", arXiv, 2022 (Fudan University). [Paper][Code (in construction)]
  • AST-GRU: "Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • SEFormer: "SEFormer: Structure Embedding Transformer for 3D Object Detection", arXiv, 2022 (Tsinghua University). [Paper]
  • CRAFT: "CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer", arXiv, 2022 (KAIST). [Paper]
  • CrossDTR: "CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection", arXiv, 2022 (NTU). [Paper][Code (in construction)]
  • SWFormer: "SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds", arXiv, 2022 (Waymo). [Paper]

[Back to Overview]

Multi-Modal Detection

  • OVR-CNN: "Open-Vocabulary Object Detection Using Captions", CVPR, 2021 (Snap). [Paper][PyTorch]
  • MDETR: "MDETR - Modulated Detection for End-to-End Multi-Modal Understanding", ICCV, 2021 (NYU). [Paper][PyTorch][Website]
  • FETNet: "FETNet: Feature Exchange Transformer Network for RGB-D Object Detection", BMVC, 2021 (Tsinghua). [Paper]
  • MEDUSA: "Exploiting Scene Depth for Object Detection with Multimodal Transformers", BMVC, 2021 (Google). [Paper][PyTorch]
  • StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", arXiv, 2021 (Baidu). [Paper]
  • MAVL: "Class-agnostic Object Detection with Multi-modal Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
  • OWL-ViT: "Simple Open-Vocabulary Object Detection with Vision Transformers", ECCV, 2022 (Google). [Paper]
  • X-DETR: "X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks", ECCV, 2022 (Amazon). [Paper]
  • simCrossTrans: "simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers", arXiv, 2022 (The City University of New York). [Paper][PyTorch]
  • ?: "DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection", arXiv, 2022 (USC). [Paper]
  • YONOD: "You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers", arXiv, 2022 (CUNY). [Paper][PyTorch]
  • OmDet: "OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training", arXiv, 2022 (Binjiang Institute of Zhejiang University). [Paper]
  • Detection-Hub: "Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding", arXiv, 2022 (Fudan + Microsoft). [Paper]
  • F-VLM: "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models", arXiv, 2022 (Google). [Paper]
  • ContFormer: "Video Referring Expression Comprehension via Transformer with Content-aware Query", arXiv, 2022 (Peking University). [Paper]

[Back to Overview]

HOI Detection

  • HOI-Transformer: "End-to-End Human Object Interaction Detection with HOI Transformer", CVPR, 2021 (Megvii). [Paper][PyTorch]
  • HOTR: "HOTR: End-to-End Human-Object Interaction Detection with Transformers", CVPR, 2021 (Kakao + Korea University). [Paper][PyTorch]
  • MSTR: "MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection", CVPR, 2022 (Kakao). [Paper]
  • SSRT: "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions", CVPR, 2022 (Amazon). [Paper]
  • CPC: "Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection", CVPR, 2022 (Korea University). [Paper][PyTorch (in construction)]
  • DisTR: "Human-Object Interaction Detection via Disentangled Transformer", CVPR, 2022 (Baidu). [Paper]
  • STIP: "Exploring Structure-Aware Transformer Over Interaction Proposals for Human-Object Interaction Detection", CVPR, 2022 (JD). [Paper][PyTorch]
  • DOQ: "Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection", CVPR, 2022 (South China University of Technology). [Paper]
  • UPT: "Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer", CVPR, 2022 (Australian Centre for Robotic Vision). [Paper][PyTorch][Website]
  • CATN: "Category-Aware Transformer Network for Better Human-Object Interaction Detection", CVPR, 2022 (Huazhong University of Science and Technology). [Paper]
  • HQM: "Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection", ECCV, 2022 (South China University of Technology). [Paper][PyTorch]
  • Iwin: "Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows", ECCV, 2022 (Shanghai Jiao Tong). [Paper]
  • ?: "Understanding Embodied Reference with Touch-Line Transformer", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]

[Back to Overview]

Salient Object Detection

  • VST: "Visual Saliency Transformer", ICCV, 2021 (Northwestern Polytechincal University). [Paper]
  • ?: "Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction", NeurIPS, 2021 (Baidu). [Paper]
  • SwinNet: "SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection", TCSVT, 2021 (Anhui University). [Paper][Code]
  • SOD-Transformer: "Transformer Transforms Salient Object Detection and Camouflaged Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
  • GLSTR: "Unifying Global-Local Representations in Salient Object Detection with Transformer", arXiv, 2021 (South China University of Technology). [Paper]
  • TriTransNet: "TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network", arXiv, 2021 (Anhui University). [Paper]
  • AbiU-Net: "Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net", arXiv, 2021 (Nankai University). [Paper]
  • TranSalNet: "TranSalNet: Visual saliency prediction using transformers", arXiv, 2021 (Cardiff University, UK). [Paper]
  • DFTR: "DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection", arXiv, 2022 (Tencent). [Paper]
  • GroupTransNet: "GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection", arXiv, 2022 (Nankai university). [Paper]
  • SelfReformer: "SelfReformer: Self-Refined Network with Transformer for Salient Object Detection", arXiv, 2022 (NTU, Singapore). [Paper]
  • DTMINet: "Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection", arXiv, 2022 (CUHK). [Paper]
  • MCNet: "Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
  • SiaTrans: "SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification", arXiv, 2022 (Shandong University of Science and Technology). [Paper]

[Back to Overview]

Other Detection Tasks

  • X-supervised:
    • LOST: "Localizing Objects with Self-Supervised Transformers and no Labels", BMVC, 2021 (Valeo.ai). [Paper][PyTorch]
    • Omni-DETR: "Omni-DETR: Omni-Supervised Object Detection with Transformers", CVPR, 2022 (Amazon). [Paper][PyTorch]
    • TokenCut: "Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut", CVPR, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
    • WS-DETR: "Scaling Novel Object Detection with Weakly Supervised Detection Transformers", CVPRW, 2022 (Microsoft). [Paper]
    • TRT: "Re-Attention Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • TokenCut: "TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut", arXiv, 2022 (Univ. Grenoble Alpes, France). [Paper][PyTorch][Website]
  • X-Shot Object Detection:
    • AIT: "Adaptive Image Transformer for One-Shot Object Detection", CVPR, 2021 (Academia Sinica). [Paper]
    • Meta-DETR: "Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning", arXiv, 2021 (NTU Singapore). [Paper][PyTorch]
    • CAT: "CAT: Cross-Attention Transformer for One-Shot Object Detection", arXiv, 2021 (Northwestern Polytechnical University). [Paper]
    • FCT: "Few-Shot Object Detection with Fully Cross-Transformer", CVPR, 2022 (Columbia). [Paper]
    • SaFT: "Semantic-aligned Fusion Transformer for One-shot Object Detection", CVPR, 2022 (Microsoft). [Paper]
    • Meta-DETR: "Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation", TPAMI, 2022 (NTU, Singapore). [Paper]
    • Incremental-DETR: "Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning", arXiv, 2022 (NUS). [Paper]
    • FS-DETR: "FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training", arXiv, 2022 (Samsung). [Paper]
  • Open-World/Vocabulary:
    • OW-DETR: "OW-DETR: Open-world Detection Transformer", CVPR, 2022 (IIAI). [Paper][PyTorch]
    • DetPro: "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • PromptDet: "PromptDet: Towards Open-vocabulary Detection using Uncurated Images", ECCV, 2022 (Meituan). [Paper][PyTorch][Website]
    • OV-DETR: "Open-Vocabulary DETR with Conditional Matching", ECCV, 2022 (NTU, Singapore). [Paper]
    • DetCLIP: "DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection", NeurIPS, 2022 (HKUST). [Paper]
  • Pedestrian Detection:
    • PED: "DETR for Crowd Pedestrian Detection", arXiv, 2020 (Tsinghua). [Paper][PyTorch]
    • Pedestron: "Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond", arXiv, 2022 (IIAI). [Paper][PyTorch]
  • Lane Detection:
    • LSTR: "End-to-end Lane Shape Prediction with Transformers", WACV, 2021 (Xi'an Jiaotong). [Paper][PyTorch]
    • LETR: "Line Segment Detection Using Transformers without Edges", CVPR, 2021 (UCSD). [Paper][PyTorch]
    • Laneformer: "Laneformer: Object-aware Row-Column Transformers for Lane Detection", AAAI, 2022 (Huawei). [Paper]
    • TLC: "Transformer Based Line Segment Classifier With Image Context for Real-Time Vanishing Point Detection in Manhattan World", CVPR, 2022 (Peking University). [Paper]
    • PersFormer: "PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark", ECCV, 2022 (Shanghai AI Laboratory). [Paper][PyTorch]
    • PriorLane: "PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer", arXiv, 2022 (Zhejiang Lab). [Paper][PyTorch]
    • CurveFormer: "CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention", arXiv, 2022 (NullMax, China). [Paper]
  • Object Localization:
    • TS-CAM: "TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization", arXiv, 2021 (CAS). [Paper]
    • LCTR: "LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization", AAAI, 2022 (Xiamen University). [Paper]
    • ViTOL: "ViTOL: Vision Transformer for Weakly Supervised Object Localization", CVPRW, 2022 (Mercedes-Benz). [Paper][PyTorch]
    • SCM: "Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration", ECCV, 2022 (CUHK). [Paper][PyTorch]
    • CaFT: "CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization", arXiv, 2022 (Zhejiang University). [Paper]
  • Relation Detection:
    • PST: "Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries", ICCV, 2021 (Amazon). [Paper]
    • PST: "Visual Composite Set Detection Using Part-and-Sum Transformers", arXiv, 2021 (Amazon). [Paper]
    • TROI: "Transformed ROIs for Capturing Visual Transformations in Videos", arXiv, 2021 (NUS, Singapore). [Paper]
    • RelTransformer: "RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition", CVPR, 2022 (KAUST). [Paper][PyTorch]
    • VReBERT: "VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection", ICPR, 2022 (ANU). [Paper]
  • Anomaly Detection:
    • VT-ADL: "VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization", ISIE, 2021 (University of Udine, Italy). [Paper]
    • InTra: "Inpainting Transformer for Anomaly Detection", arXiv, 2021 (Fujitsu). [Paper]
    • AnoViT: "AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder", arXiv, 2022 (Korea University). [Paper]
    • ?: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", arXiv, 2022 (Korea University). [Paper]
  • Cross-Domain:
    • SSTN: "SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
    • DA-DETR: "DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention", arXiv, 2021 (NTU Singapore). [Paper]
    • MTTrans: "MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer", ECCV, 2022 (Beihang University). [Paper]
    • OAA-OTA: "Improving Transferability for Domain Adaptive Detection Transformers", arXiv, 2022 (Beijing Institute of Technology). [Paper]
    • SSTA: "Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • Co-Salient Object Detection:
    • CoSformer: "CoSformer: Detecting Co-Salient Object with Transformers", arXiv, 2021 (Nanjing University). [Paper]
  • Oriented Object Detection:
    • O2DETR: "Oriented Object Detection with Transformer", arXiv, 2021 (Baidu). [Paper]
  • Multiview Detection:
    • MVDeTr: "Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)", ACMMM, 2021 (ANU). [Paper]
  • Polygon Detection:
    • ?: "Investigating transformers in the decomposition of polygonal shapes as point collections", ICCVW, 2021 (Delft University of Technology, Netherlands). [Paper]
  • Drone-view:
    • TPH: "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios", ICCVW, 2021 (Beihang University). [Paper]
    • TransVisDrone: "TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos", arXiv, 2022 (UCF). [Paper][Code (in construction)]
  • Infrared:
    • ?: "Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds", arXiv, 2021 (Chongqing University of Posts and Telecommunications). [Paper]
  • Text:
    • SwinTextSpotter: "SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition", CVPR, 2022 (South China University of Technology). [Paper][PyTorch]
    • TESTR: "Text Spotting Transformers", CVPR, 2022 (UCSD). [Paper][PyTorch]
    • TTS: "Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer", CVPR, 2022 (Amazon). [Paper]
    • TransDETR: "End-to-End Video Text Spotting with Transformer", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
    • ?: "Arbitrary Shape Text Detection using Transformers", arXiv, 2022 (University of Waterloo, Canada). [Paper]
    • ?: "Arbitrary Shape Text Detection via Boundary Transformer", arXiv, 2022 (University of Science and Technology Beijing). [Paper][Code (in construction)]
    • DPText-DETR: "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer", arXiv, 2022 (JD). [Paper][Code (in construction)]
    • DPTNet: "DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection", arXiv, 2022 (Xiamen University). [Paper]
  • Change Detection:
    • ChangeFormer: "A Transformer-Based Siamese Network for Change Detection", arXiv, 2022 (JHU). [Paper][PyTorch]
    • IDET: "IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection", arXiv, 2022 (Civil Aviation University of China). [Paper]
  • Edge Detection:
    • EDTER: "EDTER: Edge Detection with Transformer", CVPR, 2022 (Beijing Jiaotong University). [Paper][Code (in construction)]
    • HEAT: "HEAT: Holistic Edge Attention Transformer for Structured Reconstruction", CVPR, 2022 (Simon Fraser). [Paper][PyTorch][Website]
  • Person Search:
    • COAT: "Cascade Transformers for End-to-End Person Search", CVPR, 2022 (Kitware). [Paper][PyTorch]
    • PSTR: "PSTR: End-to-End One-Step Person Search With Transformers", CVPR, 2022 (Tianjin University). [Paper][PyTorch]
  • Manipulation Detection:
    • ObjectFormer: "ObjectFormer for Image Manipulation Detection and Localization", CVPR, 2022 (Fudan University). [Paper]
  • Grounded Situation Recognition:
    • CoFormer: "Collaborative Transformers for Grounded Situation Recognition", CVPR, 2022 (POSTECH). [Paper][PyTorch]
  • Mirror Detection:
    • SATNet: "Symmetry-Aware Transformer-based Mirror Detection", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]

[Back to Overview]

Segmentation

Semantic Segmentation

  • SETR: "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers", CVPR, 2021 (Tencent). [Paper][PyTorch][Website]
  • TrSeg: "TrSeg: Transformer for semantic segmentation", PRL, 2021 (Korea University). [Paper][PyTorch]
  • CWT: "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer", ICCV, 2021 (University of Surrey, UK). [Paper][PyTorch]
  • Segmenter: "Segmenter: Transformer for Semantic Segmentation", ICCV, 2021 (INRIA). [Paper][PyTorch]
  • UN-EPT: "A Unified Efficient Pyramid Transformer for Semantic Segmentation", ICCVW, 2021 (Amazon). [Paper][PyTorch]
  • SegFormer: "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers", NeurIPS, 2021 (NVIDIA). [Paper][PyTorch]
  • FTN: "Fully Transformer Networks for Semantic Image Segmentation", arXiv, 2021 (Baidu). [Paper]
  • OffRoadTranSeg: "OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments", arXiv, 2021 (IISER. India). [Paper]
  • MaskFormer: "Per-Pixel Classification is Not All You Need for Semantic Segmentation", arXiv, 2021 (UIUC + Facebook). [Paper][Website]
  • TRFS: "Boosting Few-shot Semantic Segmentation with Transformers", arXiv, 2021 (ETHZ). [Paper]
  • Flying-Guide-Dog: "Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation", arXiv, 2021 (KIT, Germany). [Paper][Code (in construction)]
  • VSPW: "Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models", arXiv, 2021 (Xiaomi). [Paper]
  • SDTP: "SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction", arXiv, 2021 (?). [Paper]
  • TopFormer: "TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • GroupViT: "GroupViT: Semantic Segmentation Emerges from Text Supervision", CVPR, 2022 (NVIDIA). [Paper][Website][PyTorch]
  • HRViT: "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch]
  • GReaT: "Graph Reasoning Transformer for Image Parsing", ACMMM, 2022 (HKUST). [Paper]
  • SegViT: "SegViT: Semantic Segmentation with Plain Vision Transformers", NeurIPS, 2022 (The University of Adelaide, Australia). [Paper]
  • RTFormer: "RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer", NeurIPS, 2022 (Baidu). [Paper][Paddle]
  • Lawin: "Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper][PyTorch]
  • PFT: "Pyramid Fusion Transformer for Semantic Segmentation", arXiv, 2022 (CUHK + SenseTime). [Paper]
  • DFlatFormer: "Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation", arXiv, 2022 (OPPO). [Paper]
  • FeSeFormer: "Feature Selective Transformer for Semantic Image Segmentation", arXiv, 2022 (Baidu). [Paper]
  • StructToken: "StructToken : Rethinking Semantic Segmentation with Structural Prior", arXiv, 2022 (Shanghai AI Lab). [Paper]
  • TSG: "Transformer Scale Gate for Semantic Segmentation", arXiv, 2022 (Monash University, Australia). [Paper]
  • HILA: "Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention", arXiv, 2022 (University of Toronto). [Paper][Website][PyTorch]
  • HLG: "Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective", arXiv, 2022 (Fudan University). [Paper][PyTorch]
  • SSformer: "SSformer: A Lightweight Transformer for Semantic Segmentation", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper][PyTorch]
  • NamedMask: "NamedMask: Distilling Segmenters from Complementary Foundation Models", arXiv, 2022 (Oxford). [Paper][PyTorch][Website]

[Back to Overview]

Depth Estimation

  • DPT: "Vision Transformers for Dense Prediction", ICCV, 2021 (Intel). [Paper][PyTorch]
  • TransDepth: "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction", ICCV, 2021 (Haerbin Institute of Technology + University of Trento). [Paper][PyTorch]
  • ASTransformer: "Transformer-based Monocular Depth Estimation with Attention Supervision", BMVC, 2021 (USTC). [Paper][PyTorch]
  • MT-SfMLearner: "Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
  • DepthFormer: "Multi-Frame Self-Supervised Depth with Transformers", CVPR, 2022 (Toyota). [Paper]
  • GuideFormer: "GuideFormer: Transformers for Image Guided Depth Completion", CVPR, 2022 (Agency for Defense Development, Korea). [Paper]
  • SparseFormer: "SparseFormer: Attention-based Depth Completion Network", CVPRW, 2022 (Meta). [Paper]
  • DEST: "Depth Estimation with Simplified Transformer", CVPRW, 2022 (NVIDIA). [Paper]
  • MonoViT: "MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer", 3DV, 2022 (University of Bologna, Italy). [Paper][PyTorch]
  • Spike-Transformer: "Spike Transformer: Monocular Depth Estimation for Spiking Camera", ECCV, 2022 (Peking University). [Paper][PyTorch]
  • GLPanoDepth: "GLPanoDepth: Global-to-Local Panoramic Depth Estimation", arXiv, 2022 (Nanjing University). [Paper]
  • DepthFormer: "DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • BinsFormer: "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", arXiv, 2022 (Harbin Institute of Technology). [Paper][PyTorch]
  • SideRT: "SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation", arXiv, 2022 (Meituan). [Paper]
  • MonoFormer: "MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers", arXiv, 2022 (DGIST, Korea). [Paper]
  • Depthformer: "Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion", arXiv, 2022 (Indian Institute of Technology Delhi). [Paper]
  • TODE-Trans: "TODE-Trans: Transparent Object Depth Estimation with Transformer", arXiv, 2022 (USTC). [Paper][Code (in construction)]

[Back to Overview]

Object Segmentation

  • SOTR: "SOTR: Segmenting Objects with Transformers", ICCV, 2021 (China Agricultural University). [Paper][PyTorch]
  • Trans4Trans: "Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World", ICCVW, 2021 (Karlsruhe Institute of Technology, Germany). [Paper][Code (in construction)]
  • Trans2Seg: "Segmenting Transparent Object in the Wild with Transformer", arXiv, 2021 (HKU + SenseTime). [Paper][PyTorch]
  • SOIT: "SOIT: Segmenting Objects with Instance-Aware Transformers", AAAI, 2022 (Hikvision). [Paper][PyTorch]
  • CAST: "Concurrent Recognition and Segmentation with Adaptive Segment Tokens", arXiv, 2022 (Berkeley). [Paper]

[Back to Overview]

Other Segmentation Tasks

  • Vision-Language:
    • LSeg: "Language-driven Semantic Segmentation", ICLR, 2022 (Cornell). [Paper][PyTorch]
    • ZegFormer: "Decoupling Zero-Shot Semantic Segmentation", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
    • CLIPSeg: "Image Segmentation Using Text and Image Prompts", CVPR, 2022 (University of Göttingen, Germany). [Paper][PyTorch]
    • DenseCLIP: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting", CVPR, 2022 (Tsinghua University). [Paper][PyTorch][Website]
    • MaskCLIP: "Extract Free Dense Labels from CLIP", ECCV, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
  • Multi-Modal:
    • CMX: "CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
  • Panoptic Segmentation:
    • MaX-DeepLab: "MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers", CVPR, 2021 (Google). [Paper][PyTorch (conradry)]
    • SIAin: "An End-to-End Trainable Video Panoptic Segmentation Method usingTransformers", arXiv, 2021 (SI Analytics, South Korea). [Paper]
    • VPS-Transformer: "Time-Space Transformers for Video Panoptic Segmentation", WACV, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
    • CMT-DeepLab: "CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation", CVPR, 2022 (Google). [Paper]
    • Panoptic-SegFormer: "Panoptic SegFormer", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
    • Mask2Former: "Masked-attention Mask Transformer for Universal Image Segmentation", CVPR, 2022 (Meta). [Paper][PyTorch][Website]
    • kMaX-DeepLab: "k-means Mask Transformer", ECCV, 2022 (Google). [Paper][Tensorflow]
    • Panoptic-PartFormer: "Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation", ECCV, 2022 (Peking). [Paper][PyTorch]
  • Instance Segmentation:
    • ISTR: "ISTR: End-to-End Instance Segmentation with Transformers", arXiv, 2021 (Xiamen University). [Paper][PyTorch]
    • Mask-Transfiner: "Mask Transfiner for High-Quality Instance Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch][Website]
    • BoundaryFormer: "Instance Segmentation With Mask-Supervised Polygonal Boundary Transformers", CVPR, 2022 (UCSD). [Paper]
    • PPT: "Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation", CVPRW, 2022 (ByteDance). [Paper]
    • OSFormer: "OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • AISFormer: "AISFormer: Amodal Instance Segmentation with Transformer", BMVC, 2022 (University of Arkansas, Arkansas). [Paper][Code (in construction)]
  • Optical Flow:
    • CRAFT: "CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow", CVPR, 2022 (A*STAR, Singapore). [Paper][PyTorch]
    • KPA-Flow: "Learning Optical Flow With Kernel Patch Attention", CVPR, 2022 (Megvii). [Paper][PyTorch (in construction)]
    • GMFlowNet: "Global Matching with Overlapping Attention for Optical Flow Estimation", CVPR, 2022 (Rutgers). [Paper][PyTorch]
    • FlowFormer: "FlowFormer: A Transformer Architecture for Optical Flow", ECCV, 2022 (CUHK). [Paper][Website]
  • Panoramic Semantic Segmentation:
    • Trans4PASS: "Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation", CVPR, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
  • X-Shot:
    • CyCTR: "Few-Shot Segmentation via Cycle-Consistent Transformer", NeurIPS, 2021 (University of Technology Sydney). [Paper]
    • CATrans: "CATrans: Context and Affinity Transformer for Few-Shot Segmentation", IJCAI, 2022 (Baidu). [Paper]
    • VAT: "Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation", ECCV, 2022 (Korea University). [Paper][PyTorch][Website]
    • DCAMA: "Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation", ECCV, 2022 (Tencent). [Paper]
    • IPMT: "Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation", NeurIPS, 2022 (Northwestern Polytechnical University). [Paper][PyTorch]
    • TAFT: "Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation", arXiv, 2022 (KAIST). [Paper]
    • MSANet: "MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation", arXiv, 2022 (AiV Research Group, Korea). [Paper][PyTorch]
  • X-Supervised:
    • MCTformer: "Multi-class Token Transformer for Weakly Supervised Semantic Segmentation", CVPR, 2022 (The University of Western Australia). [Paper][Code (in construction)]
    • AFA: "Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers", CVPR, 2022 (Wuhan University). [Paper][PyTorch]
    • HSG: "Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers", CVPR, 2022 (Berkeley). [Paper][PyTorch]
    • ?: "Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks", CVPRW, 2022 (Université Paris-Saclay, France). [Paper]
    • SegSwap: "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery", CVPRW, 2022 (École des Ponts ParisTech). [Paper][PyTorch][Website]
    • ViT-PCM: "Max Pooling with Vision Transformers Reconciles Class and Shape in Weakly Supervised Semantic Segmentation", ECCV, 2022 (Sapienza University, Italy). [Paper][Tensorflow]
    • TransFGU: "TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation", ECCV, 2022 (Alibaba). [Paper][PyTorch]
    • TransCAM: "TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation", arXiv, 2022 (University of Toronto). [Paper][PyTorch]
    • WegFormer: "WegFormer: Transformers for Weakly Supervised Semantic Segmentation", arXiv, 2022 (Tongji University, China). [Paper]
    • MaskDistill: "Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation", arXiv, 2022 (KU Leuven). [Paper][PyTorch]
    • eX-ViT: "eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation", arXiv, 2022 (La Trobe University, Australia). [Paper]
    • TCC: "Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students", arXiv, 2022 (Alibaba). [Paper]
  • Cross-Domain:
    • DAFormer: "DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation", CVPR, 2022 (ETHZ). [Paper][PyTorch]
  • Crack Detection:
    • CrackFormer: "CrackFormer: Transformer Network for Fine-Grained Crack Detection", ICCV, 2021 (Nanjing University of Science and Technology). [Paper]
  • Camouflaged Object Detection:
    • UGTR: "Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection", ICCV, 2021 (Group42, Abu Dhabi). [Paper][PyTorch]
    • COD: "Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer", ICPR, 2022 (Anhui University, China). [Paper][Code (in construction)]
  • Background Separation:
    • TransBlast: "TransBlast: Self-Supervised Learning Using Augmented Subspace With Transformer for Background/Foreground Separation", ICCVW, 2021 (University of British Columbia). [Paper]
  • Scene Understanding:
    • BANet: "Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Urban Scene Images", arXiv, 2021 (Wuhan University). [Paper]
    • Cerberus-Transformer: "Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • IRISformer: "IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes", CVPR, 2022 (UCSD). [Paper][Code (in construction)]
    • InvPT: "Inverted Pyramid Multi-task Transformer for Dense Scene Understanding", ECCV, 2022 (HKUST). [Paper][PyTorch]
  • 3D Segmentation:
    • Stratified-Transformer: "Stratified Transformer for 3D Point Cloud Segmentation", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • CodedVTR: "CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance", CVPR, 2022 (Tsinghua). [Paper]
    • M2F3D: "M2F3D: Mask2Former for 3D Instance Segmentation", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper][Website]
  • Multi-Task:
    • MQTransformer: "Multi-Task Learning with Multi-Query Transformer for Dense Prediction", arXiv, 2022 (Wuhan University). [Paper]
  • Forcasting:
    • DiffAttn: "Joint Forecasting of Panoptic Segmentations with Difference Attention", CVPR, 2022 (UIUC). [Paper][Code (in construction)]
  • LiDAR:
    • HelixNet: "Online Segmentation of LiDAR Sequences: Dataset and Algorithm", CVPRW, 2022 (CNRS, France). [Paper][Website][PyTorch]
  • Co-Segmentation:
    • DINO-ViT-feature: "Deep ViT Features as Dense Visual Descriptors", arXiv, 2022 (Weizmann Institute of Science, Israel). [Paper][PyTorch][Website]
  • Top-Down Semantic Segmentation:
    • Trans4Map: "Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
  • Open-World/Vocabulary:
    • ViL-Seg: "Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding", ECCV, 2022 (CUHK). [Paper]
    • OVSeg: "Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP", arXiv, 2022 (Meta). [Paper][Website]
  • Applications:
    • FloodTransformer: "Transformer-based Flood Scene Segmentation for Developing Countries", NeurIPSW, 2022 (BITS Pilani, India). [Paper]

[Back to Overview]

Video (High-level)

Action Recognition

  • RGB mainly
    • Action Transformer: "Video Action Transformer Network", CVPR, 2019 (DeepMind). [Paper][Code (ppriyank)]
    • ViViT-Ensemble: "Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition", CVPRW, 2021 (Alibaba). [Paper]
    • TimeSformer: "Is Space-Time Attention All You Need for Video Understanding?", ICML, 2021 (Facebook). [Paper][PyTorch (lucidrains)]
    • MViT: "Multiscale Vision Transformers", ICCV, 2021 (Facebook). [Paper][PyTorch]
    • VidTr: "VidTr: Video Transformer Without Convolutions", ICCV, 2021 (Amazon). [Paper][PyTorch]
    • ViViT: "ViViT: A Video Vision Transformer", ICCV, 2021 (Google). [Paper][PyTorch (rishikksh20)]
    • VTN: "Video Transformer Network", ICCVW, 2021 (Theator). [Paper][PyTorch]
    • TokShift: "Token Shift Transformer for Video Classification", ACMMM, 2021 (CUHK). [Paper][PyTorch]
    • Motionformer: "Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers", NeurIPS, 2021 (Facebook). [Paper][PyTorch][Website]
    • X-ViT: "Space-time Mixing Attention for Video Transformer", NeurIPS, 2021 (Samsung). [Paper][PyTorch]
    • SCT: "Shifted Chunk Transformer for Spatio-Temporal Representational Learning", NeurIPS, 2021 (Kuaishou). [Paper]
    • RSANet: "Relational Self-Attention: What's Missing in Attention for Video Understanding", NeurIPS, 2021 (POSTECH). [Paper][PyTorch][Website]
    • STAM: "An Image is Worth 16x16 Words, What is a Video Worth?", arXiv, 2021 (Alibaba). [Paper][Code]
    • GAT: "Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training", arXiv, 2021 (Samsung). [Paper]
    • TokenLearner: "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?", arXiv, 2021 (Google). [Paper]
    • VLF: "VideoLightFormer: Lightweight Action Recognition using Transformers", arXiv, 2021 (The University of Sheffield). [Paper]
    • UniFormer: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning", ICLR, 2022 (CAS + SenstTime). [Paper][PyTorch]
    • Video-Swin: "Video Swin Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • DirecFormer: "DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition", CVPR, 2022 (University of Arkansas). [Paper][Code (in construction)]
    • DVT: "Deformable Video Transformer", CVPR, 2022 (Meta). [Paper]
    • MeMViT: "MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition", CVPR, 2022 (Meta). [Paper]
    • MLP-3D: "MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing", CVPR, 2022 (JD). [Paper][PyTorch (in construction)]
    • RViT: "Recurring the Transformer for Video Action Recognition", CVPR, 2022 (TCL Corporate Research, HK). [Paper]
    • SIFA: "Stand-Alone Inter-Frame Attention in Video Models", CVPR, 2022 (JD). [Paper][PyTorch]
    • MViTv2: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection", CVPR, 2022 (Meta). [Paper][PyTorch]
    • MTV: "Multiview Transformers for Video Recognition", CVPR, 2022 (Google). [Paper][Tensorflow]
    • ORViT: "Object-Region Video Transformers", CVPR, 2022 (Tel Aviv). [Paper][Website]
    • TIME: "Time Is MattEr: Temporal Self-supervision for Video Transformers", ICML, 2022 (KAIST). [Paper][PyTorch]
    • TPS: "Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition", ECCV, 2022 (Alibaba). [Paper][PyTorch]
    • DualFormer: "DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
    • STTS: "Efficient Video Transformers with Spatial-Temporal Token Selection", ECCV, 2022 (Fudan University). [Paper][PyTorch]
    • Turbo: "Turbo Training with Token Dropout", BMVC, 2022 (Oxford). [Paper]
    • MultiTrain: "Multi-dataset Training of Transformers for Robust Action Recognition", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]
    • AIA: "Attention in Attention: Modeling Context Correlation for Efficient Video Classification", TCSVT, 2022 (University of Science and Technology of China). [Paper][PyTorch]
    • MSCA: "Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition", arXiv, 2022 (Nagoya Institute of Technology). [Paper]
    • SViT: "Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens", arXiv, 2022 (Tel Aviv). [Paper][Website]
    • VAST: "Efficient Attention-free Video Shift Transformers", arXiv, 2022 (Samsung). [Paper]
    • Video-MobileFormer: "Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling", arXiv, 2022 (Microsoft). [Paper]
    • MAM2: "It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training", arXiv, 2022 (Baidu). [Paper]
    • ?: "Linear Video Transformer with Feature Fixation", arXiv, 2022 (SenseTime). [Paper]
  • Depth:
    • Trear: "Trear: Transformer-based RGB-D Egocentric Action Recognition", IEEE Transactions on Cognitive and Developmental Systems, 2021 (Tianjing University). [Paper]
  • Pose:
    • ST-TR: "Spatial Temporal Transformer Network for Skeleton-based Action Recognition", ICPRW, 2020 (Polytechnic University of Milan). [Paper]
    • AcT: "Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition", arXiv, 2021 (Politecnico di Torino, Italy). [Paper][Code (in construction)]
    • STAR: "STAR: Sparse Transformer-based Action Recognition", arXiv, 2021 (UCLA). [Paper]
    • GCsT: "GCsT: Graph Convolutional Skeleton Transformer for Action Recognition", arXiv, 2021 (CAS). [Paper]
    • GL-Transformer: "Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
    • ?: "Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer", International Conference on Multimodal Interaction (ICMI), 2022 (University of Delaware). [Paper]
    • FG-STFormer: "Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition", ACCV, 2022 (Zhengzhou University). [Paper]
    • STTFormer: "Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition", arXiv, 2022 (Xidian University). [Paper][Code (in construction)]
    • ProFormer: "ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper][PyTorch]
    • ?: "Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition", arXiv, 2022 (Harbin Institute of Technology). [Paper]
    • STAN: "Two-Stream Transformer Architecture for Long Video Understanding", arXiv, 2022 (The University of Surrey, UK). [Paper]
    • STAR-Transformer: "STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition", WACV, 2023 (Keimyung University, Korea). [Paper]
  • Multi-modal:
    • MBT: "Attention Bottlenecks for Multimodal Fusion", NeurIPS, 2021 (Google). [Paper]
    • MM-ViT: "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition", WACV, 2022 (OPPO). [Paper]
    • MMT-NCRC: "Multimodal Transformer for Nursing Activity Recognition", CVPRW, 2022 (UCF). [Paper][Code (in construction)]
    • M&M: "M&M Mix: A Multimodal Multiview Transformer Ensemble", CVPRW, 2022 (Google). [Paper]
    • VT-CE: "Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition", CVPRW, 2022 (A*STAR). [Paper]
    • Hi-TRS: "Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning", ECCV, 2022 (Rutgers). [Paper][PyTorch]
    • MVFT: "Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition", arXiv, 2022 (Alibaba). [Paper]
    • MOV: "Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models", arXiv, 2022 (Google). [Paper]
    • MotionBERT: "MotionBERT: Unified Pretraining for Human Motion Analysis", arXiv, 2022 (Peking University). [Paper][Code (in construction)][Website]
  • Group Activity:
    • GroupFormer: "GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer", ICCV, 2021 (Sensetime). [Paper]
    • ?: "Hunting Group Clues with Transformers for Social Group Activity Recognition", ECCV, 2022 (Hitachi). [Paper]

[Back to Overview]

Action Detection/Localization

  • OadTR: "OadTR: Online Action Detection with Transformers", ICCV, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • RTD-Net: "Relaxed Transformer Decoders for Direct Action Proposal Generation", ICCV, 2021 (Nanjing University). [Paper][PyTorch]
  • FS-TAL: "Few-Shot Temporal Action Localization with Query Adaptive Transformer", BMVC, 2021 (University of Surrey, UK). [Paper][PyTorch]
  • LSTR: "Long Short-Term Transformer for Online Action Detection", NeurIPS, 2021 (Amazon). [Paper][PyTorch][Website]
  • ATAG: "Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation", arXiv, 2021 (Alibaba). [Paper]
  • TAPG-Transformer: "Temporal Action Proposal Generation with Transformers", arXiv, 2021 (Harbin Institute of Technology). [Paper]
  • TadTR: "End-to-end Temporal Action Detection with Transformer", arXiv, 2021 (Alibaba). [Paper][Code (in construction)]
  • Vidpress-Soccer: "Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection", arXiv, 2021 (Baidu). [Paper][GitHub]
  • MS-TCT: "MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection", CVPR, 2022 (INRIA). [Paper][PyTorch]
  • UGPT: "Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition", CVPR, 2022 (Rensselaer Polytechnic Institute, NY). [Paper]
  • TubeR: "TubeR: Tube-Transformer for Action Detection", CVPR, 2022 (Amazon). [Paper]
  • DDM-Net: "Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
  • ?: "Dual-Stream Transformer for Generic Event Boundary Captioning", CVPRW, 2022 (ByteDance). [Paper][PyTorch]
  • ?: "Exploring Anchor-based Detection for Ego4D Natural Language Query", arXiv, 2022 (Renmin University of China). [Paper]
  • EAMAT: "Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos", IJCAI, 2022 (Beijing Institute of Technology). [Paper][Code (in construction)]
  • STPT: "An Efficient Spatio-Temporal Pyramid Transformer for Action Detection", ECCV, 2022 (Monash University, Australia). [Paper]
  • TeSTra: "Real-time Online Video Detection with Temporal Smoothing Transformers", ECCV, 2022 (UT Austin). [Paper][PyTorch]
  • TALLFormer: "TALLFormer: Temporal Action Localization with Long-memory Transformer", ECCV, 2022 (UNC). [Paper][PyTorch]
  • CoOadTR: "Continual Transformers: Redundancy-Free Attention for Online Inference", arXiv, 2022 (Aarhus University, Denmark). [Paper][PyTorch]
  • ActionFormer: "ActionFormer: Localizing Moments of Actions with Transformers", ECCV, 2022 (UW-Madison). [Paper][PyTorch]
  • Temporal-Perceiver: "Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection", arXiv, 2022 (Nanjing University). [Paper]
  • LocATe: "LocATe: End-to-end Localization of Actions in 3D with Transformers", arXiv, 2022 (Stanford). [Paper]
  • HTNet: "HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers", arXiv, 2022 (Korea University). [Paper]
  • AdaPerFormer: "Adaptive Perception Transformer for Temporal Action Localization", arXiv, 2022 (Tianjin University). [Paper][PyTorch]
  • CWC-Trans: "A Circular Window-based Cascade Transformer for Online Action Detection", arXiv, 2022 (Meituan). [Paper]

[Back to Overview]

Action Prediction/Anticipation

  • AVT: "Anticipative Video Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
  • HORST: "Higher Order Recurrent Space-Time Transformer", arXiv, 2021 (NVIDIA). [Paper][PyTorch]
  • ?: "Action Forecasting with Feature-wise Self-Attention", arXiv, 2021 (A*STAR). [Paper]
  • FUTR: "Future Transformer for Long-term Action Anticipation", CVPR, 2022 (POSTECH). [Paper]
  • TTPP: "TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation", arXiv, 2022 (CAS). [Paper]
  • VPTR: "VPTR: Efficient Transformers for Video Prediction", ICPR, 2022 (Polytechnique Montreal, Canada). [Paper][PyTorch]
  • Earthformer: "Earthformer: Exploring Space-Time Transformers for Earth System Forecasting", arXiv, 2022 (Amazon). [Paper]

[Back to Overview]

Video Object Segmentation

  • SSTVOS: "SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation", CVPR, 2021 (Modiface). [Paper][Code (in construction)]
  • JOINT: "Joint Inductive and Transductive Learning for Video Object Segmentation", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
  • AOT: "Associating Objects with Transformers for Video Object Segmentation", NeurIPS, 2021 (University of Technology Sydney). [Paper][PyTorch (yoxu515)][Code (in construction)]
  • TransVOS: "TransVOS: Video Object Segmentation with Transformers", arXiv, 2021 (Zhejiang University). [Paper]
  • SITVOS: "Siamese Network with Interactive Transformer for Video Object Segmentation", AAAI, 2022 (JD). [Paper]
  • MTTR: "End-to-End Referring Video Object Segmentation with Multimodal Transformers", CVPR, 2022 (Technion - Israel Institute of Technology). [Paper][PyTorch]
  • HODOR: "Differentiable Soft-Masked Attention", CVPRW, 2022 (RWTH Aachen University, Germany). [Paper]
  • BATMAN: "BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation", ECCV, 2022 (Microsoft). [Paper]
  • AOT: "Associating Objects with Scalable Transformers for Video Object Segmentation", arXiv, 2022 (University of Technology Sydney). [Paper][Code (in construction)]

[Back to Overview]

Video Instance Segmentation

  • VisTR: "End-to-End Video Instance Segmentation with Transformers", CVPR, 2021 (Meituan). [Paper][PyTorch]
  • IFC: "Video Instance Segmentation using Inter-Frame Communication Transformers", NeurIPS, 2021 (Yonsei University). [Paper][PyTorch]
  • Deformable-VisTR: "Deformable VisTR: Spatio temporal deformable attention for video instance segmentation", ICASSP, 2022 (University at Buffalo). [Paper][Code (in construction)]
  • TeViT: "Temporally Efficient Vision Transformer for Video Instance Segmentation", CVPR, 2022 (Tencent). [Paper][PyTorch]
  • GMP-VIS: "A Graph Matching Perspective With Transformers on Video Instance Segmentation", CVPR, 2022 (Shandong University). [Paper]
  • VMT: "Video Mask Transfiner for High-Quality Video Instance Segmentation", ECCV, 2022 (ETHZ). [Paper][GitHub][Website]
  • SeqFormer: "SeqFormer: Sequential Transformer for Video Instance Segmentation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
  • MS-STS: "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", ECCV, 2022 (MBZUAI). [Paper][PyTorch]
  • VITA: "VITA: Video Instance Segmentation via Object Token Association", arXiv, 2022 (Yonsei University). [Paper][Code (in construction)]
  • IFR: "Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention", arXiv, 2022 (Microsoft). [Paper]
  • DeVIS: "DeVIS: Making Deformable Transformers Work for Video Instance Segmentation", arXiv, 2022 (TUM). [Paper][PyTorch]
  • MinVIS: "MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training", arXiv, 2022 (NVIDIA). [Paper][PyTorch]
  • InstanceFormer: "InstanceFormer: An Online Video Instance Segmentation Framework", arXiv, 2022 (Ludwig Maximilian University of Munich). [Paper][Code (in construction)]

[Back to Overview]

Other Video Tasks

  • Action Segmentation
    • ASFormer: "ASFormer: Transformer for Action Segmentation", BMVC, 2021 (Peking University). [Paper][PyTorch]
    • Bridge-Prompt: "Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos", CVPR, 2022 (Tsinghua University). [Paper][PyTorch]
    • SC-Transformer++: "SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection", CVPRW, 2022 (CAS). [Paper][Code (in construction)]
    • LocVTP: "LocVTP: Video-Text Pre-training for Temporal Localization", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • ?: "Transformers in Action: Weakly Supervised Action Segmentation", arXiv, 2022 (TUM). [Paper]
    • CETNet: "Cross-Enhancement Transformer for Action Segmentation", arXiv, 2022 (Shijiazhuang Tiedao University). [Paper]
    • EUT: "Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation", arXiv, 2022 (CAS). [Paper]
    • SC-Transformer: "Structured Context Transformer for Generic Event Boundary Detection", arXiv, 2022 (CAS). [Paper]
  • Video X Segmentation:
    • STT: "Video Semantic Segmentation via Sparse Temporal Transformer", MM, 2021 (Shanghai Jiao Tong). [Paper]
    • CFFM: "Coarse-to-Fine Feature Mining for Video Semantic Segmentation", CVPR, 2022 (ETH Zurich). [Paper][PyTorch]
    • TF-DL: "TubeFormer-DeepLab: Video Mask Transformer", CVPR, 2022 (Google). [Paper]
    • MRCFA: "Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation", ECCV, 2022 (ETH Zurich). [Paper][PyTorch]
    • PolyphonicFormer: "PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, ECCV, 2022 (Wuhan University). [Paper][Code (in construction)]
    • ?: "Time-Space Transformers for Video Panoptic Segmentation", arXiv, 2022 (Technical University of Cluj-Napoca, Romania). [Paper]
  • Video Object Detection:
    • TransVOD: "End-to-End Video Object Detection with Spatial-Temporal Transformers", arXiv, 2021 (Shanghai Jiao Tong + SenseTime). [Paper][Code (in construction)]
    • MODETR: "MODETR: Moving Object Detection with Transformers", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-MTL: "Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation", arXiv, 2021 (Valeo, Egypt). [Paper]
    • ST-DETR: "ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer", arXiv, 2021 (Valeo, Egypt). [Paper]
    • PTSEFormer: "PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection", ECCV, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
    • TransVOD: "TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers", arXiv, 2022 (Shanghai Jiao Tong + SenseTime). [Paper]
    • ?: "Learning Future Object Prediction with a Spatiotemporal Detection Transformer", arXiv, 2022 (Zenseact, Sweden). [Paper]
  • Video Retrieval
    • SVRTN: "Self-supervised Video Retrieval Transformer Network", arXiv, 2021 (Alibaba). [Paper]
  • Video Hashing
    • BTH: "Self-Supervised Video Hashing via Bidirectional Transformers", CVPR, 2021 (Tsinghua). [Paper][PyTorch]
  • Video-Language:
    • ?: "Prompting Visual-Language Models for Efficient Video Understanding", ECCV, 2022 (Shanghai Jiao Tong + Oxford). [Paper][PyTorch][Website]
    • X-CLIP: "Expanding Language-Image Pretrained Models for General Video Recognition", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • EVL: "Frozen CLIP Models are Efficient Video Learners", ECCV, 2022 (CUHK). [Paper][PyTorch (in construction)]
    • STALE: "Zero-Shot Temporal Action Detection via Vision-Language Prompting", ECCV, 2022 (University of Surrey, UK). [Paper][Code (in construction)]
    • FineCo: "Contrastive Video-Language Learning with Fine-grained Frame Sampling", AACL, 2022 (ICL, UK). [Paper]
  • X-supervised Learning:
    • LSTCL: "Long-Short Temporal Contrastive Learning of Video Transformers", CVPR, 2022 (Facebook). [Paper]
    • SVT: "Self-supervised Video Transformer", CVPR, 2022 (Stony Brook). [Paper][PyTorch][Website]
    • BEVT: "BEVT: BERT Pretraining of Video Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • SCVRL: "SCVRL: Shuffled Contrastive Video Representation Learning", CVPRW, 2022 (Amazon). [Paper]
    • VideoMAE: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training", CVPRW, 2022 (Tencent). [Paper][Code (in construction)]
    • VIMPAC: "VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning", CVPRW, 2022 (UNC). [Paper][PyTorch]
    • ?: "Static and Dynamic Concepts for Self-supervised Video Representation Learning", ECCV, 2022 (CUHK). [Paper]
    • MAE: "Masked Autoencoders As Spatiotemporal Learners", arXiv, 2022 (Meta). [Paper]
    • OmniMAE: "OmniMAE: Single Model Masked Pretraining on Images and Videos", arXiv, 2022 (Meta). [Paper][PyTorch]
    • MaskViT: "MaskViT: Masked Visual Pre-Training for Video Prediction", arXiv, 2022 (Stanford). [Paper][Website]
    • ?: "On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition", arXiv, 2022 (Georgia Tech). [Paper]
  • X-shot:
    • ResT: "Cross-modal Representation Learning for Zero-shot Action Recognition", CVPR, 2022 (Microsoft). [Paper]
    • ViSET: "Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding", arXiv, 2022 (University of South FLorida). [Paper]
    • REST: "REST: REtrieve & Self-Train for generative action recognition", arXiv, 2022 (Samsung). [Paper]
  • Anomaly Detection:
    • CT-D2GAN: "Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection", ACMMM, 2021 (NEC). [Paper]
    • ADTR: "ADTR: Anomaly Detection Transformer with Feature Reconstruction", International Conference on Neural Information Processing (ICONIP), 2022 (Shanghai Jiao Tong University). [Paper]
    • SSMCTB: "Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection", arXiv, 2022 (UCF). [Paper][Code (in construction)]
  • Relation Detection:
    • VidVRD: "Video Relation Detection via Tracklet based Visual Transformer", ACMMMW, 2021 (Zhejiang University). [Paper][PyTorch]
    • VRDFormer: "VRDFormer: End-to-End Video Visual Relation Detection With Transformers", CVPR, 2022 (Renmin University of China). [Paper][Code (in construction)]
    • VidSGG-BIG: "Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs", CVPR, 2022 (Zhejiang University). [Paper][PyTorch]
  • Saliency Prediction:
    • STSANet: "Spatio-Temporal Self-Attention Network for Video Saliency Prediction", arXiv, 2021 (Shanghai University). [Paper]
    • UFO: "A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection", arXiv, 2022 (South China University of Technology). [Paper][PyTorch]
  • Video Inpainting Detection:
    • FAST: "Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection", ICCV, 2021 (Tsinghua University). [Paper]
  • Driver Activity:
    • TransDARC: "TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
    • ?: "Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers", arXiv, 2022 (Jericho High School, NY). [Paper]
    • ViT-DD: "Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection", arXiv, 2022 (Purdue). [Paper][PyTorch (in construction)]
  • Video Alignment:
    • DGWT: "Dynamic Graph Warping Transformer for Video Alignment", BMVC, 2021 (University of New South Wales, Australia). [Paper]
  • Sport-related:
    • Skating-Mixer: "Skating-Mixer: Multimodal MLP for Scoring Figure Skating", arXiv, 2022 (Southern University of Science and Technology). [Paper]
  • Action Counting:
    • TransRAC: "TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting", CVPR, 2022 (ShanghaiTech). [Paper][PyTorch][Website]
  • Action Quality Assessment:
    • ?: "Action Quality Assessment with Temporal Parsing Transformer", ECCV, 2022 (Baidu). [Paper]
    • ?: "Action Quality Assessment using Transformers", arXiv, 2022 (USC). [Paper]
  • Human Interaction:
    • IGFormer: "IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition", ECCV, 2022 (The University of Melbourne). [Paper]
  • Domain Adaptation:
    • UDAVT: "Unsupervised Domain Adaptation for Video Transformers in Action Recognition", ICPR, 2022 (University of Trento). [Paper][Code (in construction)]
  • Multi-Camera Editing:
    • TC-Transformer: "Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows", ECCVW, 2022 (CUHK). [Paper]

[Back to Overview]

Multi-Modality

Visual Captioning

  • Masked Transformers: "End-to-End Dense Video Captioning with Masked Transformer", CVPR, 2018 (UMich + Salesforce). [Paper][PyTorch]
  • ETA-Transformer: "Entangled Transformer for Image Captioning", ICCV, 2019 (UTS). [Paper]
  • M2-Transformer: "Meshed-Memory Transformer for Image Captioning", CVPR, 2020 (UniMoRE). [Paper][PyTorch]
  • BMT: "A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer", BMVC, 2020 (Tampere University, Finland). [Paper][PyTorch][Website]
  • ?: "Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers", Interspeech, 2021 (MERL). [Paper]
  • MCCFormers: "Describing and Localizing Multiple Changes with Transformers", ICCV, 2021 (AIST). [Paper][Website]
  • SATIC: "Semi-Autoregressive Transformer for Image Captioning", ICCVW, 2021 (Hefei University of Technology). [Paper][PyTorch]
  • DGCN: "Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning", ACMMM, 2021 (Wuhan University). [Paper]
  • CPTR: "CPTR: Full Transformer Network for Image Captioning", arXiv, 2021 (CAS). [Paper]
  • ReFormer: "ReFormer: The Relational Transformer for Image Captioning", arXiv, 2021 (Stony Brook University). [Paper]
  • LAViTeR: "LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation", arXiv, 2021 (University at Buffalo). [Paper]
  • LATGeO: "Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
  • GEVST: "Geometry-Entangled Visual Semantic Transformer for Image Captioning", arXiv, 2021 (NTU, Singapore). [Paper]
  • GAT: "Geometry Attention Transformer with Position-aware LSTMs for Image Captioning", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
  • PureT: "End-to-End Transformer Based Model for Image Captioning", AAAI, 2022 (CAS). [Paper]
  • VisualGPT: "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning", CVPR, 2022 (KAUST). [Paper][PyTorch]
  • ViTCAP: "Injecting Semantic Concepts into End-to-End Image Captioning", CVPR, 2022 (Microsoft). [Paper]
  • CLIP-Event: "CLIP-Event: Connecting Text and Images with Event Structures", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • CLIP4IDC: "CLIP4IDC: CLIP for Image Difference Captioning", CVPRW, 2022 (Aalto University, Finland). [Paper][Code (in construction)]
  • ?: "A Dual-Attentive Approach to Style-Based Image Captioning Using a CNN-Transformer Model", CVPRW, 2022 (The University of the West Indies, Jamaica). [Paper]
  • SpaCap3D: "Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds", IJCAI, 2022 (University of Sydney). [Paper][Code (in construction)][Website]
  • RA-Transformer: "Retrieval-Augmented Transformer for Image Captioning", International Conference on Content-based Multimedia Indexing (CMBI), 2022 (University of Modena and Reggio Emilia, Italy). [Paper]
  • VGCL: "Video-Guided Curriculum Learning for Spoken Video Grounding", ACMMM, 2022 (Zhejiang University). [Paper][PyTorch]
  • GRIT: "GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features", ECCV, 2022 (Tohoku University + RIKEN AIP). [Paper][PyTorch]
  • CVLNM: "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning", IJCV, 2022 (Southeast University, China). [Paper][PyTorch]
  • ViNTER: "ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer", arXiv, 2022 (The University of Tokyo). [Paper]
  • D2: "Dual-Level Decoupled Transformer for Video Captioning", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
  • VaT: "Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning", arXiv, 2022 (Tongji University). [Paper]
  • SCST-GEG: "Distincive Image Captioning via CLIP Guided Group Optimization", arXiv, 2022 (McGill University). [Paper]
  • VASTA: "Diverse Video Captioning by Adaptive Spatio-temporal Attention", arXiv, 2022 (University of Tubingen, Germany). [Paper]
  • ?: "Vision Transformer Based Model for Describing a Set of Images as a Story", arXiv, 2022 (The University of Western Australia). [Paper]

[Back to Overview]

Visual Question Answering

  • MCAN: "Deep Modular Co-Attention Networks for Visual Question Answering", CVPR, 2019 (Hangzhou Dianzi University). [Paper][PyTorch]
  • M4C: "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA", CVPR, 2020 (Facebook). [Paper]
  • SA-M4C: "Spatially Aware Multimodal Transformers for TextVQA", ECCV, 2020 (Georgia Tech). [Paper][PyTorch][Website]
  • ConClaT: "Contrast and Classify: Training Robust VQA Models", ICCV, 2021 (Georgia Tech). [Paper]
  • TRAR: "TRAR: Routing the Attention Spans in Transformer for Visual Question Answering", ICCV, 2021 (Xiamen University). [Paper]
  • UniQer: "Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue", ICCV, 2021 (Keio). [Paper]
  • TxT: "TxT: Crossmodal End-to-End Learning with Transformers", GCPR, 2021 (TU Darmstadt). [Paper]
  • ProTo: "ProTo: Program-Guided Transformer for Program-Guided Tasks", NeurIPS, 2021 (Georiga Tech). [Paper]
  • VisQA: "VisQA: X-raying Vision and Language Reasoning in Transformers", arXiv, 2021 (INSA-Lyon). [Paper][PyTorch]
  • ?: "Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering", arXiv, 2021 (Seoul National University). [Paper]
  • TPT: "Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering", arXiv, 2021 (CAS). [Paper]
  • Block-Skim: "Block-Skim: Efficient Question Answering for Transformer", AAAI, 2022 (* Shanghai Jiao Tong*). [Paper]
  • RelViT: "RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning", ICLR, 2022 (NVIDIA). [Paper] [PyTorch]
  • Hypergraph-Transformer: "Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering", ACL, 2022 (SNU). [Paper][Code (in construction)]
  • X-Trans2Cap: "X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning", CVPR, 2022 (CUHK). [Paper]
  • SwinBERT: "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • UTC: "UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog", CVPR, 2022 (Fudan). [Paper]
  • LaTr: "LaTr: Layout-Aware Transformer for Scene-Text VQA", CVPR, 2022 (Amazon). [Paper]
  • QAA: "Query and Attention Augmentation for Knowledge-Based Explainable Reasoning", CVPR, 2022 (University of Minnesota). [Paper][PyTorch]
  • WebQA: "WebQA: Multihop and Multimodal QA", CVPR, 2022 (CMU + Microsoft). [Paper][PyTorch][Website]
  • ?: "Efficient Adaptive Image-Language Learning for Visual Question Answering", CVPRW, 2022 (Google). [Paper]
  • cViL: "cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation", ICPR, 2022 (IIIT, Hyderabad). [Paper]
  • WildQA: "WildQA: In-the-Wild Video Question Answering", International Conference on Computational Linguistics (COLING), 2022 (University of Michigan). [Paper][Website]
  • Distinguishing-VQA: "Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances", COLING, 2022 (Nankai University). [Paper][Code (in construction)]
  • ?: "Weakly Supervised Grounding for VQA in Vision-Language Transformers", ECCV, 2022 (UCF). [Paper][PyTorch (in construction)]
  • VGT: "Video Graph Transformer for Video Question Answering", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
  • ?: "Video Question Answering with Iterative Video-Text Co-Tokenization", ECCV, 2022 (Google). [Paper][Website (in construction)]
  • MUST-VQA: "MUST-VQA: MUltilingual Scene-text VQA", ECCVW, 2022 (UAB, Spain). [Paper]
  • DeST: "Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling", BMVC, 2022 (NTU). [Paper][PyTorch]
  • MuRAG: "MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text", EMNLP, 2022 (Google). [Paper]
  • MMBS: "Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning", EMNLP, 2022 (CAS). [Paper][PyTorch]
  • PnP-VQA: "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", EMNLP Findings, 2022 (Salesforce). [Paper]
  • TMN: "Transformer Module Networks for Systematic Generalization in Visual Question Answering", arXiv, 2022 (Fujitsu). [Paper]
  • ?: "On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering", arXiv, 2022 (Birla Institute of Technology Mesra, India). [Paper]
  • DST: "Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
  • PAVCR: "Attention Mechanism based Cognition-level Scene Understanding", arXiv, 2022 (Leibniz University of Hannover, Germany). [Paper]
  • REVIVE: "REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering", arXiv, 2022 (Microsoft). [Paper]
  • TAG: "TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation", arXiv, 2022 (Maryland + Salesforce). [Paper][PyTorch]
  • UniCon: "UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering", arXiv, 2022 (University of Tokyo). [Paper]
  • CLOVE: "Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task", arXiv, 2022 (NUS). [Paper][Code (in construction)]
  • WSQG: "Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering", arXiv, 2022 (Zhejiang University). [Paper]
  • mVQA: "Towards Multi-Lingual Visual Question Answering", arXiv, 2022 (Google). [Paper]
  • CIB: "Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
  • LocAns: "Locate before Answering: Answer Guided Question Localization for Video Question Answering", arXiv, 2022 (Fudan University). [Paper]

[Back to Overview]

Visual Grounding

  • Multi-Stage-Transformer: "Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos", CVPR, 2021 (University of Electronic Science and Technology of China). [Paper]
  • TransRefer3D: "TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding", ACMMM, 2021 (Beihang University). [Paper]
  • ?: "Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers", EMNLP, 2021 (University of Trento). [Paper]
  • GTR: "On Pursuit of Designing Multi-modal Transformer for Video Grounding", EMNLP, 2021 (Peking). [Paper]
  • MITVG: "Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation", ACL Findings, 2021 (Tencent). [Paper]
  • STVGBert: "STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding", ICCV, 2021 (Tencent). [Paper]
  • TransVG: "TransVG: End-to-End Visual Grounding with Transformers", ICCV, 2021 (USTC). [Paper]
  • GSRTR: "Grounded Situation Recognition with Transformers", BMVC, 2021 (POSTECH). [Paper][PyTorch]
  • DRFT: "End-to-end Multi-modal Video Temporal Grounding", NeurIPS, 2021 (UC Merced). [Paper]
  • Referring-Transformer: "Referring Transformer: A One-step Approach to Multi-task Visual Grounding", NeurIPS, 2021 (UBC). [Paper]
  • VGTR: "Visual Grounding with Transformers", arXiv, 2021 (Beihang University). [Paper]
  • UNICORN: "Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling", arXiv, 2021 (Microsoft). [Paper]
  • Word2Pix: "Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding", arXiv, 2021 (A*STAR). [Paper]
  • TubeDETR: "TubeDETR: Spatio-Temporal Video Grounding with Transformers", CVPR, 2022 (INRIA). [Paper][Website]
  • MVT: "Multi-View Transformer for 3D Visual Grounding", CVPR, 2022 (CUHK). [Paper][PyTorch]
  • GLIP: "Grounded Language-Image Pre-training", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • M-DGT: "Multi-Modal Dynamic Graph Transformer for Visual Grounding", CVPR, 2022 (University of Toronto). [Paper][PyTorch]
  • QRNet: "Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
  • STVGFormer: "STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding", ACMMMW, 2022 (Sun Yat-sen University). [Paper]
  • SiRi: "SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding", ECCV, 2022 (JD). [Paper][PyTorch]
  • VidGTR: "Explore and Match: End-to-End Video Grounding with Transformer", arXiv, 2022 (KAIST). [Paper]
  • SeqTR: "SeqTR: A Simple yet Universal Network for Visual Grounding", arXiv, 2022 (Xiamen University). [Paper][Code (in construction)]
  • BEST: "Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning", arXiv, 2022 (Microsoft). [Paper]
  • GLIPv2: "GLIPv2: Unifying Localization and Vision-Language Understanding", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • TransVG++: "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer", arXiv, 2022 (USTC). [Paper]
  • HLGT: "Hierarchical Local-Global Transformer for Temporal Sentence Grounding", arXiv, 2022 (Huazhong University of Science and Technology). [Paper]
  • Dynamic-MDETR: "Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding", arXiv, 2022 (Nanjing University). [Paper]

[Back to Overview]

Multi-Modal Representation Learning

  • LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers", EMNLP, 2019 (UNC). [Paper][PyTorch]
  • ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", NeurIPS, 2019 (Georgia Tech). [Paper][PyTorch]
  • Unified-VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA", AAAI, 2020 (UMich + Microsoft). [Paper][PyTorch]
  • UNITER: "UNITER: UNiversal Image-TExt Representation Learning", ECCV, 2020 (Microsoft). [Paper][PyTorch]
  • COOT: "COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning", NeurIPS, 2020 (University of Freiburg). [Paper][PyTorch]
  • Parameter-Reduction: "Parameter Efficient Multimodal Transformers for Video Representation Learning", ICLR, 2021 (Seoul National University). [Paper]
  • VinVL: "VinVL: Revisiting Visual Representations in Vision-Language Models", CVPR, 2021 (Microsoft). [Paper][Code]
  • CATT: "Causal Attention for Vision-Language Tasks", CVPR, 2021 (NTU Singapore). [Paper][PyTorch]
  • CLIP: "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021 (OpenAI). [Paper][PyTorch]
  • ViLT: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision", ICML, 2021 (Kakao). [Paper][PyTorch]
  • VML: "VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding", ACL Findings, 2021 (Facebook). [Paper]
  • VATT: "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text", NeurIPS, 2021 (Google). [Paper][Tensorflow]
  • SVO-Probes: "Probing Image-Language Transformers for Verb Understanding", arXiv, 2021 (DeepMind). [Paper]
  • CLIP-ViL: "How Much Can CLIP Benefit Vision-and-Language Tasks?", arXiv, 2021 (Berkeley + UCLA). [Paper][PyTorch]
  • Florence: "Florence: A New Foundation Model for Computer Vision", arXiv, 2021 (Microsoft). [Paper]
  • UFO: "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning", arXiv, 2021 (Microsoft). [Paper]
  • TAN: "Temporal Alignment Networks for Long-term Video", CVPR, 2022 (Oxford). [Paper][Code (in construction)][Website]
  • LiT: "LiT: Zero-Shot Transfer with Locked-image text Tuning", CVPR, 2022 (Google). [Paper]
  • UniCL: "Unified Contrastive Learning in Image-Text-Label Space", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • FLAVA: "FLAVA: A Foundational Language And Vision Alignment Model", CVPR, 2022 (Meta). [Paper][Pretrained Model][Code][Dataset][Website][Demos]
  • LEMON: "Scaling Up Vision-Language Pre-training for Image Captioning", CVPR, 2022 (Microsoft). [Paper]
  • METER: "An Empirical Study of Training End-to-End Vision-and-Language Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • HD-VILA: "Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions", CVPR, 2022 (Microsoft). [Paper][GitHub]
  • ATP: "Revisiting the "Video" in Video-Language Understanding", CVPR, 2022 (Stanford). [Paper][Website]
  • ALPRO: "Align and Prompt: Video-and-Language Pre-training with Entity Prompts", CVPR, 2022 (Salesforce). [Paper][PyTorch]
  • CM-mix: "Pre-training image-language transformers for open-vocabulary tasks", CVPRW, 2022 (Google). [Paper]
  • VLMixer: "VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix", ICML, 2022 (Southern University of Science and Technology). [Paper][Code (in construction)]
  • VLUE: "VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models", ICML, 2022 (ByteDance). [Paper][Website][PyTorch]
  • X-VLM: "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts", ICML, 2022 (ByteDance). [Paper][PyTorch]
  • BLIP: "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation", ICML, 2022 (Salesforce). [Paper][PyTorch]
  • MS-CLIP: "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • GRIT-VLP: "GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • OmniVL: "OmniVL: One Foundation Model for Image-Language and Video-Language Tasks", NeurIPS, 2022 (Microsoft). [Paper]
  • UniCLIP: "UniCLIP: Unified Framework for Contrastive Language-Image Pre-training", NeurIPS, 2022 (LG). [Paper]
  • TVLT: "TVLT: Textless Vision-Language Transformer", NeurIPS, 2022 (UNC). [Paper][PyTorch]
  • VLMo: "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts", arXiv, 2022 (Microsoft). [Paper][PyTorch (in construction)]
  • Omnivore: "Omnivore: A Single Model for Many Visual Modalities", arXiv, 2022 (Meta). [Paper][PyTorch]
  • MultiMAE: "MultiMAE: Multi-modal Multi-task Masked Autoencoders", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
  • Flamingo: "Flamingo: a Visual Language Model for Few-Shot Learning", arXiv, 2022 (DeepMind). [Paper]
  • PyramidCLIP: "PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining", arXiv, 2022 (Tencent). [Paper]
  • CoCa: "CoCa: Contrastive Captioners are Image-Text Foundation Models", arXiv, 2022 (Google). [Paper]
  • VLC: "Training Vision-Language Transformers from Captions Alone", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • UViM: "UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes", arXiv, 2022 (Google). [Paper]
  • GIT: "GIT: A Generative Image-to-text Transformer for Vision and Language", arXiv, 2022 (Microsoft). [Paper]
  • CyCLIP: "CyCLIP: Cyclic Contrastive Language-Image Pretraining", arXiv, 2022 (UCLA). [Paper]
  • CCLM: "Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training", arXiv, 2022 (ByteDance). [Paper]
  • VL-BEiT: "VL-BEiT: Generative Vision-Language Pretraining", arXiv, 2022 (Microsoft). [Paper]
  • EgoVLP: "Egocentric Video-Language Pretraining", arXiv, 2022 (NUS). [Paper][Code (in construction)]
  • Singularity: "Revealing Single Frame Bias for Video-and-Language Learning", arXiv, 2022 (UNC). [Paper]
  • Uni-Perceiver-MoE: "Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs", arXiv, 2022 (SenseTime). [Paper]
  • MetaLM: "Language Models are General-Purpose Interfaces", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • DaVinci: "Prefix Language Models are Unified Modal Learners", arXiv, 2022 (ByteDance). [Paper]
  • FIBER: "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • Bridge-Tower: "Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • e-CLIP: "e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce", arXiv, 2022 (NAVER). [Paper]
  • LAVENDER: "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • Clover: "Clover: Towards A Unified Video-Language Alignment and Fusion Model", arXiv, 2022 (ByteDance). [Paper][PyTorch (in construction)]
  • LW-Transformer: "Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
  • UCM: "Self-Training Vision Language BERTs with a Unified Conditional Model", arXiv, 2022 (NTU, Singapore). [Paper]
  • MaskVLM: "Masked Vision and Language Modeling for Multi-modal Representation Learning", arXiv, 2022 (Amazon). [Paper]
  • LOUPE: "Fine-Grained Semantically Aligned Vision-Language Pre-Training", arXiv, 2022 (Huawei). [Paper]
  • Prefix-conditioning: "Prefix Conditioning Unifies Language and Label Supervision", arXiv, 2022 (Google). [Paper]
  • VLMAE: "VLMAE: Vision-Language Masked Autoencoder", arXiv, 2022 (Tencent). [Paper]
  • BEiT-3: "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment", arXiv, 2022 (Sorbonne University, France). [Paper][Code (in construction)]
  • DetailCLIP: "Injecting Image Details into CLIP's Feature Space", arXiv, 2022 (Megvii). [Paper]
  • ?: "An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling", arXiv, 2022 (Microsoft). [Paper]
  • ?: "Pre-training image-language transformers for open-vocabulary tasks", arXiv, 2022 (Google). [Paper]
  • PaLI: "PaLI: A Jointly-Scaled Multilingual Language-Image Model", arXiv, 2022 (Google). [Paper]
  • CLIP-ViP: "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • ERNIE: "ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training", arXiv, 2022 (Baidu). [Paper][Paddle]
  • Pix2Struct: "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding", arXiv, 2022 (Google). [Paper]
  • VoLTA: "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment", arXiv, 2022 (JHU). [Paper]
  • MAP: "MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model", arXiv, 2022 (Tsinghua + Waseda). [Paper][PyTorch]
  • ?: "One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
  • MAPL: "MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting", arXiv, 2022 (Mila). [Paper]
  • EfficientVLM: "EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning", arXiv, 2022 (Bytedance). [Paper][PyTorch (in construction)]
  • xCLIP: "Non-Contrastive Learning Meets Language-Image Pre-Training", arXiv, 2022 (Microsoft). [Paper]

[Back to Overview]

Multi-Modal Retrieval

  • General:
    • Fast-and-Slow: "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers", CVPR, 2021 (DeepMind). [Paper]
    • HTR: "Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning", CVPR, 2021 (Amazon). [Paper][PyTorch]
    • TERN: "Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features", CBMI, 2021 (National Research Council, Italy). [Paper]
    • VisualSparta: "VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search", arXiv, 2021 (CMU). [Paper]
    • CCR-CCS: "More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints", arXiv, 2021 (Rutgers + Amazon). [Paper]
    • MCProp: "Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching", ICLRW, 2022 (National Research Council, Italy). [Paper][PyTorch]
    • TASK-former: "A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch", ECCV, 2022 (Georgia Tech). [Paper][Website]
    • SpeechCLIP: "SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model", IEEE Workshop on Spoken Language Technology (SLT), 2022 (NTU). [Paper]
    • LoopITR: "LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval", arXiv, 2022 (UNC). [Paper]
    • TNLBT: "Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training", arXiv, 2022 (The University of Electro-Communications, Japan). [Paper]
    • HiVLP: "HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval", arXiv, 2022 (Huawei). [Paper]
    • ?: "Revising Image-Text Retrieval via Multi-Modal Entailment". arXiv, 2022 (Soochow University, China). [Paper]
    • TokenFlow: "TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval", arXiv, 2022 (Kuaishou). [Paper]
  • Video:
    • MMT: "Multi-modal Transformer for Video Retrieval", ECCV, 2020 (INRIA + Google). [Paper][Website]
    • ClipBERT: "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling", CVPR, 2021 (UNC + Microsoft). [Paper][PyTorch]
    • AYCE: "All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers", CVPRW, 2021 (University of Modena and Reggio Emilia). [Paper][PyTorch]
    • HiT: "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval", ICCV, 2021 (Kuaishou). [Paper]
    • WebVid-2M: "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval", ICCV, 2021 (Oxford). [Paper]
    • UMT: "UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection", CVPR, 2022 (Tencent). [Paper][Code (in constrcution)]
    • MMFT: "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval", CVPR, 2022 (Goethe University Frankfurt, Germany). [Paper]
    • X-Pool: "X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval", CVPR, 2022 (Layer 6 AI, Toronto). [Paper][PyTorch][Website]
    • MVPt: "It's Time for Artistic Correspondence in Music and Video", CVPR, 2022 (Adobe). [Paper][Website]
    • CenterCLIP: "CenterCLIP: Token Clustering for Efficient Text-Video Retrieval", SIGIR, 2022 (Zhejiang University). [Paper]
    • X-CLIP: "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval", ACMMM, 2022 (Alibaba). [Paper]
    • HiSE: "Boosting Video-Text Retrieval with Explicit High-Level Semantics", ACMMM, 2022 (Baidu). [Paper]
    • TS2-Net: "TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval", ECCV, 2022 (Tencent). [Paper][PyTorch]
    • LAFF: "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval", ECCV, 2022 (Renmin University of China). [Paper]
    • ECLIPSE: "ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound", ECCV, 2022 (UNC). [Paper][PyTorch][Website]
    • ?: "Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval", NeurIPS, 2022 (Sun Yat-sen University). [Paper]
    • ConTra: "ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval", ACCV, 2022 (University of Bristol, UK). [Paper]
    • RaP: "RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • BridgeFormer: "BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions", arXiv, 2022 (HKU). [Paper][Website]
    • MDMMT-2: "MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization", arXiv, 2022 (Huawei). [Paper]
    • MILES: "MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval", arXiv, 2022 (HKU). [Paper]
    • M2HF: "M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval", arXiv, 2022 (Tencent). [Paper]
    • FIRE: "Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks", arXiv, 2022 (Meta). [Paper][PyTorch]

[Back to Overview]

Multi-Modal Generation

  • General:
    • DALL-E: "Zero-Shot Text-to-Image Generation", ICML, 2021 (OpenAI). [Paper][PyTorch][PyTorch (lucidrains)]
    • CogView: "CogView: Mastering Text-to-Image Generation via Transformers", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
    • Layout-VQGAN: "Text-to-Image Synthesis Based on Object-Guided Joint-Decoding Transformer", CVPR, 2022 (CAS). [Paper]
    • Lafite: "Towards Language-Free Training for Text-to-Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • AvatarCLIP: "AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars", SIGGRAPH, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • StoryDALL-E: "StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation", ECCV, 2022 (UNC). [Paper][PyTorch]
    • DALL-Eval: "DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers", arXiv, 2022 (UNC). [Paper][PyTorch]
    • DALL-E-2: "Hierarchical Text-Conditional Image Generation with CLIP Latents", arXiv, 2022 (OpenAI). [Paper][Website]
    • CogView2: "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
    • ?: "A very preliminary analysis of DALL-E 2", arXiv, 2022 (NYU). [Paper]
    • Imagen: "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding", arXiv, 2022 (Google). [Paper][Website]
    • GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models", arXiv, 2022 (OpenAI). [Paper][PyTorch]
    • ?: "Discovering the Hidden Vocabulary of DALLE-2", arXiv, 2022 (UT Austin). [Paper]
    • Parti: "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", arXiv, 2022 (Google). [Paper][GitHub][Website]
    • ?: "Prompt-to-Prompt Image Editing with Cross Attention Control", arXiv, 2022 (Google). [Paper]
    • Textual-Inversion: "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion", arXiv, 2022 (NVIDIA). [Paper][Website]
    • VLMGAN: "Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks", arXiv, 2022 (Fudan University). [Paper]
    • PDM: "Progressive Denoising Model for Fine-Grained Text-to-Image Generation", arXiv, 2022 (Meituan). [Paper]
    • FS-VQG: "Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets", arXiv, 2022 (IIT Kharagpur). [Paper]
  • Video:
    • CogVideo: "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers", arXiv, 2022 (Tsinghua University) [Paper][GitHub (in construction)]
    • Make-A-Video: "Make-A-Video: Text-to-Video Generation without Text-Video Data", arXiv, 2022 (Meta). [Paper]
    • Imagen-Video: "Imagen Video: High Definition Video Generation with Diffusion Models", arXiv, 2022 (Google). [Paper][Website]
    • Phenaki: "Phenaki: Variable Length Video Generation From Open Domain Textual Description", arXiv, 2022 (Google). [Paper][PyTorch (LAION-AI, in construction)][Website]

[Back to Overview]

Visual Document Understanding

  • LayoutLMv2: "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding", ACL, 2021 (Microsoft). [Paper][PyTorch]
  • DocFormer: "DocFormer: End-to-End Transformer for Document Understanding", ICCV, 2021 (Amazon). [Paper]
  • LayoutXLM: "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • TableFormer: "TableFormer: Table Structure Understanding with Transformers", CVPR, 2022 (IBM). [Paper]
  • TSRFormer: "TSRFormer: Table Structure Recognition with Transformers", ACMMM, 2022 (Microsoft). [Paper]
  • ERNIE-mmLayout: "ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding", ACMMM, 2022 (Baidu). [Paper]
  • Donut: "Donut: Document Understanding Transformer without OCR", ECCV, 2022 (NAVER). [Paper][PyTorch]
  • I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
  • DocEnTr: "DocEnTr: An End-to-End Document Image Enhancement Transformer", arXiv, 2022 (UAB, Spain). [Paper][PyTorch]
  • DocSegTr: "DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer", arXiv, 2022 (UAB, Spain). [Paper]
  • DiT: "DiT: Self-supervised Pre-training for Document Image Transformer", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • LayoutLMv3: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MATrIX: "MATrIX - Modality-Aware Transformer for Information eXtraction", arXiv, 2022 (Amazon). [Paper]
  • VLCDoC: "VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification", arXiv, 2022 (La Rochelle University, France). [Paper]
  • Bi-VLDoc: "Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding", arXiv, 2022 (Alibaba). [Paper]
  • TRUST: "TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers", arXiv, 2022 (Baidu). [Paper]

[Back to Overview]

Scene Graph

  • BGT-Net: "BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation", CVPRW, 2021 (ETHZ). [Paper]
  • STTran: "Spatial-Temporal Transformer for Dynamic Scene Graph Generation", ICCV, 2021 (Leibniz University Hannover, Germany). [Paper][PyTorch]
  • SGG-NLS: "Learning to Generate Scene Graph from Natural Language Supervision", ICCV, 2021 (University of Wisconsin-Madison). [Paper][PyTorch]
  • SGG-Seq2Seq: "Context-Aware Scene Graph Generation With Seq2Seq Transformers", ICCV, 2021 (Layer 6 AI, Canada). [Paper][PyTorch]
  • RELAX: "Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs", BMVC, 2021 (Samsung). [Paper]
  • Relation-Transformer: "Scenes and Surroundings: Scene Graph Generation using Relation Transformer", arXiv, 2021 (LMU Munich). [Paper]
  • SGTR: "SGTR: End-to-end Scene Graph Generation with Transformer", CVPR, 2022 (ShanghaiTech). [Paper][Code (in construction)]
  • GCL: "Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation", CVPR, 2022 (Shandong University). [Paper][PyTorch]
  • Relationformer: "Relationformer: A Unified Framework for Image-to-Graph Generation", ECCV, 2022 (TUM). [Paper][Code (in construction)]
  • SVRP: "Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning", ECCV, 2022 (Monash University). [Paper]
  • RelTR: "RelTR: Relation Transformer for Scene Graph Generation", arXiv, 2022 (Leibniz University Hannover, Germany). [Paper][PyTorch]

[Back to Overview]

Other Multi-Modal Tasks

  • Prompt Learning:
    • CoCoOp: "Conditional Prompt Learning for Vision-Language Models", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
    • ProDA: "Prompt Distribution Learning", CVPR, 2022 (Huawei). [Paper]
    • VPT: "Visual Prompt Tuning", ECCV, 2022 (Cornell). [Paper][PyTorch]
    • CoOp: "Learning to Prompt for Vision-Language Models", IJCV, 2022 (NTU, Singapore). [Paper][PyTorch]
    • LASP: "Language-Aware Soft Prompting for Vision & Language Foundation Models", arXiv, 2022 (Samsung). [Paper]
    • PLOT: "Prompt Learning with Optimal Transport for Vision-Language Models", arXiv, 2022 (CMU). [Paper]
    • VPT: "Variational prompt tuning improves generalization of vision-language models", arXiv, 2022 (Samsung). [Paper]
    • MaPLe: "MaPLe: Multi-modal Prompt Learning", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
    • CAVPT: "Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
    • Visual-Prompting: "Exploring Visual Prompts for Adapting Large-Scale Models", arXiv, 2022 (MIT). [Paper][PyTorch][Website]
    • PGN: "Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers", arXiv, 2022 (University of Amsterdam). [Paper][PyTorch]
    • UPT: "Unified Vision and Language Prompt Learning", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
    • ?: "Visual Classification via Description from Large Language Models", arXiv, 2022 (Columbia). [Paper]
  • X-Shot:
    • VidIL: "Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners", NeurIPS, 2022 (UIUC). [Paper][PyTorch]
    • LIMoE: "Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts", arXiv, 2022 (Google). [Paper]
  • Segmentation:
    • VLT: "Vision-Language Transformer and Query Generation for Referring Segmentation", ICCV, 2021 (NTU, Singapore). [Paper][Tensorflow]
    • LAVT: "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation", CVPR, 2022 (Oxford). [Paper]
    • ReSTR: "ReSTR: Convolution-free Referring Image Segmentation Using Transformers", CVPR, 2022 (POSTECH). [Paper][Website]
  • Tracking:
    • ModaMixer: "Divert More Attention to Vision-Language Tracking", arXiv, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
  • Analysis:
    • MM-Explainability: "Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers", ICCV, 2021 (Tel Aviv). [Paper][PyTorch]
    • ?: "Are Multimodal Transformers Robust to Missing Modality?", CVPR, 2022 (University of Delaware). [Paper]
    • VL-InterpreT: "VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers", CVPR (demo), 2022 (Intel). [Paper][Website][Video]
    • ?: "Understanding Attention for Vision-and-Language Tasks", International Conference on Computational Linguistics (COLING), 2022 (The University of Sydney). [Paper]
    • VL-CheckList: "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
  • Speaker Localization:
    • ?: "The Right to Talk: An Audio-Visual Transformer Approach", ICCV, 2021 (University of Arkansas). [Paper]
  • Multi-task:
    • UniT: "Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
    • Pix2Seq: "A Unified Sequence Interface for Vision Tasks", arXiv, 2022 (Google). [Paper]
    • Unified-IO: "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks", arXiv, 2022 (AI2). [Paper][Website]
    • LAVIS: "LAVIS: A Library for Language-Vision Intelligence", arXiv, 2022 (Salesforce). [Paper][PyTorch]
  • Language-based Video Editing:
    • M3L: "Language-based Video Editing via Multi-Modal Multi-Level Transformer", CVPR, 2022 (UCSB). [Paper]
  • Video Summarization:
    • GPT2MVS: "GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization", ICMR, 2021 (BBC). [Paper]
    • QVHighlights: "QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries", NeurIPS, 2021 (UNC). [Paper][PyTorch]
    • HMT: "Hierarchical Multimodal Transformer to Summarize Videos", arXiv, 2021 (Xidian University). [Paper]
    • ?: "Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention", ACMMM, 2022 (Adobe). [Paper]
    • IV-Sum: "TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency", ECCV, 2022 (Google). [Paper][Website]
  • Robotics:
    • CRT: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", IROS, 2021 (Keio University). [Paper]
    • TraSeTR: "TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery", ICRA, 2022 (CUHK). [Paper]
  • Multi-modal Fusion:
    • MICA: "Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion", ICCV, 2021 (Southwest Jiaotong University). [Paper]
    • IFT: "Image Fusion Transformer", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
    • PPT: "PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion", arXiv, 2021 (?). [Paper]
    • TransFuse: "TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning", arXiv, 2022 (Fudan University). [Paper]
    • SwinFuse: "SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images", arXiv, 2022 (Taiyuan University of Science and Technology). [Paper]
    • ?: "Array Camera Image Fusion using Physics-Aware Transformers", arXiv, 2022 (University of Arizona). [Paper]
  • Human Interaction:
    • Dyadformer: "Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions", ICCVW, 2021 (Universitat de Barcelona). [Paper]
  • Sign Language Translation:
    • LWTA: "Stochastic Transformer Networks with Linear Competing Units: Application to end-to-end SL Translation", ICCV, 2021 (Cyprus University of Technology). [Paper]
  • 3D:
    • 3DRefTransformer: "3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language", WACV, 2022 (KAUST). [Paper][Website]
    • EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning", arXiv, 2022 (Peking University). [Paper]
  • Speech Recognition:
    • AV-HuBERT: "Robust Self-Supervised Audio-Visual Speech Recognition", arXiv, 2022 (Meta). [Paper][PyTorch]
    • ?: "Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition", arXiv, 2022 (Google). [Paper]
  • Emotion Recognition:
    • ?: "A Pre-trained Audio-Visual Transformer for Emotion Recognition", ICASSP, 2022 (USC). [Paper]
    • MDAN: "MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis", CVPR, 2022 (Tencent). [Paper]
  • Voice Separation:
    • VoViT: "VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer", ECCV, 2022 (Universitat Pompeu Fabra, Spain). [Paper][PyTorch][Website]
  • Language-guided Video Segmentation:
    • Locater: "Local-Global Context Aware Transformer for Language-Guided Video Segmentation", arXiv, 2022 (Zhejiang). [Paper][PyTorch]
  • Audio-Visual:
    • AVCA: "Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language", CVPR, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • TCaF: "Temporal and cross-modal attention for audio-visual zero-shot learning", ECCV, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • AVE-CLIP: "AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization", WACV, 2023 (UT Austin). [Paper]
  • Sentiment Analysis:
    • CubeMLP: "CubeMLP: A MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation", ACMMM, 2022 (Zhejiang University). [Paper]
    • MCMulT: "Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos", arXiv, 2022 (Tencent). [Paper]
  • Name Entity Recognition:
    • FMIT: "Flat Multi-modal Interaction Transformer for Named Entity Recognition", International Conference on Computational Linguistics (COLING), 2022 (South China University of Technology). [Paper]
  • Localization via Embodied Dialog:
    • LED-Bert: "Transformer-based Localization from Embodied Dialog with Large-scale Pre-training", arXiv, 2022 (Georgia Tech). [Paper]

[Back to Overview]

Other High-level Vision Tasks

Point Cloud / 3D

  • PCT: "PCT: Point Cloud Transformer", arXiv, 2020 (Tsinghua). [Paper][Jittor][PyTorch (uyzhang)]
  • Point-Transformer: "Point Transformer", arXiv, 2020 (Ulm University). [Paper]
  • NDT-Transformer: "NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation", ICRA, 2021 (University of Sheffield). [Paper][PyTorch]
  • P4Transformer: "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos", CVPR, 2021 (NUS). [Paper]
  • PTT: "PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds", IROS, 2021 (Northeastern University). [Paper][PyTorch (in construction)]
  • SnowflakeNet: "SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer", ICCV, 2021 (Tsinghua). [Paper][PyTorch]
  • PoinTr: "PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers", ICCV, 2021 (Tsinghua). [Paper][PyTorch]
  • Point-Transformer: "Point Transformer", ICCV, 2021 (Oxford + CUHK). [Paper][PyTorch (lucidrains)]
  • CT: "Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks", ICCV, 2021 (Samsung). [Paper]
  • 3DVG-Transformer: "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds", ICCV, 2021 (Beihang University). [Paper]
  • PPT-Net: "Pyramid Point Cloud Transformer for Large-Scale Place Recognition", ICCV, 2021 (Nanjing University of Science and Technology). [Paper]
  • LTTR: "3D Object Tracking with Transformer", BMVC, 2021 (Northeastern University, China). [Paper][Code (in construction)]
  • ?: "Shape registration in the time of transformers", NeurIPS, 2021 (Sapienza University of Rome). [Paper]
  • YOGO: "You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module", arXiv, 2021 (Berkeley). [Paper][PyTorch]
  • DTNet: "Dual Transformer for Point Cloud Analysis", arXiv, 2021 (Southwest University). [Paper]
  • MLMSPT: "Point Cloud Learning with Transformer", arXiv, 2021 (Southwest University). [Paper]
  • PQ-Transformer: "PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds", arXiv, 2021 (Tsinghua). [Paper][PyTorch]
  • PST2: "Spatial-Temporal Transformer for 3D Point Cloud Sequences", WACV, 2022 (Sun Yat-sen University). [Paper]
  • SCTN: "SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation", AAAI, 2022 (KAUST). [Paper]
  • AWT-Net: "Adaptive Wavelet Transformer Network for 3D Shape Representation Learning", ICLR, 2022 (NYU). [Paper]
  • ?: "Deep Point Cloud Reconstruction", ICLR, 2022 (KAIST). [Paper]
  • HiTPR: "HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud", ICRA, 2022 (Nanjing University of Science and Technology). [Paper]
  • FastPointTransformer: "Fast Point Transformer", CVPR, 2022 (POSTECH). [Paper]
  • REGTR: "REGTR: End-to-end Point Cloud Correspondences with Transformers", CVPR, 2022 (NUS, Singapore). [Paper][PyTorch]
  • ShapeFormer: "ShapeFormer: Transformer-based Shape Completion via Sparse Representation", CVPR, 2022 (Shenzhen University). [Paper][Website]
  • PatchFormer: "PatchFormer: An Efficient Point Transformer with Patch Attention", CVPR, 2022 (Hangzhou Dianzi University). [Paper]
  • ?: "An MIL-Derived Transformer for Weakly Supervised Point Cloud Segmentation", CVPR, 2022 (NTU + NYCU). [Paper][Code (in construction)]
  • Point-BERT: "Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling", CVPR, 2022 (Tsinghua). [Paper][PyTorch][Website]
  • PTTR: "PTTR: Relational 3D Point Cloud Object Tracking with Transformer", CVPR, 2022 (Sensetime). [Paper][PyTorch]
  • GeoTransformer: "Geometric Transformer for Fast and Robust Point Cloud Registration", CVPR, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
  • ?: "3D Part Assembly Generation with Instance Encoded Transformer", IROS, 2022 (Tongji University). [Paper]
  • SeedFormer: "SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer", ECCV, 2022 (Tencent). [Paper][PyTorch]
  • MeshMAE: "MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis", ECCV, 2022 (JD). [Paper]
  • PPTr: "Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding", ECCV, 2022 (Tsinghua University). [Paper]
  • Geodesic-Former: "Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter", ECCV, 2022 (VinAI Research, Vietnam). [Paper]
  • PTT: "Real-time 3D Single Object Tracking with Transformer", TMM, 2022 (Northeastern University, China). [Paper][PyTorch]
  • Point-Transformer-V2: "Point Transformer V2: Grouped Vector Attention and Partition-based Pooling", NeurIPS, 2022 (HKU). [Paper][PyTorch (in construction)]
  • LighTN: "LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling", arXiv, 2022 (Beijing Jiaotong University). [Paper]
  • PMP-Net++: "PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths", arXiv, 2022 (Tsinghua). [Paper]
  • SnowflakeNet: "Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer", arXiv, 2022 (Tsinghua). [Paper][PyTorch]
  • 3DCTN: "3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification", arXiv, 2022 (University of Waterloo, Canada). [Paper]
  • VNT-Net: "VNT-Net: Rotational Invariant Vector Neuron Transformers", arXiv, 2022 (Ben-Gurion University of the Negev, Israel). [Paper]
  • CompleteDT: "CompleteDT: Point Cloud Completion with Dense Augment Inference Transformers", arXiv, 2022 (Beijing Institute of Technology). [Paper]
  • VN-Transformer: "VN-Transformer: Rotation-Equivariant Attention for Vector Neurons", arXiv, 2022 (Waymo). [Paper]
  • Voxel-MAE: "Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds", arXiv, 2022 (Chalmers University of Technology, Sweden). [Paper]
  • MAE3D: "Masked Autoencoders in 3D Point Cloud Representation Learning", arXiv, 2022 (Northwest A&F University, China). [Paper]
  • PointConvFormer: "PointConvFormer: Revenge of the Point-based Convolution", arXiv, 2022 (Apple). [Paper]
  • PTTR++: "Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
  • Pix4Point: "Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding", arXiv, 2022 (KAUST). [Paper][Code (in construction)]
  • MVP: "Multiple View Performers for Shape Completion", arXiv, 2022 (Columbia University). [Paper]
  • Simple3D-Former: "Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?", arXiv, 2022 (UT Austin). [Paper][PyTorch]
  • 3DPCT: "3DPCT: 3D Point Cloud Transformer with Dual Self-attention", arXiv, 2022 (University of Waterloo, Canada). [Paper]
  • PS-Former: "Point Cloud Recognition with Position-to-Structure Attention Transformers", arXiv, 2022 (UCSD). [Paper]

[Back to Overview]

Pose Estimation

  • Human-related:
    • Hand-Transformer: "Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation", ECCV, 2020 (Kwai). [Paper]
    • HOT-Net: "HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation", ACMMM. 2020 (Kwai). [Paper]
    • TransPose: "TransPose: Towards Explainable Human Pose Estimation by Transformer", arXiv, 2020 (Southeast University). [Paper][PyTorch]
    • PTF: "Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration", CVPR, 2021 (ETHZ). [Paper][Code (in construction)][Website]
    • METRO: "End-to-End Human Pose and Mesh Reconstruction with Transformers", CVPR, 2021 (Microsoft). [Paper][PyTorch]
    • PRTR: "Pose Recognition with Cascade Transformers", CVPR, 2021 (UCSD). [Paper][PyTorch]
    • Mesh-Graphormer: "Mesh Graphormer", ICCV, 2021 (Microsoft). [Paper][PyTorch]
    • THUNDR: "THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers", ICCV, 2021 (Google). [Paper]
    • PoseFormer: "3D Human Pose Estimation with Spatial and Temporal Transformers", ICCV, 2021 (UNC). [Paper][PyTorch]
    • TransPose: "TransPose: Keypoint Localization via Transformer", ICCV, 2021 (Southeast University, China). [Paper][PyTorch]
    • SCAT: "SCAT: Stride Consistency With Auto-Regressive Regressor and Transformer for Hand Pose Estimation", ICCVW, 2021 (Alibaba). [Paper]
    • POTR: "Pose Transformers (POTR): Human Motion Prediction With Non-Autoregressive Transformers", ICCVW, 2021 (Idiap). [Paper]
    • TransFusion: "TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation", BMVC, 2021 (UC Irvine). [Paper][PyTorch]
    • HRT: "HRFormer: High-Resolution Transformer for Dense Prediction", NeurIPS, 2021 (CAS). [Paper][PyTorch]
    • POET: "End-to-End Trainable Multi-Instance Pose Estimation with Transformers", arXiv, 2021 (EPFL). [Paper]
    • Lifting-Transformer: "Lifting Transformer for 3D Human Pose Estimation in Video", arXiv, 2021 (Peking). [Paper]
    • TFPose: "TFPose: Direct Human Pose Estimation with Transformers", arXiv, 2021 (The University of Adelaide). [Paper][PyTorch]
    • Skeletor: "Skeletor: Skeletal Transformers for Robust Body-Pose Estimation", arXiv, 2021 (University of Surrey). [Paper]
    • HandsFormer: "HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction", arXiv, 2021 (Graz University of Technology). [Paper]
    • TTP: "Test-Time Personalization with a Transformer for Human Pose Estimation", NeurIPS, 2021 (UCSD). [Paper][PyTorch][Website]
    • GraFormer: "GraFormer: Graph Convolution Transformer for 3D Pose Estimation", arXiv, 2021 (CAS). [Paper]
    • GCT: "Geometry-Contrastive Transformer for Generalized 3D Pose Transfer", AAAI, 2022 (University of Oulu). [Paper][PyTorch]
    • MHFormer: "MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation", CVPR, 2022 (Peking). [Paper][PyTorch]
    • PAHMT: "Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation", CVPR, 2022 (NetEase). [Paper]
    • TCFormer: "Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • PETR: "End-to-End Multi-Person Pose Estimation With Transformers", CVPR, 2022 (Hikvision). [Paper][PyTorch]
    • GraFormer: "GraFormer: Graph-Oriented Transformer for 3D Pose Estimation", CVPR, 2022 (CAS). [Paper]
    • Keypoint-Transformer: "Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation", CVPR, 2022 (Graz University of Technology, Austria). [Paper][PyTorch][Website]
    • MPS-Net: "Capturing Humans in Motion: Temporal-Attentive 3D Human Pose and Shape Estimation from Monocular Video", CVPR, 2022 (Academia Sinica). [Paper][Website]
    • Ego-STAN: "Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation", CVPRW, 2022 (University of Waterloo, Canada). [Paper]
    • AggPose: "AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation", IJCAI, 2022 (Shenzhen Baoan Women’s and Childiren’s Hospital). [Paper][Code (in construction)]
    • MotionMixer: "MotionMixer: MLP-based 3D Human Body Pose Forecasting", IJCAI, 2022 (Ulm University, Germany). [Paper][Code (in construction)]
    • Jointformer: "Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation", ICPR, 2022 (Trinity College Dublin, Ireland). [Paper]
    • IVT: "IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation", ACMMM, 2022 (Baidu). [Paper]
    • FastMETRO: "Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers", ECCV, 2022 (POSTECH). [Paper][PyTorch][Website]
    • PPT: "PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation", ECCV, 2022 (UC Irvine). [Paper][PyTorch]
    • Poseur: "Poseur: Direct Human Pose Regression with Transformers", ECCV, 2022 (The University of Adelaide, Australia). [Paper]
    • Swin-Pose: "Swin-Pose: Swin Transformer Based Human Pose Estimation", arXiv, 2022 (UMass Lowell) [Paper]
    • HeadPosr: "HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders", arXiv, 2022 (ETHZ). [Paper]
    • CrossFormer: "CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation", arXiv, 2022 (Canberra University, Australia). [Paper]
    • ViTPose: "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation", arXiv, 2022 (The University of Sydney). [Paper][PyTorch]
    • VTP: "VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
    • HeatER: "HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER", arXiv, 2022 (UCF). [Paper]
    • SeTHPose: "Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation", arXiv, 2022 (Queen's University, Canada). [Paper]
    • GraphMLP: "GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation", arXiv, 2022 (Peking University). [Paper]
    • siMLPe: "Back to MLP: A Simple Baseline for Human Motion Prediction", arXiv, 2022 (INRIA). [Paper][Pytorch]
    • Snipper: "Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet", arXiv, 2022 (University of Alberta, Canada). [Paper][PyTorch]
    • OTPose: "OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos", arXiv, 2022 (Korea University). [Paper]
    • PoseBERT: "PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling", arXiv, 2022 (NAVER). [Paper][PyTorch]
    • KOG-Transformer: "K-Order Graph-oriented Transformer with GraAttention for 3D Pose and Shape Estimation", arXiv, 2022 (CAS). [Paper]
    • SoMoFormer: "SoMoFormer: Multi-Person Pose Forecasting with Transformers", arXiv, 2022 (Stanford). [Paper]
    • DPIT: "DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation", arXiv, 2022 (Shanghai University). [Paper]
    • HTT: "Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos", arXiv, 2022 (HKU). [Paper]
    • Uplift-Upsample: "Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers", WACV, 2023 (University of Augsburg, Germany). [Paper][Tensorflow]
  • Others:
    • TAPE: "Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry", arXiv, 2020 (Tianjing University). [Paper]
    • T6D-Direct: "T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression", GCPR, 2021 (University of Bonn). [Paper]
    • 6D-ViT: "6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning", arXiv, 2021 (University of Science and Technology of China). [Paper]
    • RayTran: "RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers", ECCV, 2022 (Google). [Paper]
    • DProST: "DProST: Dynamic Projective Spatial Transformer Network for 6D Pose Estimation", ECCV, 2022 (Seoul National University). [Paper][PyTorch]
    • AFT-VO: "AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry Estimation", arXiv, 2022 (University of Surrey, UK). [Paper]
    • DPT-VO: "Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry", arXiv, 2022 (Aeronautics Institute of Technology, Brazil). [Paper]

[Back to Overview]

Tracking

  • TransTrack: "TransTrack: Multiple-Object Tracking with Transformer",arXiv, 2020 (HKU + ByteDance) . [Paper][PyTorch]
  • TransformerTrack: "Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking", CVPR, 2021 (USTC). [Paper][PyTorch]
  • TransT: "Transformer Tracking", CVPR, 2021 (Dalian University of Technology). [Paper][PyTorch]
  • STARK: "Learning Spatio-Temporal Transformer for Visual Tracking", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • HiFT: "HiFT: Hierarchical Feature Transformer for Aerial Tracking", ICCV, 2021 (Tongji University). [Paper][PyTorch]
  • DTT: "High-Performance Discriminative Tracking With Transformers", ICCV, 2021 (CAS). [Paper]
  • DualTFR: "Learning Tracking Representations via Dual-Branch Fully Transformer Networks", ICCVW, 2021 (Microsoft). [Paper][PyTorch (in construction)]
  • TransCenter: "TransCenter: Transformers with Dense Queries for Multiple-Object Tracking", arXiv, 2021 (INRIA + MIT). [Paper]
  • TransMOT: "TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking", arXiv, 2021 (Microsoft). [Paper]
  • TREG: "Target Transformed Regression for Accurate Tracking", arXiv, 2021 (Nanjing University). [Paper][Code (in construction)]
  • TrTr: "TrTr: Visual Tracking with Transformer", arXiv, 2021 (University of Tokyo). [Paper][PyTorch]
  • RelationTrack: "RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation", arXiv, 2021 (Huazhong Univerisity of Science and Technology). [Paper]
  • SiamTPN: "Siamese Transformer Pyramid Networks for Real-Time UAV Tracking", WACV, 2022 (New York University). [Paper]
  • MixFormer: "MixFormer: End-to-End Tracking with Iterative Mixed Attention", CVPR, 2022 (Nanjing University). [Paper][PyTorch]
  • ToMP: "Transforming Model Prediction for Tracking", CVPR, 2022 (ETHZ). [Paper][PyTorch]
  • GTR: "Global Tracking Transformers", CVPR, 2022 (UT Austin). [Paper][PyTorch]
  • UTT: "Unified Transformer Tracker for Object Tracking", CVPR, 2022 (Meta). [Paper][Code (in construction)]
  • MeMOT: "MeMOT: Multi-Object Tracking with Memory", CVPR, 2022 (Amazon). [Paper]
  • CSwinTT: "Transformer Tracking with Cyclic Shifting Window Attention", CVPR, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • STNet: "Spiking Transformers for Event-Based Single Object Tracking", CVPR, 2022 (Dalian University of Technology). [Paper]
  • TrackFormer: "TrackFormer: Multi-Object Tracking with Transformers", CVPR, 2022 (Facebook). [Paper][PyTorch]
  • SparseTT: "SparseTT: Visual Tracking with Sparse Transformers", IJCAI, 2022 (Beihang University). [Paper][Code (in construction)]
  • AiATrack: "AiATrack: Attention in Attention for Transformer Visual Tracking", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • STNet: "3D Siamese Transformer Network for Single Object Tracking on Point Clouds", ECCV, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • MOTR: "MOTR: End-to-End Multiple-Object Tracking with TRansformer", ECCV, 2022 (Megvii). [Paper][PyTorch]
  • SwinTrack: "SwinTrack: A Simple and Strong Baseline for Transformer Tracking", NeurIPS, 2022 (South China University of Technology). [Paper][PyTorch]
  • TransMOT: "Transformers for Multi-Object Tracking on Point Clouds", IV, 2022 (Bosch). [Paper]
  • TransT-M: "High-Performance Transformer Tracking", arXiv, 2022 (Dalian University of Technology). [Paper]
  • HCAT: "Efficient Visual Tracking via Hierarchical Cross-Attention Transformer", arXiv, 2022 (Dalian University of Technology). [Paper]
  • ?: "Keypoints Tracking via Transformer Networks", arXiv, 2022 (KAIST). [Paper][PyTorch]
  • TranSTAM: "Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking", arXiv, 2022 (Tsinghua University). [Paper][PyTorch]
  • TransFiner: "TransFiner: A Full-Scale Refinement Approach for Multiple Object Tracking", arXiv, 2022 (China University of Geosciences). [Paper]
  • LPAT: "Local Perception-Aware Transformer for Aerial Tracking", arXiv, 2022 (Tongji University). [Paper][PyTorch]
  • TADN: "Transformer-based assignment decision network for multiple object tracking", arXiv, 2022 (National Technical University of Athens, Greece). [Paper][Code (in construction)]
  • InterTrack: "InterTrack: Interaction Transformer for 3D Multi-Object Tracking", arXiv, 2022 (University of Toronto). [Paper]

[Back to Overview]

Re-ID

  • PAT: "Diverse Part Discovery: Occluded Person Re-Identification With Part-Aware Transformer", CVPR, 2021 (University of Science and Technology of China). [Paper]
  • HAT: "HAT: Hierarchical Aggregation Transformers for Person Re-identification", ACMMM, 2021 (Dalian University of Technology). [Paper]
  • TransReID: "TransReID: Transformer-based Object Re-Identification", ICCV, 2021 (Alibaba). [Paper][PyTorch]
  • APD: "Transformer Meets Part Model: Adaptive Part Division for Person Re-Identification", ICCVW, 2021 (Meituan). [Paper]
  • Pirt: "Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification", ACMMM, 2021 (Beihang University). [Paper]
  • TransMatcher: "Transformer-Based Deep Image Matching for Generalizable Person Re-identification", NeurIPS, 2021 (IIAI). [Paper][PyTorch]
  • STT: "Spatiotemporal Transformer for Video-based Person Re-identification", arXiv, 2021 (Beihang University). [Paper]
  • AAformer: "AAformer: Auto-Aligned Transformer for Person Re-Identification", arXiv, 2021 (CAS). [Paper]
  • TMT: "A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification", arXiv, 2021 (Dalian University of Technology). [Paper]
  • LA-Transformer: "Person Re-Identification with a Locally Aware Transformer", arXiv, 2021 (University of Maryland Baltimore County). [Paper]
  • DRL-Net: "Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification", arXiv, 2021 (Peking University). [Paper]
  • GiT: "GiT: Graph Interactive Transformer for Vehicle Re-identification", arXiv, 2021 (Huaqiao University). [Paper]
  • OH-Former: "OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification", arXiv, 2021 (Shanghaitech University). [Paper]
  • CMTR: "CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification", arXiv, 2021 (Beijing Jiaotong University). [Paper]
  • PFD: "Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer", AAAI, 2022 (Peking). [Paper][PyTorch]
  • NFormer: "NFormer: Robust Person Re-identification with Neighbor Transformer", CVPR, 2022 (University of Amsterdam, Netherlands). [Paper][Code (in construction)]
  • DCAL: "Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification", CVPR, 2022 (Advanced Micro Devices, China). [Paper]
  • PiT: "Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval", IEEE Transactions on Industrial Informatics, 2022 (* Peking*). [Paper]
  • ?: "Motion-Aware Transformer For Occluded Person Re-identification", arXiv, 2022 (NetEase, China). [Paper]
  • PFT: "Short Range Correlation Transformer for Occluded Person Re-Identification", arXiv, 2022 (Nanjing University of Posts and Telecommunications). [Paper]

[Back to Overview]

Face

  • General:
    • FAU-Transformer: "Facial Action Unit Detection With Transformers", CVPR, 2021 (Rakuten Institute of Technology). [Paper]
    • TADeT: "Mitigating Bias in Visual Transformers via Targeted Alignment", BMVC, 2021 (Gerogia Tech). [Paper]
    • ViT-Face: "Face Transformer for Recognition", arXiv, 2021 (Beijing University of Posts and Telecommunications). [Paper]
    • FaceT: "Learning to Cluster Faces via Transformer", arXiv, 2021 (Alibaba). [Paper]
    • VidFace: "VidFace: A Full-Transformer Solver for Video Face Hallucination with Unaligned Tiny Snapshots", arXiv, 2021 (Zhejiang University). [Paper]
    • FAA: "Shuffle Transformer with Feature Alignment for Video Face Parsing", arXiv, 2021 (Tencent). [Paper]
    • FaRL: "General Facial Representation Learning in a Visual-Linguistic Manner", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • FaceFormer: "FaceFormer: Speech-Driven 3D Facial Animation with Transformers", CVPR, 2022 (HKU). [Paper][PyTorch][Website]
    • PhysFormer: "PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer", CVPR, 2022 (University of Oulu, Finland). [Paper][PyTorch]
    • VTP: "Sub-word Level Lip Reading With Visual Attention", CVPR, 2022 (Oxford). [Paper]
    • EventFormer: "EventFormer: AU Event Transformer for Facial Action Unit Event Detection", arXiv, 2022 (Peking). [Paper]
    • MFT: "Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers", arXiv, 2022 (SUNY Binghamton). [Paper]
    • VC-TRSF: "Self-supervised Video-centralised Transformer for Video Face Clustering", arXiv, 2022 (ICL). [Paper]
  • Facial Landmark:
    • Clusformer: "Clusformer: A Transformer Based Clustering Approach to Unsupervised Large-Scale Face and Visual Landmark Recognition", CVPR, 2021 (VinAI Research, Vietnam). [Paper]
    • LOTR: "LOTR: Face Landmark Localization Using Localization Transformer", arXiv, 2021 (Sertis, Thailand). [Paper]
    • SLPT: "Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning", CVPR, 2022 (University of Technology Sydney). [Paper][PyTorch]
    • DTLD: "Towards Accurate Facial Landmark Detection via Cascaded Transformers", CVPR, 2022 (Samsung). [Paper]
  • Face Low-Level Vision:
    • Latent-Transformer: "A Latent Transformer for Disentangled Face Editing in Images and Videos", ICCV, 2021 (Institut Polytechnique de Paris). [Paper][PyTorch]
    • FAT: "Facial Attribute Transformers for Precise and Robust Makeup Transfer", WACV, 2022 (University of Rochester). [Paper]
    • SSAT: "SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal", AAAI, 2022 (Wuhan University). [Paper][PyTorch]
    • TransEditor: "TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing", CVPR, 2022 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • RestoreFormer: "RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs", CVPR, 2022 (HKU). [Paper]
    • HairCLIP: "HairCLIP: Design Your Hair by Text and Reference Image", CVPR, 2022 (USTC). [Paper][PyTorch]
    • Cycle-Text2Face: "Cycle Text2Face: Cycle Text-to-face GAN via Transformers", arXiv, 2022 (Shahed Univerisity, Iran). [Paper]
    • CodeFormer: "Towards Robust Blind Face Restoration with Codebook Lookup Transformer", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch (in construction)][Website]
    • FaceFormer: "FaceFormer: Scale-aware Blind Face Restoration with Transformers", arXiv, 2022 (Tencent). [Paper]
    • text2StyleGAN: "Text-Free Learning of a Natural Language Interface for Pretrained Face Generators", arXiv, 2022 (Toyota Technological Institute, Chicago). [Paper][PyTorch]
    • ManiCLIP: "ManiCLIP: Multi-Attribute Face Manipulation from Text", arXiv, 2022 (NTU, Singapore). [Paper]
    • FEAT: "FEAT: Face Editing with Attention", arXiv, 2022 (Shenzhen University). [Paper]
  • Facial Expression:
    • TransFER: "TransFER: Learning Relation-aware Facial Expression Representations with Transformers", ICCV, 2021 (CAS). [Paper]
    • CVT-Face: "Robust Facial Expression Recognition with Convolutional Visual Transformers", arXiv, 2021 (Hunan University). [Paper]
    • MViT: "MViT: Mask Vision Transformer for Facial Expression Recognition in the wild", arXiv, 2021 (University of Science and Technology of China). [Paper]
    • ViT-SE: "Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition", arXiv, 2021 (CentraleSupélec, France). [Paper]
    • EST: "Expression Snippet Transformer for Robust Video-based Facial Expression Recognition", arXiv, 2021 (China University of Geosciences). [Paper][PyTorch]
    • MFEViT: "MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition", arXiv, 2021 (University of Science and Technology of China). [Paper]
    • F-PDLS: "Vision Transformer Equipped with Neural Resizer on Facial Expression Recognition Task", ICASSP, 2022 (KAIST). [Paper]
    • ?: "Transformer-based Multimodal Information Fusion for Facial Expression Analysis", arXiv, 2022 (Netease, China). [Paper]
    • ?: "Facial Expression Recognition with Swin Transformer", arXiv, 2022 (Dongguk University, Korea). [Paper]
    • POSTER: "POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition", arXiv, 2022 (UCF). [Paper]
    • STT: "Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild", arXiv, 2022 (*Hunan University *). [Paper]
    • FaceMAE: "FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders", arXiv, 2022 (NUS). [Paper][Code (in construction)]
    • RePFormer: "RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection", arXiv, 2022 (CUHK). [Paper]
    • TransFA: "TransFA: Transformer-based Representation for Face Attribute Evaluation", arXiv, 2022 (Xidian University). [Paper]
    • AU-CVT: "AU-Supervised Convolutional Vision Transformers for Synthetic Facial Expression Recognition", arXiv, 2022 (Shenzhen Technology University). [Paper][PyTorch]
    • ?: "Multi-Task Transformer with uncertainty modelling for Face Based Affective Computing", arXiv, 2022 (Datakalab, France). [Paper]
  • Attack-related:
    • ?: "Video Transformer for Deepfake Detection with Incremental Learning", ACMMM, 2021 (MBZUAI). [Paper]
    • ViTranZFAS: "On the Effectiveness of Vision Transformers for Zero-shot Face Anti-Spoofing", International Joint Conference on Biometrics (IJCB), 2021 (Idiap). [Paper]
    • MTSS: "Multi-Teacher Single-Student Visual Transformer with Multi-Level Attention for Face Spoofing Detection", BMVC, 2021 (National Taiwan Ocean University). [Paper]
    • TransRPPG: "TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection", arXiv, 2021 (University of Oulu). [Paper]
    • CViT: "Deepfake Video Detection Using Convolutional Vision Transformer", arXiv, 2021 (Jimma University). [Paper]
    • ViT-Distill: "Deepfake Detection Scheme Based on Vision Transformer and Distillation", arXiv, 2021 (Sookmyung Women’s University). [Paper]
    • M2TR: "M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection", arXiv, 2021 (Fudan University). [Paper]
    • Cross-ViT: "Combining EfficientNet and Vision Transformers for Video Deepfake Detection", arXiv, 2021 (University of Pisa). [Paper][PyTorch]
    • ICT: "Protecting Celebrities from DeepFake with Identity Consistency Transformer", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • GGViT: "GGViT: Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection", ICPR, 2022 (CAS). [Paper]
    • ?: "Hybrid Transformer Network for Deepfake Detection", International Conference on Content-Based Multimedia Indexing (CBMI), 2022 (MediaFutures, Norway). [Paper]
    • ViTAF: "Adaptive Transformers for Robust Few-shot Cross-domain Face Anti-spoofing", ECCV, 2022 (Google). [Paper]
    • ?: "Multi-Scale Wavelet Transformer for Face Forgery Detection", ACCV, 2022 (Hikvision). [Paper]
    • ?: "Self-supervised Transformer for Deepfake Detection", arXiv, 2022 (USTC, China). [Paper]
    • ViTransPAD: "ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection", arXiv, 2022 (University of La Rochelle, France). [Paper]
    • ?: "Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection", arXiv, 2022 (National Research Council, Italy). [Paper]
    • STDT: "Deepfake Video Detection with Spatiotemporal Dropout Transformer", arXiv, 2022 (CAS). [Paper]
    • ?: "Deep Convolutional Pooling Transformer for Deepfake Detection", arXiv, 2022 (HKU). [Paper]

[Back to Overview]

Neural Architecture Search

  • HR-NAS: "HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers", CVPR, 2021 (HKU). [Paper][PyTorch]
  • CATE: "CATE: Computation-aware Neural Architecture Encoding with Transformers", ICML, 2021 (Michigan State University). [Paper]
  • AutoFormer: "AutoFormer: Searching Transformers for Visual Recognition", ICCV, 2021 (Microsoft). [Paper][PyTorch]
  • GLiT: "GLiT: Neural Architecture Search for Global and Local Image Transformer", ICCV, 2021 (The University of Sydney + SenseTime). [Paper]
  • BossNAS: "BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search", ICCV, 2021 (Monash University). [Paper][PyTorch]
  • ViT-ResNAS: "Searching for Efficient Multi-Stage Vision Transformers", ICCVW, 2021 (MIT). [Paper][PyTorch]
  • AutoformerV2: "Searching the Search Space of Vision Transformer", NeurIPS, 2021 (Microsoft). [Paper][PyTorch]
  • TNASP: "TNASP: A Transformer-based NAS Predictor with a Self-evolution Framework", NeurIPS, 2021 (CAS + Kuaishou). [Paper]
  • PSViT: "PSViT: Better Vision Transformer via Token Pooling and Attention Sharing", arXiv, 2021 (The University of Sydney + SenseTime). [Paper]
  • As-ViT: "Auto-scaling Vision Transformers without Training", ICLR, 2022 (UT Austin). [Paper][PyTorch]
  • NASViT: "NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training", ICLR, 2022 (Facebook). [Paper]
  • TF-TAS: "Training-free Transformer Architecture Search", CVPR, 2022 (Tencent). [Paper]
  • ViT-Slim: "Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space", CVPR, 2022 (MBZUAI). [Paper][PyTorch]
  • BurgerFormer: "Searching for BurgerFormer with Micro-Meso-Macro Space Design", ICML, 2022 (CAS). [Paper][Code (in construction)]
  • UniNet: "UniNet: Unified Architecture Search with Convolution, Transformer, and MLP", ECCV, 2022 (CUHK + SenseTime). [Paper]
  • ViTAS: "Vision Transformer Architecture Search", ECCV, 2022 (The University of Sydney + SenseTime). [Paper]
  • VTCAS: "Vision Transformer with Convolutions Architecture Search", arXiv, 2022 (Donghua University). [Paper]
  • NOAH: "Neural Prompt Search", arXiv, 2022 (NTU, Singapore). [Paper][PyTorch]
  • FocusFormer: "FocusFormer: Focusing on What We Need via Architecture Sampler", arXiv, 2022 (Monash University, Australia). [Paper]

[Back to Overview]

Transfer / X-Supervised / X-Shot / Continual Learning

  • Transfer Learning:
    • AdaptFormer: "AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition", arXiv, 2022 (HKU). [Paper][Website]
    • Convpass: "Convolutional Bypasses Are Better Vision Transformer Adapters", arXiv, 2022 (Peking University). [Paper]
  • Domain Adaptation/Generalization:
    • TransDA: "Transformer-Based Source-Free Domain Adaptation", arXiv, 2021 (Haerbin Institute of Technology). [Paper][PyTorch]
    • TVT: "TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation", arXiv, 2021 (UT Arlington + Kuaishou). [Paper]
    • ResTran: "Discovering Spatial Relationships by Transformers for Domain Generalization", arXiv, 2021 (MBZUAI). [Paper]
    • WinTR: "Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation", arXiv, 2021 (Beijing Institute of Technology). [Paper]
    • CDTrans: "CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation", ICLR, 2022 (Alibaba). [Paper][PyTorch]
    • SSRT: "Safe Self-Refinement for Transformer-based Domain Adaptation", CVPR, 2022 (Stony Brook). [Paper]
    • DOT: "Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation", ACMMM, 2022 (Beijing Institute of Technology). [Paper]
    • BCAT: "Domain Adaptation via Bidirectional Cross-Attention Transformer", arXiv, 2022 (Southern University of Science and Technology). [Paper]
    • DoTNet: "Towards Unsupervised Domain Adaptation via Domain-Transformer", arXiv, 2022 (Sun Yat-Sen University). [Paper]
    • TransDA: "Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation", arXiv, 2022 (Tsinghua). [Paper][Code (in construction)]
    • FAMLP: "FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization", arXiv, 2022 (University of Science and Technology of China). [Paper]
    • PACMAC: "Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency", arXiv, 2022 (Georgia Tech). [Paper][PyTorch]
    • ERM-ViT: "Self-Distilled Vision Transformer for Domain Generalization", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
    • MPA: "Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation", arXiv, 2022 (Fudan University). [Paper]
    • DePT: "Visual Prompt Tuning for Test-time Domain Adaptation", arXiv, 2022 (Amazon). [Paper]
  • X-Supervised:
    • Semiformer: "Semi-Supervised Vision Transformers", ECCV, 2022 (Fudan University). [Paper][PyTorch]
    • SVL-Adapter: "SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models", BMVC, 2022 (UCL). [Paper][Code (in construction)]
    • Semi-ViT: "Semi-supervised Vision Transformers at Scale", arXiv, 2022 (Amazon). [Paper]
  • Zero-Shot:
    • ViT-ZSL: "Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning", IMVIP, 2021 (University of Exeter, UK). [Paper]
    • TransZero: "TransZero: Attribute-guided Transformer for Zero-Shot Learning", AAAI, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • TPT: "Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models", NeurIPS, 2022 (NVIDIA). [Paper][Code (in construction)][Website]
    • HRT: "Hybrid Routing Transformer for Zero-Shot Learning", arXiv, 2022 (Xidian University). [Paper]
    • MUST: "Masked Unsupervised Self-training for Zero-shot Image Classification", arXiv, 2022 (Salesforce). [Paper]
    • CuPL: "What does a platypus look like? Generating customized prompts for zero-shot image classification", arXiv, 2022 (University of Washington). [Paper][PyTorch]
    • VL-Taboo: "VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models", arXiv, 2022 (Goethe University Frankfurt, Germany). [Paper][Code (in construction)]
    • CALIP: "CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention", arXiv, 2022 (Peking University). [Paper]
  • X-Shot:
    • CrossTransformer: "CrossTransformers: spatially-aware few-shot transfer", NeurIPS, 2020 (DeepMind). [Paper][Tensorflow]
    • URT: "A Universal Representation Transformer Layer for Few-Shot Image Classification", ICLR, 2021 (Mila). [Paper][PyTorch]
    • TRX: "Temporal-Relational CrossTransformers for Few-Shot Action Recognition", CVPR, 2021 (University of Bristol). [Paper][PyTorch]
    • Few-shot-Transformer: "Few-Shot Transformation of Common Actions into Time and Space", arXiv, 2021 (University of Amsterdam). [Paper]
    • HCTransformers: "Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning", CVPR, 2022 (Fudan University). [Paper][PyTorch]
    • HyperTransformer: "HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning", CVPR, 2022 (Google). [Paper][PyTorch][Website]
    • STRM: "Spatio-temporal Relation Modeling for Few-shot Action Recognition", CVPR, 2022 (MBZUAI). [Paper][PyTorch][Website]
    • HyperTransformer: "HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning", ICML, 2022 (Google). [Paper]
    • CPM: "Compound Prototype Matching for Few-shot Action Recognition", ECCV, 2022 (The University of Tokyo). [Paper]
    • SUN: "Self-Promoted Supervision for Few-Shot Transformer", ECCV, 2022 (Harbin Institute of Technology + NUS). [Paper][PyTorch]
    • Tip-Adapter: "Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification", ECCV, 2022 (Shanghai AI Lab). [Paper][PyTorch]
    • BaseTransformers: "BaseTransformers: Attention over base data-points for One Shot Learning", BMVC, 2022 (Dublin City University, Ireland). [Paper][PyTorch]
    • FPTrans: "Feature-Proxy Transformer for Few-Shot Segmentation", NeurIPS, 2022 (Baidu). [Paper][Code (in construction)]
    • MG-ViT: "Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
    • QSFormer: "Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification", arXiv, 2022 (Anhui University). [Paper]
  • Continual Learning:
    • MEAT: "Meta-attention for ViT-backed Continual Learning", CVPR, 2022 (Zhejiang University). [Paper][Code (in construction)]
    • DyTox: "DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion", CVPR, 2022 (Sorbonne Universite, France). [Paper][PyTorch]
    • LVT: "Continual Learning With Lifelong Vision Transformer", CVPR, 2022 (The University of Sydney). [Paper]
    • L2P: "Learning to Prompt for Continual Learning", CVPR, 2022 (Google). [Paper][Tensorflow]
    • ?: "Simpler is Better: off-the-shelf Continual Learning Through Pretrained Backbones", CVPRW, 2022 (Ca' Foscari University, Italy). [Paper][PyTorch]
    • ADA: "Continual Learning with Transformers for Image Classification", CVPRW, 2022 (Amazon). [Paper]
    • ?: "Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization", CVPRW, 2022 (Ca' Foscari University, Italy). [Paper]
    • DualPrompt: "DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning", ECCV, 2022 (Google). [Paper][Tensorflow]
    • CVT: "Online Continual Learning with Contrastive Vision Transformer", ECCV, 2022 (The University of Sydney). [Paper]
    • COLT: "Transformers Are Better Continual Learners", arXiv, 2022 (Hikvision). [Paper]
    • S-Prompts: "S-Prompts Learning with Pre-trained Transformers: An Occam's Razor for Domain Incremental Learning", arXiv, 2022 (Singapore Management University). [Paper]
    • D3Former: "D3Former: Debiased Dual Distilled Transformer for Incremental Learning", arXiv, 2022 (MBZUAI). [Paper][PyTorch]
    • Continual-CLIP: "CLIP model is an Efficient Continual Learner", arXiv, 2022 (MBZUAI). [Paper][Code (in construction)]
  • Knowledge Distillation:
    • ?: "Knowledge Distillation via the Target-aware Transformer", CVPR, 2022 (Alibaba). [Paper]
    • DearKD: "DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers", CVPR, 2022 (JD). [Paper]
    • AttnDistill: "Attention Distillation: self-supervised vision transformer students need more guidance", BMVC, 2022 (UAB, Spain). [Paper][PyTorch]
    • ViTKD: "ViTKD: Practical Guidelines for ViT feature knowledge distillation", arXiv, 2022 (IDEA). [Paper][PyTorch (in construction)]
  • Clustering:
    • VTCC: "Vision Transformer for Contrastive Clustering", arXiv, 2022 (Sun Yat-sen University, China). [Paper]

[Back to Overview]

Low-level Vision Tasks

Image Restoration

(e.g. super-resolution, image denoising, demosaicing, compression artifacts reduction, etc.)

  • NLRN: "Non-Local Recurrent Network for Image Restoration", NeurIPS, 2018 (UIUC). [Paper][Tensorflow]
  • RNAN: "Residual Non-local Attention Networks for Image Restoration", ICLR, 2019 (Northeastern University). [Paper][PyTorch]
  • SAN: "Second-Order Attention Network for Single Image Super-Resolution", CVPR, 2019 (Tsinghua). [Paper][PyTorch]
  • CS-NL: "Image Super-Resolution with Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining", CVPR, 2020 (UIUC). [Paper][PyTorch]
  • TTSR: "Learning Texture Transformer Network for Image Super-Resolution", CVPR, 2020 (Microsoft). [Paper][PyTorch]
  • HAN: "Single Image Super-Resolution via a Holistic Attention Network", ECCV, 2020 (Northeastern University). [Paper][PyTorch]
  • PANet: "Pyramid Attention Networks for Image Restoration", arXiv, 2020 (UIUC). [Paper][PyTorch]
  • IPT: "Pre-Trained Image Processing Transformer", CVPR, 2021 (Huawei). [Paper][PyTorch (in construction)]
  • NLSN: "Image Super-Resolution With Non-Local Sparse Attention", CVPR, 2021 (UIUC). [Paper]
  • SwinIR: "SwinIR: Image Restoration Using Swin Transformer", ICCVW, 2021 (ETHZ). [Paper][PyTorch]
  • ITSRN: "Implicit Transformer Network for Screen Content Image Continuous Super-Resolution", NeurIPS, 2021 (Tianjin University). [Paper][PyTorch]
  • SDNet: "SDNet: multi-branch for single image deraining using swin", arXiv, 2021 (Xinjiang University). [Paper][Code (in construction)]
  • FPAN: "Feedback Pyramid Attention Networks for Single Image Super-Resolution", arXiv, 2021 (Nanjing University of Science and Technology). [Paper]
  • ATTSF: "Attention! Stay Focus!", arXiv, 2021 (BridgeAI, Seoul). [Paper][Tensorflow]
  • ESRT: "Efficient Transformer for Single Image Super-Resolution", arXiv, 2021 (Peking University). [Paper]
  • Fusformer: "Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
  • HyLoG-ViT: "Hybrid Local-Global Transformer for Image Dehazing", arXiv, 2021 (Beihang University). [Paper]
  • TANet: "TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network", arXiv, 2021 (Wuhan Institute of Technology). [Paper]
  • DPT: "Detail-Preserving Transformer for Light Field Image Super-Resolution", AAAI, 2022 (Beijing Institute of Technology). [Paper][PyTorch]
  • SiamTrans: "SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers", AAAI, 2022 (Huawei). [Paper]
  • Uformer: "Uformer: A General U-Shaped Transformer for Image Restoration", CVPR, 2022 (University of Science and Technology of China). [Paper][PyTorch]
  • MAXIM: "MAXIM: Multi-Axis MLP for Image Processing", CVPR, 2022 (Google). [Paper][Tensorflow]
  • HyperTransformer: "HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening", CVPR, 2022 (JHU). [Paper][PyTorch]
  • DeHamer: "Image Dehazing Transformer With Transmission-Aware 3D Position Embedding", CVPR, 2022 (Nankai University). [Paper][Website]
  • Restormer: "Restormer: Efficient Transformer for High-Resolution Image Restoration", CVPR, 2022 (IIAI, UAE). [Paper][PyTorch]
  • TransWeather: "TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions", CVPR, 2022 (JHU). [Paper][PyTorch][Website]
  • BSRT: "BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment", CVPRW, 2022 (Megvii). [Paper][PyTorch]
  • TATT: "A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution", CVPR, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • KiT: "KNN Local Attention for Image Restoration", CVPR, 2022 (Yonsei University). [Paper]
  • LBNet: "Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer", IJCAI, 2022 (Nanjing University of Posts and Telecommunications). [Paper][PyTorch (in construction)]
  • PTNet: "Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal", ACMMM, 2022 (Fudan University). [Paper]
  • CharFormer: "CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising", ACMMM, 2022 (Jilin University). [Paper][PyTorch (in construction)]
  • ELMformer: "ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer", ACMMM, 2022 (Horizon Robotics). [Paper][Code (in construction)]
  • DATSR: "Reference-based Image Super-Resolution with Deformable Attention Transformer", ECCV, 2022 (ETHZ). [Paper][Code (in construction)]
  • TurbNet: "Single Frame Atmospheric Turbulence Mitigation: A Benchmark Study and A New Physics-Inspired Transformer Model", ECCV, 2022 (Purdue + UT Austin). [Paper][PyTorch]
  • Stripformer: "Stripformer: Strip Transformer for Fast Image Deblurring", ECCV, 2022 (NTHU). [Paper]
  • ELAN: "Efficient Long-Range Attention Network for Image Super-resolution", ECCV, 2022 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • Swin2SR: "Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration", ECCVW, 2022 (University of Wurzburg, Germany). [Paper]
  • LFT: "Light Field Image Super-Resolution with Transformers", IEEE Signal Processing Letters, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
  • EDT: "On Efficient Transformer-Based Image Pre-training for Low-Level Vision", arXiv, 2022 (CUHK). [Paper][PyTorch]
  • ELAN: "Efficient Long-Range Attention Network for Image Super-resolution", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper][Code (in construction)]
  • ACT: "Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution", arXiv, 2022 (LG). [Paper]
  • ?: "Transform your Smartphone into a DSLR Camera: Learning the ISP in the Wild", arXiv, 2022 (ETHZ). [Paper]
  • HIPA: "HIPA: Hierarchical Patch Transformer for Single Image Super Resolution", arXiv, 2022 (CUHK). [Paper]
  • DehazeFormer: "Vision Transformers for Single Image Dehazing", arXiv, 2022 (Zhejiang University). [Paper][PyTorch]
  • RSTCANet: "Residual Swin Transformer Channel Attention Network for Image Demosaicing", arXiv, 2022 (Tampere University, Finland). [Paper]
  • CTCNet: "CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution", arXiv, 2022 (Nanjing University of Posts and Telecommunications). [Paper]
  • DRT: "DRT: A Lightweight Single Image Deraining Recursive Transformer", arXiv, 2022 (ANU, Australia). [Paper][PyTorch (in construction)]
  • HAT: "Activating More Pixels in Image Super-Resolution Transformer", arXiv, 2022 (University of Macau). [Paper][Code (in construction)]
  • DenSformer: "Dense residual Transformer for image denoising", arXiv, 2022 (University of Science and Technology Beijing). [Paper]
  • ShuffleMixer: "ShuffleMixer: An Efficient ConvNet for Image Super-Resolution", arXiv, 2022 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • Cubic-Mixer: "UHD Image Deblurring via Multi-scale Cubic-Mixer", arXiv, 2022 (Nanjing University of Science and Technology). [Paper]
  • PoCoformer: "Polarized Color Image Denoising using Pocoformer", arXiv, 2022 (The University of Tokyo). [Paper]
  • MSP-Former: "MSP-Former: Multi-Scale Projection Transformer for Single Image Desnowing", arXiv, 2022 (Jimei University). [Paper]
  • TMT: "Imaging through the Atmosphere using Turbulence Mitigation Transformer", arXiv, 2022 (Purdue). [Paper][Code (in construction)][Website]
  • ELF: "Magic ELF: Image Deraining Meets Association Learning and Transformer", arXiv, 2022 (Wuhan University). [Paper][PyTorch (in construction)]
  • DnSwin: "DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer", arXiv, 2022 (Guangdong University of Technology). [Paper]
  • HST: "HST: Hierarchical Swin Transformer for Compressed Image Super-resolution", ECCVW, 2022 (USTC). [Paper]
  • SnowFormer: "SnowFormer: Scale-aware Transformer via Context Interaction for Single Image Desnowing", arXiv, 2022 (Jimei University, China). [Paper]
  • SwinFIR: "SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution", arXiv, 2022 (Samsung). [Paper]
  • LRT: "LRT: An Efficient Low-Light Restoration Transformer for Dark Light Field Images", arXiv, 2022 (HKU). [Paper]
  • DMTNet: "DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer", arXiv, 2022 (Samsung). [Paper]
  • ART: "Accurate Image Restoration with Attention Retractable Transformer", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
  • LMQFormer: "LMQFormer: A Laplace-Prior-Guided Mask Query Transformer for Lightweight Snow Removal", arXiv, 2022 (Fuzhou University). [Paper]
  • ITSRN++: "ITSRN++: Stronger and Better Implicit Transformer Network for Continuous Screen Content Image Super-Resolution", arXiv, 2022 (Tianjin University). [Paper]

[Back to Overview]

Video Restoration

  • VSR-Transformer: "Video Super-Resolution Transformer", arXiv, 2021 (ETHZ). [Paper][PyTorch]
  • MANA: "Memory-Augmented Non-Local Attention for Video Super-Resolution", CVPR, 2022 (JD). [Paper]
  • ?: "Bringing Old Films Back to Life", CVPR, 2022 (Microsoft). [Paper][Code (in construction)]
  • TTVSR: "Learning Trajectory-Aware Transformer for Video Super-Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • Trans-SVSR: "A New Dataset and Transformer for Stereoscopic Video Super-Resolution", CVPR, 2022 (Bahcesehir University, Turkey). [Paper][PyTorch]
  • STDAN: "STDAN: Deformable Attention Network for Space-Time Video Super-Resolution", CVPRW, 2022 (Tsinghua). [Paper]
  • VRT: "VRT: A Video Restoration Transformer", arXiv, 2022 (ETHZ). [Paper][PyTorch]
  • FGST: "Flow-Guided Sparse Transformer for Video Deblurring", ICML, 2022 (Tsinghua). [Paper][Code (in construction)]
  • RSTT: "RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • FTVSR: "Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution", ECCV, 2022 (Microsoft). [Paper][PyTorch]
  • EFNet: "Event-Based Fusion for Motion Deblurring with Cross-modal Attention", ECCV, 2022 (ETHZ). [Paper]
  • VDTR: "VDTR: Video Deblurring with Transformer", arXiv, 2022 (Tsinghua). [Paper][Code (in construction)]
  • DSCT: "Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer", arXiv, 2022 (*Beijing University of Posts and Telecommunications *). [Paper]
  • RVRT: "Recurrent Video Restoration Transformer with Guided Deformable Attention", arXiv, 2022 (ETHZ). [Paper][Code (in construction)]
  • Group-ShiftNet: "No Attention is Needed: Grouped Spatial-temporal Shift for Simple and Efficient Video Restorers", arXiv, 2022 (CUHK). [Paper][Code (in construction)][Website]
  • ?: "Rethinking Alignment in Video Super-Resolution Transformers", arXiv, 2022 (Shanghai AI Lab). [Paper][Code (in construction)]

[Back to Overview]

Inpainting / Completion / Outpainting

  • Contexual-Attention: "Generative Image Inpainting with Contextual Attention", CVPR, 2018 (UIUC). [Paper][Tensorflow]
  • PEN-Net: "Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting", CVPR, 2019 (Microsoft). [Paper][PyTorch]
  • Copy-Paste: "Copy-and-Paste Networks for Deep Video Inpainting", ICCV, 2019 (Yonsei University). [Paper][PyTorch]
  • Onion-Peel: "Onion-Peel Networks for Deep Video Completion", ICCV, 2019 (Yonsei University). [Paper][PyTorch]
  • STTN: "Learning Joint Spatial-Temporal Transformations for Video Inpainting", ECCV, 2020 (Microsoft). [Paper][PyTorch]
  • FuseFormer: "FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting", ICCV, 2021 (CUHK + SenseTime). [Paper][PyTorch]
  • ICT: "High-Fidelity Pluralistic Image Completion with Transformers", ICCV, 2021 (CUHK). [Paper][PyTorch][Website]
  • DSTT: "Decoupled Spatial-Temporal Transformer for Video Inpainting", arXiv, 2021 (CUHK + SenseTime). [Paper][Code (in construction)]
  • TFill: "TFill: Image Completion via a Transformer-Based Architecture", arXiv, 2021 (NTU Singapore). [Paper][Code (in construction)]
  • BAT-Fill: "Diverse Image Inpainting with Bidirectional and Autoregressive Transformers", arXiv, 2021 (NTU Singapore). [Paper]
  • ?: "Image-Adaptive Hint Generation via Vision Transformer for Outpainting", WACV, 2022 (Sogang University, Korea). [Paper]
  • ZITS: "Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding", CVPR, 2022 (Fudan). [Paper][PyTorch][Website]
  • MAT: "MAT: Mask-Aware Transformer for Large Hole Image Inpainting", CVPR, 2022 (CUHK). [Paper][PyTorch]
  • PUT: "Reduce Information Loss in Transformers for Pluralistic Image Inpainting", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • DLFormer: "DLFormer: Discrete Latent Transformer for Video Inpainting", CVPR, 2022 (Tencent). [Paper][Code (in construction)]
  • QueryOTR: "Outpainting by Queries", ECCV, 2022 (University of Liverpool, UK). [Paper][PyTorch (in construction)]
  • FGT: "Flow-Guided Transformer for Video Inpainting", ECCV, 2022 (USTC). [Paper][PyTorch]
  • MAE-FAR: "Learning Prior Feature and Attention Enhanced Image Inpainting", ECCV, 2022 (Fudan University). [Paper][PyTorch (in construction)][Website]
  • U-Transformer: "Generalised Image Outpainting with U-Transformer", arXiv, 2022 (Xi'an Jiaotong-Liverpool University). [Paper]
  • SpA-Former: "SpA-Former: Transformer image shadow detection and removal via spatial attention", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][PyTorch]
  • CRFormer: "CRFormer: A Cross-Region Transformer for Shadow Removal", arXiv, 2022 (Beijing Jiaotong University). [Paper]
  • ?: "Visual Prompting via Image Inpainting", arXiv, 2022 (Berkeley). [Paper][Website]
  • DeViT: "DeViT: Deformed Vision Transformers in Video Inpainting", arXiv, 2022 (Kuaishou). [Paper]
  • ZITS++: "ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors", arXiv, 2022 (Fudan). [Paper]

[Back to Overview]

Image Generation

  • IT: "Image Transformer", ICML, 2018 (Google). [Paper][Tensorflow]
  • PixelSNAIL: "PixelSNAIL: An Improved Autoregressive Generative Model", ICML, 2018 (Berkeley). [Paper][Tensorflow]
  • BigGAN: "Large Scale GAN Training for High Fidelity Natural Image Synthesis", ICLR, 2019 (DeepMind). [Paper][PyTorch]
  • SAGAN: "Self-Attention Generative Adversarial Networks", ICML, 2019 (Google). [Paper][Tensorflow]
  • VQGAN: "Taming Transformers for High-Resolution Image Synthesis", CVPR, 2021 (Heidelberg University). [Paper][PyTorch][Website]
  • ?: "High-Resolution Complex Scene Synthesis with Transformers", CVPRW, 2021 (Heidelberg University). [Paper]
  • GANsformer: "Generative Adversarial Transformers", ICML, 2021 (Stanford + Facebook). [Paper][Tensorflow]
  • PixelTransformer: "PixelTransformer: Sample Conditioned Signal Generation", ICML, 2021 (Facebook). [Paper][Website]
  • HWT: "Handwriting Transformers", ICCV, 2021 (MBZUAI). [Paper][Code (in construction)]
  • Paint-Transformer: "Paint Transformer: Feed Forward Neural Painting with Stroke Prediction", ICCV, 2021 (Baidu). [Paper][Paddle][PyTorch]
  • Geometry-Free: "Geometry-Free View Synthesis: Transformers and no 3D Priors", ICCV, 2021 (Heidelberg University). [Paper][PyTorch]
  • VTGAN: "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers", ICCVW, 2021 (University of Nevada, Reno). [Paper]
  • ATISS: "ATISS: Autoregressive Transformers for Indoor Scene Synthesis", NeurIPS, 2021 (NVIDIA). [Paper][Website]
  • GANsformer2: "Compositional Transformers for Scene Generation", NeurIPS, 2021 (Stanford + Facebook). [Paper][Tensorflow]
  • TransGAN: "TransGAN: Two Transformers Can Make One Strong GAN", NeurIPS, 2021 (UT Austin). [Paper][PyTorch]
  • HiT: "Improved Transformer for High-Resolution GANs", NeurIPS, 2021 (Google). [Paper][Tensorflow]
  • iLAT: "The Image Local Autoregressive Transformer", NeurIPS, 2021 (Fudan). [Paper]
  • TokenGAN: "Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers", NeurIPS, 2021 (Microsoft). [Paper]
  • SceneFormer: "SceneFormer: Indoor Scene Generation with Transformers", arXiv, 2021 (TUM). [Paper]
  • SNGAN: "Combining Transformer Generators with Convolutional Discriminators", arXiv, 2021 (Fraunhofer ITWM). [Paper]
  • Invertible-Attention: "Invertible Attention", arXiv, 2021 (ANU). [Paper]
  • GPA: "Grid Partitioned Attention: Efficient Transformer Approximation with Inductive Bias for High Resolution Detail Generation", arXiv, 2021 (Zalando Research, Germany). [Paper][PyTorch (in construction)]
  • ViTGAN: "ViTGAN: Training GANs with Vision Transformers", ICLR, 2022 (Google). [Paper][PyTorch][PyTorch (wilile26811249)]
  • ViT-VQGAN: "Vector-quantized Image Modeling with Improved VQGAN", ICLR, 2022 (Google). [Paper]
  • Style-Transformer: "Style Transformer for Image Inversion and Editing", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
  • StyleSwin: "StyleSwin: Transformer-based GAN for High-resolution Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
  • Styleformer: "Styleformer: Transformer based Generative Adversarial Networks with Style Vector", CVPR, 2022 (Seoul National University). [Paper][PyTorch]
  • ?: "User-Controllable Latent Transformer for StyleGAN Image Layout Editing", Pacific Graphics, 2022 (University of Tsukuba). [Paper][Website]
  • DynaST: "DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation", ECCV, 2022 (NUS). [Paper][PyTorch]
  • DoodleFormer: "DoodleFormer: Creative Sketch Drawing with Transformers", ECCV, 2022 (MBZUAI). [Paper][PyTorch][Website]
  • U-Attention: "Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis", arXiv, 2022 (Adobe). [Paper]
  • MaskGIT: "MaskGIT: Masked Generative Image Transformer", CVPR, 2022 (Google). [Paper][PyTorch (dome272)]
  • AttnFlow: "Generative Flows with Invertible Attentions", CVPR, 2022 (ETHZ). [Paper]
  • NÜWA: "NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion", ECCV, 2022 (Microsoft). [Paper][GitHub]
  • Trans-INR: "Transformers as Meta-Learners for Implicit Neural Representations", ECCV, 2022 (UCSD). [Paper][PyTorch][Websiste]
  • ViewFormer: "ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers", ECCV, 2022 (Czech Technical University in Prague). [Paper][Tensorflow]
  • Unleashing-Transformer: "Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes", ECCV, 2022 (Durham University, UK). [Paper][PyTorch]
  • CASD: "Cross Attention Based Style Distribution for Controllable Person Image Synthesis", ECCV, 2022 (East China Norma lUniversity). [Paper]
  • VQGAN-CLIP: "VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance", ECCV, 2022 (EleutherAI). [Paper][PyTorch]
  • PromptGen: "Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models", NeurIPS, 2022 (CMU). [Paper][PyTorch]
  • ViT-Patch: "A Robust Framework of Chromosome Straightening with ViT-Patch GAN", arXiv, 2022 (Xi'an Jiaotong-Liverpool University). [Paper]
  • TransNeRF: "Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer", arXiv, 2022 (UBC). [Paper]
  • ?: "Transforming Image Generation from Scene Graphs", arXiv, 2022 (University of Catania, Italy). [Paper]
  • VisionNeRF: "Vision Transformer for NeRF-Based View Synthesis from a Single Input Image", arXiv, 2022 (Google). [Paper][Website]
  • NUWA-Infinity: "NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis", arXiv, 2022 (Microsoft). [Paper][GitHub][Website]
  • Diffusion-ViT: "Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model", arXiv, 2022 (Etsy, NY). [Paper]
  • Token-Critic: "Improved Masked Image Generation with Token-Critic", arXiv, 2022 (Google). [Paper]
  • ?: "Visual Prompt Tuning for Generative Transfer Learning", arXiv, 2022 (Google). [Paper]
  • Frido: "Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • ?: "Style-Guided Inference of Transformer for High-resolution Image Synthesis", WACV, 2023 (NCSOFT, Korea). [Paper]

[Back to Overview]

Video Generation

  • Subscale: "Scaling Autoregressive Video Models", ICLR, 2020 (Google). [Paper][Website]
  • ConvTransformer: "ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis", arXiv, 2020 (Southeast University). [Paper]
  • OCVT: "Generative Video Transformer: Can Objects be the Words?", ICML, 2021 (Rutgers University). [Paper]
  • AIST++: "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation", arXiv, 2021 (Google). [Paper][Code][Website]
  • VideoGPT: "VideoGPT: Video Generation using VQ-VAE and Transformers", arXiv, 2021 (Berkeley). [Paper][PyTorch][Website]
  • DanceFormer: "DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer", AAAI, 2022 (Huiye Technology, China). [Paper]
  • VFIformer: "Video Frame Interpolation with Transformer", CVPR, 2022 (CUHK). [Paper][PyTorch]
  • VFIT: "Video Frame Interpolation Transformer", CVPR, 2022 (McMaster Univeristy, Canada). [Paper][PyTorch]
  • MoTrans: "Motion Transformer for Unsupervised Image Animation", ECCV, 2022 (Alibaba). [Paper][PyTorch]
  • Transframer: "Transframer: Arbitrary Frame Prediction with Generative Models", arXiv, 2022 (DeepMind). [Paper]
  • TATS: "Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer", ECCV, 2022 (Maryland). [Paper][Website]
  • POVT: "Patch-based Object-centric Transformers for Efficient Video Generation", arXiv, 2022 (Berkeley). [Paper][PyTorch][Website]
  • TAIN: "Cross-Attention Transformer for Video Interpolation", arXiv, 2022 (Duke). [Paper]
  • TTVFI: "TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation", arXiv, 2022 (Microsoft). [Paper]
  • TECO: "Temporally Consistent Video Transformer for Long-Term Video Prediction", arXiv, 2022 (Berkeley). [Paper][Jax][Website]
  • SlotFormer: "SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models", arXiv, 2022 (University of Toronto). [Paper][Website]

[Back to Overview]

Transfer / Translation / Manipulation

  • AdaAttN: "AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer", ICCV, 2021 (Baidu). [Paper][Paddle][PyTorch]
  • StyleCLIP: "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery", ICCV, 2021 (Hebrew University of Jerusalem). [Paper][PyTorch]
  • StyTr2: "StyTr^2: Unbiased Image Style Transfer with Transformers", CVPR, 2022 (CAS). [Paper][PyTorch]
  • InstaFormer: "InstaFormer: Instance-Aware Image-to-Image Translation with Transformer", CVPR, 2022 (Korea University). [Paper]
  • ManiTrans: "ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation", CVPR, 2022 (Huawei). [Paper][Website]
  • QS-Attn: "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation", CVPR, 2022 (Shanghai Key Laboratory). [Paper][PyTorch]
  • ASSET: "ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions", SIGGRAPH, 2022 (Adobe). [Paper][PyTorch][Website]
  • SCAM: "SCAM! Transferring humans between images with Semantic Cross Attention Modulation", ECCV, 2022 (Univ Gustave Eiffel, France). [Paper][PyTorch][Website]
  • TargetCLIP: "Image-Based CLIP-Guided Essence Transfer", ECCV, 2022 (Tel Aviv). [Paper][PyTorch]
  • STTR: "Fine-Grained Image Style Transfer with Visual Transformers", ACCV, 2022 (The Univerisity of Tokyo). [Paper][PyTorch (in construction)]
  • Splice: "Splicing ViT Features for Semantic Appearance Transfer", arXiv, 2022 (Weizmann Institute of Science, Israel). [Paper][PyTorch][Website]
  • UVCGAN: "UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation", arXiv, 2022 (Brookhaven National Laboratory, NY). [Paper]
  • ITTR: "ITTR: Unpaired Image-to-Image Translation with Transformers", arXiv, 2022 (Kuaishou). [Paper]
  • CLIPasso: "CLIPasso: Semantically-Aware Object Sketching", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
  • CTrGAN: "CTrGAN: Cycle Transformers GAN for Gait Transfer", arXiv, 2022 (Ariel University, Israel). [Paper]
  • PI-Trans: "PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation", arXiv, 2022 (University of Trento, Italy). [Paper][PyTorch (in construction)]
  • CSLA: "Bridging CLIP and StyleGAN through Latent Alignment for Image Editing", arXiv, 2022 (Kuaishou). [Paper]
  • CLIP-PAE: "CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Image Manipulation", arXiv, 2022 (University of Cambridge). [Paper]
  • FFCLIP: "One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations", NeurIPS, 2022 (Tencent). [Paper][Code (in construction)]

[Back to Overview]

Other Low-Level Tasks

  • Colorization:
    • ColTran: "Colorization Transformer", ICLR, 2021 (Google). [Paper][Tensorflow]
    • ViT-I-GAN: "ViT-Inception-GAN for Image Colourising", arXiv, 2021 (D.Y Patil College of Engineering, India). [Paper]
    • CT2: "CT2: Colorization Transformer via Color Tokens", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • L-CoDer: "L-CoDer: Language-based Colorization with Color-object Decoupling Transformer", ECCV, 2022 (Beijing University of Posts and Telecommunications). [Paper]
    • ColorFormer: "ColorFormer: Image Colorization via Color Memory assisted Hybrid-attention Transformer", ECCV, 2022 (Tencent). [Paper]
    • UniColor: "UniColor: A Unified Framework for Multi-Modal Colorization with Transformer", SIGGRAPH Asia, 2022 (CUHK). [Paper][Website]
    • iColoriT: "iColoriT: Towards Propagating Local Hint to the Right Region in Interactive Colorization by Leveraging Vision Transformer", arXiv, 2022 (KAIST). [Paper]
  • Enhancement:
    • PanFormer: "PanFormer: a Transformer Based Model for Pan-sharpening", ICME, 2022 (Beihang University). [Paper][PyTorch]
    • URSCT-UIE: "Reinforced Swin-Convs Transformer for Underwater Image Enhancement", arXiv, 2022 (Ningbo University). [Paper]
    • IAT: "Illumination Adaptive Transformer", arXiv, 2022 (The University of Tokyo). [Paper][PyTorch]
    • SPGAT: "Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement", arXiv, 2022 (The Hong Kong Polytechnic University). [Paper]
  • HDR:
    • CA-ViT: "Ghost-free High Dynamic Range Imaging with Context-aware Transformer", ECCV, 2022 (Megvii). [Paper][PyTorch]
    • Text2Light: "Text2Light: Zero-Shot Text-Driven HDR Panorama Generation", SIGGRAPH Asia, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
  • Harmonization:
    • HT: "Image Harmonization With Transformer", ICCV, 2021 (Ocean University of China). [Paper]
  • Compression:
    • ?: "Towards End-to-End Image Compression and Analysis with Transformers", AAAI, 2022 (1Harbin Institute of Technology). [Paper][PyTorch]
    • Entroformer: "Entroformer: A Transformer-based Entropy Model for Learned Image Compression", ICLR, 2022 (Alibaba). [Paper]
    • STF: "The Devil Is in the Details: Window-based Attention for Image Compression", CVPR, 2022 (CAS). [Paper][PyTorch]
    • Contextformer: "Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression", ECCV, 2022 (TUM). [Paper]
    • VCT: "VCT: A Video Compression Transformer", arXiv, 2022 (Google). [Paper]
  • Matting:
    • MatteFormer: "MatteFormer: Transformer-Based Image Matting via Prior-Tokens", CVPR, 2022 (SNU + NAVER). [Paper][PyTorch]
    • TransMatting: "TransMatting: Enhancing Transparent Objects Matting with Transformers", ECCV, 2022 (CAS). [Paper][Code (in construction)]
    • VMFormer: "VMFormer: End-to-End Video Matting with Transformer", arXiv, 2022 (PicsArt). [Paper][PyTorch][Website]
  • Reconstruction
    • ET-Net: "Event-Based Video Reconstruction Using Transformer", ICCV, 2021 (University of Science and Technology of China). [Paper][PyTorch]
    • GradViT: "GradViT: Gradient Inversion of Vision Transformers", CVPR, 2022 (NVIDIA). [Paper][Website]
    • MST: "Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction", CVPR, 2022 (Tsinghua). [Paper][PyTorch]
    • MST++: "MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction", CVPRW, 2022 (Tsinghua). [Paper][PyTorch]
    • CST: "Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction", ECCV, 2022 (Tsinghua). [Paper][PyTorch]
    • DAUHST: "Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch]
    • S2-Transformer: "S2-Transformer for Mask-Aware Hyperspectral Image Reconstruction", arXiv, 2022 (Rochester Institute of Technology). [Paper]
  • 3D:
    • MNSRNet: "MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution", CVPR, 2022 (Shenzhen University). [Paper]
  • Others:
    • TransMEF: "TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning", AAAI, 2022 (Fudan). [Paper]
    • MS-Unet: "Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer", CVPR, 2022 (Megvii). [Paper][Code (in construction)]
    • TransCL: "TransCL: Transformer Makes Strong and Flexible Compressive Learning", TPAMI, 2022 (Peking University). [Paper][Code (in construction)]
    • GAP-CSCoT: "Spectral Compressive Imaging Reconstruction Using Convolution and Spectral Contextual Transformer", arXiv, 2022 (CAS). [Paper]
    • MatFormer: "MatFormer: A Generative Model for Procedural Materials", arXiv, 2022 (Adobe). [Paper]
    • FishFormer: "FishFormer: Annulus Slicing-based Transformer for Fisheye Rectification with Efficacy Domain Exploration", arXiv, 2022 (Beijing Jiaotong University). [Paper]
    • STFormer: "Spatial-Temporal Transformer for Video Snapshot Compressive Imaging", arXiv, 2022 (CAS). [Paper][PyTorch]

[Back to Overview]

Reinforcement Learning

Navigation

  • VTNet: "VTNet: Visual Transformer Network for Object Goal Navigation", ICLR, 2021 (ANU). [Paper]
  • MaAST: "MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation", ICRA, 2021 (SRI). [Paper]
  • TransFuser: "Multi-Modal Fusion Transformer for End-to-End Autonomous Driving", CVPR, 2021 (MPI). [Paper][PyTorch]
  • CMTP: "Topological Planning With Transformers for Vision-and-Language Navigation", CVPR, 2021 (Stanford). [Paper]
  • VLN-BERT: "VLN-BERT: A Recurrent Vision-and-Language BERT for Navigation", CVPR, 2021 (ANU). [Paper][PyTorch]
  • E.T.: "Episodic Transformer for Vision-and-Language Navigation", ICCV, 2021 (Google). [Paper][PyTorch]
  • HAMT: "History Aware Multimodal Transformer for Vision-and-Language Navigation", NeurIPS, 2021 (INRIA). [Paper][PyTorch][Website]
  • SOAT: "SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation", NeurIPS, 2021 (Georgia Tech). [Paper]
  • OMT: "Object Memory Transformer for Object Goal Navigation", ICRA, 2022 (AIST, Japan). [Paper]
  • ADAPT: "ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts", CVPR, 2022 (Huawei). [Paper]
  • DUET: "Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation", CVPR, 2022 (INRIA). [Paper][Website]
  • LSA: "Local Slot Attention for Vision-and-Language Navigation", ICMR, 2022 (Fudan). [Paper]
  • ?: "Learning from Unlabeled 3D Environments for Vision-and-Language Navigation", ECCV, 2022 (INRIA). [Paper][Website]
  • MTVM: "Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation", ECCV, 2022 (ByteDance). [Paper][PyTorch]
  • AVLEN: "AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments", NeurIPS, 2022 (UC Riverside). [Paper]
  • TransFuser: "TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving", arXiv, 2022 (MPI). [Paper]
  • TD-STP: "Target-Driven Structured Transformer Planner for Vision-Language Navigation", arXiv, 2022 (Beihang University). [Paper][Code (in construction)]
  • DAVIS: "Anticipating the Unseen Discrepancy for Vision and Language Navigation", arXiv, 2022 (UCSB). [Paper]
  • LOViS: "LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation", arXiv, 2022 (Michigan State). [Paper]
  • IVLN: "Iterative Vision-and-Language Navigation", arXiv, 2022 (Oregon State University). [Paper]

[Back to Overview]

Other RL Tasks

  • SVEA: "Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation", arXiv, 2021 (UCSD). [Paper][GitHub][Website]
  • LocoTransformer: "Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers", ICLR, 2022 (UCSD). [Paper][Website]
  • STAM: "Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes", CVPR, 2022 (McGill University, Canada). [Paper][PyTorch]
  • CtrlFormer: "CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer", ICML, 2022 (HKU). [Paper][PyTorch][Website]
  • PromptDT: "Prompting Decision Transformer for Few-Shot Policy Generalization", ICML, 2022 (CMU). [Paper][Website]
  • StARformer: "StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning", ECCV, 2022 (Stony Brook). [Paper][PyTorch]
  • RAD: "Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels", arXiv, 2022 (UBC, Canada). [Paper]
  • MWM: "Masked World Models for Visual Control", arXiv, 2022 (Berkeley). [Paper][Tensorflow][Website]
  • IRIS: "Transformers are Sample Efficient World Models", arXiv, 2022 (University of Geneva, Switzerland). [Paper][PyTorch]

[Back to Overview]

Medical

Medical Segmentation

  • Cross-Transformer: "The entire network structure of Crossmodal Transformer", ICBSIP, 2021 (Capital Medical University). [Paper]
  • Segtran: "Medical Image Segmentation using Squeeze-and-Expansion Transformers", IJCAI, 2021 (A*STAR). [Paper]
  • i-ViT: "Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image", MICCAI, 2021 (Xi'an Jiaotong University). [Paper][PyTorch][Website]
  • UTNet: "UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation", MICCAI, 2021 (Rutgers). [Paper]
  • MCTrans: "Multi-Compound Transformer for Accurate Biomedical Image Segmentation", MICCAI, 2021 (HKU + CUHK). [Paper][Code (in construction)]
  • Polyformer: "Few-Shot Domain Adaptation with Polymorphic Transformers", MICCAI, 2021 (A*STAR). [Paper][PyTorch]
  • BA-Transformer: "Boundary-aware Transformers for Skin Lesion Segmentation". MICCAI, 2021 (Xiamen University). [Paper][PyTorch]
  • GT-U-Net: "GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation", MICCAIW, 2021 (Hangzhou Dianzi University). [Paper][PyTorch]
  • STN: "Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation", ISBI, 2021 (Institut Polytechnique de Paris). [Paper]
  • T-AutoML: "T-AutoML: Automated Machine Learning for Lesion Segmentation Using Transformers in 3D Medical Imaging", ICCV, 2021 (NVIDIA). [Paper]
  • MedT: "Medical Transformer: Gated Axial-Attention for Medical Image Segmentation", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
  • Convolution-Free: "Convolution-Free Medical Image Segmentation using Transformers", arXiv, 2021 (Harvard). [Paper]
  • CoTR: "CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation", arXiv, 2021 (Northwestern Polytechnical University). [Paper][PyTorch]
  • TransBTS: "TransBTS: Multimodal Brain Tumor Segmentation Using Transformer", arXiv, 2021 (University of Science and Technology Beijing). [Paper][PyTorch]
  • SpecTr: "SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation", arXiv, 2021 (East China Normal University). [Paper][Code (in construction)]
  • U-Transformer: "U-Net Transformer: Self and Cross Attention for Medical Image Segmentation", arXiv, 2021 (CEDRIC). [Paper]
  • TransUNet: "TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
  • PMTrans: "Pyramid Medical Transformer for Medical Image Segmentation", arXiv, 2021 (Washington University in St. Louis). [Paper]
  • PBT-Net: "Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy", arXiv, 2021 (Hangzhou Dianzi University). [Paper]
  • Swin-Unet: "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation", arXiv, 2021 (Huawei). [Paper][Code (in construction)]
  • MBT-Net: "A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation", arXiv, 2021 (Southern University of Science and Technology). [Paper]
  • WAD: "More than Encoder: Introducing Transformer Decoder to Upsample", arXiv, 2021 (South China University of Technology). [Paper]
  • LeViT-UNet: "LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation", arXiv, 2021 (Wuhan Institute of Technology). [Paper]
  • ?: "Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation", arXiv, 2021 (Vanderbilt University). [Paper]
  • nnFormer: "nnFormer: Interleaved Transformer for Volumetric Segmentation", arXiv, 2021 (HKU + Xiamen University). [Paper][PyTorch]
  • MISSFormer: "MISSFormer: An Effective Medical Image Segmentation Transformer", arXiv, 2021 (Beijing University of Posts and Telecommunications). [Paper]
  • TUnet: "Transformer-Unet: Raw Image Processing with Unet", arXiv, 2021 (Beijing Zoezen Robot + Beihang University). [Paper]
  • BiTr-Unet: "BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation", arXiv, 2021 (New York University). [Paper]
  • ?: "Transformer Assisted Convolutional Network for Cell Instance Segmentation", arXiv, 2021 (IIT Dhanbad). [Paper]
  • ?: "Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining", arXiv, 2021 (Ukrainian Catholic University). [Paper]
  • UNETR: "UNETR: Transformers for 3D Medical Image Segmentation", WACV, 2022 (NVIDIA). [Paper][PyTorch]
  • AFTer-UNet: "AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation", WACV, 2022 (UC Irvine). [Paper]
  • UCTransNet: "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer", AAAI, 2022 (Northeastern University, China). [Paper][PyTorch]
  • Swin-UNETR: "Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis", CVPR, 2022 (NVIDIA). [Paper][PyTorch]
  • ?: "Transformer-based out-of-distribution detection for clinically safe segmentation", Medical Imaging with Deep Learning (MIDL), 2022 (King’s College London). [Paper]
  • ScaleFormer: "ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation", IJCAI, 2022 (Zhejiang University). [Paper][Code (in construction)]
  • FCBFormer: "FCN-Transformer Feature Fusion for Polyp Segmentation", Annual Conference on Medical Image Understanding and Analysis (MIUA), 2022 (University of Central Lancashire, UK). [Paper][PyTorch]
  • VDFormer: "View-Disentangled Transformer for Brain Lesion Detection", ISBI, 2022 (CUHK). [Paper][PyTorch]
  • TFCNs: "TFCNs: A CNN-Transformer Hybrid Network for Medical Image Segmentation", International Conference on Artificial Neural Networks (ICANN), 2022 (Xiamen University). [Paper][PyTorch (in construction)]
  • MIL: "Transformer based multiple instance learning for weakly supervised histopathology image segmentation", MICCAI, 2022 (Beihang University). [Paper]
  • mmFormer: "mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation", MICCAI, 2022 (CAS). [Paper][PyTorch]
  • Patcher: "Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation", MICCAI, 2022 (Pennsylvania State University). [Paper]
  • NestedFormer: "NestedFormer: Nested Modality-Aware Transformer for Brain Tumor Segmentation", MICCAI, 2022 (Tianjin University). [Paper][Code (in construction)]
  • TransDeepLab: "TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical Image Segmentation", MICCAIW, 2022 (RWTH Aachen University, Germany). [Paper][PyTorch]
  • Video-TransUNet: "Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation", International Conference on Machine Vision (ICMV), 2022 (University of Bristol, UK). [Paper]
  • Tempera: "Tempera: Spatial Transformer Feature Pyramid Network for Cardiac MRI Segmentation", arXiv, 2022 (ICL). [Paper]
  • UTNetV2: "A Multi-scale Transformer for Medical Image Segmentation: Architectures, Model Efficiency, and Benchmarks", arXiv, 2022 (Rutgers). [Paper]
  • UNesT: "Characterizing Renal Structures with 3D Block Aggregate Transformers", arXiv, 2022 (Vanderbilt University, Tennessee). [Paper]
  • PHTrans: "PHTrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation", arXiv, 2022 (Beijing University of Posts and Telecommunications). [Paper]
  • UNeXt: "UNeXt: MLP-based Rapid Medical Image Segmentation Network", arXiv, 2022 (JHU). [Paper][PyTorch]
  • TransFusion: "TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers", arXiv, 2022 (Rutgers). [Paper]
  • UNetFormer: "UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation", arXiv, 2022 (NVIDIA). [Paper][GitHub]
  • 3D-Shuffle-Mixer: "3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
  • ?: "Continual Hippocampus Segmentation with Transformers", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
  • TranSiam: "TranSiam: Fusing Multimodal Visual Features Using Transformer for Medical Image Segmentation", arXiv, 2022 (Tianjin University). [Paper]
  • ColonFormer: "ColonFormer: An Efficient Transformer based Method for Colon Polyp Segmentation", arXiv, 2022 (Hanoi University of Science and Technology). [Paper]
  • ?: "Transformer based Generative Adversarial Network for Liver Segmentation", arXiv, 2022 (Northwestern University). [Paper]
  • FCT: "The Fully Convolutional Transformer for Medical Image Segmentation", arXiv, 2022 (University of Glasgow, UK). [Paper]
  • XBound-Former: "XBound-Former: Toward Cross-scale Boundary Modeling in Transformers", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
  • Polyp-PVT: "Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers", arxiv, 2022 (IIAI). [Paper][PyTorch]
  • SeATrans: "SeATrans: Learning Segmentation-Assisted diagnosis model via Transformer", arXiv, 2022 (Baidu). [Paper]
  • TransResU-Net: "TransResU-Net: Transformer based ResU-Net for Real-Time Colonoscopy Polyp Segmentation", arXiv, 2022 (Indira Gandhi National Open University). [Paper][Code (in construction)]
  • LViT: "LViT: Language meets Vision Transformer in Medical Image Segmentation", arXiv, 2022 (Alibaba). [Paper][Code (in construction)]
  • APFormer: "The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive Pruning", arXiv, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
  • ?: "Transformer based Models for Unsupervised Anomaly Segmentation in Brain MR Images", arXiv, 2022 (University of Rennes, France). [Paper][Tensorflow]
  • CKD-TransBTS: "CKD-TransBTS: Clinical Knowledge-Driven Hybrid Transformer with Modality-Correlated Cross-Attention for Brain Tumor Segmentation", arXiv, 2022 (South China University of Technology). [Paper]
  • HiFormer: "HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation", arXiv, 2022 (Iran University of Science and Technology). [Paper][PyTorch]
  • ?: "Contextual Attention Network: Transformer Meets U-Net", arXiv, 2022 (RWTH Aachen University). [Paper][PyTorch]
  • HRSTNet: "High-Resolution Swin Transformer for Automatic Medical Image Segmentation", arXiv, 2022 (Xi'an University of Posts and Telecommunications). [Paper][Code (in construction)]
  • TransNorm: "TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model", arXiv, 2022 (Aachen University, Germany). [Paper][PyTorch]
  • ?: "When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image Semantic Segmentation", arXiv, 2022 (Oxford). [Paper][Code (in construction)]
  • CM-MLP: "CM-MLP: Cascade Multi-scale MLP with Axial Context Relation Encoder for Edge Segmentation of Medical Image", arXiv, 2022 (Zhengzhou University). [Paper]
  • CATS: "Cats: Complementary CNN and Transformer Encoders for Segmentation", arXiv, 2022 (Vanderbilt University, Nashville). [Paper]
  • TFusion: "TFusion: Transformer based N-to-One Multimodal Fusion Block", arXiv, 2022 (SouthChinaUniversityofTechnology). [Paper]
  • AutoPET: "AutoPET Challenge: Combining nn-Unet with Swin UNETR Augmented by Maximum Intensity Projection Classifier", arXiv, 2022 (University Hospital Essen, Germany). [Paper]
  • SPAN: "Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers", arXiv, 2022 (Berkeley). [Paper]
  • TMSS: "TMSS: An End-to-End Transformer-based Multimodal Network for Segmentation and Survival Prediction", arXiv, 2022 (MBZUAI). [Paper]
  • CR-Swin2-VT: "Hybrid Window Attention Based Transformer Architecture for Brain Tumor Segmentation", arXiv, 2022 (Monash University). [Paper][PyTorch]
  • 3DUX-Net: "3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation", arXiv, 2022 (Vanderbilt University). [Paper][PyTorch]
  • FocalUNETR: "FocalUNETR: A Focal Transformer for Boundary-aware Segmentation of CT Images", arXiv, 2022 (Wayne State University, Detroit). [Paper]
  • LAPFormer: "LAPFormer: A Light and Accurate Polyp Segmentation Transformer", arXiv, 2022 (Sun*, Hanoi). [Paper]
  • FINE: "Memory transformers for full context and high-resolution 3D Medical Segmentation", arXiv, 2022 (National Conservatory of Arts and Crafts, France). [Paper]
  • ConvTransSeg: "ConvTransSeg: A Multi-resolution Convolution-Transformer Network for Medical Image Segmentation", arXiv, 2022 (University of Nottingham, UK). [Paper]
  • CS-Unet: "Optimizing Vision Transformers for Medical Image Segmentation and Few-Shot Domain Adaptation", arXiv, 2022 (University of Glasgow, UK). [Paper]

[Back to Overview]

Medical Classification

  • COVID19T: "A Transformer-Based Framework for Automatic COVID19 Diagnosis in Chest CTs", ICCVW, 2021 (?). [Paper][PyTorch]
  • TransMIL: "TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication", NeurIPS, 2021 (Tsinghua University). [Paper][PyTorch]
  • TransMed: "TransMed: Transformers Advance Multi-modal Medical Image Classification", arXiv, 2021 (Northeastern University). [Paper]
  • CXR-ViT: "Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification", arXiv, 2021 (KAIST). [Paper]
  • ViT-TSA: "Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer", arXiv, 2021 (Queen’s University). [Paper]
  • GasHis-Transformer: "GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification", arXiv, 2021 (Northeastern University). [Paper]
  • POCFormer: "POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound", arXiv, 2021 (The Ohio State University). [Paper]
  • COVID-ViT: "COVID-VIT: Classification of COVID-19 from CT chest images based on vision transformer models", arXiv, 2021 (Middlesex University, UK). [Paper][PyTorch]
  • EEG-ConvTransformer: "EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification", arXiv, 2021 (IIT Ropar). [Paper]
  • CCAT: "Visual Transformer with Statistical Test for COVID-19 Classification", arXiv, 2021 (NCKU). [Paper]
  • M3T: "M3T: Three-Dimensional Medical Image Classifier Using Multi-Plane and Multi-Slice Transformer", CVPR, 2022 (Yonsei University). [Paper]
  • ?: "A comparative study between vision transformers and CNNs in digital pathology", CVPRW, 2022 (Roche, Switzerland). [Paper]
  • SCT: "Context-Aware Transformers For Spinal Cancer Detection and Radiological Grading", MICCAI, 2022 (Oxford). [Paper]
  • KAT: "Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification", MICCAI, 2022 (Beihang University). [Paper][PyTorch]
  • SEViT: "Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification", MICCAI, 2022 (MBZUAI). [Paper][PyTorch]
  • MF-ViT: "Multi-Feature Vision Transformer via Self-Supervised Representation Learning for Improvement of COVID-19 Diagnosis", MICCAIW, 2022 (Rutgers University). [Paper][PyTorch]
  • SB-SSL: "SB-SSL: Slice-Based Self-Supervised Transformers for Knee Abnormality Classification from MRI", MICCAIW, 2022 (University of Surrey, UK). [Paper]
  • RadioTransformer: "RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-guided Disease Classification", ECCV, 2022 (Stony Brook). [Paper][Tensorflow (in construction)]
  • ScoreNet: "ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification", arXiv, 2022 (EPFL). [Paper]
  • LA-MIL: "Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction", arXiv, 2022 (TUM). [Paper]
  • HoVer-Trans: "HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images", arXiv, 2022 (South China University of Technology). [Paper]
  • GTP: "A graph-transformer for whole slide image classification", arXiv, 2022 (Boston University). [Paper]
  • ?: "Zero-Shot and Few-Shot Learning for Lung Cancer Multi-Label Classification using Vision Transformer", arXiv, 2022 (Harvard). [Paper]
  • SwinCheX: "SwinCheX: Multi-label classification on chest X-ray images with transformers", arXiv, 2022 (Sharif University of Technology, Iran). [Paper]
  • SGT: "Rectify ViT Shortcut Learning by Visual Saliency", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
  • IPMN-ViT: "Neural Transformers for Intraductal Papillary Mucosal Neoplasms (IPMN) Classification in MRI images", arXiv, 2022 (University of Catania, Italy). [Paper]
  • ?: "Multi-Label Retinal Disease Classification using Transformers", arXiv, 2022 (Khalifa University, UAE). [Paper][PyTorch]
  • TractoFormer: "TractoFormer: A Novel Fiber-level Whole Brain Tractography Analysis Framework Using Spectral Embedding and Vision Transformers", arXiv, 2022 (Harvard). [Paper]
  • BrainFormer: "BrainFormer: A Hybrid CNN-Transformer Model for Brain fMRI Data Classification", arXiv, 2022 (Chinese PLA General Hospital). [Paper]
  • SI-ViT: "Shuffle Instances-based Vision Transformer for Pancreatic Cancer ROSE Image Classification", arXiv, 2022 (Beihang University). [Paper][PyTorch]

[Back to Overview]

Medical Detection

  • COTR: "COTR: Convolution in Transformer Network for End to End Polyp Detection", arXiv, 2021 (Fuzhou University). [Paper]
  • TR-Net: "Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries", arXiv, 2021 (Harbin Institute of Technology). [Paper]
  • CAE-Transformer: "CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans", arXiv, 2021 (Concordia University, Canada). [Paper]
  • DATR: "DATR: Domain-adaptive transformer for multi-domain landmark detection", arXiv, 2022 (CAS). [Paper]
  • SATr: "SATr: Slice Attention with Transformer for Universal Lesion Detection", arXiv, 2022 (CAS). [Paper]
  • Focused-Decoder: "Focused Decoding Enables 3D Anatomical Detection by Transformers", arXiv, 2022 (TUM). [Paper][PyTorch]

[Back to Overview]

Medical Reconstruction

  • T2Net: "Task Transformer Network for Joint MRI Reconstruction and Super-Resolution", MICCAI, 2021 (Harbin Institute of Technology). [Paper][PyTorch]
  • FIT: "Fourier Image Transformer", arXiv, 2021 (MPI). [Paper][PyTorch]
  • SLATER: "Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers", arXiv, 2021 (Bilkent University). [Paper]
  • MTrans: "MTrans: Multi-Modal Transformer for Accelerated MR Imaging", arXiv, 2021 (Harbin Institute of Technology). [Paper][PyTorch]
  • SDAUT: "Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI", MICCAI, 2022 (ICL). [Paper]
  • ?: "Adaptively Re-weighting Multi-Loss Untrained Transformer for Sparse-View Cone-Beam CT Reconstruction", arXiv, 2022 (Zhejiang Lab). [Paper]
  • K-Space-Transformer: "K-Space Transformer for Fast MRI Reconstruction with Implicit Representation", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][Code (in construction)][Website]
  • McSTRA: "Multi-head Cascaded Swin Transformers with Attention to k-space Sampling Pattern for Accelerated MRI Reconstruction", arXiv, 2022 (Monash University, Australia). [Paper]
  • ?: "Colonoscopy Landmark Detection using Vision Transformers", arXiv, 2022 (Intuitive Surgical, CA). [Paper]

[Back to Overview]

Medical Low-Level Vision

  • Eformer: "Eformer: Edge Enhancement based Transformer for Medical Image Denoising", ICCV, 2021 (BITS Pilani, India). [Paper]
  • PTNet: "PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer", arXiv, 2021 (* Columbia *). [Paper]
  • ResViT: "ResViT: Residual vision transformers for multi-modal medical image synthesis", arXiv, 2021 (Bilkent University, Turkey). [Paper]
  • CyTran: "CyTran: Cycle-Consistent Transformers for Non-Contrast to Contrast CT Translation", arXiv, 2021 (University Politehnica of Bucharest, Romania). [Paper][PyTorch]
  • McMRSR: "Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution", CVPR, 2022 (Yantai University, China). [Paper][PyTorch]
  • RPLHR-CT: "RPLHR-CT Dataset and Transformer Baseline for Volumetric Super-Resolution from CT Scans", MICCAI, 2022 (Infervision Medical Technology, China). [Paper][Code (in construction)]
  • W-G2L-ART: "Wide Range MRI Artifact Removal with Transformers", BMVC, 2022 (KTH). [Paper]
  • RFormer: "RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark", arXiv, 2022 (Tsinghua). [Paper]
  • CTformer: "CTformer: Convolution-free Token2Token Dilated Vision Transformer for Low-dose CT Denoising", arXiv, 2022 (UMass Lowell). [Paper][PyTorch]
  • Cohf-T: "Cross-Modality High-Frequency Transformer for MR Image Super-Resolution", arXiv, 2022 (Xidian University). [Paper]
  • SIST: "Low-Dose CT Denoising via Sinogram Inner-Structure Transformer", arXiv, 2022 (?). [Paper]
  • Spach-Transformer: "Spach Transformer: Spatial and Channel-wise Transformer Based on Local and Global Self-attentions for PET Image Denoising", arXiv, 2022 (Harvard). [Paper]

[Back to Overview]

Medical Others

  • LAT: "Lesion-Aware Transformers for Diabetic Retinopathy Grading", CVPR, 2021 (USTC). [Paper]
  • UVT: "Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation", MICCAI, 2021 (ICL). [Paper][PyTorch]
  • ?: "Surgical Instruction Generation with Transformers", MICCAI, 2021 (Bournemouth University, UK). [Paper]
  • AlignTransformer: "AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation", MICCAI, 2021 (Peking University). [Paper]
  • MCAT: "Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images", ICCV, 2021 (Harvard). [Paper][PyTorch]
  • ?: "Is it Time to Replace CNNs with Transformers for Medical Images?", ICCVW, 2021 (KTH, Sweden). [Paper]
  • HAT-Net: "HAT-Net: A Hierarchical Transformer Graph Neural Network for Grading of Colorectal Cancer Histology Images", BMVC, 2021 (Beijing University of Posts and Telecommunications). [Paper]
  • ?: "Federated Split Vision Transformer for COVID-19 CXR Diagnosis using Task-Agnostic Training", NeurIPS, 2021 (KAIST). [Paper]
  • ViT-Path: "Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology", NeurIPSW, 2021 (Microsoft). [Paper]
  • Global-Local-Transformer: "Global-Local Transformer for Brain Age Estimation", IEEE Transactions on Medical Imaging, 2021 (Harvard). [Paper][PyTorch]
  • CE-TFE: "Deep Transformers for Fast Small Intestine Grounding in Capsule Endoscope Video", arXiv, 2021 (Sun Yat-Sen University). [Paper]
  • DeepProg: "DeepProg: A Transformer-based Framework for Predicting Disease Prognosis", arXiv, 2021 (University of Oulu). [Paper]
  • Medical-Transformer: "Medical Transformer: Universal Brain Encoder for 3D MRI Analysis", arXiv, 2021 (Korea University). [Paper]
  • RATCHET: "RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting", arXiv, 2021 (ICL). [Paper]
  • C2FViT: "Affine Medical Image Registration with Coarse-to-Fine Vision Transformer", CVPR, 2022 (HKUST). [Paper][Code (in construction)]
  • HIPT: "Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning", CVPR, 2022 (Harvard). [Paper]
  • CGT: "Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation", CVPR, 2022 (University of Technology Sydney). [Paper]
  • SiT: "Surface Analysis with Vision Transformers", CVPRW, 2022 (King’s College London, UK). [Paper][PyTorch]
  • SiT: "Surface Vision Transformers: Attention-Based Modelling applied to Cortical Analysis", Medical Imaging with Deep Learning (MIDL), 2022 (King’s College London, UK). [Paper]
  • ViT-V-Net: "ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration", ICML, 2022 (JHU). [Paper][PyTorch]
  • HybridStereoNet: "Deep Laparoscopic Stereo Matching with Transformers", MICCAI, 2022 (Monash University, Australia). [Paper][PyTorch]
  • BabyNet: "BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video", MICCAI, 2022 (Sano Centre for Computational Medicine, Poland). [Paper][PyTorch]
  • TLT: "Transformer Lesion Tracker", MICCAI, 2022 (InferVision Medical Technology, China). [Paper]
  • XMorpher: "XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention", MICCAI, 2022 (Southeast University, China). [Paper][PyTorch]
  • SVoRT: "SVoRT: Iterative Transformer for Slice-to-Volume Registration in Fetal Brain MRI", MICCAI, 2022 (MIT). [Paper]
  • GaitForeMer: "GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation", MICCAI, 2022 (Stanford). [Paper][PyTorch]
  • MCGN: "A Medical Semantic-Assisted Transformer for Radiographic Report Generation", MICCAI, 2022 (University of Sydney). [Paper]
  • M3AE: "Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training", MICCAI, 2022 (CUHK). [Paper][PyTorch]
  • LKU-Net: "U-Net vs Transformer: Is U-Net Outdated in Medical Image Registration?", MICCAIW, 2022 (University of Birmingham, UK). [Paper]
  • LVOT: "Shifted Windows Transformers for Medical Image Quality Assessment", MICCAIW, 2022 (Istanbul Technical University, Turkey). [Paper]
  • MINiT: "Multiple Instance Neuroimage Transformer", MICCAIW, 2022 (Stanford). [Paper][Code (in construction)]
  • BrainNetTF: "Brain Network Transformer", NeurIPS, 2022 (Emory University). [Paper][PyTorch]
  • SiT: "Surface Vision Transformers: Flexible Attention-Based Modelling of Biomedical Surfaces", arXiv, 2022 (King’s College London, UK). [Paper][PyTorch]
  • TransMorph: "TransMorph: Transformer for unsupervised medical image registration", arXiv, 2022 (JHU). [Paper]
  • MDBERT: "Hierarchical BERT for Medical Document Understanding", arXiv, 2022 (IQVIA, NC). [Paper]
  • SymTrans: "Symmetric Transformer-based Nwholeetwork for Unsupervised Image Registration", arXiv, 2022 (Jilin University). [Paper]
  • MMT: "One Model to Synthesize Them All: Multi-contrast Multi-scale Transformer for Missing Data Imputation", arXiv, 2022 (JHU). [Paper]
  • EG-ViT: "Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning", arXiv, 2022 (Northwestern Polytechnical University). [Paper]
  • CSM: "Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection", arXiv, 2022 (University of Adelaide, Australia). [Paper]
  • Surgical-VQA: "Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer", arXiv, 2022 (NUS). [Paper][PyTorch (in construction)]
  • SwinMLP-TranCAP: "Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches", arXiv, 2022 (CUHK). [Paper][PyTorch]
  • CASHformer: "CASHformer: Cognition Aware SHape Transformer for Longitudinal Analysis", arXiv, 2022 (TUM). [Paper]
  • ARST: "ARST: Auto-Regressive Surgical Transformer for Phase Recognition from Laparoscopic Videos", arXiv, 2022 (Shanghai Jiao Tong University). [Paper]
  • SAT: "Medical Image Captioning via Generative Pretrained Transformers", arXiv, 2022 (Philips Innovation Labs Rus, Russia). [Paper]
  • RepsNet: "RepsNet: Combining Vision with Language for Automated Medical Reports", arXiv, 2022 (Google). [Paper][Website]

[Back to Overview]

Other Tasks

  • Active Learning:
    • TJLS: "Visual Transformer for Task-aware Active Learning", arXiv, 2021 (ICL). [Paper][PyTorch]
  • Agriculture:
    • PlantXViT: "Explainable vision transformer enabled convolutional neural network for plant disease identification: PlantXViT", arXiv, 2922 (Indian Institute of Information Technology). [Paper]
  • Animation-related:
    • AnT: "The Animation Transformer: Visual Correspondence via Segment Matching", ICCV, 2021 (Cadmium). [Paper]
    • AniFormer: "AniFormer: Data-driven 3D Animation with Transformer", BMVC, 2021 (University of Oulu, Finland). [Paper][PyTorch]
  • Biology:
    • ?: "A State-of-the-art Survey of Object Detection Techniques in Microorganism Image Analysis: from Traditional Image Processing and Classical Machine Learning to Current Deep Convolutional Neural Networks and Potential Visual Transformers", arXiv, 2021 (Northeastern University). [Paper]
  • Brain Score:
    • CrossViT: "Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4", CVPRW, 2022 (MIT). [Paper][PyTorch]
  • Camera-related:
    • CTRL-C: "CTRL-C: Camera calibration TRansformer with Line-Classification", ICCV, 2021 (Kakao + Kookmin University). [Paper][PyTorch]
    • MS-Transformer: "Learning Multi-Scene Absolute Pose Regression with Transformers", ICCV, 2021 (Bar-Ilan University, Israel). [Paper][PyTorch]
    • GTCaR: "GTCaR: Graph Transformer for Camera Re-localization", ECCV, 2022 (Magic Leap). [Paper]
  • Character Recognition:
    • BTTR: "Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer", arXiv, 2021 (Peking). [Paper]
    • TrOCR: "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models", arXiv, 2021 (Microsoft). [Paper][PyTorch]
    • ?: "Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks", arXiv, 2021 (Salesforce). [Paper]
    • T3: "TrueType Transformer: Character and Font Style Recognition in Outline Format", Document Analysis Systems (DAS), 2022 (Kyushu University). [Paper]
    • ?: "Transformer-based HTR for Historical Documents", ComHum, 2022 (University of Zurich, Switzerland). [Paper]
    • ?: "SVG Vector Font Generation for Chinese Characters with Transformer", ICIP, 2022 (The University of Tokyo). [Paper]
    • LP-Transformer: "Forensic License Plate Recognition with Compression-Informed Transformers", ICIP, 2022 (University of Erlangen-Nurnberg, Germany). [Paper]
    • CoMER: "CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • CONSENT: "CONSENT: Context Sensitive Transformer for Bold Words Classification", arXiv, 2022 (Amazon). [Paper]
  • Curriculum Learning:
    • SSTN: "Spatial Transformer Networks for Curriculum Learning", arXiv, 2021 (TU Kaiserslautern, Germany). [Paper]
  • Defect Classification:
    • MSHViT: "Multi-Scale Hybrid Vision Transformer and Sinkhorn Tokenizer for Sewer Defect Classification", CVPRW, 2022 (Aalborg University, Denmark). [Paper]
    • DefT: "Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection", arXiv, 2022 (Nanjing University of Aeronautics and Astronautics). [Paper]
  • Digital Holography:
    • ?: "Convolutional Neural Network (CNN) vs Visual Transformer (ViT) for Digital Holography", ICCCR, 2022 (UBFC, France). [Paper]
  • Disentangled representation:
    • VCT: "Visual Concepts Tokenization", arXiv, 2022 (Microsoft). [Paper]
  • Event data:
    • EvT: "Event Transformer: A sparse-aware solution for efficient event data processing", arXiv, 2022 (Universidad de Zaragoza, Spain). [Paper][PyTorch]
    • ETB: "Event Transformer", arXiv, 2022 (Nanjing University). [Paper]
  • Fashion:
    • Kaleido-BERT: "Kaleido-BERT: Vision-Language Pre-training on Fashion Domain", CVPR, 2021 (Alibaba). [Paper][Tensorflow]
    • CIT: "Cloth Interactive Transformer for Virtual Try-On", arXiv, 2021 (University of Trento). [Paper][Code (in construction)]
    • ClothFormer: "ClothFormer: Taming Video Virtual Try-on in All Module", CVPR, 2022 (iQIYI). [Paper][Website]
    • FashionVLP: "FashionVLP: Vision Language Transformer for Fashion Retrieval With Feedback", CVPR, 2022 (Amazon). [Paper]
    • FashionViL: "FashionViL: Fashion-Focused Vision-and-Language Representation Learning", ECCV, 2022 (University of Surrey, UK). [Paper][PyTorch]
    • OutfitTransformer: "OutfitTransformer: Learning Outfit Representations for Fashion Recommendation", arXiv, 2022 (Amazon). [Paper]
    • Fashionformer: "Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition", ECCV, 2022 (Peking). [Paper][PyTorch]
  • Feature Matching:
    • SuperGlue: "SuperGlue: Learning Feature Matching with Graph Neural Networks", CVPR, 2020 (Magic Leap). [Paper][PyTorch]
    • LoFTR: "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR, 2021 (Zhejiang University). [Paper][PyTorch][Website]
    • COTR: "COTR: Correspondence Transformer for Matching Across Images", ICCV, 2021 (UBC). [Paper]
    • CATs: "CATs: Cost Aggregation Transformers for Visual Correspondence", NeurIPS, 2021 (Yonsei University + Korea University). [Paper][PyTorch][Website]
    • TransforMatcher: "TransforMatcher: Match-to-Match Attention for Semantic Correspondence", CVPR, 2022 (POSTECH). [Paper]
    • ASpanFormer: "ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer", ECCV, 2022 (HKUST). [Paper][Website]
    • CATs++: "CATs++: Boosting Cost Aggregation with Convolutions and Transformers", arXiv, 2022 (Korea University). [Paper]
    • LoFTR-TensorRT: "Local Feature Matching with Transformers for low-end devices", arXiv, 2022 (?). [Paper][PyTorch]
    • MatchFormer: "MatchFormer: Interleaving Attention in Transformers for Feature Matching", arXiv, 2022 (Karlsruhe Institute of Technology, Germany). [Paper]
    • OpenGlue: "OpenGlue: Open Source Graph Neural Net Based Pipeline for Image Matching", arXiv, 2022 (Ukrainian Catholic University). [Paper][PyTorch]
  • Fine-grained:
    • ViT-FGVC: "Exploring Vision Transformers for Fine-grained Classification", CVPRW, 2021 (Universidad de Valladolid). [Paper]
    • FFVT: "Feature Fusion Vision Transformer for Fine-Grained Visual Categorization", BMVC, 2021 (Griffith University, Australia). [Paper][PyTorch]
    • TPSKG: "Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition", arXiv, 2021 (Beihang University). [Paper]
    • AFTrans: "A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition", arXiv, 2021 (Peking University). [Paper]
    • TransFG: "TransFG: A Transformer Architecture for Fine-grained Recognition", AAAI, 2022 (Johns Hopkins). [Paper][PyTorch]
    • DynamicMLP: "Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information", CVPR, 2022 (Megvii). [Paper][PyTorch]
    • SIM-Trans: "SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization", ACMMM, 2022 (Peking University). [Paper][PyTorch]
    • MetaFormer: "MetaFormer: A Unified Meta Framework for Fine-Grained Recognition", arXiv, 2022 (ByteDance). [Paper][PyTorch]
    • ViT-FOD: "ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator", arXiv, 2022 (Shandong University). [Paper]
  • Gait:
    • Gait-TR: "Spatial Transformer Network on Skeleton-based Gait Recognition", arXiv, 2022 (South China University of Technology). [Paper]
  • Gaze:
    • GazeTR: "Gaze Estimation using Transformer", arXiv, 2021 (Beihang University). [Paper][PyTorch]
    • HGTTR: "End-to-End Human-Gaze-Target Detection with Transformers", CVPR, 2022 (Shanghai Jiao Tong). [Paper]
    • MGTR: "MGTR: End-to-End Mutual Gaze Detection with Transformer", ACCV, 2022 (Nankai University). [Paper][PyTorch]
    • GLC: "In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation", arXiv, 2022 (Georgia Tech). [Paper][Website]
  • Geo-Localization:
    • EgoTR: "Cross-view Geo-localization with Evolving Transformer", arXiv, 2021 (Shenzhen University). [Paper]
    • TransGeo: "TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization", CVPR, 2022 (UCF). [Paper][PyTorch]
    • GAMa: "GAMa: Cross-view Video Geo-localization", ECCV, 2022 (UCF). [Paper][Code (in construction)]
    • TransLocator: "Where in the World is this Image? Transformer-based Geo-localization in the Wild", ECCV, 2022 (JHU). [Paper]
    • TransGCNN: "Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization", arXiv, 2022 (Southeast University, China). [Paper]
    • MGTL: "Mutual Generative Transformer Learning for Cross-view Geo-localization", arXiv, 2022 (University of Electronic Science and Technology of China). [Paper]
  • Homography Estimation:
    • LocalTrans: "LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation", ICCV, 2021 (Tsinghua). [Paper]
  • Image Registration:
    • AiR: "Attention for Image Registration (AiR): an unsupervised Transformer approach", arXiv, 2021 (INRIA). [Paper]
  • Image Retrieval:
    • RRT: "Instance-level Image Retrieval using Reranking Transformers", ICCV, 2021 (University of Virginia). [Paper][PyTorch]
    • SwinFGHash: "SwinFGHash: Fine-grained Image Retrieval via Transformer-based Hashing Network", BMVC, 2021 (Tsinghua). [Paper]
    • ViT-Retrieval: "Investigating the Vision Transformer Model for Image Retrieval Tasks", arXiv, 2021 (Democritus University of Thrace). [Paper]
    • IRT: "Training Vision Transformers for Image Retrieval", arXiv, 2021 (Facebook + INRIA). [Paper]
    • TransHash: "TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval", arXiv, 2021 (Shanghai Jiao Tong University). [Paper]
    • VTS: "Vision Transformer Hashing for Image Retrieval", arXiv, 2021 (IIIT-Allahabad). [Paper]
    • GTZSR: "Zero-Shot Sketch Based Image Retrieval using Graph Transformer", arXiv, 2022 (IIT Bombay). [Paper]
    • EViT: "EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing", arXiv, 2022 (Jinan University). [Paper][PyTorch (in construction)]
    • ?: "Transformers and CNNs both Beat Humans on SBIR", arXiv, 2022 (University of Mons, Belgium). [Paper]
  • Layout Generation:
    • VTN: "Variational Transformer Networks for Layout Generation", CVPR, 2021 (Google). [Paper]
    • LayoutTransformer: "LayoutTransformer: Scene Layout Generation With Conceptual and Spatial Diversity", CVPR, 2021 (NTU). [Paper][PyTorch]
    • LayoutTransformer: "LayoutTransformer: Layout Generation and Completion with Self-attention", ICCV, 2021 (Amazon). [Paper][Website]
    • LGT-Net: "LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
    • CADTransformer: "CADTransformer: Panoptic Symbol Spotting Transformer for CAD Drawings", CVPR, 2022 (UT Austin). [Paper]
    • GAT-CADNet: "GAT-CADNet: Graph Attention Network for Panoptic Symbol Spotting in CAD Drawings", CVPR, 2022 (TUM + Alibaba). [Paper]
    • LayoutBERT: "LayoutBERT: Masked Language Layout Model for Object Insertion", CVPRW, 2022 (Adobe). [Paper]
    • ICVT: "Geometry Aligned Variational Transformer for Image-conditioned Layout Generation", ACMMM, 2022 (Alibaba). [Paper]
    • BLT: "BLT: Bidirectional Layout Transformer for Controllable Layout Generation", ECCV, 2022 (Google). [Paper][Tensorflow][Website]
    • ATEK: "ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout Synthesis", arXiv, 2022 (New Jersey Institute of Technology). [Paper]
    • ?: "Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades", arXiv, 2022 (Simon Fraser). [Paper]
    • UniLayout: "UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation", arXiv, 2022 (Microsoft). [Paper]
  • Livestock Monitoring:
    • STARFormer: "Livestock Monitoring with Transformer", BMVC, 2021 (IIT Dhanbad). [Paper]
  • Long-tail:
    • BatchFormer: "BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning", CVPR, 2022 (The University of Sydney). [Paper][PyTorch]
    • BatchFormerV2: "BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning", arXiv, 2022 (The University of Sydney). [Paper]
    • LPT: "LPT: Long-tailed Prompt Tuning for Image Classification", arXiv, 2022 (Harbin Institute of Technology). [Paper]
  • Metric Learning:
    • Hyp-ViT: "Hyperbolic Vision Transformers: Combining Improvements in Metric Learning", CVPR, 2022 (University of Trento, Italy). [Paper][PyTorch]
  • Multi-Input:
    • MixViT: "Adapting Multi-Input Multi-Output schemes to Vision Transformers", CVPRW, 2022 (Sorbonne Universite, France). [Paper]
  • Multi-label:
    • C-Tran: "General Multi-label Image Classification with Transformers", CVPR, 2021 (University of Virginia). [Paper]
    • TDRG: "Transformer-Based Dual Relation Graph for Multi-Label Image Recognition", ICCV, 2021 (Tencent). [Paper]
    • MlTr: "MlTr: Multi-label Classification with Transformer", arXiv, 2021 (KuaiShou). [Paper]
    • GATN: "Graph Attention Transformer Network for Multi-Label Image Classification", arXiv, 2022 (Southeast University, China). [Paper]
  • Multi-task:
    • MulT: "MulT: An End-to-End Multitask Learning Transformer", CVPR, 2022 (EPFL). [Paper]
  • Open Set:
    • OSR-ViT: "Open Set Recognition using Vision Transformer with an Additional Detection Head", arXiv, 2022 (Vanderbilt University, Tennessee). [Paper]
  • Out-Of-Distribution:
    • OODformer: "OODformer: Out-Of-Distribution Detection Transformer", BMVC, 2021 (LMU Munich). [Paper][PyTorch]
  • Pedestrian Intention:
    • IntFormer: "IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture", arXiv, 2021 (Universidad de Alcala). [Paper]
  • Physics Simulation:
    • TIE: "Transformer with Implicit Edges for Particle-based Physics Simulation", ECCV, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
  • Place Recognition:
    • SVT-Net: "SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers", AAAI, 2022 (Renmin University of China). [Paper]
    • TransVPR: "TransVPR: Transformer-based place recognition with multi-level attention aggregation", CVPR, 2022 (Xi'an Jiaotong). [Paper]
    • OverlapTransformer: "OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition", IROS, 2022 (HAOMO.AI, China). [Paper][PyTorch]
    • SeqOT: "SeqOT: A Spatial-Temporal Transformer Network for Place Recognition Using Sequential LiDAR Data", arXiv, 2022 (National University of Defense Technology, China). [Paper][PyTorch]
  • Remote Sensing/Hyperspectral/Satellite:
    • DCFAM: "Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images", arXiv, 2021 (Wuhan University). [Paper]
    • WiCNet: "Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images", arXiv, 2021 (University of Trento). [Paper]
    • ?: "Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images", arXiv, 2021 (University of Orleans, France). [Paper]
    • Satellite-ViT: "Manipulation Detection in Satellite Images Using Vision Transformer", arXiv, 2021 (Purdue). [Paper]
    • ?: "Self-supervised Vision Transformers for Joint SAR-optical Representation Learning", IGARSS, 2022 (German Aerospace Center). [Paper]
    • VBFusion: "Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing", SPIE Remote Sensing, 2022 (Technische Universitat Berlin, Germany). [Paper][PyTorch]
    • ANDT: "Anomaly Detection in Aerial Videos with Transformers", IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2022 (TUM). [Paper]
    • RNGDet: "RNGDet: Road Network Graph Detection by Transformer in Aerial Images", arXiv, 2022 (HKUST). [Paper]
    • FSRA: "A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization", arXiv, 2022 (China Jiliang University). [Paper][PyTorch]
    • ?: "Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Imag (e Cl)assificationtion", arXiv, 2022 (Shenzhen University). [Paper]
    • ?: "Deep Hyperspectral Unmixing using Transformer Network", arXiv, 2022 (Jalpaiguri Engineering College, India). [Paper]
    • SiamixFormer: "SiamixFormer: A Siamese Transformer Network For Building Detection And Change Detection From Bi-Temporal Remote Sensing Images", arXiv, 2022 (Tarbiat Modares University, Iran). [Paper]
    • SatMAE: "SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery", arXiv, 2022 (Stanford). [Paper]
    • DAHiTrA: "DAHiTrA: Damage Assessment Using a Novel Hierarchical Transformer Architecture", arXiv, 2022 (Simon Fraser University, Canada). [Paper]
    • RVSA: "Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model", arXiv, 2022 (Wuhan University + The University of Sydney). [Paper]
    • SatViT: "Transfer Learning with Pretrained Remote Sensing Transformers", arXiv, 2022 (?). [Paper][PyTorch]
    • FTN: "Fully Transformer Network for Change Detection of Remote Sensing Images", arXiv, 2022 (Dalian University of Technology). [Paper]
    • MCTNet: "MCTNet: A Multi-Scale CNN-Transformer Network for Change Detection in Optical Remote Sensing Images", arXiv, 2022 (Tsinghua University). [Paper]
  • Robotics:
    • TF-Grasp: "When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection", arXiv, 2022 (University of Science and Technology of China). [Paper][Code (in construction)]
    • BeT: "Behavior Transformers: Cloning k modes with one stone", arXiv, 2022 (NYU). [Paper][PyTorch]
    • Perceiver-Actor: "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation", Conference on Robot Learning (CoRL), 2022 (NVIDIA). [Paper][Website]
    • PACT: "PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training", arXiv, 2022 (Microsoft). [Paper]
    • ?: "A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers", arXiv, 2022 (University of Groningen, The Netherlands). [Paper]
    • ?: "Grounding Language with Visual Affordances over Unstructured Data", arXiv, 2022 (University of Freiburg, Germany). [Paper][Website]
    • VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", arXiv, 2022 (NVIDIA). [Paper][PyTorch][Website]
  • Scene Decomposition:
    • SRT: "Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations", CVPR, 2022 (Google). [Paper][PyTorch (stelzner)][Website]
    • OSRT: "Object Scene Representation Transformer", arXiv, 2022 (Google). [Paper][Website]
  • Scene Text Recognition:
    • ViTSTR: "Vision Transformer for Fast and Efficient Scene Text Recognition", ICDAR, 2021 (University of the Philippines). [Paper]
    • STKM: "Self-attention based Text Knowledge Mining for Text Detection", CVPR, 2021 (?). [Paper][Code (in construction)]
    • I2C2W: "I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition", arXiv, 2021 (NTU Singapoer). [Paper]
    • CornerTransformer: "Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • VLAMD: "Vision-Language Adaptive Mutual Decoder for OOV-STR", ECCVW, 2022 (iFLYTEK, China). [Paper]
  • Spike:
    • Spikformer: "Spikformer: When Spiking Neural Network Meets Transformer", arXiv, 2022 (Peking). [Paper]
  • Stereo:
    • STTR: "Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers", ICCV, 2021 (Johns Hopkins). [Paper][PyTorch]
    • PS-Transformer: "PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism", BMVC, 2021 (National Institute of Informatics, JAPAN). [Paper]
    • ChiTransformer: "ChiTransformer: Towards Reliable Stereo from Cues", CVPR, 2022 (GSU). [Paper]
    • TransMVSNet: "TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers", CVPR, 2022 (Megvii). [Paper][Code (in construction)]
    • MVSTER: "MVSTER: Epipolar Transformer for Efficient Multi-View Stereo", ECCV, 2022 (CAS). [Paper][PyTorch]
    • WT-MVSNet: "WT-MVSNet: Window-based Transformers for Multi-view Stereo", arXiv, 2022 (Tsinghua University). [Paper]
    • MVSFormer: "MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo", arXiv, 2022 (Fudan University). [Paper]
  • Time Series:
    • MissFormer: "MissFormer: (In-)attention-based handling of missing observations for trajectory filtering and prediction", arXiv, 2021 (Fraunhofer IOSB, Germany). [Paper]
  • Traffic:
    • NEAT: "NEAT: Neural Attention Fields for End-to-End Autonomous Driving", ICCV, 2021 (MPI). [Paper][PyTorch]
    • ViTAL: "Novelty Detection and Analysis of Traffic Scenario Infrastructures in the Latent Space of a Vision Transformer-Based Triplet Autoencoder", IV, 2021 (Technische Hochschule Ingolstadt). [Paper]
    • ?: "Predicting Vehicles Trajectories in Urban Scenarios with Transformer Networks and Augmented Information", IVS, 2021 (Universidad de Alcala). [Paper]
    • ?: "Translating Images into Maps", ICRA, 2022 (University of Surrey, UK). [Paper][PyTorch (in construction)]
    • Crossview-Transformer: "Cross-view Transformers for real-time Map-view Semantic Segmentation", CVPR, 2022 (UT Austin). [Paper][PyTorch]
    • ViT-BEVSeg: "ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation", IJCNN, 2022 (Maynooth University, Ireland). [Paper][Code (in construction)]
    • TransLPC: "Transformers for Object Detection in Large Point Clouds", ITSC, 2022 (Bosch). [Paper]
    • PicT: "PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification", ACMMM, 2022 (Chongqing University). [Paper][PyTorch (in construction)]
    • BEVFormer: "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers", ECCV, 2022 (Shanghai AI Lab). [Paper][PyTorch]
    • JPerceiver: "JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes", ECCV, 2022 (The University of Sydney). [Paper][PyTorch]
    • V2X-ViT: "V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer", ECCV, 2022 (UCLA). [Paper]
    • MTR: "Motion Transformer with Global Intention Localization and Local Movement Refinement", NeurIPS, 2022 (MPI). [Paper][Code (in construction)]
    • BEVSegFormer: "BEVSegFormer: Bird's Eye View Semantic Segmentation From Arbitrary Camera Rigs", arXiv, 2022 (Nullmax, China). [Paper]
    • ParkPredict+: "ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer", arXiv, 2022 (Berkeley). [Paper]
    • GKT: "Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer", arXiv, 2022 (Huazhong University of Science and Technology). [Paper][Code (in construction)]
    • CoBEVT: "CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers", arXiv, 2022 (UCLA). [Paper]
    • ?: "Pyramid Transformer for Traffic Sign Detection", arXiv, 2022 (Iran University of Science and Technology). [Paper]
    • UniFormer: "UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View", arXiv, 2022 (Zhejiang University). [Paper]
    • STrajNet: "STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer", arXiv, 2022 (NTU, Singapore). [Paper]
    • MTPP: "Multi-modal Transformer Path Prediction for Autonomous Vehicle", arXiv, 2022 (National Central University). [Paper]
    • MapTR: "MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction", arXiv, 2022 (Horizon Robotics). [Paper][Code (in construction)]
    • DCT: "A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View", arXiv, 2022 (Gwang-ju Institute of Science and Technology). [Paper]
    • C-ViT: "Traffic Accident Risk Forecasting using Contextual Vision Transformers", arXiv, 2022 (University of Technology Sydney). [Paper]
  • Trajectory Prediction:
    • mmTransformer: "Multimodal Motion Prediction with Stacked Transformers", CVPR, 2021 (CUHK + SenseTime). [Paper][Code (in construction)][Website]
    • AgentFormer: "AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting", ICCV, 2021 (CMU). [Paper][PyTorch][Website]
    • S2TNet: "S2TNet: Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving", ACML, 2021 (Xi'an Jiaotong University). [Paper][PyTorch]
    • MRT: "Multi-Person 3D Motion Prediction with Multi-Range Transformers", NeurIPS, 2021 (UCSD + Berkeley). [Paper][PyTorch][Website]
    • ?: "Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction", ICLR, 2022 (MILA). [Paper]
    • Scene-Transformer: "Scene Transformer: A unified architecture for predicting multiple agent trajectories", ICLR, 2022 (Google). [Paper]
    • ST-MR: "Graph-based Spatial Transformer with Memory Replay for Multi-Future Pedestrian Trajectory Prediction", CVPR, 2022 (University of New South Wales, Australia). [Paper][Tensorflow]
    • HiVT: "HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction", CVPR, 2022 (CUHK). [Paper]
    • EF-Transformer: "Entry-Flipped Transformer for Inference and Prediction of Participant Behavior", ECCV, 2022 (NTU, Singapore). [Paper]
    • Social-SSL: "Social-SSL: Self-Supervised Cross-Sequence Representation Learning Based on Transformers for Multi-Agent Trajectory Prediction", ECCV, 2022 (NYCU). [Paper][PyTorch]
    • LatentFormer: "LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction", arXiv, 2022 (Huawei). [Paper]
    • PreTR: "PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer", arXiv, 2022 (Stellantis, France). [Paper]
    • Wayformer: "Wayformer: Motion Forecasting via Simple & Efficient Attention Networks", arXiv, 2022 (Waymo). [Paper]
    • LaTTe: "LaTTe: Language Trajectory TransformEr", arXiv, 2022 (TUM). [Paper][Tensorflow]
    • SoMoFormer: "SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
    • ViewBirdiformer: "ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view", arXiv, 2022 (Kyoto University). [Paper]
    • PedFormer: "PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning", arXiv, 2022 (Huawei). [Paper]
  • Visual Counting:
    • CC-AV: "Audio-Visual Transformer Based Crowd Counting", ICCVW, 2021 (University of Kansas). [Paper]
    • TransCrowd: "TransCrowd: Weakly-Supervised Crowd Counting with Transformer", arXiv, 2021 (Huazhong University of Science and Technology). [Paper][PyTorch]
    • TAM-RTM: "Boosting Crowd Counting with Transformers", arXiv, 2021 (ETHZ). [Paper]
    • CCTrans: "CCTrans: Simplifying and Improving Crowd Counting with Transformer", arXiv, 2021 (Meituan). [Paper]
    • MAN: "Boosting Crowd Counting via Multifaceted Attention", CVPR, 2022 (Xi'an Jiaotong). [Paper][PyTorch]
    • CLTR: "An End-to-End Transformer Model for Crowd Localization", ECCV, 2022 (Huazhong University of Science and Technology). [Paper][PyTorch][Website]
    • SAANet: "Scene-Adaptive Attention Network for Crowd Counting", arXiv, 2022 (Xi'an Jiaotong). [Paper]
    • JCTNet: "Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting", arXiv, 2022 (Chongqing University). [Paper]
    • CrowdMLP: "CrowdMLP: Weakly-Supervised Crowd Counting via Multi-Granularity MLP", arXiv, 2022 (University of Guelph, Canada). [Paper]
    • CounTR: "CounTR: Transformer-based Generalised Visual Counting", arXiv, 2022 (Shanghai Jiao Tong University). [Paper][Website]
  • Visual Quality Assessment:
    • TRIQ: "Transformer for Image Quality Assessment", arXiv, 2020 (NORCE). [Paper][Tensorflow-Keras]
    • IQT: "Perceptual Image Quality Assessment with Transformers", CVPRW, 2021 (LG). [Paper][Code (in construction)]
    • MUSIQ: "MUSIQ: Multi-scale Image Quality Transformer", ICCV, 2021 (Google). [Paper]
    • TranSLA: "Saliency-Guided Transformer Network Combined With Local Embedding for No-Reference Image Quality Assessment", ICCVW, 2021 (Hikvision). [Paper]
    • TReS: "No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency", WACV, 2022 (CMU). [Paper]
    • IQA-Conformer: "Conformer and Blind Noisy Students for Improved Image Quality Assessment", CVPRW, 2022 (University of Wurzburg, Germany). [Paper][PyTorch]
    • SwinIQA: "SwinIQA: Learned Swin Distance for Compressed Image Quality Assessment", CVPRW, 2022 (USTC, China). [Paper]
    • DCVQE: "DCVQE: A Hierarchical Transformer for Video Quality Assessment", ACCV, 2022 (Weibo). [Paper]
    • MCAS-IQA: "Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment", arXiv, 2022 (Norwegian Research Centre, Norway). [Paper]
    • MSTRIQ: "MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion", arXiv, 2022 (ByteDance). [Paper]
    • DisCoVQA: "DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment", arXiv, 2022 (NTU, Singapore). [Paper]
  • Visual Reasoning:
    • SAViR-T: "SAViR-T: Spatially Attentive Visual Reasoning with Transformers", arXiv, 2022 (Rutgers University). [Paper]
  • 3D Human Texture Estimation:
    • Texformer: "3D Human Texture Estimation from a Single Image with Transformers", ICCV, 2021 (NTU, Singapore). [Paper][PyTorch][Website]
  • 3D Motion Synthesis:
    • ACTOR: "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV, 2021 (Univ Gustave Eiffel). [Paper][PyTorch][Website]
    • RTVAE: "Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis", CVPRW, 2022 (Amazon). [Paper]
    • MotionCLIP: "MotionCLIP: Exposing Human Motion Generation to CLIP Space", ECCV, 2022 (Tel Aviv). [Paper]
    • CLIP-Actor: "CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes", ECCV, 2022 (POSTECH). [Paper][PyTorch][Website]
    • ActFormer: "ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation", arXiv, 2022 (SenseTime). [Paper]
    • ?: "Diverse Dance Synthesis via Keyframes with Transformer Controllers", arXiv, 2022 (Beihang University). [Paper]
    • MARIONET: "NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System", arXiv, 2022 (Wuhan University). [Paper]
  • 3D Object Recognition:
    • MVT: "MVT: Multi-view Vision Transformer for 3D Object Recognition", BMVC, 2021 (Baidu). [Paper]
  • 3D Reconstruction:
    • PlaneTR: "PlaneTR: Structure-Guided Transformers for 3D Plane Recovery", ICCV, 2021 (Wuhan University). [Paper][PyTorch]
    • CO3D: "CommonObjects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction", ICCV, 2021 (Facebook). [Paper][PyTorch]
    • VolT: "Multi-view 3D Reconstruction with Transformer", ICCV, 2021 (University of British Columbia). [Paper]
    • 3D-RETR: "3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers", BMVC, 2021 (ETHZ). [Paper][PyTorch]
    • TransformerFusion: "TransformerFusion: Monocular RGB Scene Reconstruction using Transformers", NeurIPS, 2021 (TUM). [Paper][Website]
    • LegoFormer: "LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction", arXiv, 2021 (TUM + Google). [Paper]
    • PlaneFormers: "PlaneFormers: From Sparse View Planes to 3D Reconstruction", ECCV, 2022 (Michigan). [Paper][PyTorch][Website]
    • 3D-C2FT: "3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction", arXiv, 2022 (Korea Institute of Science and Technology). [Paper]
  • 360 Scene:
    • ?: "Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning", AAAI, 2022 (Seoul National University). [Paper][PyTorch]
    • PAVER: "Panoramic Vision Transformer for Saliency Detection in 360° Videos", ECCV, 2022 (Seoul National University). [Paper]
    • PanoFormer: "PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation", ECCV, 2022 (Beijing Jiaotong University). [Paper]
    • SPH: "Spherical Transformer", arXiv, 2022 (Chung-Ang University, Korea). [Paper]
  • Others:
    • ?: "Connecting Compression Spaces with Transformer for Approximate Nearest Neighbor Search", ECCV, 2022 (Intellifusion, China). [Paper]
    • ?: "Strong Gravitational Lensing Parameter Estimation with Vision Transformer", ECCVW, 2022 (CMU). [Paper][PyTorch]
    • Transformer-DR: "Transformer-based dimensionality reduction", arXiv, 2022 (Chongqing Normal University, China). [Paper]

[Back to Overview]


Attention Mechanisms in Vision/NLP

Attention for Vision

  • AA: "Attention Augmented Convolutional Networks", ICCV, 2019 (Google). [Paper][PyTorch (Unofficial)][Tensorflow (Unofficial)]
  • LR-Net: "Local Relation Networks for Image Recognition", ICCV, 2019 (Microsoft). [Paper][PyTorch (Unofficial)]
  • CCNet: "CCNet: Criss-Cross Attention for Semantic Segmentation", ICCV, 2019 (& TPAMI 2020) (Horizon). [Paper][PyTorch]
  • GCNet: "Global Context Networks", ICCVW, 2019 (& TPAMI 2020) (Microsoft). [Paper][PyTorch]
  • SASA: "Stand-Alone Self-Attention in Vision Models", NeurIPS, 2019 (Google). [Paper][PyTorch-1 (Unofficial)][PyTorch-2 (Unofficial)]
    • key message: attention module is more efficient than conv & provide comparable accuracy
  • Axial-Transformer: "Axial Attention in Multidimensional Transformers", arXiv, 2019 (Google). [Paper][PyTorch (Unofficial)]
  • Attention-CNN: "On the Relationship between Self-Attention and Convolutional Layers", ICLR, 2020 (EPFL). [Paper][PyTorch][Website]
  • SAN: "Exploring Self-attention for Image Recognition", CVPR, 2020 (CUHK + Intel). [Paper][PyTorch]
  • BA-Transform: "Non-Local Neural Networks With Grouped Bilinear Attentional Transforms", CVPR, 2020 (ByteDance). [Paper]
  • Axial-DeepLab: "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation", ECCV, 2020 (Google). [Paper][PyTorch]
  • GSA: "Global Self-Attention Networks for Image Recognition", arXiv, 2020 (Google). [Paper][PyTorch (Unofficial)]
  • EA: "Efficient Attention: Attention with Linear Complexities", WACV, 2021 (SenseTime). [Paper][PyTorch]
  • LambdaNetworks: "LambdaNetworks: Modeling long-range Interactions without Attention", ICLR, 2021 (Google). [Paper][PyTorch-1 (Unofficial)][PyTorch-2 (Unofficial)]
  • GSA-Nets: "Group Equivariant Stand-Alone Self-Attention For Vision", ICLR, 2021 (EPFL). [Paper]
  • Hamburger: "Is Attention Better Than Matrix Decomposition?", ICLR, 2021 (Peking). [Paper][PyTorch (Unofficial)]
  • HaloNet: "Scaling Local Self-Attention For Parameter Efficient Visual Backbones", CVPR, 2021 (Google). [Paper]
  • BoTNet: "Bottleneck Transformers for Visual Recognition", CVPR, 2021 (Google). [Paper]
  • SSAN: "SSAN: Separable Self-Attention Network for Video Representation Learning", CVPR, 2021 (Microsoft). [Paper]
  • CoTNet: "Contextual Transformer Networks for Visual Recognition", CVPRW, 2021 (JD). [Paper][PyTorch]
  • Involution: "Involution: Inverting the Inherence of Convolution for Visual Recognition", CVPR, 2021 (HKUST). [Paper][PyTorch]
  • Perceiver: "Perceiver: General Perception with Iterative Attention", ICML, 2021 (DeepMind). [Paper][PyTorch (lucidrains)]
  • SNL: "Unifying Nonlocal Blocks for Neural Networks", ICCV, 2021 (Peking + Bytedance). [Paper]
  • External-Attention: "Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks", arXiv, 2021 (Tsinghua). [Paper]
  • Container: "Container: Context Aggregation Network", arXiv, 2021 (AI2). [Paper]
  • X-volution: "X-volution: On the unification of convolution and self-attention", arXiv, 2021 (Huawei Hisilicon). [Paper]
  • Invertible-Attention: "Invertible Attention", arXiv, 2021 (ANU). [Paper]
  • VOLO: "VOLO: Vision Outlooker for Visual Recognition", arXiv, 2021 (Sea AI Lab + NUS, Singapore). [Paper][PyTorch]
  • LESA: "Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms", arXiv, 2021 (Johns Hopkins). [Paper]
  • PS-Attention: "Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention", AAAI, 2022 (Baidu). [Paper][Paddle]
  • QuadTree: "QuadTree Attention for Vision Transformers", ICLR, 2022 (Simon Fraser + Alibaba). [Paper][PyTorch]
  • QnA: "Learned Queries for Efficient Local Attention", CVPR, 2022 (Tel-Aviv). [Paper][Jax]
  • ?: "Fair Comparison between Efficient Attentions", CVPRW, 2022 (Kyungpook National University, Korea). [Paper][PyTorch]
  • KVT: "KVT: k-NN Attention for Boosting Vision Transformers", ECCV, 2022 (Alibaba). [Paper][PyTorch]
  • Hydra: "Hydra Attention: Efficient Attention with Many Heads", ECCVW, 2022 (Meta). [Paper]
  • HiP: "Hierarchical Perceiver", arXiv, 2022 (DeepMind). [Paper]
  • AttendNeXt: "Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers", arXiv, 2022 (University of Waterloo, Canada). [Paper]

[Back to Overview]

Attention for NLP

  • T-DMCA: "Generating Wikipedia by Summarizing Long Sequences", ICLR, 2018 (Google). [Paper]
  • LSRA: "Lite Transformer with Long-Short Range Attention", ICLR, 2020 (MIT). [Paper][PyTorch]
  • ETC: "ETC: Encoding Long and Structured Inputs in Transformers", EMNLP, 2020 (Google). [Paper][Tensorflow]
  • BlockBERT: "Blockwise Self-Attention for Long Document Understanding", EMNLP Findings, 2020 (Facebook). [Paper][GitHub]
  • Clustered-Attention: "Fast Transformers with Clustered Attention", NeurIPS, 2020 (Idiap). [Paper][PyTorch][Website]
  • BigBird: "Big Bird: Transformers for Longer Sequences", NeurIPS, 2020 (Google). [Paper][Tensorflow]
  • Longformer: "Longformer: The Long-Document Transformer", arXiv, 2020 (AI2). [Paper][PyTorch]
  • Linformer: "Linformer: Self-Attention with Linear Complexity", arXiv, 2020 (Facebook). [Paper][PyTorch (Unofficial)]
  • Nystromformer: "Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention", AAAI, 2021 (UW-Madison). [Paper][PyTorch]
  • RFA: "Random Feature Attention", ICLR, 2021 (DeepMind). [Paper]
  • Performer: "Rethinking Attention with Performers", ICLR, 2021 (Google). [Paper][Code][Blog]
  • DeLight: "DeLighT: Deep and Light-weight Transformer", ICLR, 2021 (UW). [Paper]
  • Synthesizer: "Synthesizer: Rethinking Self-Attention for Transformer Models", ICML, 2021 (Google). [Paper][Tensorflow][PyTorch (leaderj1001)]
  • Poolingformer: "Poolingformer: Long Document Modeling with Pooling Attention", ICML, 2021 (Microsoft). [Paper]
  • Hi-Transformer: "Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling", ACL, 2021 (Tsinghua). [Paper]
  • Smart-Bird: "Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer", arXiv, 2021 (Tsinghua). [Paper]
  • Fastformer: "Fastformer: Additive Attention is All You Need", arXiv, 2021 (Tsinghua). [Paper]
  • ∞-former: "∞-former: Infinite Memory Transformer", arXiv, 2021 (Instituto de Telecomunicações, Portugal). [Paper]
  • cosFormer: "cosFormer: Rethinking Softmax In Attention", ICLR, 2022 (SenseTime). [Paper][PyTorch (davidsvy)]
  • MGK: "Improving Transformers with Probabilistic Attention Keys", ICML, 2022 (UCLA). [Paper]

[Back to Overview]

Attention for Both

  • Sparse-Transformer: "Generating Long Sequences with Sparse Transformers", arXiv, 2019 (OpenAI). [Paper][Tensorflow][Blog]
  • Reformer: "Reformer: The Efficient Transformer", ICLR, 2020 (Google). [Paper][Tensorflow][Blog]
  • Sinkhorn-Transformer: "Sparse Sinkhorn Attention", ICML, 2020 (Google). [Paper][PyTorch (Unofficial)]
  • Linear-Transformer: "Transformers are rnns: Fast autoregressive transformers with linear attention", ICML, 2020 (Idiap). [Paper][PyTorch][Website]
  • SMYRF: "SMYRF: Efficient Attention using Asymmetric Clustering", NeurIPS, 2020 (UT Austin + Google). [Paper][PyTorch]
  • Routing-Transformer: "Efficient Content-Based Sparse Attention with Routing Transformers", TACL, 2021 (Google). [Paper][Tensorflow][PyTorch (Unofficial)][Slides]
  • LRA: "Long Range Arena: A Benchmark for Efficient Transformers", ICLR, 2021 (Google). [Paper][Tensorflow]
  • OmniNet: "OmniNet: Omnidirectional Representations from Transformers", ICML, 2021 (Google). [Paper]
  • Evolving-Attention: "Evolving Attention with Residual Convolutions", ICML, 2021 (Peking + Microsoft). [Paper]
  • H-Transformer-1D: "H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences", ACL, 2021 (Google). [Paper]
  • Combiner: "Combiner: Full Attention Transformer with Sparse Computation Cost", NeurIPS, 2021 (Google). [Paper]
  • Centroid-Transformer: "Centroid Transformers: Learning to Abstract with Attention", arXiv, 2021 (UT Austin). [Paper]
  • AFT: "An Attention Free Transformer", arXiv, 2021 (Apple). [Paper]
  • Luna: "Luna: Linear Unified Nested Attention", arXiv, 2021 (USC + CMU + Facebook). [Paper]
  • Transformer-LS: "Long-Short Transformer: Efficient Transformers for Language and Vision", arXiv, 2021 (NVIDIA). [Paper]
  • PoNet: "PoNet: Pooling Network for Efficient Token Mixing in Long Sequences", ICLR, 2022 (Alibaba). [Paper]
  • Paramixer: "Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention", CVPR, 2022 (Norwegian University of Science and Technology, Norway). [Paper]
  • ContextPool: "Efficient Representation Learning via Adaptive Context Pooling", ICML, 2022 (Apple). [Paper]
  • LARA: "Linear Complexity Randomized Self-attention Mechanism", ICML, 2022 (Bytedance). [Paper]
  • Flowformer: "Flowformer: Linearizing Transformers with Conservation Flows", ICML, 2022 (Tsinghua University). [Paper][PyTorch]
  • MRA: "Multi Resolution Analysis (MRA) for Approximate Self-Attention", ICML, 2022 (University of Wisconsin, Madison). [Paper][PyTorch]
  • EcoFormer: "EcoFormer: Energy-Saving Attention with Linear Complexity", NeurIPS, 2022 (Monash University). [Paper][Code (in construction)]
  • ?: "Horizontal and Vertical Attention in Transformers", arXiv, 2022 (University of Technology Sydney). [Paper]
  • MRL: "MRL: Learning to Mix with Attention and Convolutions", arXiv, 2022 (Sony). [Paper]

[Back to Overview]

Attention for Others

  • Informer: "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting", AAAI, 2021 (Beihang University). [Paper][PyTorch]
  • Attention-Rank-Collapse: "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth", ICML, 2021 (Google + EPFL). [Paper][PyTorch]
  • ?: "Choose a Transformer: Fourier or Galerkin", NeurIPS, 2021 (Washington University, St. Louis). [Paper]
  • NPT: "Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning", arXiv, 2021 (Oxford). [Paper]
  • FEDformer: "FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting", ICML, 2022 (Alibaba). [Paper][PyTorch]
  • ?: "Generalizable Memory-driven Transformer for Multivariate Long Sequence Time-series Forecasting", arXiv, 2022 (University of Technology Sydney). [Paper]

[Back to Overview]


Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

References

About

An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published