- 2D Scene Understanding
- 3D Scene Understanding
- NeRF/Gaussian
- 2D Generation
- 3D Generation
- Human
- Video
- LLM/MLLM/VLM
- Transformer
- Diffusion
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
- Homepage : https://www.robots.ox.ac.uk/~vgg/research/ovdiff/
- Paper : https://arxiv.org/abs/2306.09316
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
- Homepage : https://github.com/gpt4vision/OvSGTR/
- Paper : https://arxiv.org/abs/2311.10988
Relation DETR: Exploring Explicit Position Relation Prior for Object Detection
- Homepage : https://github.com/xiuqhou/Relation-DETR
- Paper : https://arxiv.org/abs/2407.11699
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
- Homepage : https://github.com/xjwu1024/WPS-SAM
- Paper : https://arxiv.org/abs/2407.10131
Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
- Homepage : https://vlislab22.github.io/Any2Seg/
- Paper : https://arxiv.org/abs/2407.11351
OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model
- Paper : https://arxiv.org/abs/2404.10312
Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
- Homepage : https://github.com/jiaosiyu1999/MAFT-Plus
- Paper : https://arxiv.org/abs/2408.00744
ESA: Annotation-Efficient Active Learning for Semantic Segmentation
- Homepage : https://github.com/jinchaogjc/ESA
- Paper : https://arxiv.org/abs/2408.13491v1
Towards Scene Graph Anticipation
- Homepage : https://github.com/rohithpeddi/SceneSayer/tree/main
- Paper : https://arxiv.org/abs/2403.04899
Dataset Enhancement with Instance-Level Augmentations
- Homepage : https://www.robots.ox.ac.uk/~vgg/research/instance-augmentation/
- Paper : https://arxiv.org/abs/2406.08249
An Adaptive Correspondence Scoring Framework for Unsupervised Image Registration of Medical Images
- Homepage : https://voldemort108x.github.io/AdaCS/
- Paper : https://arxiv.org/abs/2312.00837
HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
- Homepage : https://github.com/XiangZ-0/HiT-SR
- Paper : https://arxiv.org/abs/2407.05878
Towards Open-ended Visual Quality Comparison
- Homepage : https://co-instruct.github.io/
- Paper : https://arxiv.org/abs/2402.16641
CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model
- Homepage : https://github.com/weihao1115/cat-sam
- Paper : https://arxiv.org/abs/2402.03631
A Fair Ranking and New Model for Panoptic Scene Graph Generation
- Paper : https://arxiv.org/pdf/2407.09216
Parrot Captions Teach CLIP to Spot Text
- Homepage : https://linyq17.github.io/CLIP-Parrot-Bias/
- Paper : https://arxiv.org/abs/2312.14232
On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines
- Homepage : https://github.com/fiveai/detection_calibration
- Paper : https://arxiv.org/abs/2405.20459
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition
- Homepage : https://github.com/mqraitem/From-Fake-to-Real
- Paper : https://arxiv.org/pdf/2308.04553
SINDER: Repairing the Singular Defects of DINOv2
- Homepage : https://github.com/haoqiwang/sinder
- Paper : https://arxiv.org/abs/2407.16826
Emergent Visual-Semantic Hierarchies in Image-Text Representations
- Homepage : https://tau-vailab.github.io/hierarcaps/
- Paper : https://arxiv.org/abs/2407.08521
AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation
- Homepage : https://github.com/RogerQi/AlignDiff
- Paper : https://motion.cs.illinois.edu/papers/ECCV2024-Qiu-AlignDiff.pdf
OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
- Homepage : https://omninocs.github.io/
- Paper : https://arxiv.org/abs/2407.08711
PointLLM: Empowering Large Language Models to Understand Point Clouds
- Homepage : https://runsenxu.com/projects/PointLLM/
- Paper : https://arxiv.org/abs/2308.16911
Bi-directional Contextual Attention for 3D Dense Captioning
- Paper : https://arxiv.org/abs/2408.06662
Watch Your Steps: Local Image and Scene Editing by Text Instructions
- Homepage : https://ashmrz.github.io/WatchYourSteps/
- Paper : https://arxiv.org/abs/2308.08947
Scene Coordinate Reconstruction
- Homepage : https://nianticlabs.github.io/acezero/
- Paper : https://arxiv.org/abs/2404.14351
HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation
- Homepage : https://github.com/tpzou/HGL
- Paper : https://arxiv.org/abs/2407.12387
RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentatio
- Homepage : https://github.com/cszyzhang/RISurConv
- Paper : https://arxiv.org/abs/2408.06110
RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation
- Homepage : https://github.com/l1997i/Rapid_Seg#
- Paper : https://arxiv.org/abs/2407.10159
Grounding Image Matching in 3D with MASt3R
- Homepage : https://github.com/naver/mast3r
- Paper : https://arxiv.org/abs/2312.14132
Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration
- Paper : https://arxiv.org/abs/2407.08729
SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments
- Homepage : https://fraunhoferhhi.github.io/spvloc/
- Paper : https://arxiv.org/abs/2404.10527
Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering
- Homepage : https://anttwo.github.io/frosting/
- Paper : https://arxiv.org/abs/2403.14554
MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
- Homepage : https://donydchen.github.io/mvsplat/
- Paper : https://arxiv.org/abs/2403.14627
Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
- Homepage : https://github.com/GATECH-EIC/Omni-Recon
- Paper : https://arxiv.org/abs/2403.11131
RaFE: Generative Radiance Fields Restoration
- Homepage : https://zkaiwu.github.io/RaFE/
- Paper : https://arxiv.org/abs/2404.03654
Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration
- Homepage : https://lzhnb.github.io/project-pages/analytic-splatting/
- Paper : https://arxiv.org/abs/2403.11056
FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information
- Homepage : https://jiangwenpl.github.io/FisherRF/
- Paper : https://arxiv.org/abs/2311.17874
Adversarial Diffusion Distillation
- Homepage : https://github.com/Stability-AI/generative-models
- Paper : https://arxiv.org/abs/2311.17042
Adversarial Robustification via Text-to-Image Diffusion Models
- Homepage : https://github.com/ChoiDae1/robustify-T2I
- Paper : https://arxiv.org/abs/2407.18658
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
- Homepage : https://jingyechen.github.io/textdiffuser2/
- Paper : https://arxiv.org/abs/2311.16465
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
- Homepage : https://doubiiu.github.io/projects/DynamiCrafter/
- Paper : https://arxiv.org/abs/2310.12190
Accelerating Image Generation with Sub-path Linear Approximation Model
- Homepage : https://subpath-linear-approx-model.github.io/
- Paper : https://arxiv.org/abs/2404.13903
LLMGA: Multimodal Large Language Model based Generation Assistant
- Homepage : https://llmga.github.io/
- Paper : https://arxiv.org/abs/2311.16500
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
- Homepage : https://bolinlai.github.io/Lego_EgoActGen/
- Paper : https://arxiv.org/abs/2312.03849
ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model
- Homepage : https://gen-l-2.github.io/
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
- Homepage : https://vision.cs.utexas.edu/projects/action2sound/
- Paper : https://arxiv.org/abs/2406.09272
Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
- Homepage : https://huggingface.co/papers/2401.05675
- Paper : https://arxiv.org/abs/2401.05675
R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
- Homepage : https://github.com/chkimmmmm/R.A.C.E
- Paper : https://arxiv.org/abs/2405.16341
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
- Homepage : https://kailinli.github.io/SemGrasp/
- Paper : https://arxiv.org/abs/2404.03590
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
- Homepage : https://me.kiui.moe/lgm/
- Paper : https://arxiv.org/abs/2402.05054
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
- Homepage : https://flashtex.github.io/
- Paper : https://arxiv.org/abs/2402.13251
Pyramid Diffusion for Fine 3D Large Scene Generation
- Homepage : https://github.com/yuhengliu02/pyramid-discrete-diffusion
- Paper : https://arxiv.org/abs/2311.12085
COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation
- Homepage : https://arking1995.github.io/ContextLayout/
- Paper : https://arxiv.org/abs/2407.11294
A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
- Homepage : https://ggxxii.github.io/texdreamer/
- Paper : https://arxiv.org/abs/2403.12906
Controllable Human-Object Interaction Synthesis
- Homepage : https://lijiaman.github.io/projects/chois/
- Paper : https://arxiv.org/abs/2312.03913
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation
- Homepage : https://zikaihuangscut.github.io/Beat-It/
- Paper : https://arxiv.org/abs/2407.07554
Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models
- Homepage : https://snuvclab.github.io/coma/
- Paper : https://arxiv.org/abs/2401.12978
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars
- Homepage : https://github.com/FangyunWei/SLRT/tree/main/Spoken2Sign
- Paper : https://arxiv.org/abs/2401.04730
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
- Homepage : https://guanjz20.github.io/projects/ReSyncer/
- Paper : https://arxiv.org/abs/2408.03284
Sapiens: Foundation for Human Vision Models
- Homepage : https://about.meta.com/realitylabs/codecavatars/sapiens/
- Paper : https://www.arxiv.org/abs/2408.12569
Arc2Face: A Foundation Model for ID-Consistent Human Faces
- Homepage : https://arc2face.github.io/
- Paper : https://arxiv.org/abs/2403.11641
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
- Homepage : https://physdreamer.github.io/
- Paper : https://arxiv.org/abs/2404.13026
Audio-Synchronized Visual Animation
- Homepage : https://lzhangbj.github.io/projects/asva/asva.html
- Paper : https://arxiv.org/abs/2403.05659
LongVLM: Efficient Long Video Understanding via Large Language Models
- Homepage : https://github.com/ziplab/LongVLM
- Paper : https://arxiv.org/abs/2404.03384
ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
- Homepage : https://vislearn.github.io/ControlNet-XS/
- Paper : https://arxiv.org/abs/2312.06573
Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos
- Homepage : https://remysabathier.github.io/animalavatar.github.io/
- Paper : https://arxiv.org/abs/2403.17103
E3M: Zero-Shot Spatio-Temporal Video Grounding
- Homepage : https://github.com/baopj/E3M?tab=readme-ov-file#e3m-zero-shot-spatio-temporal-video-grounding
- Paper : https://baopj.github.io/files/ECCV24_E3M_ZeroSTVG.pdf
Classification Matters: Improving Video Action Detection with Class-Specific Attention
- Homepage : https://jinsingsangsung.github.io/ClassificationMatters/
- Paper : https://arxiv.org/abs/2407.19698
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
- Homepage : https://daveishan.github.io/avr-webpage/
- Paper : https://www.crcv.ucf.edu/wp-content/uploads/2018/11/avr_eccv24_dave.pdf
ActionVOS: Actions as Prompts for Video Object Segmentation
- Homepage : https://github.com/ut-vision/ActionVOS
- Paper : https://arxiv.org/abs/2407.07402
DEVIAS: Learning Disentangled Video Representations of Action and Scene
- Homepage : https://github.com/KHU-VLL/DEVIAS
- Paper : https://arxiv.org/abs/2312.00826
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
- Homepage : https://showlab.github.io/MotionDirector/
- Paper : https://arxiv.org/abs/2310.08465
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
- Homepage : https://github.com/charigyang/made2order
- Paper : https://arxiv.org/abs/2404.16828
SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
- Homepage : https://sv3d.github.io/
- Paper : https://arxiv.org/abs/2403.12008
Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation
- Homepage : https://zzh-tech.github.io/InterpAny-Clearer/
- Paper : https://arxiv.org/abs/2311.08007
Video Editing via Factorized Diffusion Distillation
- Homepage : https://fdd-video-edit.github.io/
- Paper : https://arxiv.org/abs/2403.09334
Towards Neuro-Symbolic Video Understanding
- Homepage : https://utaustin-swarmlab.github.io/nsvs-project-page.github.io/
- Paper : https://arxiv.org/abs/2403.11021
MMBench: Is Your Multi-modal Model an All-around Player?
- Homepage : https://github.com/open-compass/MMBench
- Paper : https://arxiv.org/abs/2307.06281
BRAVE: Broadening the visual encoding of vision-language models
- Homepage : https://brave-vlms.epfl.ch/
- Paper : https://arxiv.org/abs/2404.07204
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
- Homepage : https://omniview-tuning.github.io/
- Paper : https://arxiv.org/abs/2404.12139
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- Homepage : https://github.com/pkunlp-icler/FastV
- Paper : https://arxiv.org/abs/2403.06764
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
- Homepage : https://arxiv.org/abs/2403.09792
- Paper : https://github.com/AoiDragon/HADES
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
- Paper : https://arxiv.org/abs/2403.08730
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models
- Paper : https://arxiv.org/abs/2312.07408
Towards Goal-oriented Large Language Model Prompting: A Survey
Denoising Vision Transformers
- Homepage : https://jiawei-yang.github.io/DenoisingViT/
- Paper : https://arxiv.org/abs/2401.02957
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
- Homepage : https://cs-people.bu.edu/vpetsiuk/arc/
- Paper : https://arxiv.org/abs/2404.13706