- githubのmarkdownプレビューだと数式が崩壊してしまうため、pdf参照
- vision-basic
- u-net : U-Net(Convolutional Networks for Biomedical Image Segmentation)
- vision-transformer : VisionTransformer(AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE)
- swin-transformer : SwinTransformer(Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
- distillation : Distilling the Knowledge in a Neural Network
- maxvit : MaxViT: Multi-Axis Vision Transformer
- mae : Masked Autoencoders Are Scalable Vision Learners
- simmim : SimMIM: a Simple Framework for Masked Image Modeling
- revnet : The Reversible Residual Network: Backpropagation Without Storing Activations
- rev-vit : Reversible Vision Transformers
- dino: Emerging Properties in Self-Supervised Vision Transformers
- dinov2: DINOv2: Learning Robust Visual Features without Supervision
- dinov3: DINOv3
- generation (diffusion, flow)
- ddpm : Denoising Diffusion Probabilistc Models
- palette : Palette: Image-to-Image Diffusion Models
- ddim : Denoising Diffusion Implicit Models
- improved-ddpm : Improved Denoising Diffusion Probabilistic Models
- adm : Diffusion Models Beat GANs on Image Synthesis
- glide : Guided Language to Image Diffusion for Generation and Editing
- ldm : Latent Diffusion Model(Stable diffusion)
- cdm : Cascaded Diffusion Model
- inpaint-survey : Deep Learning-based Image and Video Inpainting: A Survey
- dit: Scalable Diffusion Models with Transformers
- qwen-image: Qwen-Image Technical Report
- flow-matching: FLOW MATCHING FOR GENERATIVE MODELING
- super-resolution
- srcnn-vdsr-fsrcnn-fspcn : 超解像の歴史(CNNあたりからGAN登場まで)
- swinir : SwinIR(SwinIR: Image Restoration Using Swin Transformer)
- hat : HAT-L(Hybrid Attention Transformer)
- drct : DRCT(Dense Residual Connected Transformer)
- sr3 : Image Super-Resolution via Iterative Refinement
- ipg : Image Processing GNN: Breaking Rigidity in Super-Resolution
- yonos-sr : You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- hmanet : HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution
- diffusion-sr-survey : Diffusion Models, Image Super-Resolution
- blip-diffusion: BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
- controlnet: Adding Conditional Control to Text-to-Image Diffusion Models
- prompt-to-prompt: Prompt-to-Prompt Image Editing with Cross Attention Control And Everything: A Survey
- tr-misr : TR-MISR: Multiimage Super-Resolution Based on Feature Fusion With Transformers
- div2k : NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study
- lsdir : LSDIR: A Large Scale Dataset for Image Restoration
- df2k : DF2K
- ntire-challenge-on-lfsr : NTIRE 2024 Challenge on Light Field Image Super-Resolution: Methods and Results
- epit : (EPIT)Learning Non-Local Spatial-Angular Correlation for Light Field Image Super-Resolution
- pixel-shuffle : Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
- datsr : Reference-based Image Super-Resolution with Deformable Attention Transformer
- ais2024challenge-survey : Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey
- image-restoration
- aioir-survey : A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends
- ram : Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration
- deblurring
- image-deblurring-survey : Deep Image Deblurring: A Survey
- adarevd : AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
- object-detecion
- faster-r-cnn: Faster R-CNN: Towards Real-Time Object Detection with Reg
- yolo: You Only Look Once: Unified, Real-Time Object Detection
- yolov2: YOLO9000: Better, Faster, Stronger
- yolov3: YOLOv3: An Incremental Improvement
- yolov4: YOLOv4: Optimal Speed and Accuracy of Object Detection
- yolox: YOLOX: Exceeding YOLO Series in 2021
- detr: End-to-End Object Detection with Transformers
- ovd-survey: A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
- ovr-cnn: Open-Vocabulary Object Detection Using Captions
- vild: OPEN-VOCABULARY OBJECT DETECTION VIA VISION AND LANGUAGE KNOWLEDGE DISTILLATION
- mrvg: Multimodal Reference Visual Grounding
- nids: Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation
- segmentation
- sam: Segment Anything
- sam2: SAM 2: Segment Anything in Images and Videos
- sam3: SEGMENT ANYTHING WITH CONCEPTS
- 3dgs
- 3dgs : 3D Gaussian Splatting for Real-Time Radiance Field Rendering
- srgs: SRGS: Super-Resolution 3D Gaussian Splatting
- gaussiansr : GaussianSR: 3D Gaussian Super-Resolution with 2D
- supergaussian : SuperGaussian: Repurposing Video Models for 3D Super Resolution Diffusion Priors
- supergs : SuperGS: Super-Resolution 3D Gaussian Splatting via Latent Feature Field and Gradient-guided Splitting
- e-3dgs : Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting
- deblurring-3dgs : Deblurring 3D Gaussian Splatting
- nerf
- nerf : NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
- nerf-sr : NeRF-SR: High Quality Neural Radiance Fields using Supersampling
- mip-nerf : Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
- crop : Cross-Guided Optimization of Radiance Fields with Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis
- video
- adatad : End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
- iaw : Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
- vtune: On the Consistency of Video Large Language Models in Temporal Comprehension
- longvale: LongVALE: Vision-Audio-Language-Event Benchmark TowardsTime-Aware Omni-Modal Perception of Long Videos
- video-3d-llm: Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
- m-llm-based-video-frame-selection: M-LLM Based Video Frame Selection for Efficient Video Understanding
- video-comp: VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
- seq2time: Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
- video-models-are-zero-shot-learners-and-reasoners: Video models are zero-shot learners and reasoners
- vision-and-language
- clip : CLIP(Learning Transferable Visual Models From Natural Language Supervision)
- lit : LiT : Zero-Shot Transfer with Locked-image text Tuning
- blip : BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- blip2 : BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- siglip : Sigmoid Loss for Language Image Pre-Training
- Flamingo: a Visual Language Model for Few-Shot Learning
- video-llm-survey : Video Understanding with Large Language Models: A Survey(途中)
- llava : Visual Instruction Tuning
- llava-next-video : blog
- llava-next-stronger : blog
- llava-video : VIDEO INSTRUCTION TUNING WITH SYNTHETIC DATA
- llava-next-interleave : LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
- long-vlm : LongVLM: Efficient Long Video Understanding via Large Language Models(ECCV2024)
- tcr : Text-Conditioned Resampler For Long Form Video Understanding(ECCV2024)
- qwen-vl : Qwen2.5-VL Technical Report
- vtg-llm : VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
- internvl : InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- internvl-1_5 : How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
- internvl-3 : InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
- video-xl : Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
- video-xl-pro : Reconstructive Token Compression for Extremely Long Video Understanding
- phi3-tech-report : Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
- phi4-mini : Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
- deepseek-vl : DeepSeek-VL: Towards Real-World Vision-Language Understanding
- deepseek-vl2 : DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
- vidi: Vidi: Large Multimodal Models for Video Understanding and Editing
- ref-l4: Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
- molmo-and-pixmo: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- llava-st: LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
- internvl-3_5: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
- vlm-vg: Learning Visual Grounding from Generative Vision and Language Model
- streaming-vlm: STREAMINGVLM: REAL-TIME UNDERSTANDING FOR INFINITE VIDEO STREAMS
- piza: Referring Expression Comprehension for Small Objects
- paddle-ocr: PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
- deepseek-ocr: DeepSeek-OCR: Contexts Optical Compression
- idefic: Building and better understanding vision-language models: insights and future directions
- covt: Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
- mcot: Multimodal Chain-of-Thought Reasoning in Language Models
- viscot: visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
- nlp
- keyword : LLMの用語集
- transformer : Transformer(Attention is all you need)
- perceiver: Perceiver: General Perception with Iterative Attention
- perceiver: PERCEIVER IO: A GENERAL ARCHITECTURE FOR STRUCTURED INPUTS & OUTPUTS
- lora : LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
- auxiliary-loss-free : auxiliary-loss-free load balancing strategy for mixture-of-experts
- deepseek-v3 : DeepSeek-V3 Technical Report
- rag: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- prompt-tuning: The Power of Scale for Parameter-Efficient Prompt Tuning