Skip to content

msdsm/paper-summary

Repository files navigation

論文まとめ

  • githubのmarkdownプレビューだと数式が崩壊してしまうため、pdf参照

論文

  • vision-basic
    • u-net : U-Net(Convolutional Networks for Biomedical Image Segmentation)
    • vision-transformer : VisionTransformer(AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE)
    • swin-transformer : SwinTransformer(Swin Transformer: Hierarchical Vision Transformer using Shifted Windows)
    • distillation : Distilling the Knowledge in a Neural Network
    • maxvit : MaxViT: Multi-Axis Vision Transformer
    • mae : Masked Autoencoders Are Scalable Vision Learners
    • simmim : SimMIM: a Simple Framework for Masked Image Modeling
    • revnet : The Reversible Residual Network: Backpropagation Without Storing Activations
    • rev-vit : Reversible Vision Transformers
    • dino: Emerging Properties in Self-Supervised Vision Transformers
    • dinov2: DINOv2: Learning Robust Visual Features without Supervision
    • dinov3: DINOv3
  • generation (diffusion, flow)
    • ddpm : Denoising Diffusion Probabilistc Models
    • palette : Palette: Image-to-Image Diffusion Models
    • ddim : Denoising Diffusion Implicit Models
    • improved-ddpm : Improved Denoising Diffusion Probabilistic Models
    • adm : Diffusion Models Beat GANs on Image Synthesis
    • glide : Guided Language to Image Diffusion for Generation and Editing
    • ldm : Latent Diffusion Model(Stable diffusion)
    • cdm : Cascaded Diffusion Model
    • inpaint-survey : Deep Learning-based Image and Video Inpainting: A Survey
    • dit: Scalable Diffusion Models with Transformers
    • qwen-image: Qwen-Image Technical Report
    • flow-matching: FLOW MATCHING FOR GENERATIVE MODELING
  • super-resolution
    • srcnn-vdsr-fsrcnn-fspcn : 超解像の歴史(CNNあたりからGAN登場まで)
    • swinir : SwinIR(SwinIR: Image Restoration Using Swin Transformer)
    • hat : HAT-L(Hybrid Attention Transformer)
    • drct : DRCT(Dense Residual Connected Transformer)
    • sr3 : Image Super-Resolution via Iterative Refinement
    • ipg : Image Processing GNN: Breaking Rigidity in Super-Resolution
    • yonos-sr : You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
    • hmanet : HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution
    • diffusion-sr-survey : Diffusion Models, Image Super-Resolution
    • blip-diffusion: BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
    • controlnet: Adding Conditional Control to Text-to-Image Diffusion Models
    • prompt-to-prompt: Prompt-to-Prompt Image Editing with Cross Attention Control And Everything: A Survey
    • tr-misr : TR-MISR: Multiimage Super-Resolution Based on Feature Fusion With Transformers
    • div2k : NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study
    • lsdir : LSDIR: A Large Scale Dataset for Image Restoration
    • df2k : DF2K
    • ntire-challenge-on-lfsr : NTIRE 2024 Challenge on Light Field Image Super-Resolution: Methods and Results
    • epit : (EPIT)Learning Non-Local Spatial-Angular Correlation for Light Field Image Super-Resolution
    • pixel-shuffle : Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
    • datsr : Reference-based Image Super-Resolution with Deformable Attention Transformer
    • ais2024challenge-survey : Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey
  • image-restoration
    • aioir-survey : A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends
    • ram : Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration
  • deblurring
    • image-deblurring-survey : Deep Image Deblurring: A Survey
    • adarevd : AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring
  • object-detecion
    • faster-r-cnn: Faster R-CNN: Towards Real-Time Object Detection with Reg
    • yolo: You Only Look Once: Unified, Real-Time Object Detection
    • yolov2: YOLO9000: Better, Faster, Stronger
    • yolov3: YOLOv3: An Incremental Improvement
    • yolov4: YOLOv4: Optimal Speed and Accuracy of Object Detection
    • yolox: YOLOX: Exceeding YOLO Series in 2021
    • detr: End-to-End Object Detection with Transformers
    • ovd-survey: A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
    • ovr-cnn: Open-Vocabulary Object Detection Using Captions
    • vild: OPEN-VOCABULARY OBJECT DETECTION VIA VISION AND LANGUAGE KNOWLEDGE DISTILLATION
    • mrvg: Multimodal Reference Visual Grounding
    • nids: Adapting Pre-Trained Vision Models for Novel Instance Detection and Segmentation
  • segmentation
    • sam: Segment Anything
    • sam2: SAM 2: Segment Anything in Images and Videos
    • sam3: SEGMENT ANYTHING WITH CONCEPTS
  • 3dgs
    • 3dgs : 3D Gaussian Splatting for Real-Time Radiance Field Rendering
    • srgs: SRGS: Super-Resolution 3D Gaussian Splatting
    • gaussiansr : GaussianSR: 3D Gaussian Super-Resolution with 2D
    • supergaussian : SuperGaussian: Repurposing Video Models for 3D Super Resolution Diffusion Priors
    • supergs : SuperGS: Super-Resolution 3D Gaussian Splatting via Latent Feature Field and Gradient-guided Splitting
    • e-3dgs : Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting
    • deblurring-3dgs : Deblurring 3D Gaussian Splatting
  • nerf
    • nerf : NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
    • nerf-sr : NeRF-SR: High Quality Neural Radiance Fields using Supersampling
    • mip-nerf : Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields
    • crop : Cross-Guided Optimization of Radiance Fields with Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis
  • video
    • adatad : End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
    • iaw : Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
    • vtune: On the Consistency of Video Large Language Models in Temporal Comprehension
    • longvale: LongVALE: Vision-Audio-Language-Event Benchmark TowardsTime-Aware Omni-Modal Perception of Long Videos
    • video-3d-llm: Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
    • m-llm-based-video-frame-selection: M-LLM Based Video Frame Selection for Efficient Video Understanding
    • video-comp: VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
    • seq2time: Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
    • video-models-are-zero-shot-learners-and-reasoners: Video models are zero-shot learners and reasoners
  • vision-and-language
    • clip : CLIP(Learning Transferable Visual Models From Natural Language Supervision)
    • lit : LiT : Zero-Shot Transfer with Locked-image text Tuning
    • blip : BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    • blip2 : BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
    • siglip : Sigmoid Loss for Language Image Pre-Training
    • Flamingo: a Visual Language Model for Few-Shot Learning
    • video-llm-survey : Video Understanding with Large Language Models: A Survey(途中)
    • llava : Visual Instruction Tuning
    • llava-next-video : blog
    • llava-next-stronger : blog
    • llava-video : VIDEO INSTRUCTION TUNING WITH SYNTHETIC DATA
    • llava-next-interleave : LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
    • long-vlm : LongVLM: Efficient Long Video Understanding via Large Language Models(ECCV2024)
    • tcr : Text-Conditioned Resampler For Long Form Video Understanding(ECCV2024)
    • qwen-vl : Qwen2.5-VL Technical Report
    • vtg-llm : VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
    • internvl : InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
    • internvl-1_5 : How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
    • internvl-3 : InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
    • video-xl : Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
    • video-xl-pro : Reconstructive Token Compression for Extremely Long Video Understanding
    • phi3-tech-report : Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
    • phi4-mini : Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
    • deepseek-vl : DeepSeek-VL: Towards Real-World Vision-Language Understanding
    • deepseek-vl2 : DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
    • vidi: Vidi: Large Multimodal Models for Video Understanding and Editing
    • ref-l4: Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
    • molmo-and-pixmo: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
    • llava-st: LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
    • internvl-3_5: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
    • vlm-vg: Learning Visual Grounding from Generative Vision and Language Model
    • streaming-vlm: STREAMINGVLM: REAL-TIME UNDERSTANDING FOR INFINITE VIDEO STREAMS
    • piza: Referring Expression Comprehension for Small Objects
    • paddle-ocr: PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
    • deepseek-ocr: DeepSeek-OCR: Contexts Optical Compression
    • idefic: Building and better understanding vision-language models: insights and future directions
    • covt: Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
    • mcot: Multimodal Chain-of-Thought Reasoning in Language Models
    • viscot: visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
  • nlp
    • keyword : LLMの用語集
    • transformer : Transformer(Attention is all you need)
    • perceiver: Perceiver: General Perception with Iterative Attention
    • perceiver: PERCEIVER IO: A GENERAL ARCHITECTURE FOR STRUCTURED INPUTS & OUTPUTS
    • lora : LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
    • auxiliary-loss-free : auxiliary-loss-free load balancing strategy for mixture-of-experts
    • deepseek-v3 : DeepSeek-V3 Technical Report
    • rag: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    • prompt-tuning: The Power of Scale for Parameter-Efficient Prompt Tuning

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published