The State of the Art in Image Upscaling and Video Super-Resolution (Nov 13, 2025)

A comprehensive analysis of 30+ algorithms, performance benchmarks, and emerging trends in real-time image and video super-resolution.

Executive Summary: Key Findings

🎯 Main Discoveries

1. Architecture Paradigm Shift

Hybrid CNN-Transformer models now dominate state-of-the-art results
State Space Models (Mamba) offer compelling linear-complexity alternatives to quadratic attention
Pure GAN approaches (ESRGAN) giving way to sophisticated perceptual loss combinations

2. Quality Has Plateaued

PSNR improvements stalling at ~32.8-32.9 dB on standard benchmarks
Critical insight: The field is shifting from PSNR-only optimization to perceptual metrics (LPIPS, CLIP-IQA)
Gap between test-set performance and real-world results remains significant

3. Real-Time Revolution is Here

RT4KSR (2023): First proven real-time 4K super-resolution (60-120 FPS on consumer GPUs)
VPEG (2025): Achieves Real-ESRGAN-level perceptual quality on 17.6% of computational budget
Video SR: AIM 2024 winners hitting <33ms per frame (24-30 FPS real-time)

4. Efficiency Explosion

Parameter reduction: 16M+ (2022) → <5M (2024) → <3M now viable (2025)
Quality vs. speed trade-offs becoming increasingly favorable
Edge deployment finally practical for real-time upscaling

5. Video SR Maturation

Recurrent architectures proven essential for temporal consistency
Motion-aware methods (HAMSA) showing 0.2-0.3 dB improvements
Real-time video super-resolution moving from lab to production

📊 By The Numbers

Metric	2022-2023	2024-2025	Change
Typical SOTA Parameters	16-20M	5-8M	75% reduction
PSNR improvement/year	+0.3 dB	+0.1 dB	Diminishing returns
Real-time at 4K	Not practical	60+ FPS	Now standard
Methods handling real-world degradation	Limited	Multiple	Solved
Perceptual metric adoption	Emerging	Mainstream	Standard now

Part 1: The Landscape

Historical Context: 14 Years of Evolution

The super-resolution field has undergone dramatic architectural transformations:

2014: SRCNN - The pioneer (3 layers, 1M parameters) - established CNN-based SR

2016-2018: GAN era - SRGAN, ESRGAN, BSRGAN introduced adversarial training for perceptual quality

2018: Attention mechanisms - RCAN brought channel attention to SR (16M parameters, deep networks)

2021: Transformer arrival - SwinIR demonstrated vision transformers could reduce parameters by 67% while improving quality

2023: RT4KSR challenge - Proved real-time 4K feasible (60+ FPS on commercial GPUs)

2024-2025: Mamba era - State space models emerged as efficient alternatives; hybrid architectures solidified dominance

Why This Matters

For practitioners, this means:

Production deployments: Real-time upscaling is now feasible on consumer hardware
Mobile: Edge deployment viable with <5M parameter models
Quality ceiling: Further PSNR improvements unlikely; perceptual quality is the frontier
Cost: Inference costs dropping dramatically

Part 2: The Algorithms

Category 1: Transformer-Based Methods (SOTA Quality Leaders)

HAT (Hybrid Attention Transformer) - 2023 ⭐ RECOMMENDED FOR MAXIMUM QUALITY

Performance Specs

PSNR: 32.8+ dB (state-of-the-art level)
SSIM: 0.92+
Parameters: 16-20M
Processing time: 1-5 seconds per image on GPU
Scale factors: 2x, 3x, 4x

Architecture Innovation HAT’s breakthrough was combining two complementary attention mechanisms:

Channel Attention - learns which feature channels are most important
Window-based Self-Attention - captures spatial relationships locally

This “hybrid” approach activates more pixels in feature space than methods using only one attention type, resulting in clearer, more coherent details.

Best for: Professional image enhancement, publication-quality results, desktop applications

Availability: Open source at https://github.com/XPixelGroup/HAT

HAAT (Hybrid Attention Aggregation Transformer) - 2024

What’s New Building on HAT’s success, HAAT introduces:

Swin-Dense-Residual-Connected Blocks (SDRCB) for expanded receptive fields
Hybrid Grid Attention Blocks (HGAB) for sophisticated attention aggregation
More refined architecture with fewer parameters

Performance Specs

PSNR: SOTA (32.8+ dB, sometimes exceeding HAT)
Attention type: Advanced grid-based mechanisms
Status: Newest benchmark leader

Best for: Research, highest-quality offline processing

SwinIR (Swin Image Restoration) - 2021 ⭐ STABLE BASELINE

Why It’s Still Relevant SwinIR kicked off the transformer revolution in SR by proving transformers could be more efficient than CNNs:

67% fewer parameters than competing methods
Similar MACs (multiply-accumulate operations)
0.14-0.45 dB PSNR improvement over prior SOTA

Performance Specs

PSNR: 32.5+ dB
Parameters: 16M
Computational efficiency: High (67% reduction vs. baselines)
Multi-task capable (SR, denoising, JPEG artifact removal)

Best for: Established production use, research baseline comparisons, when stability is prioritized

Availability: https://github.com/JingyunLiang/SwinIR

Emerging Transformer Variants (2024-2025)

SVTSR - Scattering Vision Transformer with spectral analysis for intricate detail capture

XTNSR - Hybrid CNN-Transformer using Xception blocks + local feature window transformers

LFESR - Local Feature Enhancement Transformer balancing global context with local detail

PARGT - Parallel Attention Recursive Generalization for fine-grained feature interaction

Status: Academic research stage; not yet mainstream production deployment

Category 2: State Space Models / Mamba (The New Frontier)

MambaIR (Mamba Image Restoration) - 2024

The Game Changer Mamba represents a fundamentally different approach to modeling dependencies:

Linear complexity (O(n)) vs. transformer’s quadratic complexity (O(n²))
Can theoretically handle much larger images without memory explosion
Long-range dependency modeling with minimal computational cost

Performance Specs

Parameters: 3-8M
Complexity: Linear (efficient at scale)
Performance: Competitive with transformers on standard benchmarks
Real-time: Increasingly viable on standard hardware

Architecture Combines vanilla Mamba foundation with:

Local enhancement modules
Channel attention mechanisms
Integrated residual connections

Best for: Large-scale processing, edge devices, situations where memory is constrained

Hi-Mamba (Hierarchical Mamba) - October 2024

Key Innovation Two-path design capturing both local and regional context:

Local SSM (L-SSM): Fine-grained detail at pixel level
Region SSM (R-SSM): Broader contextual information

Performance Specs

PSNR improvement: +0.29 dB over MambaIR on Manga109 3x SR
Architecture: Hierarchical Mamba Blocks (HMB)
Efficiency: Maintained linear complexity

Best for: Production efficiency-focused deployments

S³Mamba (Scaleable State Space Model) - 2024

Unique Capability First Mamba model supporting arbitrary-scale super-resolution (not limited to 2x, 3x, 4x)

Specs

Scale flexibility: Continuous, user-defined scales
Linear complexity advantage maintained
Efficiency: Mamba-level computational savings

Category 3: CNN-Based Methods (GAN & Attention)

Why It Dominates Real-World Applications Real-ESRGAN solved the “blind super-resolution” problem - upscaling images with unknown degradation:

Performance Specs

PSNR: 24.97 dB (on real-world degraded images; lower than ESRGAN’s 32.01 dB on synthetic data)
SSIM: 0.76
Parameters: 16.7M
Memory: 33-50 MB
Real-time: No (takes 7-30 minutes for 2500×2500px)
Scale factors: 2x, 3x, 4x

What Makes It Special

High-order degradation modeling simulates real-world degradation more accurately
RRDBNet architecture (Residual-in-Residual Dense Blocks) balances quality and computational efficiency
Superior to perfect-information methods on real photographs

The Results Real-ESRGAN produces noticeably better results on:

Old photographs with unknown damage
Screenshots with various compression artifacts
Webcam footage
Consumer camera images

Best for: Any production deployment on real-world images, professional photo restoration

Availability: https://github.com/xinntao/Real-ESRGAN (Apache 2.0, pre-trained models on TensorFlow Hub)

ESRGAN - 2018

Historical Importance Still competitive nearly a decade later. Introduced:

Removal of Batch Normalization for very deep networks
Relativistic discriminator
Enhanced perceptual loss formulation

Performance Specs

PSNR: 32.01 dB (Set5 benchmark)
SSIM: 0.9065
Parameters: 16.6M
Real-time: No

RCAN (Residual Channel Attention Network) - 2018

The Attention Baseline RCAN pioneered channel attention mechanisms for SR:

400 convolutional layers (very deep!)
10 residual groups with 20 attention blocks each
Channel-wise feature rescaling

Performance Specs

PSNR: 32+ dB
SSIM: 0.90+
Parameters: 16M
Real-time: No

Significance: Established attention mechanisms as fundamental to SR architecture design

BSRGAN - 2021

Innovation: Practical degradation model for blind SR

Performance Specs

User study scores: 3.95-4.60 (vs. RealSR, ESRGAN)
Training patch: 72×72 (larger than typical 48×48)
Parameters: 16.6M
Real-time: No

Key Feature: Random shuffling of degradation order for realistic simulation

Category 4: Diffusion-Based Methods (High-Quality Experimental)

Latent Diffusion Models for Super-Resolution - 2022-2024

Concept Operating diffusion process in lower-dimensional latent space rather than pixel space:

10-100x reduction in computational cost vs. pixel-space diffusion
High perceptual quality reconstruction
More practical inference times

Architecture

Feature encoder → latent space
Diffusion process in latent space
Frequency compensation module
Pixel decoder

Advantages

Dramatically improved efficiency
High perceptual quality
Practical for real-world deployment

Disadvantages

Still slower than CNN approaches
Requires more VRAM

DPM-Solver (Diffusion Probabilistic Model Solver)

The Acceleration Breakthrough High-order ODE solver reducing diffusion inference steps:

Traditional DDPM: Hundreds of steps
DPM-Solver: 10-50 steps
Quality: Maintained or improved

Mathematical Foundation

Explicit numerical integration
Exponentially weighted integral solution
Theoretical guarantees on sample quality

Latest Diffusion-Based Approach Controlled diffusion process with strong constraints:

DPM module for fast denoising
GSPM module for guidance
Latent LR embeddings for consistency
Real-world degradation handling

Performance

High-quality perceptual results
Improved real-world image handling
Controllable super-resolution process

Category 5: Real-Time Specialized Methods

VPEG (Efficient Perceptual SR) - 2025 ⭐ BEST EFFICIENCY

The Breakthrough Achieves Real-ESRGAN’s perceptual quality on a fraction of computational budget:

Performance Specs

Parameters: 5M (vs. Real-ESRGAN’s 16.7M)
GFLOPs: <2000 (vs. Real-ESRGAN’s 11,300+)
FPS: >30 on standard hardware
FLOPs efficiency: Uses 17.6% of Real-ESRGAN’s computation

Quality Comparison vs. Real-ESRGAN

PI (Perceptual Index): -24.7% improvement
CLIPIQA: +23.4% improvement
MANIQA: +19.4% improvement

Key Achievement: Proves high efficiency and quality are no longer mutually exclusive

Best for: Real-time applications, edge deployment, resource-constrained environments

RT4KSR (Real-Time 4K Super-Resolution) - 2023 ⭐ BENCHMARK ACHIEVEMENT

The Challenge NTIRE 2023 set an audacious goal: achieve >60 FPS at 4K resolution

The Results

Baseline: >60 FPS target
Top teams: Achieved 60-120 FPS
Input: 720p → 4K (2x), 1080p → 4K (3x)
Architecture: Efficient CNN with progressive modifications
Tested content: Photography, digital art, gaming

Key Techniques

Pixel-unshuffling
Structural re-parameterization
Efficient high-frequency extraction
Deep feature map resolution downscaling

Significance: Proved real-time 4K is achievable on commercial hardware

Constraints

170+ participants
25 teams contributed benchmark report
Multiple GPU support (NVIDIA, AMD)

REAPPEAR - 2025

Platform-Specific Optimization AMD Ryzen AI-optimized real-time super-resolution engine

Features

Edge device optimization
Real-time processing on NPU
Parallel pixel-upscaling architecture

Category 6: Video Super-Resolution

BasicVSR++ - 2021 ⭐ VIDEO REFERENCE STANDARD

Why It’s the Baseline Most video SR research compares against BasicVSR++:

Performance Specs

Parameters: 5.2M
GMACs per frame: ~400
Processing: 10-20 FPS on GPU
Real-time: No (requires <33ms per frame)
Scale factor: 4x
PSNR: High (state-of-the-art for video SR)

Architecture

Recurrent residual CNN
Frame-by-frame processing
Enhanced propagation and alignment (over BasicVSR)
Bidirectional propagation mechanism

Best for: Video enhancement research, quality-focused applications

Availability: https://github.com/OpenVisualCloud/Video-Super-Resolution-Library

FRVSR (Frame-Recurrent Video SR) - 2018

Innovation: Explicit optical flow for motion handling

Architecture

FNet: Optical flow estimation network
SRNet: Super-resolution reconstruction network
Warps previous output frame using flow guidance

Performance Specs

Processing: 5-10 FPS on GPU
Scale factor: 4x
Real-time: No

Key Achievement: Reduced temporal flickering through explicit motion modeling

HAMSA (Hybrid Attention + Motion Alignment) - 2024

Latest Video Approach Combines HAT’s hybrid attention with motion-aware mechanisms:

Components

HAT feature extraction
Channel Motion Attention (CMA)
Inter-frame alignment via motion attention

Performance

High-quality video upscaling
Motion-aware quality improvements (0.2-0.3 dB)

Other Recurrent Methods

RLSP (Recurrent Latent Space Propagation) - 2019

Implicit temporal propagation (no explicit optical flow)
Reduced complexity vs. FRVSR
Efficient latent space representation

RRN and variants

Recurrent feature updating networks
Structure-detail separation approaches
Regional focus with recurrence

Common advantages of recurrent methods:

Unlimited temporal receptive field (access multiple past frames)
Each frame processed only once (computational efficiency)
Hidden state sharing reduces temporal flickering

Part 3: Performance Benchmarks & Comparison

PSNR Rankings (Peak Signal-to-Noise Ratio)

Rank	Method	Year	PSNR (Set5)	Scale	Architecture
1	HAT/HAAT	2023-2024	32.8+ dB	4x	Transformer
2	SwinIR	2021	32.5+ dB	4x	Transformer
3	RCAN	2018	32+ dB	4x	CNN+Attention
4	ESRGAN	2018	32.01 dB	4x	GAN
5	Real-ESRGAN	2021	24.97 dB*	4x	GAN
6	SRCNN	2014	~32 dB	4x	Simple CNN

*Real-ESRGAN scores on real-world degraded images (different distribution); not directly comparable

Speed Comparison

Method	Input	GPU	Time	FPS	Real-time
SRCNN	256×256	CPU	<100ms	10+	✓ Yes
ESRGAN	480×480	V100	2-5s	0.2	✗ No
Real-ESRGAN	2500×2500	Mid-range	7-30 min	<0.1	✗ No
SwinIR	256×256	GPU	0.5-2s	0.5-2	✗ No
VPEG	960×540	GPU	<33ms	>30	✓ Yes
RT4KSR	4K	GPU	8-16ms	60-120	✓ Yes
BasicVSR	480p frame	GPU	50-100ms	10-20	Limited
FRVSR	480p frame	GPU	100-200ms	5-10	✗ No

Parameter Efficiency

Method	Parameters	Memory	GFLOPs (960×540)	Category
SRCNN	1M	<10 MB	200-500M	Ultra-light
VPEG	5M	~15 MB	<2000M	Lightweight
MambaIR	3-8M	10-20 MB	<1500M	Lightweight
RCAN	16M	40 MB	1000+M	Heavy
ESRGAN	16.6M	33 MB	1000+M	Heavy
SwinIR	16M	40 MB	1000+M	Heavy
HAT	16-20M	40-50 MB	1000+M	Heavy
BasicVSR	5.2M	15 MB	400M/frame	Medium
Real-ESRGAN	16.7M	33-50 MB	1000+M	Heavy

Real-Time Capability Matrix

Resolution:    │ 480p   │ 720p   │ 1080p  │ 2K     │ 4K
Scale (2x):    │ 960p   │ 1440p  │ 2160p  │ 4K     │ 8K
               │        │        │        │        │
SRCNN          │ ✓ Yes  │ ✓ Yes  │ ~Okay  │ ✗ No   │ ✗ No
VPEG           │ ✓ Yes  │ ✓ Yes  │ ✓ Yes  │ ~Okay  │ ✗ No
RT4KSR         │ N/A    │ ✓ Yes* │ ✓ Yes* │ ✓ Yes* │ ~Okay
MambaIR        │ ✓ Yes  │ ✓ Yes  │ ~Okay  │ ✗ No   │ ✗ No
SwinIR         │ ~Okay  │ ~Okay  │ ✗ No   │ ✗ No   │ ✗ No
HAT            │ ~Okay  │ ~Okay  │ ✗ No   │ ✗ No   │ ✗ No
Real-ESRGAN    │ ~Okay  │ ✗ No   │ ✗ No   │ ✗ No   │ ✗ No

* Specifically optimized for 4K output
~ = Possible but challenging on standard hardware

Metric Definitions

PSNR (Peak Signal-to-Noise Ratio)

Range: Higher is better (typically 20-40 dB)
Basis: Mathematical pixel-wise difference
Limitation: Doesn’t correlate strongly with human perception
Character: Formula-based, theoretical

SSIM (Structural Similarity Index Measure)

Range: 0-1 (higher is better)
Basis: Luminance, contrast, structure
Advantage: Models human visual perception better than PSNR
Character: Balanced metric

LPIPS (Learned Perceptual Image Patch Similarity)

Range: 0-∞ (lower is better)
Basis: Deep neural network trained on human judgments
Character: Best correlation with human perception
Adoption: Now standard in 2024-2025 research

VMAF (Video Multi-Method Assessment Fusion)

Range: 0-100 (higher is better)
Purpose: Video-specific quality assessment
Application: AIM 2024 efficient video SR evaluation
Basis: Multiple quality metrics fused

PI (Perceptual Index)

Range: 0-∞ (lower is better)
Basis: Combination of perceptual metrics
Focus: Visual artifacts and naturalness
Application: Efficient SR evaluation

Part 4: Challenge Winners & Trends

NTIRE 2024 Challenge (×4 Super-Resolution)

Winner: XiaomiMM Team

Results

Top 6 teams: PSNR >31.1 dB
Approach: Mamba-based hybrid architecture
Mainstream trend: Pre-trained transformers

Key Insights

Transformers superior for sequence relationship modeling
Mamba shows promise for scalability and efficiency
Hybrid approaches (CNN + Transformer) emerging as optimal

AIM 2024 Challenge (Efficient Video Super-Resolution)

Context: Optimizing AV1-compressed content

Constraints

Maximum GMACs: <250 per frame
Target latency: <33ms per frame (24-30 FPS)
Quality metric: VMAF optimization

Results

Top 3 solutions: Significant VMAF improvement over BasicVSR++
Processing: 24-30 FPS real-time achieved
Efficiency: Better than BasicVSR while maintaining quality

Significance: Real-time video SR moved from theoretical to practical

NTIRE 2023 Real-Time 4K Challenge

Challenge Details

170+ participants
25 teams contributed to benchmark report
Goal: >60 FPS at 4K

Results

Multiple methods achieved >60 FPS
Best: 120 FPS on commercial GPUs
Content: Photography, digital art, gaming

Impact: Proved real-time 4K is commercially viable

Technology Adoption Patterns (2023-2025)

2023-2024 Shift

From pure transformers → Hybrid architectures
From PSNR focus → Perceptual metrics (LPIPS, CLIP-IQA)
From slow offline → Real-time feasible
From large models → Compact efficient versions

2024-2025 Frontier

Mamba/SSM as transformer alternative
State space models moving from research to production
CLIP-based semantic filtering adoption
Frequency-domain losses for texture restoration
Multi-stage adaptive training strategies

Part 5: Architecture Evolution Over Time

2014-2017: Simple CNN Era
├─ SRCNN: Proof of concept
├─ Basic CNN stacking
└─ Focus: Any improvement over interpolation

2018-2020: GAN & Attention Era
├─ SRGAN: Adversarial training
├─ ESRGAN: Enhanced GAN
├─ RCAN: Channel attention
└─ Focus: Perceptual quality via GANs and attention

2021-2023: Transformer Dominance
├─ SwinIR: Vision transformers in SR
├─ HAT: Hybrid attention
├─ Real-ESRGAN: Blind SR maturity
└─ Focus: Transformer efficiency and performance

2024-2025: Mamba & Hybrid Architectures
├─ MambaIR: Linear complexity SSM
├─ Hi-Mamba: Hierarchical state space
├─ HAAT: Advanced hybrid attention
├─ Diff-Mamba: Diffusion + SSM
└─ Focus: Efficiency, hybrid approaches, and emerging frontiers

Part 6: Use-Case Recommendations

Scenario 1: Real-Time Video Streaming Service

Primary:       VPEG or AIM 2024 Challenge Winners
Alternative:   RT4KSR (if static content)
Parameters:    3-5M
Target FPS:    24-30
Quality:       Balanced (VMAF optimized)
Infrastructure: GPU required
Timeline:      Weeks (proven methods)

Why: These methods proven in competition; real-time capability validated

Scenario 2: Desktop Photo Enhancement

Primary:       HAT or HAAT
Alternative:   SwinIR (stable baseline)
Parameters:    16-20M
Processing:    1-5 seconds acceptable
Quality:       Maximum
Infrastructure: GPU recommended
Timeline:      Weeks (implementations available)

Why: Highest quality acceptable when user waits seconds

Scenario 3: Mobile/Edge Device Deployment

Primary:       Quantized VPEG
Alternative:   TensorFlow Lite SRCNN
Parameters:    <5M (ideally <3M)
Target FPS:    10-15
Quality:       Acceptable (perceptual)
Infrastructure: No GPU required
Timeline:      Months (optimization work)

Why: Parameter constraints dominate; quantization essential

Scenario 4: 4K Real-Time Broadcast

Primary:       RT4KSR or variants
Alternative:   Custom optimized method
Target FPS:    60+
Quality:       Maintain over Bicubic
Infrastructure: High-end GPU or FPGA
Timeline:      Months (custom optimization)

Why: RT4KSR specifically designed for this; proven track record

Scenario 5: Real-World Degraded Images (Photo Restoration)

Primary:       Real-ESRGAN
Alternative:   BSRGAN
Blind SR:      Essential (unknown degradation)
Processing:    <30 seconds acceptable
Quality:       Industry standard
Infrastructure: GPU recommended
Timeline:      Weeks (pre-trained models available)

Why: Only methods specifically trained for unknown degradation types

Scenario 6: Video Quality (High-Quality Offline)

Primary:       BasicVSR++ or HAMSA
Alternative:   FRVSR
Real-time:     Not required
Quality:       Maximum PSNR
Infrastructure: GPU cluster
Timeline:      Hours per video

Why: Reference quality standards; recurrent architecture for temporal consistency

Scenario 7: Research Publication

Primary:       HAAT or latest NTIRE winner
Alternative:   HAT (stable baseline)
Focus:         PSNR + LPIPS + perceptual metrics
Quality:       State-of-the-art
Infrastructure: GPU cluster (training)
Timeline:      3-6 months (training required)

Why: Need latest methods for competitive results; multiple metrics for publication

Scenario 8: Existing Production System Upgrade

Primary:       SwinIR (migration from GAN-based)
Alternative:   HAT (if quality critical)
Compatibility: Framework-agnostic (ONNX export)
Risk:          Low (well-documented methods)
Timeline:      2-4 weeks

Why: Proven stability, extensive documentation, clear performance improvements

Part 7: Key Metrics Explained

Understanding PSNR

Peak Signal-to-Noise Ratio measures pixel-level differences:

Formula-based: Mathematical pixel difference
Higher is better: Typical range 20-40 dB
Characteristic: Doesn’t model human perception well
Sweet spot for SR: 32-33 dB
Beyond 33 dB: Diminishing returns and imperceptible differences

Limitation: Two images with same PSNR can look dramatically different to human eyes

Understanding SSIM

Structural Similarity models human visual perception:

Range: 0-1 (higher is better)
Components: Luminance, contrast, structure
Better than PSNR: More aligned with subjective quality
Standard in SR: Used in most publications

Use case: Better indicator of perceived quality than PSNR alone

Understanding LPIPS (Key Metric for 2024-2025)

Learned Perceptual Image Patch Similarity:

Learning-based: Deep network trained on human judgments
Range: 0-∞ (lower is better)
Best correlation: Most aligned with human perception
Modern standard: Now preferred in cutting-edge research

Why it matters: LPIPS reveals why PSNR-optimized methods sometimes look worse than lower-PSNR methods

Example: Two methods both at 32 dB PSNR:

Method A: LPIPS 0.15 (looks good)
Method B: LPIPS 0.25 (looks worse despite same PSNR)

Understanding VMAF

Video Multi-Method Assessment Fusion:

Purpose: Video-specific quality assessment
Basis: Combines multiple quality metrics
Range: 0-100 (higher is better)
Application: Standard for video compression evaluation

Adoption in VSR: AIM 2024 challenge shifted from PSNR to VMAF for video SR

Part 8: Deployment Strategies

For GPU-Accelerated Environments

Tier 1 - Maximum Quality

Model: HAT or HAAT
Framework: PyTorch with CUDA optimization
Expected: 1-5 seconds per image on V100

Tier 2 - Balanced

Model: SwinIR
Framework: PyTorch/TensorFlow
Expected: 0.5-2 seconds per image on RTX 3080

Tier 3 - Real-Time

Model: VPEG or RT4KSR
Framework: ONNX Runtime optimized
Expected: <33ms on RTX 3060

For CPU-Only Environments

Not recommended for production due to speed constraints, except:

Option 1: SRCNN variant

Processing: ~100ms on modern CPU
Quality: Acceptable baseline
Use case: Fallback only

Option 2: Quantized lightweight model

Processing: Highly variable (1-5 seconds typical)
Quality: Moderate
Use case: Extremely resource-constrained

For Mobile/Edge Deployment

Framework: TensorFlow Lite, ONNX Runtime, or PyTorch Mobile

Model Selection

Keep parameters <5M (ideally <3M)
Use quantization (int8 recommended, int4 for extreme constraints)
Target: 1-3 FPS on mid-range devices

Process

Start with VPEG or lightweight Mamba
Convert to TensorFlow Lite / ONNX
Apply int8 quantization (typically <1 dB PSNR loss)
Test on target hardware
Iterate if needed

Expected Results

Model size: 3-15 MB
Inference: 500ms-2s per image on smartphone
Quality: Acceptable perceptual improvement

For Browser-Based Deployment

Limited options due to computational constraints:

Framework: ONNX.js or TensorFlow.js

Recommendations

Limit to SRCNN or ultra-lightweight models
Requires WebGPU for practical performance
Better approach: Server-side processing with progressive streaming

Part 9: The Efficiency Frontier

From Research to Production: The Efficiency Timeline

2022  │ HAT/SwinIR: 16M params, ~1 second
      │ Real-ESRGAN: 16.7M params, well-established
      │
2023  │ RT4KSR: 60+ FPS real-time proven
      │ Efficiency track emerges
      │
2024  │ VPEG: 5M params, >30 FPS, matches Real-ESRGAN quality
      │ Mamba methods: 3-8M params, linear complexity
      │ AIM Efficient VSR: <250 GMacs, 24-30 FPS video
      │
2025  │ VPEG refined: 5M params optimal sweet spot
      │ Hi-Mamba: Hierarchical efficiency
      │ Multi-method ensembles emerging

The Parameter Reduction Story

Why parameters matter:

Model size → Download time
Model size → Memory requirement
Larger models → More compute during inference

The trend:

2022: 16-20M standard
2023: 16-20M still dominant
2024: 5-8M becoming mainstream
2025: <3M possible, 5M optimal

Practical implications:

Quantization increasingly effective (<1 dB PSNR loss typical)
Edge deployment finally viable
Download sizes shrinking
Real-time feasible on consumer hardware

Part 10: Emerging Frontiers

1. Diffusion Models for Super-Resolution

Status: Experimental, gaining traction

Advantages

Highest perceptual quality possible
Novel approach to generation
Flexible control options

Disadvantages

Slower than CNN methods (50-200ms typical)
Higher memory requirement
Still research-focused

Latest: ControlSR (2024) combining DPM-Solver acceleration with real-world degradation handling

Trajectory: Moving toward practical deployment; still 2-3 years from mainstream production

2. State Space Models (Mamba)

Status: Rapidly advancing from research to production

Why Exciting

Linear complexity vs. transformer’s quadratic
Effective long-range dependency modeling
Memory-efficient at scale

Reality Check

Still slightly behind transformers on standard benchmarks
Not yet proven in production at scale
Emerging as viable alternative (not replacement)

Near term: Mamba adoption in specialized use cases (large-scale processing, mobile)

Medium term: Competitive parity with transformers on most tasks

3. CLIP-Based Semantic Filtering

Status: Entering mainstream adoption

Innovation

Using CLIP (vision-language model) to understand image semantics
Filtering generated artifacts intelligently
Prioritizing semantic coherence over pixel perfection

Impact

Improved results on complex scenes
Better handling of text in images
More “natural” upscaling

4. Frequency-Domain Losses

Status: Emerging standard in 2024-2025

Concept

Analyzing image quality in frequency domain
Preserving high-frequency details (texture, edges)
Reducing low-frequency artifacts

Results

Visibly sharper outputs
Better texture restoration
Reduced blur/smoothing artifacts

5. Multi-Stage Adaptive Pipelines

Status: Research frontier

Approach

First stage: Quick initial upscaling
Analysis: Detect problem areas
Second stage: Refined processing on difficult regions
Fusion: Blend results

Advantage: Allocate computational resources where needed

Part 11: Hardware Acceleration Support

GPU Support Matrix

Method	NVIDIA	AMD	Intel GPU	NPU/AI	CPU
SRCNN	✓	✓	✓	Limited	✓ (slow)
ESRGAN/Real-ESRGAN	✓	✓	✓	Limited	✗
SwinIR	✓	✓	✓	Limited	✗
HAT	✓	✓	✓	Limited	✗
VPEG	✓	✓	✓	✓ Yes	Limited
MambaIR	✓	✓	✓	Limited	✗
RT4KSR	✓	✓	✓	Limited	✗
Upscayl	✓ (Vulkan)	✓ (Vulkan)	✓ (Vulkan)	Limited	Limited

Framework Support

PyTorch

Native support for most methods
Best for research and training
Good optimization ecosystem

TensorFlow

Available for major methods
Good mobile/edge support
TensorFlow Lite for deployment

ONNX

Cross-framework compatibility
Wide runtime support
Industry standard for interchange

TensorFlow Lite

Mobile deployment standard
Hardware acceleration (GPU, NPU, DSP)
Good quantization support

NCNN

Edge and embedded focus
Low memory footprint
GPU-agnostic (Vulkan support)

Part 12: Installation & Deployment Guides

Quick Start: Using Real-ESRGAN

Installation (Python)

pip install realesrgan
# or using uv:
uv add realesrgan

Basic Usage

from realesrgan import RealESRGANer

model = RealESRGANer(scale=4, model_name='RealESRGAN_x4plus')
output = model.enhance(input_image)

For Desktop: Use Upscayl (https://github.com/upscayl/upscayl)

No coding required
Supports NVIDIA, AMD, Intel GPUs
Linux, macOS, Windows

For Maximum Quality: HAT Deployment

GitHub: https://github.com/XPixelGroup/HAT

Installation

git clone https://github.com/XPixelGroup/HAT.git
cd HAT
uv add -r requirements.txt  # or pip install

Pre-trained Models

Available on GitHub releases
Download and point to model path

For Real-Time 4K: RT4KSR

GitHub: https://github.com/eduardzamfir/RT4KSR

Key Setting: Optimized specifically for 4K throughput

For Production Video: BasicVSR++

Framework: BasicSR (https://github.com/XPixelGroup/BasicSR)

Includes: Full training framework, pre-trained models, evaluation scripts

Part 13: The Future Outlook

Near Term (Next 12 Months - 2025)

Likely Developments

Mamba maturation: Production-ready SSM models with transformer parity
Efficiency focus: 3M parameter models becoming standard
Real-time video: 30 FPS video SR on consumer GPU becoming normal
Mobile deployment: Practical real-time super-resolution on mid-range phones
Semantic awareness: CLIP integration becoming standard

Challenges

PSNR plateau requires new evaluation frameworks
Generalization on unknown degradations still difficult
Real-time video SR at high resolution still challenging

Medium Term (12-24 Months - 2026)

Expected Breakthroughs

Arbitrary-scale SR: Seamless upscaling at any factor
Unified architectures: Single model handling image+video+blind SR
Adaptive methods: Real-time adjustment to image content
Quantum considerations: Exploring quantum-friendly approaches

Long Term (24+ Months - 2027+)

Speculative Frontiers

Neural rendering: Direct feature space manipulation
Neuromorphic hardware: Spiking networks for ultra-efficient SR
Foundation models: Large pretrained models for adaptation
Task-agnostic: Single model for all image restoration tasks

Part 14: Critical Insights for Decision Making

The Quality Ceiling

Reality: PSNR improvements have plateaued at ~32.8-32.9 dB

Implication: Further algorithm innovation unlikely to yield significant PSNR gains

Solution: The field is shifting toward:

Perceptual metrics (LPIPS, CLIP-IQA, VMAF)
Real-world scenario performance
Computational efficiency
Edge case handling

The Efficiency Revolution

Key Finding: 5M parameter models now match 16M+ parameter models in perceptual quality

VPEG Case Study:

Uses 17.6% of Real-ESRGAN’s computation
Exceeds Real-ESRGAN’s perceptual quality scores
Achieves real-time on consumer hardware

Implication: The efficiency frontier has moved dramatically; old assumptions about quality vs. speed tradeoffs are outdated

Real-Time Achievement

Proven: 4K real-time (60+ FPS) is achievable and production-ready

Proven: Video real-time (30 FPS) is commercial reality

Implication: Resource constraints no longer excuse non-real-time deployments for most use cases

Reality: Real-ESRGAN and variants effectively handle real-world degraded images

Implication: Can now deploy production systems without knowing exact degradation type

The Hybrid Advantage

Finding: CNN-Transformer hybrids outperform pure architectures

Examples: HAT (hybrid attention), HAMSA (hybrid + motion)

Implication: Future architectures will likely embrace hybrid approaches

Part 15: Comparative Quick Reference

One-Liner Descriptions

Best Overall Quality: HAT/HAAT (32.8+ dB PSNR) Best Efficiency: VPEG (5M params, >30 FPS) Best Real-Time 4K: RT4KSR (60-120 FPS) Best Real-World Photos: Real-ESRGAN (blind SR) Best Video Quality: BasicVSR++ (reference standard) Best Research Stability: SwinIR (proven baseline) Best Emerging Tech: Hi-Mamba (hierarchical SSM) Best Video Real-Time: AIM 2024 Challenge Winners (<33ms)

Decision Tree

START: What's your primary constraint?

├─ Quality (no time limit)
│  └─ Use: HAT or HAAT (2023-2024)
│
├─ Speed (must be real-time)
│  ├─ 4K video: RT4KSR (60+ FPS)
│  ├─ Single image: VPEG (>30 FPS)
│  └─ Video: AIM 2024 winners (<33ms/frame)
│
├─ Real-world degradation (unknown type)
│  └─ Use: Real-ESRGAN (blind SR specialist)
│
├─ Edge device (limited memory/CPU)
│  └─ Use: Quantized VPEG or SRCNN
│
├─ Video processing (temporal consistency)
│  ├─ High quality: BasicVSR++
│  ├─ Real-time: AIM 2024 challenge winners
│  └─ Motion-aware: HAMSA
│
└─ Research/Publication
   └─ Use: HAAT or latest NTIRE winner

Conclusion

The State of Super-Resolution in 2025

Where We Are

Quality plateau achieved: Further PSNR improvements unlikely
Real-time is now standard: Not an aspirational goal anymore
Efficiency revolutionized: 75% parameter reduction while improving quality
Practical deployment ready: Production systems can be deployed today
Hybrid approaches winning: CNN-Transformer combinations dominating

What Works Best

Need	Solution	Why
Maximum quality	HAT	32.8+ dB PSNR proven
Real-time anything	VPEG or RT4KSR	Tested in competition
Photo restoration	Real-ESRGAN	Only blind SR specialist
Video SR	BasicVSR++	Reference standard
Mobile/Edge	Quantized VPEG	Proven efficiency
Research	HAAT	Latest SOTA

The Next Frontier

The field is transitioning from “how to improve PSNR” to “how to handle real-world complexity better”:

Semantic understanding (CLIP integration)
Frequency-domain quality
Adaptive multi-stage processing
Efficient edge deployment

For Practitioners

Start here:

For real-time: Use VPEG or RT4KSR (proven in competition)
For quality: Use HAT or SwinIR (established baselines)
For real photos: Use Real-ESRGAN (industry standard)
For video: Use BasicVSR++ (reference implementation)

Then adapt:

Test on your specific use case
Measure with perceptual metrics (LPIPS), not just PSNR
Consider efficiency for your hardware constraints
Evaluate on real data, not just benchmarks

The Bottom Line

Super-resolution has matured from research curiosity to production technology. The question is no longer “can we do this” but “which method best fits my constraints.” Start with proven methods from recent competitions (RT4KSR, VPEG, AIM 2024 winners), validate on your data, and iterate.

The 2025 frontier is no longer about pushing PSNR; it’s about real-world quality, efficiency, and practical deployment.

References

Key Papers & Sources

Foundational

SRCNN (2014): Super-Resolution Using Convolutional Neural Networks
SRGAN (2016): First GAN-based super-resolution

GANs & Attention Era

ESRGAN (2018): Enhanced Super-Resolution GANs
RCAN (2018): Residual Channel Attention Networks
BSRGAN (2021): Blind Super-Resolution GAN
Real-ESRGAN (2021): arXiv:2107.10833

Transformer Revolution

SwinIR (2021): arXiv:2108.10257
HAT (2023): arXiv:2309.05239
HAAT (2024): arXiv:2411.18003

State Space Models

MambaIR (2024): SpringerLink
Hi-Mamba (2024): arXiv:2410.10140
S³Mamba (2024): arXiv:2411.11906
Directing Mamba (2025): arXiv:2501.16583

Diffusion Models

ControlSR (2024): arXiv:2410.14279
Diff-Mamba (2025): Scientific Reports

Efficient Methods

VPEG (2025): arXiv:2510.12765
REAPPEAR (2025): AMD Technical Articles

Real-Time & Video

RT4KSR (2023): NTIRE 2023 Challenge, arXiv:2404.16484
BasicVSR++ (2021): REDS baseline
FRVSR (2018): arXiv:1801.04590
HAMSA (2024): Hybrid Attention Motion Alignment
AIM 2024 VSR (2024): arXiv:2409.17256

Challenges & Competitions

NTIRE 2024: https://openaccess.thecvf.com/content/CVPR2024W/NTIRE/papers
NTIRE 2025: Latest results
AIM 2024: Efficient image and video SR challenges
CVPR 2023 NTIRE: RT4KSR challenge baseline

Open Source Implementations

Real-ESRGAN: https://github.com/xinntao/Real-ESRGAN
SwinIR: https://github.com/JingyunLiang/SwinIR
HAT: https://github.com/XPixelGroup/HAT
BasicSR: https://github.com/XPixelGroup/BasicSR (includes RRDB, SwinIR, HAT)
BasicVSR: https://github.com/OpenVisualCloud/Video-Super-Resolution-Library
FRVSR: https://github.com/msmsajjadi/FRVSR
Upscayl: https://github.com/upscayl/upscayl (Desktop application)

Data & Benchmarks

Pre-trained Models: TensorFlow Hub, HuggingFace Model Hub
Test Sets: Set5, Set14, BSD100, Urban100, Manga109 (image SR), VIMEO-90K, REDS (video SR)
Challenge Leaderboards: Papers with Code, official competition websites

Report Compiled: November 2025 Coverage Period: 2014-2025 (with emphasis on 2022-2025) Total Methods Analyzed: 30+ Data Points: 200+

This report synthesizes research from academic papers, challenge proceedings, and industry implementations. For specific method citations and detailed comparisons, refer to the reference section and original paper repositories.

The State of the Art in Image Upscaling and Video Super-Resolution (Nov 13, 2025)

Executive Summary: Key Findings

🎯 Main Discoveries

📊 By The Numbers

Part 1: The Landscape

Historical Context: 14 Years of Evolution

Why This Matters

Part 2: The Algorithms

Category 1: Transformer-Based Methods (SOTA Quality Leaders)

HAT (Hybrid Attention Transformer) - 2023 ⭐ RECOMMENDED FOR MAXIMUM QUALITY

HAAT (Hybrid Attention Aggregation Transformer) - 2024

SwinIR (Swin Image Restoration) - 2021 ⭐ STABLE BASELINE

Emerging Transformer Variants (2024-2025)

Category 2: State Space Models / Mamba (The New Frontier)

MambaIR (Mamba Image Restoration) - 2024

Hi-Mamba (Hierarchical Mamba) - October 2024

S³Mamba (Scaleable State Space Model) - 2024

Category 3: CNN-Based Methods (GAN & Attention)

Real-ESRGAN - 2021 ⭐ INDUSTRY STANDARD FOR BLIND SR

ESRGAN - 2018

RCAN (Residual Channel Attention Network) - 2018

BSRGAN - 2021

Category 4: Diffusion-Based Methods (High-Quality Experimental)

Latent Diffusion Models for Super-Resolution - 2022-2024

DPM-Solver (Diffusion Probabilistic Model Solver)

ControlSR (Taming Diffusion for SR) - 2024

Category 5: Real-Time Specialized Methods

VPEG (Efficient Perceptual SR) - 2025 ⭐ BEST EFFICIENCY

RT4KSR (Real-Time 4K Super-Resolution) - 2023 ⭐ BENCHMARK ACHIEVEMENT

REAPPEAR - 2025

Category 6: Video Super-Resolution

BasicVSR++ - 2021 ⭐ VIDEO REFERENCE STANDARD

FRVSR (Frame-Recurrent Video SR) - 2018

HAMSA (Hybrid Attention + Motion Alignment) - 2024

Other Recurrent Methods

Part 3: Performance Benchmarks & Comparison

PSNR Rankings (Peak Signal-to-Noise Ratio)

Speed Comparison

Parameter Efficiency

Real-Time Capability Matrix

Metric Definitions

Part 4: Challenge Winners & Trends

NTIRE 2024 Challenge (×4 Super-Resolution)

AIM 2024 Challenge (Efficient Video Super-Resolution)

NTIRE 2023 Real-Time 4K Challenge

Technology Adoption Patterns (2023-2025)

Part 5: Architecture Evolution Over Time

Part 6: Use-Case Recommendations

Scenario 1: Real-Time Video Streaming Service

Scenario 2: Desktop Photo Enhancement

Scenario 3: Mobile/Edge Device Deployment

Scenario 4: 4K Real-Time Broadcast

Scenario 5: Real-World Degraded Images (Photo Restoration)

Scenario 6: Video Quality (High-Quality Offline)

Scenario 7: Research Publication

Scenario 8: Existing Production System Upgrade

Part 7: Key Metrics Explained

Understanding PSNR

Understanding SSIM

Understanding LPIPS (Key Metric for 2024-2025)

Understanding VMAF

Part 8: Deployment Strategies

For GPU-Accelerated Environments

For CPU-Only Environments

For Mobile/Edge Deployment

For Browser-Based Deployment

Part 9: The Efficiency Frontier

From Research to Production: The Efficiency Timeline

The Parameter Reduction Story

Part 10: Emerging Frontiers

1. Diffusion Models for Super-Resolution

2. State Space Models (Mamba)

3. CLIP-Based Semantic Filtering

4. Frequency-Domain Losses

5. Multi-Stage Adaptive Pipelines

Part 11: Hardware Acceleration Support

GPU Support Matrix

Framework Support

Part 12: Installation & Deployment Guides

Quick Start: Using Real-ESRGAN