ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler

Keyframe interpolation result using our ViBiDSampler (25-frames 1024x576).


Abstract

Recent advancements in text-to-video (T2V) and image-to-video (I2V) diffusion models have improved video generation, particularly for keyframe interpolation. However, adapting current I2V models for two-frame conditioning (start & end) remains challenging, often leading to artifacts and off-manifold issues. We introduce a novel bidirectional sampling strategy that sequentially samples along forward and backward paths without extensive re-noising or model fine-tuning, ensuring more coherent and on-manifold generation of intermediate frames. Advanced guidance techniques, CFG++ and DDS, are also employed to enhance interpolation.
Ours 😁
High-quality interpolation Generating high-quality videos
between keyframes.
Computational effieciency Not requiring extensive re-noising
or model fine-tuning.
Accessibility Using open-sourced diffusion model (SVD).
By integrating these, our method achieves state-of-the-art performance, efficiently generating high-quality, smooth videos between keyframes. On a single 3090 GPU, our method can interpolate 25 frames at 1024x576 resolution in just 195 seconds, establishing it as a leading solution for keyframe interpolation.

Method

Comparison of denoising processes between (a) Time Reversal Fusion method and (b) bidirectional sampling (Ours). Our key innovation lies in the sequential sampling of the temporal forward path and the temporal backward path by integrating a single re-noising step between them.

On-manifold sampling

In the geometric view of diffusion models, the sampling process is typically described as iterative transitions moving from the noisy manifold to the clean manifold. From this perspective, (a) fusing two intermediate sample points through linear interpolation on a noisy manifold can lead to an undesirable off-manifold issue, where the generated samples deviate from the learned data distribution. (b) Bidirectional sampling effectively addresses this issue by sequentially sampling both the temporal forward and backward paths, with a single re-noising step in between. This approach enables on-manifold sampling, ensuring that the generated samples stay close to the learned data distribution.

Baseline comparisons

TRF & Generative Inbetweening

Input pairs
TRF
Generative Inbetweening
Ours
Image 1
Input pairs
TRF
Generative Inbetweening
Ours
Image 1
Input pairs
TRF
Generative Inbetweening
Ours
Image 1
Input pairs
TRF
Generative Inbetweening
Ours
Image 1
Input pairs
TRF
Generative Inbetweening
Ours
Image 1
Input pairs
TRF
Generative Inbetweening
Ours
Image 1
Input pairs
TRF
Generative Inbetweening
Ours
Image 1

Baseline comparisons

FILM & DynamiCrafter

Input pairs
FILM
DynamiCrafter
Ours
Image 1
Input pairs
FILM
DynamiCrafter
Ours
Image 1
Input pairs
FILM
DynamiCrafter
Ours
Image 1
Input pairs
FILM
DynamiCrafter
Ours
Image 1
Input pairs
FILM
DynamiCrafter
Ours
Image 1

Additional visualizations

We provide visualization of ablation studys in video format. Please click the link.