DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion

1RECON Labs Inc.
2Department of Artificial Intelligence, Yonsei University
3Department of Artificial Intelligence, Sungkyunkwan University
4Department of Electrical and Computer Engineering, Sungkyunkwan University

Abstract

Recent advancements in diffusion models have revolutionized video generation, enabling the creation of high-quality, temporally consistent videos. However, generating high frame-rate (FPS) videos remains a significant challenge due to issues such as flickering and degradation in long sequences, particularly in fast-motion scenarios. Existing methods often suffer from computational inefficiencies and limitations in maintaining video quality over extended frames. In this paper, we present a novel, training-free approach for high FPS video generation using pre-trained diffusion models. Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising, to achieve smooth, consistent video outputs without the need for additional fine-tuning. Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity. The proposed method is not only computationally efficient but also adaptable to various video generation tasks, making it ideal for applications such as virtual reality, video games, and high-quality content creation.

Overall Pipeline

DiffuseSlide is a training-free, high frame-rate video generation pipeline based on pre-trained image-to-video diffusion models. It consists of three stages: latent interpolation, noise re-injection, and sliding window denoising. This pipeline enables the production of smooth and temporally coherent videos without additional training.


Noise Re-Injection

We apply a multi-step noise re-injection process to refine interpolated latent frames. Interpolated latents often contain artifacts due to naive averaging. To address this, we inject Gaussian noise at intermediate steps, followed by denoising. This alternating process of noise and denoise helps recover details and align results with the smooth video manifold. It prevents the model from being stuck in suboptimal reconstructions and enhances both spatial detail and temporal consistency.

Sliding Window
for Multi-Image Conditioning

Pre-trained image-to-video models have limited sequence capacity. To overcome this, we use a sliding window denoising method. The extended latent sequence is divided into overlapping subsequences. Each is denoised independently, conditioned on its corresponding keyframe. This allows multi-image conditioning throughout the sequence, reducing flicker and enabling consistent alignment with keyframes over longer videos.

Comparison with Origin Video

Comparison with Linear Interpolation

BibTeX

@misc{hwang2025diffuseslide,
  title={DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion}, 
  author={Geunmin Hwang and Hyun-kyu Ko and Younghyun Kim and Seungryong Lee and Eunbyung Park},
  year={2025},
  eprint={2506.01454},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.01454}
}