Turkish Journal of Electrical Engineering and Computer Sciences
Abstract
Video prediction is a significant and actively researched area within the data science community. Its primary objective is to generate future video frames based on historical frames, finding applications in diverse domains such as human motion prediction, climate change analysis, and traffic flow forecasting. Traditional methods combine Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to capture complex correlations in spatial-temporal signals. Recent methods improve video prediction accuracy by introducing external information such as optical flow, semantic maps, and human pose data. However, these methods have limitations, such as not fully exploring the intermediate states of learning representations, overlooking the interactions between multiple pretext tasks and various contrastive learning schemes, and being inefficient at processing high-resolution videos. To address these issues, we propose a new model with an encoder, processor, and decoder architecture. The encoder projects input data into a latent space, capturing spatial-temporal information in video sequences. The processor module, inspired by the Vector Quantized-Variational AutoEncoder (VQ-VAE) framework, enhances the modeling of complex spatial-temporal dynamics through advanced techniques. The decoder module maps the processed latent representations back to the original data dimensions. Our model achieves accurate and efficient spatial-temporal video prediction by combining these three modules. This paper's contributions are summarized as follows: leveraging advanced techniques in the processor module, our model enhances the learning of spatial-temporal representations, capturing physical principles and complex dynamics in video sequences for more accurate and realistic predictions; our model uses a carefully designed architecture to balance accuracy and computational efficiency, making it suitable for practical scenarios; extensive experiments on multiple datasets and comparisons with ten baseline methods demonstrate the competitive performance of our proposed TS-VQ-VAE model in terms of effectiveness and efficiency.
DOI
10.55730/1300-0632.4121
Keywords
Spatio-temporal Video Prediction, Vqvae, Processor, Latent Embedding
First Page
185
Last Page
202
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
FENG, MEI and LI, FAN
(2025)
"Enhancing spatial-temporal video prediction with TS-VQ-VAE: A novel encoder-processor-decoder approach,"
Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 33:
No.
2, Article 7.
https://doi.org/10.55730/1300-0632.4121
Available at:
https://journals.tubitak.gov.tr/elektrik/vol33/iss2/7
Included in
Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons