Turkish Journal of Electrical Engineering and Computer Sciences

Enhancing spatial-temporal video prediction with TS-VQ-VAE: A novel encoder-processor-decoder approach

Author ORCID Identifier

Abstract

Video prediction is a significant and actively researched area within the data science community. Its primary objective is to generate future video frames based on historical frames, finding applications in diverse domains such as human motion prediction, climate change analysis, and traffic flow forecasting. Traditional methods combine Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to capture complex correlations in spatial-temporal signals. Recent methods improve video prediction accuracy by introducing external information such as optical flow, semantic maps, and human pose data. However, these methods have limitations, such as not fully exploring the intermediate states of learning representations, overlooking the interactions between multiple pretext tasks and various contrastive learning schemes, and being inefficient at processing high-resolution videos. To address these issues, we propose a new model with an encoder, processor, and decoder architecture. The encoder projects input data into a latent space, capturing spatial-temporal information in video sequences. The processor module, inspired by the Vector Quantized-Variational AutoEncoder (VQ-VAE) framework, enhances the modeling of complex spatial-temporal dynamics through advanced techniques. The decoder module maps the processed latent representations back to the original data dimensions. Our model achieves accurate and efficient spatial-temporal video prediction by combining these three modules. This paper's contributions are summarized as follows: leveraging advanced techniques in the processor module, our model enhances the learning of spatial-temporal representations, capturing physical principles and complex dynamics in video sequences for more accurate and realistic predictions; our model uses a carefully designed architecture to balance accuracy and computational efficiency, making it suitable for practical scenarios; extensive experiments on multiple datasets and comparisons with ten baseline methods demonstrate the competitive performance of our proposed TS-VQ-VAE model in terms of effectiveness and efficiency.

DOI

10.55730/1300-0632.4121

Keywords

Spatio-temporal Video Prediction, Vqvae, Processor, Latent Embedding

First Page

185

Last Page

202

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Recommended Citation

FENG, MEI and LI, FAN (2025) "Enhancing spatial-temporal video prediction with TS-VQ-VAE: A novel encoder-processor-decoder approach," Turkish Journal of Electrical Engineering and Computer Sciences: Vol. 33: No. 2, Article 7. https://doi.org/10.55730/1300-0632.4121
Available at: https://journals.tubitak.gov.tr/elektrik/vol33/iss2/7

Download

Included in

Computer Engineering Commons, Computer Sciences Commons, Electrical and Computer Engineering Commons

COinS

Turkish Journal of Electrical Engineering and Computer Sciences

Enhancing spatial-temporal video prediction with TS-VQ-VAE: A novel encoder-processor-decoder approach

Author ORCID Identifier

Abstract

DOI

Keywords

First Page

Last Page

Creative Commons License

Recommended Citation

Included in

Issues by Year

Search

Turkish Journal of Electrical Engineering and Computer Sciences

Enhancing spatial-temporal video prediction with TS-VQ-VAE: A novel encoder-processor-decoder approach

Authors

Author ORCID Identifier

Abstract

DOI

Keywords

First Page

Last Page

Creative Commons License

Recommended Citation

Included in

Share

Issues by Year

Search