•  
  •  
 

Turkish Journal of Electrical Engineering and Computer Sciences

Author ORCID Identifier

FAN LI: 0009-0004-1376-2677

MEI FENG: 0009-0003-2529-8528

Abstract

Video prediction is a significant and actively researched area within the data science community. Its primary objective is to generate future video frames based on historical frames, finding applications in diverse domains such as human motion prediction, climate change analysis, and traffic flow forecasting. Traditional methods combine Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to capture complex correlations in spatial-temporal signals. Recent methods improve video prediction accuracy by introducing external information such as optical flow, semantic maps, and human pose data. However, these methods have limitations, such as not fully exploring the intermediate states of learning representations, overlooking the interactions between multiple pretext tasks and various contrastive learning schemes, and being inefficient at processing high-resolution videos. To address these issues, we propose a new model with an encoder, processor, and decoder architecture. The encoder projects input data into a latent space, capturing spatial-temporal information in video sequences. The processor module, inspired by the Vector Quantized-Variational AutoEncoder (VQ-VAE) framework, enhances the modeling of complex spatial-temporal dynamics through advanced techniques. The decoder module maps the processed latent representations back to the original data dimensions. Our model achieves accurate and efficient spatial-temporal video prediction by combining these three modules. This paper's contributions are summarized as follows: leveraging advanced techniques in the processor module, our model enhances the learning of spatial-temporal representations, capturing physical principles and complex dynamics in video sequences for more accurate and realistic predictions; our model uses a carefully designed architecture to balance accuracy and computational efficiency, making it suitable for practical scenarios; extensive experiments on multiple datasets and comparisons with ten baseline methods demonstrate the competitive performance of our proposed TS-VQ-VAE model in terms of effectiveness and efficiency.

DOI

10.55730/1300-0632.4121

Keywords

Spatio-temporal Video Prediction, Vqvae, Processor, Latent Embedding

First Page

185

Last Page

202

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS