Deep Unfolding Transformers for Sparse Recovery of Video

Deep unfolding models are designed by unrolling an optimization algorithm into a deep learning network. By incorporating domain knowledge from the optimization algorithm, they have shown faster convergence and higher performance compared to the original algorithm. We design an optimization problem for sequential signal recovery, which incorporates that the signals have a sparse representation in a dictionary and are correlated over time. A corresponding optimization algorithm is derived and unfolded into a deep unfolding Transformer encoder architecture, coined DUST. To show its improved reconstruction quality and flexibility in handling sequences of different lengths, we perform extensive experiments on video frame reconstruction from low-dimensional and/or noisy measurements, using several video datasets. We evaluate extensions to the base DUST model incorporating token normalization and multi-head attention, and compare our proposed networks with several deep unfolding recurrent neural networks (RNNs), generic unfolded and vanilla Transformers, and several video denoising models. The results show that our proposed Transformer architecture improves the reconstruction quality over state-of-the-art deep unfolding RNNs, existing Transformer networks, as well as state-of-the-art video denoising models, while significantly reducing the model size and computational cost of training and inference.
Source: IEEE Transactions on Signal Processing - Category: Biomedical Engineering Source Type: research