Rembert Daems, Jeroen Taets, Francis wyffels, Guillaume Crevecoeur
We present KeyCLD, a framework to learn Lagrangian dynamics from images. Learned keypoints represent semantic landmarks in images and can directly represent state dynamics. Interpreting this state as Cartesian coordinates coupled with explicit holonomic constraints, allows expressing the dynamics with a constrained Lagrangian. Our method explicitly models kinetic and potential energy, thus allowing energy based control. We are the first to demonstrate learning of Lagrangian dynamics from images on the dm_control pendulum, cartpole and acrobot environments. This is a step forward towards learning Lagrangian dynamics from real-world images, since previous work in literature was only applied to minimalistic images with monochromatic shapes on empty backgrounds.
KeyCLD learns Lagrangian dynamics from images. (a) An observation of a dynamical system is processed by a keypoint estimator model. (b) The model represents the positions of the keypoints with a set of spatial probability heatmaps. (c) Cartesian coordinates are extracted using spatial softmax and used as state representations to learn Lagrangian dynamics. (d) The information in the keypoint coordinates bottleneck suffices for a learned renderer model to reconstruct the original observation, including background, reflections and shadows. The keypoint estimator model, Lagrangian dynamics models and renderer model are jointly learned unsupervised on sequences of images.
We investigate our model in an ablation study, see the paper for more details. KeyCLD (column 2) predicts future frames for the pendulum, cartpole and acrobot environments. Every predicted sequence is based on the first two frames of the ground truth sequence (column 1), since at least two frames are necessary to estimate the velocities. KeyCLD is capable of making accurate long-term predictions, including reflections and shadow. We compare these results with ablated models (columns 3 to 5).