Research line leader: Jan van Gemert (TUD)

R2 targets the following challenges:

  1. How to add temporal and motion modalities to DNNs efficiently?
  2. How to combine 2D nets with 3D geometric reasoning?
  3. How to efficiently deep fuse information from generic N-D modalities, including images, texts, and audio?

The spectacular revolution of deep learning is powered by representation-learning. The representation of images, video, texts, and speech can now be learned with great success instead of painstakingly hand-crafting features by using physical models. Nonetheless, there are several strongholds where DL has not yet led to its expected breakthrough. For example, in 3D-reconstruction the deep nets have gained some foothold, but geometry is still king. Another example domain is motion modeling. For tasks as action recognition, object tracking, and motion analysis, the deep feed-forward and recurrent LSTM (Long Short Term Memory) nets have added a few percent, but they have yet to redeem their paradigm shifting promise. Furthermore, an area that has been little explored is the (early) fusion of different sensor data modalities (e.g. video, radar, LIDAR). Apart from the increased computational complexity associated with the data aggregation, challenges include coping with different resolutions, synchronization and calibration automatically.
The time has come for synergy between learning and knowledge. Where the most powerful deep net for images arguably already uses a ‘hand-crafted’ operation –the convolution– we aim to explore further marriages of deep nets with other domain knowledge. While aiming to learn everything from samples is elegant, it is inefficient in its use of data and computation: the learning starts from scratch and needs to learn the same lessons every single time. Instead, we take a batteries-included approach and efficiently learn to make use of knowledge for leveraging domain specific solutions.