Title: Bridging Isolated Islands in Human Activity Understanding
Abstract: As a vital step toward the intelligent agent, action understanding attracts a lot of attention and achieves success recently. This task can be formed as the mapping from the human action physical space to semantic space. However, due to the complexity of action patterns and semantic ambiguity, great challenges remain. In terms of action physical space, multi-modal methods have been proposed to extract representative features to facilitate recognition. But few efforts have been made in the design of semantic space. Usually, researchers built up datasets according to idiosyncratic choices to define action classes and then develop methods to push the envelope of these datasets respectively. As a result, these datasets are incompatible with each other due to the semantic gap and different action class granularities e.g., ``do housework'' in dataset A and ``wash plate'' in dataset B. Here, we call it the "isolated islands" (I^2) problem which brings a great challenge to general action understanding as these ``isolated islands'' with semantic gaps cannot afford unified training. We argue that a more principled and complete semantic space is an urgent need to concentrate the efforts of the community and enable us to use all the existing multi-modal datasets to pursue general and effective action understanding. To this end, we propose a novel path to reshape the action understanding paradigm. In detail, we redesign a structured semantic space given verb taxonomy hierarchy and cover massive verbs. By aligning the classes of existing datasets to our structured space, we can put all image/video/skeleton/MoCap action datasets into the largest database by far with a unified semantic label system. Accordingly, we propose a bidirectional mapping framework, Sandwich, to use multi-modal data with unified labels to bridge the action physical-semantic space. In extensive experiments, our framework shows great potential for future action study and significant superiority upon the canonical paradigm, especially on few/zero-shot action learning tasks with semantic analogy thanks to the verb structure knowledge and our data coverage.
Title: Learning to Anticipate Human Actions from Videos
Abstract: Our ability to anticipate the behavior of others comes naturally to us. For example, an experienced driver can often predict the behavior of other road users. Similarly, a good table tennis player can estimate the direction of the ball just by observing the movements of the opponent. This phenomenon is called Action Anticipation, the ability to recognize actions of others before it happens in the immediate future. This is so natural to us but how to develop a computational approach to do the same remains a challenge. It is critical to transfer this ability to computers so that robots may be able to react quickly by anticipating human actions like humans. Robots' ability to understand what humans might do in the immediate future is important for the development of assistive robotics in domains such as manufacturing and healthcare. Current methods are primarily based on predictive methods and there are not than many successful stochastic approaches to handle the uncertainty of future. We propose two solutions to handle these challenges. In one approach we develop a model to correlate the past information with what might happen in the future. In another approach we develop methods to model the uncertainty using latent variable models.
Prof. Yebin Liu is a tenured professor in Department of Automation, Tsinghua University. His research areas include computer vision, computer graphics and computational photography, especially on digital human representation, reconstruction and rendering. He has published more than 70 papers in top conferences and journals including CVPR, ICCV, ECCV, TPAMI, and TOG. He received the First Prize of National Technology Invention Award in 2012. He has been found by the Distinguished Young Scholars of NSFC in 2021.
Title: Neural Digital Human Avatar:Motion Capture, Reconstruction and Rendering
Abstract: Digital human avatar will serve as the future interface for human-AI interaction and holographic communication. It is also the core infrastructure for metaverse. Focusing on the key and cutting-edge components of digital human avatar, this talk will introduce Neural methods for digital human 3D representation, motion capture, 3D reconstruction and neural rendering. Aiming at high quality human avatar reconstruction and rendering, traditional methods requires extremely sophisticated capture setup like dense camera systems using more than hundreds of cameras. In this talk, we will introduce our recent neural methods for high quality human avatar reconstruction and rendering using only 6 or 8 cameras.