Keynote Speakers

Prof. Cewu Lu is a Professor at Shanghai Jiao Tong University (SJTU) and an Adjunct Professor of Shanghai Institute for Advanced Study of Zhejiang University (SIAS). Before he joined SJTU, he was a research fellow at Stanford University working in the group of Prof. Fei-Fei Li and Prof. Leonidas J. Guibas. He was a Research Assistant Professor at Hong Kong University of Science andTechnology with Prof. Chi Keung Tang . He got his PhD degree from The Chinese University of Hong Kong, supervised by Prof. Jiaya Jia. He is one of core technique member in Stanford-Toyota autonomous car project. Some of his proposed algorithms have been used as a basic tool function in OpenCV(such as decolor.cpp). He won Best Paper Award at the Non-Photorealistic Animation and Rendering (NPAR) 2012 and authored the most cited paper among all papers in SIGGRAPH recent 5 years. He serves as an Associate Editor for Gate to Computer Vision andPattern Recognition and reviewer for TPAMI and IJCV. His research interests fall mainly in Computer vision, deep learning, deep reinforcement learning and robotics vision.

Title: Bridging Isolated Islands in Human Activity Understanding

Abstract: As a vital step toward the intelligent agent, action understanding attracts a lot of attention and achieves success recently. This task can be formed as the mapping from the human action physical space to semantic space. However, due to the complexity of action patterns and semantic ambiguity, great challenges remain. In terms of action physical space, multi-modal methods have been proposed to extract representative features to facilitate recognition. But few efforts have been made in the design of semantic space. Usually, researchers built up datasets according to idiosyncratic choices to define action classes and then develop methods to push the envelope of these datasets respectively. As a result, these datasets are incompatible with each other due to the semantic gap and different action class granularities e.g., ``do housework'' in dataset A and ``wash plate'' in dataset B. Here, we call it the "isolated islands" (I^2) problem which brings a great challenge to general action understanding as these ``isolated islands'' with semantic gaps cannot afford unified training. We argue that a more principled and complete semantic space is an urgent need to concentrate the efforts of the community and enable us to use all the existing multi-modal datasets to pursue general and effective action understanding. To this end, we propose a novel path to reshape the action understanding paradigm. In detail, we redesign a structured semantic space given verb taxonomy hierarchy and cover massive verbs. By aligning the classes of existing datasets to our structured space, we can put all image/video/skeleton/MoCap action datasets into the largest database by far with a unified semantic label system. Accordingly, we propose a bidirectional mapping framework, Sandwich, to use multi-modal data with unified labels to bridge the action physical-semantic space. In extensive experiments, our framework shows great potential for future action study and significant superiority upon the canonical paradigm, especially on few/zero-shot action learning tasks with semantic analogy thanks to the verb structure knowledge and our data coverage.

Dr. Basura Fernando is research scientist at the Institute of High Performance Computing (IHPC) and Centre for Frontier AI Research (CFAR) of Agency for Science, Technology and Research (A*STAR), Singapore. He is a National Research Foundation (Singapore) Fellowship (NRF-F-2022) recipient and a visiting PhD supervisor at The University of Edinburgh. He was an honorary lecturer at the Australian National University (ANU) and a research fellow at the Australian Centre for Robotic Vision (ACRV), the Australian National University. He obtained PhD from the VISICS group of KU Leuven, Belgium in 2015. He is interested in Computer Vision and Machine Learning research.

Title: Learning to Anticipate Human Actions from Videos

Abstract: Our ability to anticipate the behavior of others comes naturally to us. For example, an experienced driver can often predict the behavior of other road users. Similarly, a good table tennis player can estimate the direction of the ball just by observing the movements of the opponent. This phenomenon is called Action Anticipation, the ability to recognize actions of others before it happens in the immediate future. This is so natural to us but how to develop a computational approach to do the same remains a challenge. It is critical to transfer this ability to computers so that robots may be able to react quickly by anticipating human actions like humans. Robots' ability to understand what humans might do in the immediate future is important for the development of assistive robotics in domains such as manufacturing and healthcare. Current methods are primarily based on predictive methods and there are not than many successful stochastic approaches to handle the uncertainty of future. We propose two solutions to handle these challenges. In one approach we develop a model to correlate the past information with what might happen in the future. In another approach we develop methods to model the uncertainty using latent variable models.

Prof. Yebin Liu is a tenured professor in Department of Automation, Tsinghua University. His research areas include computer vision, computer graphics and computational photography, especially on digital human representation, reconstruction and rendering. He has published more than 70 papers in top conferences and journals including CVPR, ICCV, ECCV, TPAMI, and TOG. He received the First Prize of National Technology Invention Award in 2012. He has been found by the Distinguished Young Scholars of NSFC in 2021.

Title: Neural Digital Human Avatar:Motion Capture, Reconstruction and Rendering

Abstract: Digital human avatar will serve as the future interface for human-AI interaction and holographic communication. It is also the core infrastructure for metaverse. Focusing on the key and cutting-edge components of digital human avatar, this talk will introduce Neural methods for digital human 3D representation, motion capture, 3D reconstruction and neural rendering. Aiming at high quality human avatar reconstruction and rendering, traditional methods requires extremely sophisticated capture setup like dense camera systems using more than hundreds of cameras. In this talk, we will introduce our recent neural methods for high quality human avatar reconstruction and rendering using only 6 or 8 cameras.