Dark, Beyond Deep


Dark, Beyond Deep: Computer Vision with Humanlike Common Sense


Recent progress in deep learning is essentially based on a “big data for small tasks” paradigm, under which massive amounts of data are used to train a classifier for a single narrow task. In this talk, we call for a shift that flips this paradigm upside down. Specifically, we propose a “small data for big tasks” paradigm, wherein a single AI system is challenged to develop “common sense,” enabling it to solve a wide range of tasks with little training data. We illustrate the potential power of this new paradigm by reviewing models of common sense that synthesize recent breakthroughs in both machine and human vision. We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense. When taken as a unified concept, FPICU is concerned with the questions of “why” and “how,” beyond the dominant “what” and “where” framework for understanding vision. They are invisible in terms of pixels but nevertheless drive the creation, maintenance, and development of visual scenes. We therefore coin them the “dark matter” of vision. Just as our universe cannot be understood by merely studying observable matter, we argue that vision cannot be understood without studying FPICU. We demonstrate the power of this perspective to develop cognitive AI systems with humanlike common sense by showing how to observe and apply FPICU with little data to solve a wide range of challenging tasks, including tool use, planning, utility inference, and social learning. In summary, we argue that the next generation of AI must embrace “dark” humanlike common sense for solving novel tasks.


Dr. Yixin Zhu is an Assistant Professor at Peking University. He received a Ph.D. degree (‘18) from UCLA advised by Prof. Song-Chun Zhu. His research builds interactive AI by integrating high-level common sense (functionality, affordance, physics, causality, intent) with raw sensory inputs (pixels and haptic signals) to enable richer representation and abstract reasoning on objects, scenes, shapes, numbers, and agents. He is a co-organizer of Vision Meets Cognition (FPIC) workshops, 3D Scene Understanding for Vision, Graphics, and Robotics workshops, and Virtual Reality Meets Physical Reality workshops. During his Ph.D. and postdoc studies, his work was supported by DARPA MSEE, DARPA SIMPLEX, DARPA XAI, ONR MURI, and ONR Cognitive Systems for Human-Machine Teaming.