Contributors:
- Chao Xu (Affordance)
- Tengyu Liu (HOI)
- Zeyu Zhang (Functionality)
Reading list
survey/review/perspective paper book GitHub
Required - Affordance in General
- Visual Affordance and Function Understanding: A Survey, ACM Computing Surveys 2021
- Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense (Section 5), Engineering 2020
Required - Scene Affordance
- From 3D Scene Geometry to Human Workspace, CVPR 2011
- Inferring Forces and Learning Human Utilities From Videos, CVPR 2016
Required - Object Affordance
- Deep Affordance Foresight: Planning Through What Can Be Done in the Future, ICRA 2021
- Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery, ICCV 2021
Required - HOI
- Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities, CVPR 2010
- Holistic++ Scene Understanding: Single-view 3D Holistic Scene Parsing and Human Pose Estimation with Human-Object Interaction and Physical Commonsense, ICCV 2019
- Reconstructing Hand-Object Interactions in the Wild, CVPR 2021
- Compositional Learning for Human Object Interaction, ECCV 2018
- Exploiting Relationship for Complex-scene Image Generation, AAAI 2021
Required - Functionality
- Scene Parsing by Integrating Function, Geometry and Appearance Models, CVPR 2013
- Human-centric Indoor Scene Synthesis Using Stochastic Grammar, CVPR 2018
- Make it Home: Automatic Optimization of Furniture Arrangement, SIGGRAPH 2011
Optional - Affordance in General
- The Ecological Approach to Visual Perception, Boston: Houghton Mifflin (1979)
- Understanding Context: Environment, Language, and Information Architecture (Chapter 4), O’Reilly Media, Inc. (2014)
Optional - Scene Affordance
- Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image, ECCV 2018
- A Multi-Scale CNN for Affordance Segmentation in RGB Images, ECCV 2016
- EGO-TOPO: Environment Affordances from Egocentric Video, CVPR 2020
- People Watching: Human Actions as a Cue for Single View Geometry, ECCV 2012
- Binge Watching: Scaling Affordance Learning from Sitcoms, CVPR 2017
- Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments, CVPR 2019
Optional - Object Affordance
- Reasoning about Object Affordances in a Knowledge Base Representation, ECCV 2014
- O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning, CoRL 2022
- Hallucinated humans: Learning latent factors to model 3D environments, Diss. Cornell University, 2015
- Long-Horizon Manipulation of Unknown Objects via Task and Motion Planning with Estimated Affordances, arXiv preprint arXiv:2108.04145
- 3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding, CVPR 2022
Optional - HOI
- Hand-Object Contact Consistency Reasoning for Human Grasps Generation, CVPR 2021
- Synthesizing Diverse and Physically Stable Grasps With Arbitrary Hand Structures Using Differentiable Force Closure Estimator, RA-L 2021
- Modeling 4D Human-Object Interactions for Event and Object Recognition, CVPR 2013
- Detecting and Recognizing Human-Object Interactions, CVPR 2018
- Learning Human-Object Interactions by Graph Parsing Neural Networks, ECCV 2018
- HAKE: Human Activity Knowledge Engine, arXiv preprint arXiv:1904.06539
- Detailed 2D-3D Joint Representation for Human-Object Interaction, CVPR 2020
- Jointly Recognizing Object Fluents and Tasks in Egocentric Videos, ICCV 2017
- HOI Learning List
Optional - Functionality
- Recognition of natural scenes from global properties: Seeing the forest without representing the trees, Cognitive Psychology 2009
- Shape2Pose: Human-Centric Shape Analysis, SIGGRAPH 2014
- What Can I Do Around Here? Deep Functional Scene Understanding for Cognitive Robots, ICRA 2017
- Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope, IJCV 2001
- Understanding Bayesian rooms using composite 3D object models, CVPR 2013
- Action Genome: Actions as Composition of Spatio-temporal Scene Graphs, CVPR 2020
- ConceptNet 5.5: An Open Multilingual Graph of General Knowledge, AAAI 2017
- Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth using Stochastic Grammars, IJCV 2018
Essay
“The picture above is funny. But for me it is also one of those examples that make me sad about the outlook for AI and for Computer Vision. What would it take for a computer to understand this image as you or I do? I challenge you to think explicitly of all the pieces of knowledge that have to fall in place for it to make sense. … I hate to say it but the state of CV and AI is pathetic when we consider the task ahead, and when we think about how we can ever go from here to there. The road ahead is long, uncertain and unclear. … In any case, we are very, very far and this depresses me. What is the way forward?”
The above image was taken in 2010, and the above comment was made in 2012. Since then, AI technology has advanced significantly, and I’m wondering if the above comments still hold true today.
Please review relevant literature and write an essay on how to make AI understand the above picture. Keep in mind that in his blog, Karpathy has a long (but not exhaustive) list of task that an algorithm must understand to get the joke. You might find the list useful for your essay. Your analysis should be holistic and should include both review of existing works and possible future directions.