Human-robot collaboration is an essential research topic in Artificial Intelligence (AI), enabling researchers to devise cognitive AI systems and affords an intuitive means for users to interact with the robot. Of note, communication plays a central role. To date, prior studies in embodied agent navigation have only demonstrated that human languages facilitate communication by natural language instructions. Nevertheless, a plethora of other forms of communication is left unexplored. In fact, human communication originated in gestures and oftentimes is delivered through multimodal cues. To bridge the gap and fill in the missing dimension of communication in embodied agent navigation, we propose investigating the effects of using gestures as the communicative interface instead of verbal cues. Specifically, we develop a VR-based 3D simulation environment, named Gesture-based THOR (GesTHO), based on AI2-THOR platform. In this virtual environment, a human player is placed in the same virtual scene and shepherds the artificial agent using only gestures. The agent is tasked to solve the navigation problem guided by gestures with unknown semantics. We argue that learning the semantics of instructional gestures is mutually beneficial to learning the navigation task. In experiments, we demonstrate that human gesture cues improve the object-goal navigation for an embodied agent, outperforming various state-of-the-art methods.