[arXiv22] Understanding Embodied Reference with Touch-Line Transformer