Tsu-Jui is currently a Ph.D. student at UCSB CS, advised by William Wang. His research interest lies in natural language processing (NLP) and computer vision (CV). Primarily, he works on language grounding and interacting with environments using language. Besides, he is also interested in information extraction and video analysis. His research goal is to bridge the gap between vision and language via the AI system.
Vision-and-Language tasks, which require the agent to interact between the multi-modal source (ex: vision and language), are more practical and closed to people's daily scenarios. Data scarcity is a significant issue for vision-and-language as it is challenging to collect large-scale multi-modal examples. However, people still accomplish those tasks even when presented with an unfamiliar situation. Such ability results from counterfactual thinking and the ability to think about alternatives to events that have happened already. We incorporate counterfactual thinking into the language-based image editing (LBIE) task and the vision-and-language navigation (VLN) task. We show how to apply the counterfactual reasoning under data scarcity and make it more effective for vision-and-language tasks.