Situated Language Grounding for Multimodal AI Assistant Modeling