Joint visual-textual modeling for multimodal content classification, retrieval and generation