Question answering with real-world multi-modal personal collections, e.g., photo albums with visual, text, time and location information.