Cross-modal understanding

Artificial intelligence links what she sees with what she reads and hears. She combines vision, text, and sound to build a deeper understanding of the world.

Image captioning – Generating text descriptions from images.
Visual question answering – Answering questions about images.
Multimodal embeddings – Shared representations across text, images, and audio.