Cross-modal understanding
Artificial intelligence links what she sees with what she reads and hears. She combines vision, text, and sound to build a deeper understanding of the world.
- Image captioning – Generating text descriptions from images.
- Visual question answering – Answering questions about images.
- Multimodal embeddings – Shared representations across text, images, and audio.