Cross-Modal Understanding
Artificial intelligence links what she sees with what she reads and hears. She combines vision, text, and sound to build a deeper understanding of the world.
- Image captioning – Generating text from images.
- Visual question answering – Asking questions about images.
- Multimodal embeddings – Joint vector spaces across text, vision, and audio.