How to Optimize for Multimodal Search (Image, Audio, and Video SEO)

While most SEOs are focusing on AI Overviews and LLM citations, multimodal AI in search is becoming just as important and shouldn’t be ignored.

Liz Reid's keynote at Google I/O
During Google I/O, Liz Reid and even engineers after her keynote, noted that users are increasingly interacting with search using complex, multi-step queries and expecting results that include images, step-by-step visuals, or embedded video. 

What is Multimodal AI?

Multimodal AI refers to systems that can process and understand multiple types of input, such as text, images, and audio, within a single framework. Search is rapidly evolving in this direction, and the implications for SEO and content strategy are becoming significant.

History of Multimodal Search

Google image search homepage
Most people are already familiar with traditional multimodal search. It involves using images or voice as input for search queries. A common example is image-based product search, which would look like uploading a photo of a product and asking Google to find this item. Google’s image search algorithm can identify the brand, model, or category without requiring a text-based description.

According to a recent Google press conference, image search alone now accounts for approximately 25 billion searches per month, and around 20% of those are purchase-driven queries.

The Rise of Multimodal AI

Now, multimodal AI has advanced significantly beyond traditional image search.

Last month, Google expanded its Search Live feature, previously available in the U.S., to 200+ 200 countries and territories where AI Mode is already available. Search Live users to interact with AI using voice input through the mic on their smartphone, enabling real-time conversational search.

The use cases go way beyond simple questions. If someone is assembling IKEA furniture, they can activate the camera, show the unassembled parts, and receive step-by-step instructions through an interactive conversation. They can also ask follow-up questions when something is unclear, making the interaction continuous and adaptive.

Voice-based search explanations are also being developed. Google has already been testing podcast-style search result experiences, where users can listen to explanations of complex topics such as “how déjà vu works and its relationship to memory.” This suggests a future where search results can be consumed as audio narratives.

Image SEO

With these developments, SEO is expanding beyond text-only optimization. In this next section, I will explain how to optimize different content types, including images, video, and audio, for multimodal SEO.

Starting with image SEO, traditional best practices are still highly relevant. These include using descriptive alt text and placing images near relevant body text. Google does not interpret images in isolation. It also analyzes the surrounding content, including nearby text and captions, to better understand an image’s meaning. As a result, placing images within contextually relevant sections is generally more effective than inserting unrelated visuals into a page.

Image quality is also becoming increasingly important, not only for user experience but potentially for search visibility as well. Google has indicated that high-quality images are more appealing in search results. You may have already noticed that Google Image Search surfaces far fewer low-quality images today than it did several years ago. This principle extends beyond photographs to diagrams and technical visuals, such as system architecture diagrams and flowcharts, which should be designed for clarity and ease of understanding for both users and AI systems.

From a technical perspective, ensuring that images can be properly crawled and indexed remains important. This may involve implementing image sitemaps, using supported image formats, and verifying that images are accessible to search engines.

Audio and Video SEO

In a podcast interview, Google VP of Search Liz Reid indicated that LLMs are now capable of deeply understanding audio and video content. Previously, search algorithms relied mainly on metadata such as titles, descriptions, and transcripts. However, newer systems can analyze the actual content of audio and video directly. AI can now analyze tone of voice, facial expressions, and the target audience a speaker is addressing. This means that multimedia content is no longer treated as text-only metadata plus files, but as fully interpretable content.

Another emerging development is the improvement of key moments in video search. AI is expected to automatically identify meaningful segments in videos without manual chapter tagging. For example, instead of indexing a 30-minute video as a single unit, search systems may surface specific timestamps such as “the exact action demonstrated at 3:15.” This would significantly improve content discoverability and precision in video search.

Despite these advancements, this does not mean that transcripts, metadata, or structured data are no longer needed. These systems are still evolving, and text-based signals remain essential for training and supporting multimodal understanding. For audio content, written summaries and transcripts continue to play an important role in discoverability. Traditional SEO best practices for video content also still apply, such as clear titles, detailed descriptions, and schema implementation. These help ensure that content remains accessible and properly indexed as multimodal systems evolve.


Multimodal AI is already here and rapidly reshaping how information is searched and consumed. Audio and video content are becoming significantly more accessible, and language barriers are increasingly reduced through real-time translation and AI-generated interpretation. In the future, users may no longer need to watch entire videos or read full articles. Instead, they will be able to access precisely the segment or information they need.

From an SEO perspective, relying solely on text is no longer sufficient. A more robust strategy should incorporate images, and ideally audio and video content as well. When doing so, metadata and structured data should also be properly implemented. That said, AI continues to evolve quickly, and the importance of traditional metadata may change over time. As AI continues to reshape search, the evolution of multimodal understanding will be a trend worth watching closely.

Posted by Rei Wakayama