Working with Audio & Video

Prompting strategies for AI models that process audio and video content.

6 min read
2 quiz questions

The newest frontier in AI prompting involves audio and video. Models like Gemini can natively process video, GPT-4o supports voice conversations, and specialized models handle music generation, sound effects, and video creation. While these capabilities are still maturing, understanding how to prompt them effectively gives you a significant head start.

Audio AI falls into several categories: speech-to-text (transcription), text-to-speech (voice generation), audio analysis, and music generation. Each requires different prompting approaches.

Meeting Transcript Analyzer

Extracts structured insights from meeting transcripts.

Here is a transcript from a [MEETING TYPE] meeting:

[PASTE TRANSCRIPT]

Analyze this transcript and provide:
1. Executive summary (3-5 sentences)
2. Key decisions made (bulleted list)
3. Action items with owners and deadlines
4. Open questions or unresolved issues
5. Participant sentiment (were there disagreements? consensus?)
6. Suggested follow-up agenda items

Ignore small talk and focus on substantive content.

Gemini's ability to process video directly opens up powerful use cases: analyzing product demos, reviewing presentations, extracting information from video tutorials, and more. When prompting with video, specificity about time ranges and what to look for is crucial since videos can contain enormous amounts of information.

  • Specify time ranges: "Analyze the section from 2:30 to 5:00 where the speaker discusses pricing."
  • Tell the model what to focus on: visuals, speech, text on screen, body language, or all of the above.
  • Ask for timestamped output: "List each key point with its timestamp."
  • For long videos, ask for a chapter-by-chapter summary first, then drill into specific sections.

When generating speech with models like ElevenLabs or OpenAI TTS, the text you provide is the prompt. Formatting matters: punctuation controls pacing, capitalization can affect emphasis, and explicit direction about tone and speed improves results.

Flat TTS input: "Welcome to our podcast. Today we're going to talk about artificial intelligence and its impact on business." Optimized TTS input: "Welcome to our podcast! Today... we're diving into artificial intelligence — and its game-changing impact on business."
Audio and video AI capabilities are evolving faster than any other modality. Models that couldn't process video at all in 2023 can now analyze hour-long recordings. Stay current with model capabilities.

Music generation models like Suno and Udio accept text prompts describing the desired music. Effective music prompts specify genre, tempo, mood, instruments, and structure — similar to the descriptive approach used in image generation.

Prompt Templates

Podcast Episode Analyzer

Extracts structured insights and key moments from podcast content.

Analyze this podcast transcript/audio and provide:

1. Episode summary (1 paragraph)
2. Key topics discussed with timestamps
3. Notable quotes (with speaker attribution)
4. Main arguments or claims made (with supporting points)
5. Actionable takeaways for the listener
6. Related topics the listener might explore next

Transcript:
[PASTE TRANSCRIPT]

Video Tutorial to Notes

Converts video tutorial content into structured, reusable notes.

Convert this video tutorial transcript into structured learning notes:

[PASTE TRANSCRIPT]

Create:
1. Title and overview (what this tutorial teaches)
2. Prerequisites (what the viewer should already know)
3. Step-by-step instructions (numbered, with key details)
4. Tips and warnings mentioned by the instructor
5. Summary of key concepts
6. Practice exercises based on the content

Organize by topic, not chronologically.

Test Your Knowledge

Knowledge Check

1 / 2

What is the most important technique when prompting AI to analyze a long video?

Key Takeaways

  • Audio and video AI capabilities are maturing rapidly — invest in learning them now
  • Meeting transcript analysis is one of the highest-value audio AI use cases today
  • When analyzing video, always specify time ranges and what aspects to focus on
  • TTS quality depends heavily on how you format and punctuate your text input
  • Music generation prompts follow the same "be descriptive" principle as image prompts