Module 05 — Audio-Visual

⏱️ 5 minutes

The Reality Check

The big AI platforms promise everything in one place — images, video, audio, all inside the chat. Your Agent works differently. It is text-based at heart and orchestrates specialized tools instead of doing everything itself.

🎬 The Orchestration Model

Your Agent does NOT:

Your Agent DOES:

Why this matters: Your Agent knows your project's context, history, and goals. It crafts specific, informed prompts that generic AI cannot generate.

Images: What Actually Works

✅ Your Agent CAN Do This With Images

❌ Your Agent CANNOT Do This

Image Generation Platforms

Your Agent writes prompts; these platforms generate the visuals:

Cost note: These typically use a credits system per generation. Prices vary. Ask your agent to research options and talk you through setup

Audio: Input, Output, and Production

Audio adds a powerful new layer to your agent. It can be used in two primary ways: real-time interaction (talking directly with your agent) and production output (creating polished audio for content like audiobooks, voiceovers, and training materials).

Two Core Audio Use Cases

Voice Input (Speaking to Your Agent):
The Nerve interface currently has a voice option for speaking to your agent. WhatsApp currently is not supported. It supports native voice messages that are automatically transcribed into text for your agent.

For Discord, native speech-to-text (microphone-to-text) support is in development. Until that is released, you could try running Heyron on your iphone and using the built in microphone if you want to speak directly to your agent. ElevenLabs is not as good for this use.

Audio Capabilities (via API Integrations)

Multiple providers support these features (e.g., ElevenLabs, Play.ht, Murf, AWS Polly). Your agent can guide you through selecting and connecting the best option for your use case. We know the ElevenLabs integration with Heyron works.

Pricing: Most platforms offer free tiers for testing, with paid plans for higher volume or production use.

Getting Started: Ask your agent something like: "Help me set up ElevenLabs" or "Connect a text-to-speech provider," and it will walk you through the process step-by-step.

Video: The Studio Workflow Option

Your Agent cannot watch YouTube links or analyze video content natively. This is a common misconception. Video requires a different workflow. There are other outside apps for this.

ElevenLabs Studio 3.0

For video creation and editing, ElevenLabs offers Studio 3.0 — an AI-powered video editor that integrates with their audio models.

What Studio 3.0 does:

Plus: Access to video generation models (Veo, Sora, Kling) for creating footage from prompts. It does not generate video natively.

Typical Video Workflow:

Your Agent writes script → ElevenLabs Studio generates voiceover + adds to timeline → You upload footage or generate video → Your Agent reviews and suggests edits → Platform renders final output

The Cost Reality

What Sometimes Costs Money

Money-saving tip: Your Agent can optimize prompts to reduce waste. Better prompts = fewer attempts = lower cost. Ask: "Make this prompt more specific so we get it right the first time."

Bottom Line

Your Agent is the strategic partner who knows your project. External platforms are the production tools that execute. Your Agent bridges that gap — turning your goals into specific, actionable prompts.

Remember: Your Agent brings context and knowledge; specialized tools bring production capability. Together, they get you better results than either alone.