Audio Transcription: /v1/audio/transcriptions Limitations & Large Payload Handling
You can use the POST /v1/audio/transcriptions audio transcription endpoint to convert audio files into text. It looks like the OpenAI Whisper API, but performs format validation and file size limits in the local gateway, then sends the audio as inlineData to the upstream Gemini request.
What You'll Learn
- Use curl / OpenAI SDK to call
POST /v1/audio/transcriptionsand convert audio to{"text":"..."} - Understand the 6 supported audio formats and the real error behavior of the 15MB hard limit
- Know the default values and pass-through behavior for
model/prompt(no guessing of upstream rules) - Locate audio requests in Proxy Monitor and understand the source of
[Binary Request Data]
Your Current Pain Points
You want to convert meeting recordings, podcasts, or customer service calls into text, but often get stuck on:
- Different tools have different audio formats and interface conventions, making scripts and SDKs hard to reuse
- When uploads fail, you only see "Bad Request/Gateway Error" without knowing if it's a format issue or file too large
- You want to integrate transcription into Antigravity Tools' "local gateway" for unified scheduling and monitoring, but aren't sure about the actual compatibility level
🎒 Prerequisites
Prerequisites
- You have completed Start Local Reverse Proxy & Connect First Client and know the reverse proxy port (using
8045as example on this page) - You have completed Add Account and have at least 1 available account
What is the Audio Transcription Endpoint (/v1/audio/transcriptions)?
The audio transcription endpoint is an OpenAI Whisper-compatible route exposed by Antigravity Tools. Clients upload audio files using multipart/form-data, the server validates the extension and size, converts the audio to Base64 inlineData, calls upstream generateContent, and finally returns only a text field.
Endpoint and Limits Overview
| Item | Conclusion | Code Evidence |
|---|---|---|
| Route | POST /v1/audio/transcriptions | src-tauri/src/proxy/server.rs registers route to handlers::audio::handle_audio_transcription |
| Supported formats | Detected by file extension: mp3/wav/m4a/ogg/flac/aiff(aif) | src-tauri/src/proxy/audio/mod.rs detect_mime_type() |
| File size | 15MB hard limit (returns 413 + text error message if exceeded) | src-tauri/src/proxy/audio/mod.rs exceeds_size_limit(); src-tauri/src/proxy/handlers/audio.rs |
| Reverse proxy overall body limit | Axum layer allows up to 100MB | src-tauri/src/proxy/server.rs DefaultBodyLimit::max(100 * 1024 * 1024) |
| Default parameters | model="gemini-2.0-flash-exp"; prompt="Generate a transcript of the speech." | src-tauri/src/proxy/handlers/audio.rs |
Follow Along
Step 1: Verify Gateway is Running (/healthz)
Why First rule out issues like wrong port or service not started.
curl -s http://127.0.0.1:8045/healthzcurl http://127.0.0.1:8045/healthzExpected output: JSON like {"status":"ok"}.
Step 2: Prepare an Audio File No Larger Than 15MB
Why The server performs 15MB validation in the handler. Exceeding this directly returns 413.
ls -lh audio.mp3Get-Item audio.mp3 | Select-Object LengthExpected output: File size not exceeding 15MB.
Step 3: Call /v1/audio/transcriptions with curl
Why curl is the most direct way to verify protocol shape and error messages first.
curl -sS -X POST http://127.0.0.1:8045/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=gemini-2.0-flash-exp" \
-F "prompt=Generate a transcript of the speech."Expected output: Returns JSON with only a text field.
{
"text": "..."
}Step 4: Call with OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8045/v1",
api_key="your-proxy-api-key" # if authentication is enabled
)
audio_file = open("audio.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="gemini-2.0-flash-exp",
file=audio_file
)
print(transcript.text)Expected output: print(transcript.text) outputs a transcribed text segment.
Supported Audio Formats
Antigravity Tools determines MIME type by file extension (not by sniffing file content).
| Format | MIME Type | Extensions |
|---|---|---|
| MP3 | audio/mp3 | .mp3 |
| WAV | audio/wav | .wav |
| AAC (M4A) | audio/aac | .m4a |
| OGG | audio/ogg | .ogg |
| FLAC | audio/flac | .flac |
| AIFF | audio/aiff | .aiff, .aif |
Unsupported Formats
If the extension is not in the table, it returns 400 with a text response body, e.g.: Unsupported audio format: txt.
Checkpoint ✅
- [ ] Response body is
{"text":"..."}(no extra structure likesegments,verbose_json) - [ ] Response headers include
X-Account-Email(marks the actual account used) - [ ] This request record is visible in the "Monitor" page
Handling Large Payloads: Why You See 100MB but Still Stuck at 15MB
The server sets the request body limit to 100MB at the Axum layer (preventing some large requests from being directly rejected by the framework), but the audio transcription handler performs an additional 15MB validation.
That is to say:
15MB < file <= 100MB: Request can reach handler, but returns413+ text error messagefile > 100MB: Request may fail directly at framework layer (specific error shape not guaranteed)
What You See When Exceeding 15MB
Returns status code 413 Payload Too Large with a text response body (not JSON), content similar to:
Audio file too large (18.5 MB). Maximum supported size is 15 MB (approximately 16 minutes of MP3). Suggestions: 1) Compress audio quality 2) Upload in segmentsTwo Executable Splitting Methods
- Compress audio quality (convert WAV to smaller MP3)
ffmpeg -i input.wav -b:a 64k -ac 1 output.mp3- Segment (cut long audio into multiple segments)
ffmpeg -i long_audio.mp3 -f segment -segment_time 600 -c copy segment_%03d.mp3Log Collection Notes
Why Real Request Bodies Often Can't Be Seen in Monitor
Monitor middleware reads POST request body first for logging:
- If request body can be parsed as UTF-8 text, it records the original text
- Otherwise it records as
[Binary Request Data]
Audio transcription uses multipart/form-data, containing binary audio content, so it easily falls into the second case.
What You Should See in Monitor
URL: /v1/audio/transcriptions
Request Body: [Binary Request Data]
Response Body: {"text":"..."}Log Limitation Note
Audio itself is not visible in logs, but you can still use status/duration/X-Account-Email to quickly determine: is it protocol incompatibility, file too large, or upstream failure.
Parameter Description (No "Implicit Completion")
This endpoint only reads 3 form fields:
| Field | Required | Default Value | Handling |
|---|---|---|---|
file | ✅ | None | Must be provided; missing returns 400 + text 缺少音频文件 |
model | ❌ | gemini-2.0-flash-exp | Passed through as string and participates in token acquisition (specific upstream rules subject to actual response) |
prompt | ❌ | Generate a transcript of the speech. | Sent as first text segment to upstream, used to guide transcription |
Common Pitfalls
❌ Error 1: Wrong curl Parameters, Not multipart
# Wrong: using -d directly
curl -sS -X POST http://127.0.0.1:8045/v1/audio/transcriptions \
-d "[email protected]"Correct approach:
curl -sS -X POST http://127.0.0.1:8045/v1/audio/transcriptions \
-F "[email protected]"❌ Error 2: File Extension Not in Supported List
curl -sS -X POST http://127.0.0.1:8045/v1/audio/transcriptions \
-F "[email protected]"Correct approach: Only upload audio files (.mp3, .wav, etc.).
❌ Error 3: Treating 413 as "Gateway Broken"
413 here usually means 15MB validation was triggered. Perform compression/segmentation first—it's faster than blind retries.
Lesson Summary
- Core endpoint:
POST /v1/audio/transcriptions(Whisper-compatible shape) - Format support: mp3, wav, m4a, ogg, flac, aiff (aif)
- Size limit: 15MB (exceeding returns
413+ text error message) - Log behavior: When multipart contains binary content, Monitor shows
[Binary Request Data] - Key parameters:
file/model/prompt(see table above for defaults)
Next Lesson Preview
In the next lesson, we'll learn about MCP Endpoint: Exposing Web Search/Reader/Vision as Callable Tools.
You'll learn:
- MCP endpoint routing shape and authentication strategy
- Whether Web Search/Web Reader/Vision use "upstream forwarding" or "built-in tools"
- Which capabilities are experimental—avoid stepping on landmines in production
Appendix: Source Code Reference
Click to expand source code locations
Last updated: 2026-01-23
| Feature | File Path | Lines |
|---|---|---|
| Route registration (/v1/audio/transcriptions + body limit) | src-tauri/src/proxy/server.rs | 120-194 |
| Audio transcription handler (multipart/15MB/inlineData) | src-tauri/src/proxy/handlers/audio.rs | 16-162 |
| --- | --- | --- |
| Monitor middleware (Binary Request Data) | src-tauri/src/proxy/middleware/monitor.rs | 13-337 |
Key constants:
MAX_SIZE = 15 * 1024 * 1024: Audio file size limit (15MB)MAX_REQUEST_LOG_SIZE = 100 * 1024 * 1024: Monitor limit for reading POST request body (100MB)MAX_RESPONSE_LOG_SIZE = 100 * 1024 * 1024: Monitor limit for reading response body (100MB)
Key functions:
handle_audio_transcription(): Parse multipart, validate extension and size, assembleinlineDataand call upstreamAudioProcessor::detect_mime_type(): Extension -> MIMEAudioProcessor::exceeds_size_limit(): 15MB validationmonitor_middleware(): Fall request/response body to Proxy Monitor (only fully logged if UTF-8)