Elevenlabs scribe v1 Speech to Text
Audio
Elevenlabs scribe v1 Speech to Text
POST
Elevenlabs scribe v1 Speech to Text
Transcribes audio or video files. When use_multi_channel is true and the uploaded audio has multiple channels, returns a ‘transcripts’ object with one transcription per channel. Otherwise returns a single transcription result.
Request Headers
Enum:
application/jsonBearer authentication format: Bearer {{API Key}}.
Request Body
If specified, the system will make a best effort to sample deterministically. Repeated requests with the same seed and parameters should return the same result, but determinism is not guaranteed. Must be an integer between 0 and 2147483647.Range: [0, 2147483647]
Whether to annotate which speaker is currently speaking in the uploaded file.
Input audio format. Options are ‘pcm_s16le_16’ or ‘other’. pcm_s16le_16 requires audio to be 16kHz sample rate, 16-bit integer, mono, little-endian format, which has lower latency compared to encoded waveforms.Possible values:
pcm_s16le_16, otherControls the randomness of the transcription output. Value range is 0.0 to 2.0; higher values produce more diverse and less certain results. If omitted, the default temperature of the selected model will be used (typically 0).Range: [0, 2]
Maximum number of speakers in the uploaded file. Can be used to help distinguish speakers. Up to 32 speakers supported.Range: [1, 32]
Specifies the ISO-639-1 or ISO-639-3 language code of the audio file. Specifying it in advance can sometimes improve transcription performance. Defaults to null, which will automatically detect the language.
Whether to tag audio events such as (laughter), (footsteps), etc. in the transcription.
HTTPS URL of the file to transcribe. Either file or cloud_storage_url must be provided. The file must be accessible via HTTPS and smaller than 2GB. Supports any valid HTTPS address, including cloud storage (AWS S3, GCS, Cloudflare R2, etc.), CDNs, or other HTTPS sources. Supports pre-signed URLs with tokens or URL query parameter authentication.
Whether the audio file is multi-channel with each channel containing only a single speaker. When enabled, each channel will be transcribed independently and the results will be combined. Each word in the output will include a channel_index field. Up to 5 channels supported.
Speaker diarization threshold. A higher value means a lower probability of one person being split into multiple speakers, but a higher probability of different people being merged into one speaker (fewer speakers identified). A lower value means a higher probability of one person being split into multiple speakers, but a lower probability of different people being merged (more speakers identified). Can only be set when diarize=True and num_speakers=None. Defaults to None, which selects a threshold based on the model ID (typically 0.22).Range: [0.1, 0.4]
Granularity of timestamps in the transcription. ‘word’ provides word-level timestamps, ‘character’ provides character-level timestamps.Possible values:
none, word, characterResponse
The response may be one of the following response types:
Response Type 1
Response Type 1
The raw transcribed text.
List of words and their timing information.
Channel index corresponding to this transcription (applicable for multi-channel audio).
Detected language code (e.g., ‘eng’ for English).
Unique transcription ID for this response.
Language detection confidence (between 0 and 1).