Skip to content

Speech-to-Text API

Prosa Speech-to-Text (STT) API helps you converts your audio data into text using our Automatic Speech Recognition (ASR) engine. Your meetings, conversations, and interviews can be effortlessly transcribed using our STT API, and the transcripts may be used for analytical purposes or accessibility improvement. We support common audio/video formats, and you may choose your own recognition method: synchoronously, asynchrounously, or streaming.

Use Cases

There are 3 groups of use cases which are supported by the Prosa STT API:

  • Instantaneous Transcription

    If you record short audio files and want them to be transcribed immediately after the files are created, you can send the audio data to the API and wait for the transcripts to be returned. Our engine will generate the transcripts and send them to you synchronously. Use cases which may involve instantaneous transcription are: voice command and voice search in virtual assistants.

  • Batch Transcription

    If you have a lot of audio recordings (e.g. meeting recordings, interview recordings, etc.) to be transcribed, that you want to use later, you can submit the recordings to the API asynchronously. Our system will schedule them to be transcribed and you can retrieve the transcripts later. Use cases which may involve batch transcription are: meeting summarization, customer service call analysis, interview transcriptions, etc.

  • Live or Online Transcription

    If you want to provide audio transcripts to an ongoing recording/streaming session in almost real-time, you can stream the audio data into the API and get the transcripts on-the-spot while the recording/streaming session is still in place. Use cases which may involve live/online transcription are: live captioning of a video stream, generating meeting notes in real-time, etc.

Recognition Methods

To support those use cases, Prosa STT API provides 3 recognition methods:

  • Synchronous Recognition

    Clients send the audio data through the REST API. The wait field in the request body must be set to true. Clients then wait for the transcription process to finish, then get the transcripts immediately. The duration of each audio must not exceed 60 seconds, and the maximum data size is 10 MB.

  • Asynchronous Recognition

    Clients send the audio data or a Google Drive URL through the REST API, with the wait field in the request body set to false. After submitting the request, clients receive the STT job details, including the job ID. Using the job ID, clients can check the transcription progress and result. Clients can submit up to 4 hours of audio data for each transcription request.

  • Streaming Recognition

    Clients send audio chunks to the Streaming API using websocket connections. The transcription of the audio will be sent back to the client via the same websocket connection. In each streaming session, clients may send up to 2 hours of audio data.

Multichannel audio

Prosa STT API is able to discern audio with multiple channels. Transcription results will contain channel tag to indicate which channel they are from.


Multichannel transcription is not supported for streaming recognition.

Supported Audio Format

Prosa STT API supports common audio/video formats.

Audio Video
.wav .mp4
.mp3 .webm
.m4a .mov
.ogg .avi
.weba .wmv
.webm .mpg