Skip to main content

General information


The Elsa metered API enables developers to build applications which evaluate the speech pronunciation correctness of a voice recording. ELSA API makes available to any developer the power of advanced machine learning technology in an easy to use development interface. The metered API allows developers to use ELSA and pay per time or number of characters used. Use cases of the ELSA API includes:

  • Detect pronunciation errors of users dictating to an app's microphone (for language speaking apps)

  • Evaluate a user's pronunciation accuracy (e.g. you pronounced this sentence with 80% accuracy)

  • Evaluate the pronunciation of recorded audio files (for machine-assisted reviews of exams or applications)

Communication methods

The ELSA API allows communication either via a single file (batch mode) or in streaming. The method you choose to implement in your application will later determine how it is best to authenticate and secure the connection so that unauthorized API calls are prevented.

Single-file implementation

In single-file implementations, the customer calls the ELSA API providing the full audio recording to be evaluated + the expected text. Using a single file implementation is the easiest method by far, as it only requires an authenticated REST-API call to the ELSA server. On the other hand, it is less efficient than streaming (bigger latency to obtain a response) and it does not allow for automatically ending the user recordings (server triggered).

Streaming implementation

A streaming implementation transmits the audio into the ELSA server as it is being recorded by the client’s application. The streaming implementation uses a websocket connection with the API server, which is governed by a communication protocol consisting of json messages and raw audio packets, sent as soon as they are available from the client’s recording device. This method is much more complicated to implement but presents 2 major advantages versus the full-file implementation:

  • Processing starts as soon as the first audio packet is received in the server, therefore causing slightly less latency of response at the end.
  • It allows for either the server or the client to trigger the end of the recording (called acoustic endpoint detection). In the case of the server, it gets triggered when it detects that the user has finished speaking the provided text, or has a very long pause.

Spontaneous speech analysis

The ELSA unscripted API allows developers to submit their recorded audio files and obtain a full analysis along multiple dimensions, including pronunciation, prosody/intonation, fluency, vocabulary and grammar. To use the API you will need from ELSA:

  • ELSA <API_token> :API token you will add to the call header to authenticate your organization. This is the authentication method used in the calls below and the most standard way to integrate with ELSA API.

  • (optional) ELSA <API_master_key>: If you plan to integrate the API calls directly into your clients, you will be given a master key to allow you to create API token keys programmatically for each client. These tokens include an expiration time and can be easily disabled if you ever discover that it is being misused. By default this key is not provided, let’s discuss your specific use-case.

This is a REST-API that will receive the audio to be analyzed and process it synchronously. If your use-case needs a faster response time (e.g. audio recordings are expected to be longer than ~30 seconds) you should try our websocket / streaming solution instead.