General information

Overview

The ELSA Metered API enables developers to create applications that assess the correctness of speech pronunciation in voice recordings. ELSA API makes available to any developer the power of advanced machine learning technology in an easy to use development interface. The metered API allows developers to use ELSA and pay per time or number of characters used. Use cases of the ELSA API includes:

Detect pronunciation errors of users dictating to an app's microphone (for language speaking apps)
Evaluate a user's pronunciation accuracy (e.g. you pronounced this sentence with 80% accuracy)
Evaluate the pronunciation of recorded audio files (for machine-assisted reviews of exams or applications)

Communication methods

The ELSA API facilitates communication either through a single file (batch mode) or via streaming. The method you select for implementation in your application will determine the most effective approach for authentication and securing the connection, thereby preventing unauthorized API calls.

Single-file implementation

In single-file implementations, the customer calls the ELSA API providing the full audio recording to be evaluated + the expected text. Using a single file implementation is the easiest method by far, as it only requires an authenticated REST-API call to the ELSA server. On the other hand, it is less efficient than streaming (bigger latency to obtain a response) and it does not allow for automatically ending the user recordings (server triggered).

Streaming implementation

A streaming implementation transmits the audio into the ELSA server as it is being recorded by the client’s application. The streaming implementation uses a websocket connection with the API server, which is governed by a communication protocol consisting of json messages and raw audio packets, sent as soon as they are available from the client’s recording device. This method is much more complicated to implement but presents 2 major advantages versus the full-file implementation:

Processing starts as soon as the first audio packet is received in the server, therefore causing slightly less latency of response at the end.
It allows for either the server or the client to trigger the end of the recording (called acoustic endpoint detection). In the case of the server, it gets triggered when it detects that the user has finished speaking the provided text, or has a very long pause.

Spontaneous speech analysis

The ELSA unscripted API allows developers to submit their recorded audio files and obtain a full analysis along multiple dimensions, including pronunciation, prosody/intonation, fluency, vocabulary and grammar. To use the API you will need from ELSA:

ELSA <API_token> :
- You will add an API token to the call header to authenticate your organization. This method is used in the calls below and is the standard way to integrate with the ELSA API.
(optional) ELSA <API_master_key>:
- If you intend to integrate the API calls directly into your clients, you will receive a master key that allows you to generate API tokens programmatically for each client. These tokens come with an expiration time and can be easily disabled if misuse is detected. This master key is not provided by default, so please discuss your specific use-case with us.

This is a REST-API that will receive the audio to be analyzed and process it synchronously. If your use-case needs a faster response time (e.g. audio recordings are expected to be longer than ~30 seconds) you should try our websocket / streaming solution instead.

Overview​

Communication methods​

Single-file implementation​

Streaming implementation​

Spontaneous speech analysis​

Overview

Communication methods

Single-file implementation

Streaming implementation

Spontaneous speech analysis