Skip to main content


Developers can use the Elsa Metered API to create applications that evaluate the accuracy of spoken English in a voice recording. The ELSA API offers an easy-to-use development interface for integrating advanced machine learning technology into applications. The API can analyze unscripted English speech recordings without a previous transcript, making it ideal for analyzing spontaneous speech. The ELSA AI technology evaluates spoken English in five major dimensions of communication: pronunciation, prosody, fluency, grammar, and vocabulary. It then provides feedback to help users enhance their communication skills.

API Plans

PlanStandardPremiumSTANDARD UnscriptedPREMIUM Unscripted
Speech typeScripted speechScripted speechSpontaneous speechSpontaneous speech
Speaking activities
  • Word pronunciation
  • Sentence pronunciation
  • Multiple choice pronunciation
  • all STANDARD activities
  • Fluency assessment
  • Spontaneous speech assessment
  • Spontaneous speech assessment
  • Fluency assessment
  • IELTS assessment

API Regions

The ELSA API is deployed in several data-centers to ensure sufficient proximity to your servers and your users. This results in a low latency when sending the speech to be processed and improves your users experience. We are currently deployed in 3 different regions using the Amazon-AWS network. A single API endpoint is used for all regions, where each call is enrouted to the closest datacenter:

RegionsData Center
South-east AsiaSingapore
AmericasEast-Coast USA

Code Samples

We will soon provide client code samples in Python, PhP and Javascript.

Recording Audio requirements

Elsa supports audio in most formats used on the web (e.g. mp3, .wav, .flac, .mp4, .m4a) but we strongly recommend using flac (lossless compression, much better on bandwidth) or wav.

We internally convert all files into single channel (mono), 16KHz sampling rate and 16bits resolution. If you send anything different that this will only delay the processing. Files with less resolution or sampling rate will probably underperform (e.g. 8KHz files are known to eliminate important information about Fricative sounds).

Scripted or unscripted API

We split the ELSA API into two types of calls, depending on the data available for the API to process.

On the one hand, the scripted API considers that the provided text is spoken by the provided audio. This API is usually most useful to evaluate read speech.

On the other hand, the unscripted API considers that only audio is available. This is most useful when analyzing spontaneous speech recordings, but it can also analyze read speech for which you do no have the text available. Note that if the accent of the English speech in the recording is of a very beginner level, the ASR (automatic speech recognition) module is probably going to make some transcription mistakes to be taken into account in the metrics.

API limitations

There are certain minimum conditions required to obtain results from specific metrics. These metrics take into consideration the audio content:

MetricMinimum words
Grammar score50
Vocabulary score75

Warning: If you want to have this results in the unscripted API despite this minimum threshold, you just have to add the flag:

-F force_grammar_vocab=True

Keep in mind, that this results won't be as accurate as if they passed the minimum threshold.