Frequently Asked Questions
Where can I find API usage and endpoint details?
Please refer to the API endpoint section of our documentation for detailed information.
What network settings and certificates are required?
Authentication is token-based
. You can find all details about the hostname and how to use the API in our
documentation.
Example:
curl -X POST 'https://api.elsanow.io/api/v1/score_audio_plus' \
-H 'Content-Type: multipart/form-data' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-F 'api_plan="premium"' \
-F 'return_json="true"' \
-F 'audio_file=@"/path/to/file"'
Additional examples are available in the API section of the documentation.
How do I get and use the API key or token?
The API token
will be generated by us once the NDA is signed. Please inform us when this is completed,
and we will provide the token. Usage instructions are outlined in
the documentation.
What’s the structure and limits of API requests and responses?
You can find detailed request and response structures, including parameters
and limitations
,
in the API documentation.
Is a sandbox available, and do I need a test account?
We do not currently provide a staging
environment.
Is there a dashboard or logs for monitoring and debugging?
A customer-facing
dashboard is available to provide visibility into requests and activity. This Retool-based
interface allows users to track their usage for both scripted and unscripted API
calls.
Does our API currently support, or can it be integrated with, a RAG solution?
While our API doesn’t implement a RAG (Retrieval-Augmented Generation
) solution natively, it can certainly
be integrated into RAG workflows. Our services focus on audio stream analysis, and the outputs can be used as
input for RAG systems or other pipelines that benefit from enriched audio-derived context.
To provide a more tailored answer or suggest an optimal integration approach, please share a bit more details
about the customer intended use case or its architecture.
What bandwidth is needed for smooth AI speaking features?
As of now, the only bandwidth-related requirements we provide are those listed in our API documentation. These apply regardless of how the AI speaking features are used or integrated. If you can share a bit more about your specific use case we’d be happy to give a more precise recommendation.
Is it possible to identify which IP address each chunk was sent from?
At the moment, we do not retain any information about the origin of API requests, such as IP
addresses.
This means we are unable to determine which IP each chunk was sent from. Our logs focus on request metadata
relevant to the service functionality, but they do not include origin-level
network details.
What are the parameters under "utterance" response section?
The parameters under utterance
response section are:
- nativeness_score: The nativeness score achieved by the user for the entire utterance, on a scale of 0–100.
- nativeness_score_partial: This score reflects the user’s nativeness on the subset of words they actually
pronounced with a minimum score of
25%
. - decision: A string indicating the user’s proficiency in pronouncing the utterance, based on
nativeness_score
. Possible values:correct
,almost_correct
, orincorrect
.
Note: While nativeness_score
considers all words in the utterance (including those the user did not say),
nativeness_score_partial
focuses only on the words spoken by the user, even if they have a low score.
Example: If the utterance is “Hello ELSA” and the user only says “Hello,” nativeness_score_partial
will consider only the score of “Hello” (assuming it exceeds 25%), while nativeness_score
will account
for all words, including those not spoken (e.g., “ELSA,” with a low or zero score).
For more details, see the API documentation.
What are the parameters under "words" response section?
The parameters under words
response section are:
- nativeness_score: The nativeness score (
0–100
) for the pronunciation of the individual word. - decision: Proficiency assessment based on
nativeness_score
, with possible values ofcorrect
,almost_correct
, orincorrect
.
For more details, see the API documentation.
What are the parameters under "word_stress" response section?
The parameters under word_stress
response section are:
- decision: Indicates whether the user stressed the syllable correctly, with possible values
correct
orincorrect
.
For more details, see the API documentation.
What are the parameters under "phonemes" response section?
The parameters under phonemes
response section are:
- nativeness_score: The score for the phonemes in this entry, on a
0–100
scale. - decision: Indicates the accuracy of phoneme pronunciation, with possible values
correct
,warning
, orerror
.
For more details, see the API documentation.
How Chunk Calculation Works
A chunk is 15 seconds
in duration. However, the actual number of chunks is calculated using the formula:
num_chunks = ceil(duration / chunk_size)
So, for example, a 17-second
audio input results in 2 chunks (not 1.13), because of the ceiling operation.
In addition:
- If the total duration slightly exceeds the API’s maximum allowed duration, it is truncated to fit within limits. This prevents a second request from being triggered due to just a few extra frames.
- We also trim silence and noise from the beginning or end of audio when applicable, which can further affect the final chunk count.
These factors explain why num_standard_chunks
may not always align exactly with num_secs / 15
.
How does Speech Analyzer detect grammar mistakes?
We detect grammatical errors but only correct those for which we are highly confident in the context.
Specifically, a proposed correction must have at least an 80% confidence
score before it is applied. This
approach ensures that we avoid suggesting irrelevant changes in cases of grammatical ambiguity.
What are the Maximum Audio File Upload Limits
Bytes
The maximum file size allowed for upload is 100MB. If you need to upload files larger than this limit, please contact our support team for assistance.
Minutes
The maximum audio file length for unscripted requests having the sync flag set to True
allowed for upload
is 15 minutes
.
- Scripted: Not Limited
- Unscripted:
- sync = True => 15 minutes
- sync = False => 12 minutes
Are pronunciation and intonation scores related?
There is no direct dependency between the pronunciation_score and intonation_score. However, we typically only compute the intonation_score for longer inputs, such as full sentences. As a result, users who are absolute beginners and tend to produce very short or incomplete utterances may not receive an intonation_score. This may give the impression of a relationship, but it's more a reflection of the input length and quality than a dependency between the scores themselves.
How are pronunciation and intonation decisions evaluated?
These decision attributes are based on the corresponding CEFR-level
scores (e.g., pronunciation_cefr,
intonation_cefr). A mapping example is as follows:
- Correct: CEFR level C1 or C2
- Warning: CEFR level B1 or B2
- Incorrect: CEFR level A1 or A2
Why do some unscripted API calls miss EPS scores or transcripts?
To have results with metrics make sure that the audio length exceeds 20 seconds
.
Is the pronunciation level mapped to a US/UK native accent?
We adhere to global standards (CEFT
, IELTS
, TOEFL
), where we don't penalize individuals for speaking in their
native accent but rather focus on speech clarity. While there may be slight variations for users with thicker
accents, the differences typically remain within the same band.
Why are grammar and vocabulary metrics missing in unscripted API results?
These metrics have a minimum threshold. If you want to obtain results for grammar and vocabulary metrics in
the unscripted API despite this minimum threshold, simply add the flag -F force_grammar_vocab=True
. However,
it's important to note that these results may not be as accurate as those that meet the minimum threshold.
What standard do we use for explainability?
Our models are proprietary, and we do not disclose the exact architectures or internal mechanisms. While we prioritize transparency in terms of output quality and performance benchmarks, we do not currently follow a public standard for model explainability.
How do we calculate the Overall score?
The overall score is a combination of the five metrics: pronunciation
, intonation
, fluency
, grammar
and
vocabulary
. Sometimes, if we don’t provide vocabulary or grammar scores because the recording was short, we can
still provide an overall score based on the other available metrics.
How is the score computed?
The way we compute ELSA Score and map it to IELTS is developed internally. The parameters to calculate ELSA score can change over time, and so can the mapping to IELTS. From time to time we reevaluate it and change it slightly.
How are grammar range and errors weighted in the Grammar score?
The proportion of these aspects depends on the type of the recording. For casual speech, it's roughly 60%
grammar
errors and 40%
grammar range, for exam-like settings it's roughly 50%
grammar errors and 50%
grammar range. It's
important to note that these numbers are adjusted when we have new data.
How do we calculate the pronunciation score?
The pronunciation score is based on how accurately you pronounce the English sounds in each word that was recognized by ELSA in your recording. The number and severity of the highlighted mispronunciations will contribute to the pronunciation score.
How do we calculate the fluency score?
The Fluency score is a combination of your performance in Pace
, Pausing
and Hesitations
. Keeping a good Pace,
Pausing only in natural places and reducing the amount of filler words and repetitions will contribute to a good
Fluency score.
How do we calculate the intonation score?
The intonation
score takes into account raises and falls in your pitch and how well you gave emphasis to words
within sentences.
How do we calculate the vocabulary score?
Vocabulary
score is primarily based on the estimated CEFR levels
of words and expressions in the user’s speech.
Note: We output the raw
CEFR distribution as a feedback (percentage of words in each A1-C2
level) but
the overall score is computed by a statistical algorithm which maps this distribution to a 0-100
value (where 100%
corresponds to native-like vocabulary usage).
The vocabulary
score is returned only if the text is (currently) 75 words
or longer.
How do we calculate the grammar score?
The grammar
score is calculated based on the outputs of the grammatical error detection and correction module
combined with the grammatical range identified. Grammatical error detection and correction identifies grammatical
errors in the text and outputs the error score. Grammatical range module identifies all the grammatical structures
and calculates the range score based on the 5 highest level structures successfully used in the recording.
The grammar
score is returned only if the text is (currently) 50 words
or longer.
What is the volume and diversity of the training data?
While we do not disclose implementation details or internal structures of our models, we are committed to ensuring
that the outputs are interpretable and aligned with user expectations. Where applicable, we provide score breakdowns
or category-level
feedback that reflect the model’s evaluation logic in a transparent and actionable way for the end
user. Internally, we follow robust validation practices and benchmark performance to ensure consistency and fairness
in model decisions.
How accurate are the models, and how often are they retrained?
We have not historically shared detailed accuracy figures publicly. Instead, we’ve typically communicated that our
models outperform competitors based on our internal benchmarks. Regarding model retraining frequency
, it varies
significantly depending on the model. While some models have remained unchanged for extended periods, our overall
approach has consistently been to collect user data (where permitted
) and iteratively improve model performance
over time.
What is the risk of bias or discrimination?
ELSA's AI is purpose-built to support non-native
English speakers. By training on thousands of hours of accented
English, sourced from real second language (L2
) speakers, ELSA is uniquely positioned to recognize and address
learners' pronunciation challenges. This inclusive approach helps reduce accent bias and ensures learners feel
seen, supported, and empowered from day one.
What is the weight of "intonation" in the model, and how is it addressed ethically?
Whether the content is scripted
or unscripted
, the detection rate is approximately 20%
.
Regarding the question on handling this ethically
, it’s possible the concern relates to the potential for
detecting individual speech patterns or risks around voice spoofing.
To clarify: Our speech and intonation analysis systems are designed to be agnostic
to any identifiable
characteristics of a user’s voice. We do not perform speaker identification, nor do we use audio data for that
purpose. Our focus remains solely on assessing pronunciation and prosody in the context of language learning.
How is API service usage tracked?
We do offer a usage tracking dashboard via Retool that provides visibility into overall daily API consumption.
This includes metrics such as the number of characters processed
, request counts
(broken down by scripted
and
unscripted
usage), plan tier
, processing time
, number of chunks
, and number of ASR requests
. Additionally, we
provide a detailed per-request view
, which includes the tier
, audio length
, and transcribed text
for each
individual request. This dashboard can serve as a reliable way to monitor and manage usage.
Why there are no grammar or vocabulary scores for short audios?
In case of short audios it is expected that grammar and vocabulary scores may not be generated. Our system
typically requires a minimum of around 50 words for grammar evaluation
and 75 words for vocabulary
analysis in order to produce reliable and meaningful results.
For best results, we recommend submitting longer audio samples. This allows our scoring engine to analyze patterns and provide more comprehensive feedback.
How are the scores mapped?
CEFR | IELTS | TOEFL Speaking | PTE Range |
---|---|---|---|
A1 | 1.5 | 0-1 | 10-10 |
A1 | 2 | 2-3 | 10-10 |
A1 | 2.5 | 4-5 | 10-10 |
A2 | 3 | 6-7 | 10-11 |
A2 | 3.5 | 8-9 | 12-15 |
B1 | 4 | 10-11 | 16-19 |
B1 | 4.5 | 12-23 | 20-25 |
B1 | 5 | 14-15 | 26-31 |
B2 | 5.5 | 16-17 | 32-40 |
B2 | 6 | 18-19 | 41-50 |
B2 | 6.5 | 20-22 | 51-60 |
C1 | 7 | 23-23 | 61-70 |
C1 | 7.5 | 24-25 | 71-79 |
C1 | 8 | 26-27 | 80-86 |
C2 | 8.5 | 28-29 | 87-89 |
C2 | 9 | 30-30 | 90-90 |
References:
Why was my request blocked (403 Forbidden)?
Your request was likely blocked by Cloudflare due to one or more security checks triggered by their Managed Ruleset
.
These rules are designed to detect and prevent potentially malicious or suspicious traffic — even if the request is
not harmful in your specific context. Some common reasons why a request may be blocked include:
- Suspicious file names or extensions: For example, files ending in ".php", ".asp", or other executable formats are often blocked when uploaded or referenced in a request, as they are commonly used in exploit attempts.
- Unusual headers or payloads: If the request body or headers contain unexpected content (e.g., code injection patterns or malformed data), Cloudflare may flag it as suspicious.
- Parameters matching known exploit signatures: For example, the request may match known vulnerabilities like:
- CVE-2018-9206: Exploits in the jQuery File Upload plugin.
- CVE-2019-17132: Bulletin remote code execution vulnerabilities.
Even if your system is not directly affected by these CVEs, a similarity in structure or naming may cause the request to be blocked as a precaution. Cloudflare’s approach is to err on the side of caution — prioritizing security by blocking requests that match known attack patterns or heuristics, even if they turn out to be false positives.
To resolve this:
- You can share the full request details (method, headers, body, URL) with us so we can analyze what specifically triggered the rule.
- If the request is legitimate and expected behavior, we can consider safely creating an exception rule or modifying the security level for your account.