This document explains the different ways customers can integrate the ELSA API into their products. We also discuss the difficulty and security implications of the different options. We will refer to “client” applications when speaking about a software that is serving the application to the end user. This is usually a mobile app or a website. Conversely, we will refer to “customer” when speaking about the company that is licensing the ELSA API for their use in their products (most probably you, who are reading this document). In terms of where the integration happens, the options that are discussed here are:
- Server-side integration → it usually implements a stand-alone non-expiring ELSA token with API calls integrated into the customer’s backend servers.
- Client-side integration: it can be either implemented using:
- Single stand-alone token → non-expiring ELSA token shipped with the client application, with the option to replace it in case of detected abuse
- Server-supported expiring ELSA token via user authentication → via expiring ELSA tokens that are created on-the-fly for each client application
If we consider the communication method being used, we can talk about:
- Single-file mode
- Streaming mode
In single-file implementations, the customer calls the ELSA API providing the full audio recording to be evaluated + the expected text. Using a single file implementation is the easiest method by far, as it only requires an authenticated REST-API call to the ELSA server. On the other hand, it is less efficient than streaming (bigger latency to obtain a response) and it does not allow for automatically ending the user recordings (server triggered).
A streaming implementation transmits the audio into the ELSA server as it is being recorded by the client’s application. The streaming implementation uses a websocket connection with the API server, which is governed by a communication protocol consisting of json messages and raw audio packets, sent as soon as they are available from the client’s recording device. This method is much more complicated to implement but presents 2 major advantages versus the full-file implementation:
- Processing starts as soon as the first audio packet is received in the server, therefore causing slightly less latency of response at the end.
- It allows for either the server or the client to trigger the end of the recording (called acoustic endpoint detection). In the case of the server, it gets triggered when it detects that the user has finished speaking the provided text, or has a very long pause.
Choosing the best option
In 90% of the cases a single file implementation will be sufficient, although it does not offer endpoint detection and has slightly higher response latency. To get an idea of the difference in latency between single-file and streaming, the following figure shows the average processing time (in seconds) between both methods and conditioned to the number of words in the sentence. While both grow linearly with time, the difference stays between 0.5s and 1s. For short sentences (< 20 words) unless endpointing is a must, we recommend integrating with ELSa API using a single-file implementation.