Introduction
This document explains the different ways customers can integrate the ELSA API into their products. We also discuss the difficulty and security implications of the different options. We will refer to “client” applications when speaking about a software that is serving the application to the end user. This is usually a mobile app or a website. Conversely, we will refer to “customer” when speaking about the company that is licensing the ELSA API for their use in their products (most probably you, who are reading this document). In terms of where the integration happens, the options that are discussed here are:
- Server-side integration → it usually implements a stand-alone non-expiring ELSA token with API calls integrated into the customer’s backend servers.
- Client-side integration: it can be either implemented using:
- A single stand-alone token: This is a non-expiring ELSA token included with the client application. It can be replaced if abuse is detected.
- Server-supported expiring ELSA token via user authentication: This involves using expiring ELSA tokens that are dynamically created for each client application during user authentication.
If we consider the communication method being used, we can talk about:
- Single-file mode
- Streaming mode
Single-file implementation
In single-file implementations, the customer submits the full audio recording to be evaluated along with the expected text to the ELSA API. Using a single-file implementation is notably straightforward, requiring only an authenticated REST-API call to the ELSA server. However, it is less efficient than streaming, resulting in increased latency in obtaining a response. Additionally, it does not support automatically terminating user recordings triggered by the server.
Streaming implementation
A streaming implementation transmits the audio to the ELSA server as it's being recorded by the client's application. This method utilizes a websocket connection with the API server, governed by a communication protocol comprising JSON messages and raw audio packets sent promptly upon availability from the client's recording device. Although more complex to implement, streaming offers two significant advantages over full-file implementation:
- Processing begins upon receipt of the first audio packet at the server, resulting in slightly reduced response latency.
- It enables either the server or the client to initiate recording termination (referred to as acoustic endpoint detection). If triggered by the server, recording ends when it detects the user has finished speaking the provided text or experiences a prolonged pause.
Choosing the best option
In 90% of the cases a single file implementation will be sufficient, although it does not offer endpoint detection and has slightly higher response latency. To get an idea of the difference in latency between single-file and streaming, the following figure shows the average processing time (in seconds) between both methods and conditioned to the number of words in the sentence. While both grow linearly with time, the difference stays between 0.5s and 1s. For short sentences (< 20 words) unless endpointing is a must, we recommend integrating with ELSa API using a single-file implementation.