Why automating clinical phone calls is so challenging

In Tucuvi we have developed a voice assistant that allows the automation of clinical calls, from pre-consultation questionnaires to post-discharge follow-up and home monitoring of chronic patients. What challenges have we faced?

In the last decade, there has been a huge increase in the use of voice assistants, which have achieved very high penetration in a very short time with products such as Siri, Alexa or Google Assistant. Most of these virtual assistants are built to understand and perform specific tasks, which are usually completed in one or several short sentences, such as playing a song or checking the weather.

However, when you want to involve the user in a real conversation, as happens when automating phone calls, the challenge becomes much more complex and the field of work is at a very early stage.

In Tucuvi we have developed a voice assistant that allows the automation of clinical calls, from pre-consultation questionnaires to post-discharge follow-up and home monitoring of chronic patients. In this post we want to share our experience and the challenges we have encountered during these two years of work.

Automatic speech recognition

Automatic Speech Recognition (ASR) is a fundamental aspect of our solution. A highly accurate ASR is essential for capturing clinical information that is relevant to the patient’s experience, as it results in fewer missed or faulty transcripts.

There are several causes that make this process difficult, such as the acoustic conditions in which the call is made, the ambient noise, the specific vocabulary of each patient or the characteristics of the patient’s own way of speaking. The distinction between background noise and the patient’s own voice is complex, which can lead to the transcription of irrelevant information. This makes it difficult for Natural Language Processing (NLP) algorithms to understand the transcription and enable the assistant to respond properly to the patient.

Most voice assistants are designed to start working with a trigger word, such as the name of the assistant itself. This word provides a clear starting point at which the assistant begins to listen for the task or request to which it should respond.

However, over the telephone it is more difficult to find a clear distinction between when the patient starts and finishes speaking. In addition, sentences are longer and more elaborate and have a greater number of pauses, which can be interpreted as the end of what the patient wants to communicate, leading to unwanted interruptions or, alternatively, very long latencies.

Call length

For a virtual voice assistant to handle a phone call, the system needs to understand the context of previous interactions, which becomes more difficult with longer call lengths.

As conversations become longer, relevant information can be found in a greater number of interactions. In addition, the longer the call, the greater the number of points at which the topic of the conversation may change, and it is necessary for the system to understand that the context of the previous topic may not be relevant to the current one. These aspects imply a more complex dialogue system, composed of a combination of NLP models for detecting intents and entities and for handling interactions and context changes based on responses from previous interactions.


Most devices that support voice assistants have some kind of visual response, such as a light or an icon, which lets users know that the system has heard them and is processing the request. On the phone, there is no such visual confirmation, so latency management is more complex, with a trade-off between interruptions and the speed of the assistant’s response. All parts of the speech system must be as efficient as possible, from speech recognition to the models used for the NLP, database queries and API calls.

Because of this, one of the most important challenges we have been working on is to get the latency to a point where the conversations between the assistant and the patients are fluid and natural.

User experience: expectations, trust and confidence.

When patients receive a call from our virtual assistant for the first time, there is often a lack of confidence as to whether the virtual assistant can really understand and manage the conversation, and it is common for the first answers given by the patient to be very short, with monosyllables predominating. As the patient sees that they can speak naturally and that the assistant understands them, they begin to open up and express themselves in a more personal way, giving more elaborated answers with their own expressions.

Moreover, the concern about dehumanising care is a recurring theme whenever conversational AI is discussed in healthcare. The key to avoiding this is to understand that the calls made by our voice assistant do not replace clinical activity, but serve to complement it. It is a support tool to be able to reach more patients more quickly and to discern where it is more urgent and a priority for professionals to act. In this way, it enables care to be scaled and improves the efficiency of hospital processes, freeing up time for professionals and allowing them to prioritise those patients with more serious situations.

As the space matures and more and more patients have positive experiences using virtual assistants over the phone, this problem will begin to disappear. Despite the novelty of our solution, all the progress we have made has led to a 4.7/5 satisfaction rating for patients talking to the clinical voice assistant. In addition, the adherence of these patients to the calls has reached a unique value in the market: 98%.

Contact us

Do you have any doubts?

We’d love to hear from you. Please fill out this form
or shoot us an email.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.