Automatic and empathic monitoring of all your patients

Leave us your email to have free access to the demo.
If you need more information, please contact us

Designing NLP Engines for Medical Solutions: how to achieve outstanding performance

An NLP engine is a complex collection of systems that requires a diverse pool of expert knowledge and a great deal of care. From designing the most concise, unambiguous patient follow-up protocols, to the constant optimization and improvement of our systems, every factor is vital in maintaining the high-performance standards we strive for on a daily basis.

An NLP engine is a complex collection of systems that requires a diverse pool of expert knowledge and a great deal of care. From designing the most concise, unambiguous patient follow-up protocols, to the constant optimization and improvement of our systems, every factor is vital in maintaining the high-performance standards we strive for on a daily basis.

Language has been at the center of Artificial Intelligence (AI) research since the very beginning. Indeed, one of the first formally defined tests to prove whether or not a machine exhibits intelligence, the famous Turing Test, is based on natural conversations between a machine and a human evaluator. This focus on language has led to the development of the field of Natural Language Processing (NLP) within the more general fields of Machine Learning and AI. NLP combines knowledge from linguistics, AI, computer science and psychology to enhance interactions between humans and machines through the use of language.

With this in mind, it’s logical to assume that NLP lies at the core of LOLA, our virtual medical assistant. In fact, from the moment a patient picks up the phone, several NLP systems immediately spring into action making sure the conversation moves along as smoothly as possible: LOLA needs to be able to speak in a natural way in order to ask the patient about their symptoms; she needs to be able to listen carefully and wait for the patient to finish answering each question, making sure she transcribes each reply as accurately as possible; she then needs to be able to parse the information contained in the patient’s replies and extract the clinical data with utmost precision.

All of these actions imply complex systems that need to work together organically in order for the conversation to run naturally, making sure the patient feels comfortable at all times. Furthermore, given the extremely sensitive nature of LOLA’s job, it’s vital that we ensure that the clinical information gathered during each conversation is correct and that we constantly have a finger on the pulse of the accuracy of our NLP engine.

The core components of LOLA’s brain

We have already hinted at what the core components of our NLP engine are, but we will now give them their formal names and define them in more specific terms.

Text to speech

In AI, Text to Speech (TTS) is the action of converting written text into audio using a synthesized vocalizer. Essentially, this is what gives LOLA her voice and enables her to speak. At Tucuvi we use state of the art third party text to speech services based on Deep Neural Networks. Our emphasis is placed on making LOLA sound as natural as possible. This is achieved partly by carefully selecting the best-sounding voices for each language we deploy. More importantly, though, careful Conversation Design is the main factor when it comes to making LOLA sound personable and empathic while remaining clear and easy to understand, all of which has a big impact on the overall performance of the system. We will talk a little more about this in the next section.

Speech to text

Conversely, Speech to Text (STT) is the action of transcribing speech accurately. This is what happens in real time when the patient replies to each one of LOLA’s questions. This task is of key importance and is greatly complicated by the fact that conversations with LOLA happen in a wide variety of circumstances, with differing levels of background noise, varying accents and voice types and other complex factors. Again, we use state of the art third party tools based on Large Language Models, together with a range of internal tools to ensure that what our patients say is accurately registered. These internal tools include active analysis of background noise to optimize the configuration of the STT engine, active latency adjustment to ensure LOLA adapts its response times to the patient’s speech patterns, or full conversation post-processing to increase accuracy and improve the readability of the transcriptions.

Natural Language Understanding

Crucially, once we have transcribed the patient’s reply, we need to parse the information contained within and make sure it matches what we’re expecting. This is where Natural Language Understanding (NLU) comes into play. An NLU engine takes a piece of written text, such as the transcription of a patient’s reply, and returns the essential information contained within. To this end, two actions are performed: intent detection and named entity recognition

Intent detection is concerned with the classification of the intention behind a given piece of text. For example, in the context of LOLA asking about a patient’s wellbeing, the phrases “I’ve felt a little under the weather this week”, “I haven’t been feeling too well” or “No, I don’t really feel good at all” all share the same intent: the patient is expressing a deterioration of their wellbeing. The range of possible expected intents contained in the replies to a given question are predefined and fixed by the Product and Conversation Design teams. This way, for the context of wellbeing, we could for example define the intents better, same and worse. Defining intents correctly and making sure they’re relevant to the question being asked is of extreme importance when it comes to maintaining the performance of our entire NLP system.

Intent detection example for the “wellbeing” context.

Named entity recognition, on the other hand, is concerned with extracting the value of a specific variable. For example, in the context of LOLA asking about the patient's temperature last night, the phrases “Well, I think it was around 36 degrees” or “Oh, I measured it and it was exactly 36.7” would register values of 36 and 36.7, respectively, for the fever variable.

The way NLU engines work is by learning from a collection of training phrases associated with their corresponding intents and entities. These training phrases inform the algorithms so that they can infer the intents and entities contained in the responses encountered during a conversation with a patient. When it comes to maintaining a high NLU accuracy, one of the most important factors is maintaining a clean and efficient repository of training phrases. At Tucuvi, we have several systems put in place that actively update, validate and monitor our training phrase repositories to ensure that the accuracy of our NLU engine is as high as it can possibly be.

Conversation Design and its impact on NLP performance

In the previous section, we stated that Conversation Design is one of the most important aspects when it comes to maintaining LOLA’s high performance. Our mission statement is that we provide natural and empathic interactions with our patients, to ensure they feel comfortable speaking with LOLA. There are many reasons for this, but from an NLP performance point of view, a patient who feels comfortable is a patient who will stay engaged throughout the whole conversation. This means their level of attention will be high and their responses will be more accurate. It also means that they will be much more likely to follow through with the entire call, allowing us to gather all the clinical information required by our clients.

In this sense, good Conversation Design implies that great care is taken, not only when determining what is being said, but how to say it. Factors such as the speed at which LOLA speaks, the rhythmical patterns she uses, the specific vocabulary, how successive questions are linked depending on specific replies, etc. all contribute immensely to ensuring that the patient feels comfortable. At the same time, it is very important to define exactly what intents and entities we are expecting to collect in the responses to each question, and that we formulate each question with absolute precision and in a way that is completely free of any ambiguity.

This is a complex balance to strike and the real challenge behind Conversation Design: making the conversation fluid, engaging and natural, while retaining accuracy and remaining completely unambiguous.

Measuring Tucuvi’s NLP performance

Now that we understand the main elements of our NLP engine, we’re ready to look at how we can measure its performance. At the end of the day, the most important reference for us when evaluating our system’s accuracy is whether or not it can correctly identify the clinical data being reported by our patients. This implies that we need to look at how accurately LOLA can detect the intents and entities associated with each question. To do this rigorously, we use three key metrics: precision, recall and the F1 score.

Intuitively, precision gives us a measure of how many times the system wrongly claimed that a patient’s reply contained the intent or entity under evaluation. Following the wellbeing context example from the previous sections, precision for the worse intent would give us the ratio of the number of times LOLA correctly identified the worse intent versus the number of times we misclassified a better or same intent as worse. The fewer times we mistakenly assign the worse intent to a reply the better the precision will be.

Recall, on the other hand, gives us a measure of the number of times our system missed the intent under evaluation or assigned a different one to a given reply from a patient. Again, using the wellbeing context and worse intent, recall would give us the ratio for the number of times LOLA correctly classified the worse intent versus the number of times it misclassified a worse intent as better or same, or failed to detect any intent whatsoever. The fewer times we miss the worse intent the better the recall will be.

Finally, the F1 score combines precision and recall in a single value using the harmonic mean of both metrics. It therefore symmetrically reflects both precision and recall in one single metric. Since both precision and recall represent ratios, all of these metrics can take values between zero and one, where one is the highest possible score.

Visual representation of the precision and recall metrics using the “Wellbeing - Worse” intent as an example

The combination of all the factors and systems we have described in the previous sections lead to the following average performance metrics for LOLA’s NLP engine, computed using 10000 conversations from the months of April and May, 2023, sourced from our most active patient follow-up programs:

  • Precision = 0.980
  • Recall = 0.977
  • F1 = 0.978

These metrics are excellent and are the reason why so many healthcare professionals trust LOLA to help them care for their patients.

Unlocking Success: The Team Effort Behind LOLA's Exceptional NLP Engine

An NLP engine is a complex collection of systems that requires a diverse pool of expert knowledge and a great deal of care. Maintaining and improving such systems is above all a team effort. The excellent metrics achieved by LOLA are the result of an enormous amount of work and attention to detail from the entire team at Tucuvi. From designing the most concise, unambiguous patient follow-up protocols, to the constant optimization and improvement of our systems, every factor is vital in maintaining the high-performance standards we strive for on a daily basis.

Contact us

Do you want
to know more?

Whether you want to scale your capacity of care, automate repetitive tasks, improve care team efficiency, or reduce relapses through early interventions, we have a solution for you.

Fill out the form and our team will get in touch with you soon.