Listening to the following words may become a little tiresome: „I’ve heard you have a solution which can read unstructured texts. Can you tell me which keywords do you use?“ It was a reason why we added a slide describing why keywords are not sufficient for text mining into our presentations and now we decided to publish it online.

We have spent more than two years building new solutions based on advanced data analytics for healthcare. It means utilization of natural language processing (NLP), information extraction, machine learning and expert systems for processing electronic medical records. It involves understanding all kinds of medical information including structured data like laboratory results and unstructured texts like medical history, daily progress notes, consultation notes, discharge summaries etc.

Let us reveal the truth about keywords in text mining and information extraction. We will illustrate it on an example of finding patients with pneumonia. Although it might sound quite easy, you would encounter at least some of the following troubles.

People may read between the lines but machines cannot. You need to teach them everything from the very beginning.

1. Synonyms

When working with keywords, you need to gather the comprehensive list of all synonyms. And this list can be very long as the writing styles of each doctor (or nurse or any other health practitioner) differ a lot. You would probably be interested also in patients with bronchopneumonia, pleuropneumonia, pleurobronchopneumonia, and other similar diagnoses because these are often used interchangeably.

2. Acronyms and abbreviations

The healthcare professionals often use abbreviations in the medical documentation, because these can help to reduce long phrases and keep the text brief. The problem with abbreviations is, that they are often domain specific and that they can mean many different things. E.g., HAP in healthcare text probably stands for hospital-acquired pneumonia, but the same abbreviation could mean host access protocol in technical texts. Therefore, you need to know the context and to have an expert knowledge in the analyzed field to interpret the abbreviations in the correct way.

3. Reasoning

You should not stop with the “direct” synonyms and abbreviations as there are many other terms, that might be used to describe the patients of your interest. For example, patients with an unspecified lower respiratory tract infection, patients with an inflammation of the lungs or patients with an infection of alveoli. We use hierarchical dictionary systems, where we keep the relationships among the terms (together with their synonyms). Therefore, it is clear, that pulmonary alveolar structure is the cavity in lung parenchyma and that lungs are the primary organs of the lower respiratory tract. When we connect the hierarchy with all the available synonyms and abbreviations of each of these terms, we can easily find more patients with the required set of diagnoses.

4. Typos

If you have ever seen any medical record, you probably noticed one thing all of them have in common: typos. These minor errors usually don’t affect the readability of the texts for humans, but they can mean big troubles for machines (and for the search based only on the keywords). For example, “pneumonia” is often misspelled as “penumonia”. You can find this typo even in some official medical publications.

5. Working with incomplete information

Sometimes, the texts don’t contain the full information known to the physicians – the diagnoses may not be explicitly recorded in the documentation. Therefore, you need to consider the symptoms and other events which occurred during the hospitalization. For example, we also monitor mentions of a cough, increased sputum production, respiratory crackles, relevant microbial laboratory results, etc. We also observe general symptoms like fever or new antibiotic treatment.

6. Negative statements

Maybe you think, that by now you have sorted out every problem you could possibly have with the keywords. You are now able to find any occurrence of pneumonia or any of its synonyms, abbreviations, or symptoms in your texts. And then, while browsing through the selected documents, you’ll find phrases like these: “patient denies chest pains”, “no crackles” or “pneumonia not confirmed”. Therefore, you need to detect the negative statements to exclude them from your results. However, not every negation verb indicates a negative phrase, e.g. sentence “pneumonia cannot be excluded” is often used for patient with high risk of pneumonia.

Do you still believe you can build a proper text mining solution using only keywords?