Following essay is an English academic writing project, which I now present to you as a primer for my natural language processing related blogposts
Human expression of thought and intent, aptly named natural language, carries vast amounts of information. The topic, style, words used, and several other aspects contribute to the interpretation, and extractable information from natural language. The process of analysing natural language text data can be a daunting task, for it may contain thousands of different words, which form sentences, paragraphs and eventually complete narratives (Shivhare and Khethawat 2012). Essentially, each word in a sentence indirectly affects the meaning of the other, which creates, in a non-linear fashion, a very specific semantic meaning.
Manually extracting information from clinical texts, such as patient queries, takes time to interpret. Automating the processing of clinical text is preferable, but a challenging task. If successful, it may be possible to quickly extract valuable information about patient’s disease, medication, mental, and familial history (Zhu et al 2019). Thus, the need for automated natural language processing (NLP) had emerged. As the disciplines in text pre-processing and machine learning have made great advancements, NLP is emerging as a viable option for automated text analysis. In this essay, we shall review the most common methods NLP utilizes and challenges it faces, in the context of clinical text analysis.
Clinical text inherently contains a varied array of specialized terms, abbreviations and professional jargon, that can cause many challenges in text processing. To extract the semantic meaning from clinical text, its pre-processing is inevitably necessary. Clinical text pre-processing includes sentence boundary detection, tokenization, lemmatization, and part of speech assignment tasks. Sentence boundary detection can segment each sentence in a text into an individual element, but certain abbreviations such as m.g., m.l., or Dr. can present exceptional cases for determining sentence boundaries (Jurafsky and Martin, 2008). One method to counter this challenge is to make exception cases for sentence boundary detection, but clinical text may carry more exceptional cases than one can predict.
The most basic pre-processing task is tokenization that includes separating each word within a sentence into an individual feature, but special characters, such as forward slashes connecting two words, can cause failures in word separation. Part of speech assignment to individual words annotates each word within a sentence with a word-feature describing tag. Morphological decomposition of words, or lemmatization, that removes word suffixes (Jurafsky and Martin, 2008). Both part of speech and lemmatization are prone to error, due to human misspelling (Gurusamy and Kamman 2014). These tasks are only a small part of NLP and a stepping-stone for deeper text-analysis processes.
Once the clinical text is pre-processed, it must be turned into an array of numerical matrices that represent semantic weight of each word. This is to make reading of text data much easier for the computer, which with mathematical algorithms, trains a machine learning model that can predict semantic meaning from text (Zhu et al 2019). There is a wide array of different machine learning algorithms, that utilize probabilistic formulas to calculate statistical frequency creating broad rules to address the semantic meaning of text. Other approaches add frequency of each word in a sentence into a feature-vector data, from which statistical predictions can be assessed (Quinlan 1993). Nevertheless, in semantic analysis, most machine learning models are developed for specific task only, for their predictive accuracy is tied to a training data that they are developed to analyse.
Creating a reliable NLP prediction is not as simple as it may seem. Hence, many companies have created user friendly solutions, such as IBM Watson. Watson has attracted much attention within the bioinformatics community, for it integrates several software technologies and has been utilized to conduct a medical diagnosis demonstration from clinical text data (Fitzgerald 2011). Nevertheless, its discernible limitations in more in depth analysis highlight ongoing NLP challenges.
Overall, NLP holds great potential in clinical text analysis and has recently been utilized in word annotation, and semantic analysis tasks. However, creating an NLP solution that can reliably predict the semantic meaning from clinical text data requires a large, pre-annotated dataset. This requires expertise and many work hours, that many hospitals cannot necessarily afford. In the future, it is perhaps likely that technology companies will work in tighter conjunction with hospitals to utilize their NLP solutions in clinical text analysis.
In the following blogposts, I shall present my personal experience in natural language enrichment and pseudonymization with FiNER. My FiNER experiments are already available in Jupyter Notebook format in my GitHub page. After that, I'll present my own machine learning experiments of time-sequential natural language documents.