Introduction

Text extraction is a fundamental part of data in gastroenterology. Most data we deal with is either unstructured or semi-structured text. Semi-structured text means that there is some standardised organisation to the text but the user has basically got free reign to fill in free text wherever there is a heading.

So how do we deal with this? R has many possible solutions but we have to devise one over-arching solution to deal with all eventualities:

The main situations to deal with are as follows:

1. Extracting diagnoses eg. ‘Dysplasia is seen’

2. Excluding text where the diagnosis is normal but is still mentioned eg. ‘There is no dysplasia…’

3. Dealing with text where there are subclauses ‘There is no obvious dysplasia, but there may be some at the sample margin’

This is tricky. The methodology I use is as follows:

1. Order the text into sentences with defined borders (eg Ensure that at the end of sentences there is a newline or a full stop).

2. Extract logical sections of the text into their own columns

3. Clean the text (eg get rid of unnecessary characters).

4. Negation extraction.

5. Extraction of terms of interest into their own column.

Of the steps above, point two is the most difficult to make generic between reports especially from different hospitals, and point 3 is the most difficult for all units. But where there’s a will….

This page will describe the use of gsub as a way of ordering and cleaning the text:



The problem: How do I clean my text?

Lets create a report, lets say from pathology

pathRep<-c("CLINICAL DETAILSOesophageal stricture.MACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5.Nature of specimen not stated on pot.Three pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip.  HISTOLOGYThe appearances are suggestive of Barretts oesophagus.Neither dysplasia nor malignancy is seen. DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation.- Squamo-columnar mucosa with intestinal metaplasia.- Suggestive of Barretts oesophagus.- No evidence of dysplasia or malignancy.")

At the moment it is fairly unstructured so lets create come borders. The structure looks as through each section starts with a header in capitals and ends with a full stop so lets replace each full stop with a newline first to see things more clearly:



Using gsub to find and replace

pathRep<-gsub("\\.","\n",pathRep)
pathRep
## [1] "CLINICAL DETAILSOesophageal stricture\nMACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5\nNature of specimen not stated on pot\nThree pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip\n  HISTOLOGYThe appearances are suggestive of Barretts oesophagus\nNeither dysplasia nor malignancy is seen\n DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation\n- Squamo-columnar mucosa with intestinal metaplasia\n- Suggestive of Barretts oesophagus\n- No evidence of dysplasia or malignancy\n"

So gsub is a find and replace funciton in R. At least now we can see what it looks like more clearly. So lets clean it up a little more. We want to extract as much as possible here so firstly lets extract sections according to their logical separation. It looks as though the headers are in capitals so lets seperate those into their own columns: We are going to use the library stringr to do this. THis allows us to use regular expressions which are flexible ways of matching text expression patterns. More information is here http://www.regular-expressions.info. Because we have used a newline we can just define the start border and it will match everything up to the newline



Using string_extract to separate text of interest into columns

#Lets start again:
pathRep<-c("CLINICAL DETAILSOesophageal stricture.MACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5.Nature of specimen not stated on pot.Three pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip.  HISTOLOGYThe appearances are suggestive of Barretts oesophagus.Neither dysplasia nor malignancy is seen. DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation.- Squamo-columnar mucosa with intestinal metaplasia.- Suggestive of Barretts oesophagus.- No evidence of dysplasia or malignancy.")

library(stringr)
ClinDet<-str_extract(pathRep,"CLINICAL DETAILS.*MACROSCOPICAL DESCRIPTION")
MacDet<-str_extract(pathRep,"MACROSCOPICAL DESCRIPTION.*HISTOLOGY")
HistolDet<-str_extract(pathRep,"HISTOLOGY.*DIAGNOSIS")
DiagDet<-str_extract(pathRep,"DIAGNOSIS.*")
myPath<-data.frame(ClinDet,MacDet,HistolDet,DiagDet,stringsAsFactors=F)
myPath
##                                                           ClinDet
## 1 CLINICAL DETAILSOesophageal stricture.MACROSCOPICAL DESCRIPTION
##                                                                                                                                                                                                                                                                        MacDet
## 1 MACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5.Nature of specimen not stated on pot.Three pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip.  HISTOLOGY
##                                                                                                            HistolDet
## 1 HISTOLOGYThe appearances are suggestive of Barretts oesophagus.Neither dysplasia nor malignancy is seen. DIAGNOSIS
##                                                                                                                                                                                                         DiagDet
## 1 DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation.- Squamo-columnar mucosa with intestinal metaplasia.- Suggestive of Barretts oesophagus.- No evidence of dysplasia or malignancy.

Using stringr and gsub to get the size and number of biopsies

OK that’s quite useful. What if we want to do more? Now I want to be able to calculate the size of each of the biopsy specimens, or at least get the size of the largest biopsy taken here. Also I want to find out how many biopsies are taken as a number. Again we can use gsub and stringr

For the size of the biopsy specimen we can use the following:

#Lets extract just the largest biopsy
myPath$BxSize<-str_extract(myPath$MacDet, "the largest.*?mm")
myPath$BxSize
## [1] "the largest measuring 1 x 2 x 2 mm"
#Lets get rid of some text we don't need
  myPath$BxSize<-gsub("the largest measuring ","",myPath$BxSize)
  myPath$BxSize<-gsub("mm","",myPath$BxSize)
  myPath$BxSize<-gsub("less than","",myPath$BxSize)
myPath$BxSize
## [1] "1 x 2 x 2 "
#Now lets use the stringr library to match the biopsy size
  myPath$BxSize<-as.numeric(str_match(myPath$BxSize, "([0-9]+).*?([0-9])+.*?([0-9])")[, 2])*as.numeric(str_match(myPath$BxSize, "([0-9]+).*?([0-9])+.*?([0-9])")[, 3])*as.numeric(str_match(myPath$BxSize, "([0-9]+).*?([0-9])+.*?([0-9])")[, 4])
myPath$BxSize
## [1] 4

To find out how many biopsies we have taken we’ll have to change some text into numbers:

 myPath$MacDet<-gsub("Three","3",myPath$MacDet)
#But remember there may be more than one pot of biopsies so we have to get all of them:
myPath$NumbOfBx<-str_extract_all(myPath$MacDet, "([A-Za-z]*|[0-9]) piece.*?(([0-9]).*?x.*?([0-9]).*?x.*?([0-9])).*?([a-z]\\.)")
  myPath$NumbOfBx<-sapply(myPath$NumbOfBx, function(myPath) sum(as.numeric(unlist(str_extract_all(myPath, "^\\d+")))))
  myPath$NumbOfBxs<-unlist(myPath$NumbOfBx)
  myPath$NumbOfBx<-as.numeric(str_extract(myPath$NumbOfBx,"^.*?\\d"))

So 3 biopsies were taken in all

the resulting dataset looks like this:

knitr::kable(myPath)
ClinDet MacDet HistolDet DiagDet BxSize NumbOfBx NumbOfBxs
CLINICAL DETAILSOesophageal stricture.MACROSCOPICAL DESCRIPTION MACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5.Nature of specimen not stated on pot.3 pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip. HISTOLOGY HISTOLOGYThe appearances are suggestive of Barretts oesophagus.Neither dysplasia nor malignancy is seen. DIAGNOSIS DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation.- Squamo-columnar mucosa with intestinal metaplasia.- Suggestive of Barretts oesophagus.- No evidence of dysplasia or malignancy. 4 3 3



Semi-structured text cleaning issues:


The exclusion of negatives:

So that’s pretty good. But now we want some more data. What about trying to extract a final diagnosis here. It’s tricky as in fact there are three diagnoses and one non-diagnosis (“No evidence of dysplasia or malignancy”).

Given we have extracted the diagnoses into the DiagDet column we can focus just on this one. The approach I take is firstly to exclude negative diagnoses. Having explored many of there reports I understand that they normally do not include subclauses so that the complexity of a sentence in a medical report is usually low. This is useful to know. Furthermore we can define patterns that isolate negative sentences. These are fairly universal patterns. It is the subject of ongoing work so I won’t tell you the patterns until they are published but for this example all we need to do is use the gsub again:

myPath$DiagDet<-gsub("No evidence.*","",myPath$DiagDet)



Using lookup tables to extract terms of interest:

So now what? We have word in there we don’t need and “Oesophagus, biopsies_” is not a diagnosis so how do we exclude that. At this point the best way is to define a corpus of diagnoses you are interested in and then look these up. This is a type of lookup list and is a very useful technique to focus on what you want to assess text for rather than endlessly cleaning the text and getting unpredicatble results:

The lookup lists can be done as follows:

#This will return any record that has the desired search term
toMatch<-c("[Bb]arrett","[Cc]andida","[Cc]oeliac","[Ee]osinophilic","[Pp]eptic")
myPath$DxFromCorpus <- unique (grep(paste(toMatch,collapse="|"), myPath$DiagDet, value=TRUE))

 

#This will extract the match
myPath$DxFromCorpus<-str_extract_all(myPath$DiagDet, "[Bb]arrett|[Cc]andida|[Cc]oeliac|[Ee]osinophilic|[Pp]eptic")

Natural language processing basics



Using document term matrices to extract terms of interest:

Text is always messy, it being fully of punctuation, bad spelling, word variations such as plurals etc. This means that we need a way to clean the text that may be slightly different to the above. Natural kanguage processing is a vast area and contains methods to clean the text. One common starting point is the use of a document term matrix whcih does exactly as its title suggests. Below explains a methodology we can use.

We are going to use the pre-prepared dataset for this:

#To get the prepared endoscopy reports we are going to use the pre-prepared dataset here:
EndoHistoMerge<-source('EndoPathMerged_ExternalCode.R')
EndoHistoMerge<-as.data.frame(EndoHistoMerge)
names(EndoHistoMerge)<-gsub("value.","",names(EndoHistoMerge),fixed=T)



Now we create the document term matrix using the packages listed in the code. We will run this on one of the free text columns:

library(dplyr)
library(directlabels)
library(splitstackshape)
library(tm)
library(SnowballC)

#theframe is the dataframe, y is the columnn of interest written in inverted commas and PropThreshold is the Proportion of reports Threshold for the graph

  mywords<-head(EndoHistoMerge$Diagnoses,100)
  jeopCorpus <- Corpus(VectorSource(EndoHistoMerge$Diagnoses))
  #jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
  jeopCorpus <- tm_map(jeopCorpus, content_transformer(removeWords), stopwords("english"))
  jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
  jeopCorpus <- tm_map(jeopCorpus, stripWhitespace)
  jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
  jeopCorpus <- tm_map(jeopCorpus, stemDocument)
  
  
  #wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
  
  #Get the frequency table of terms being used over all the reports (ie counts x2 if mentioned twice in the report)
  dtm <- TermDocumentMatrix(jeopCorpus)
  m <- as.matrix(dtm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)



Inspecting each aspect of the dtm is interesting. The matrix shows the number of times each term is found in each report (each report is a column in the matrix)



# Just show a snapshot of the data
kable(m[1:10,1:10])
1 2 3 4 5 6 7 8 9 10
activ 1 0 0 2 1 0 2 1 0 1
admix 0 1 0 1 0 0 1 1 1 1
appear 2 2 0 2 0 1 1 1 1 0
atypia 0 0 0 1 0 0 1 0 0 0
atypiabas 0 0 0 0 0 0 0 0 0 0
atypiahigh 0 0 0 0 0 0 0 0 0 0
atypiaintestin 0 0 0 0 0 0 0 0 0 0
atypianeith 0 0 0 0 0 0 0 0 0 0
atypiano 0 0 0 0 0 0 0 0 0 0
atypianumer 0 1 0 0 0 0 0 0 0 0


We can also see the summary of all the terms



kable(head(d))
word freq
dysplasia dysplasia 24432
consist consist 13946
appear appear 13886
there there 13273
present present 10720
malign malign 10586



There are many things that can be done with such a matrix and more will be written here at another point. Data visualisation is always fun so to end for now here is a pretty word cloud made out of the document term matrix.

library(wordcloud)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE,colors=brewer.pal(8, "Dark2"))

More advanced text analysis using Natural Language Processing

One more advanced way to analyse text is by using natural language processing. Various packages exist for this but one of the most useful, because it shares the philosophy of tidy data, is tidytext. This uses many of the features of dplyr which is so heavily used on this site and also allows for the use of piping in the form of %>%

For example, lets say we have a lot of pathology reports and we want to get all the reports that contain a reference to candida. We can do the following:

library(tidytext)
mytidy<-head(EndoHistoMerge,100)%>%
unnest_tokens(word, Diagnoses) %>%
anti_join(stop_words)%>%
count(word, sort = TRUE) %>%
  filter(n > 30) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

mytidy



This takes the EndoHistoMerge dataset, splits out the words (a process called tokenisation and which can also be done for sentences, n-grams and paragraphs) so that one word is on one row and also removes stop_words such as “and”, “the” etc. The format of the dataframe is now in tidy format around the words from the pathology report. This is great as ggplot loves tidy formatted data so we can pipe straight into a graph.



Using NLP to determine commonly used phrases.

One particularly interesting use of NLP is to determine the commonly used phrases in a report. I find this useful to study how doctors might express their findings but particularly how negative findings (ie the lack of a diagnosis) are expressed. This is particularly important as we may want to exclude ‘negative’ findings so that they don’t confuse word searches and cause over-estimates in our analysis:

We again start with the EndoMerge dataset, but this time we are going to tokenize according to n-grams. Note how we don’t exclude stop words here because we are interested in sentences that have no and neither etc. The top 10 are shown here. So now we can sift through them and in our algorithm for negative detection we can see what the most common negatives are:

ngramEndoHistoMerge<-EndoHistoMerge%>%
  unnest_tokens(trigram, Diagnoses, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  count(word1, word2, word3, sort = TRUE)
# kable(head(ngramEndoHistoMerge,10))



So now of course we can graph it because that’s what we love to do:

triigrams_united <- ngramEndoHistoMerge %>%
  unite(trigram, word1, word2,word3, sep = " ")

ggplot(head(triigrams_united,20), aes(trigram, n)) +
  geom_histogram(stat="identity",show.legend = FALSE) +
  theme(axis.text.x=element_text(angle = -90, hjust = 0))



Clustering algorithms using NLP

In fact text mining can get pretty interesting very quickly. Let’s say we can’t to be able to cluster endoscopy reports based on their content. Maybe we think that they will cluster according to endoscopist, or maybe we are interested to see if everyone reports the same type of disease similarly or differently.

To to this we need to take the document term matrix we created above and make it into a formal matrix as follows:

mywords<-head(EndoHistoMerge$Diagnoses,100)
jeopCorpus <- Corpus(VectorSource(mywords))
  #jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
  jeopCorpus <- tm_map(jeopCorpus, content_transformer(removeWords), stopwords("english"))
  jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
  jeopCorpus <- tm_map(jeopCorpus, stripWhitespace)
  jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
  jeopCorpus <- tm_map(jeopCorpus, stemDocument)
  
  
  #wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
  
  #Get the frequency table of terms being used over all the reports (ie counts x2 if mentioned twice in the report)
  dtm <- TermDocumentMatrix(jeopCorpus)
  m <- as.matrix(dtm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)

distMatrix <- dist(m, method="euclidean")


groups <- hclust(distMatrix,method="ward.D")
{plot(groups, cex=0.5, hang=-1)
rect.hclust(groups, k=15)}



So now we can see how specific terms are related to each other.

Topic modelling

The difference between topic modelling and text clustering is subtle. In clustering you are deciding how to group documents based on how similar they are. This is based on the weighting of the words which itself relies on tf-idf (term frequency-inverse document frequency) . In topic modelling you are representing a document as a function of the topics in it so that topic modelling returns a list of topics within a document whereas clustering returns groups that documents belong to. You can use topic modelling to do clustering as well.



Topic modelling is often done with Latent Dirichelet Allocation (LDA). To quote another source, “LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document”

We can use it to solve the following issue: How can I decide who wrote the following reports, or How can I decide what the combined method of reporting is for a certain illness and what is the variation in reporting?

We will use the topicmodels package to run the LDA() function. This takes a document term matrix as its input.


library(topicmodels)
#Create the document term matrix
mywords<-head(EndoHistoMerge$Diagnoses,100)
jeopCorpus <- Corpus(VectorSource(mywords))
  #jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
  jeopCorpus <- tm_map(jeopCorpus, content_transformer(removeWords), stopwords("english"))
  jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
  jeopCorpus <- tm_map(jeopCorpus, stripWhitespace)
  jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
  jeopCorpus <- tm_map(jeopCorpus, stemDocument)
  
  
  #wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
  
  #Get the frequency table of terms being used over all the reports (ie counts x2 if mentioned twice in the report)
  dtm <- t(TermDocumentMatrix(jeopCorpus))
  m <- as.matrix(dtm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
ap_lda <- LDA(dtm, k = 2, control = list(seed = 1234))

This then allows us to determine which topics are more or less likely to have particular words. It shows the probabilities in each document so you can see how they are similar or dissimilar

library(tidytext)
head(ap_topics<-tidy(ap_lda, matrix = "beta"),10)
## # A tibble: 10 × 3
##    topic           term         beta
##    <int>          <chr>        <dbl>
## 1      1          activ 0.0305416188
## 2      2          activ 0.0138672701
## 3      1          admix 0.0073713903
## 4      2          admix 0.0108925437
## 5      1         appear 0.0425472213
## 6      2         appear 0.0225511315
## 7      1         atypia 0.0041407381
## 8      2         atypia 0.0007238893
## 9      1 atypiaintestin 0.0003766227
## 10     2 atypiaintestin 0.0008412766


but because we love to visualise things, we will plot this out

library(ggplot2)
library(dplyr)

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()


Of course we have to remember this is heavily generated data but the LDA function has defined two document types and the graph above shows how they have been characterised.

Another method is to look at what the greatest differences are between topics. The mutate column is logged to make the result symmetrical

library(tidyr)

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread %>%
   mutate(term = reorder(term, log_ratio)) %>%
    filter(topic1 > .01 | topic2 > .01) %>%
  ggplot(aes(term, log_ratio)) +
  geom_col(show.legend = FALSE)  +
  coord_flip()

I will also discuss the other important aspects of text mining namely text classification…..when I get round to it…