## Introduction

Text extraction is a fundamental part of data in gastroenterology. Most data we deal with is either unstructured or semi-structured text. Semi-structured text means that there is some standardised organisation to the text but the user has basically got free reign to fill in free text wherever there is a heading.

So how do we deal with this? R has many possible solutions but we have to devise one over-arching solution to deal with all eventualities:

The main situations to deal with are as follows:

1. Extracting diagnoses eg. ‘Dysplasia is seen’

2. Excluding text where the diagnosis is normal but is still mentioned eg. ‘There is no dysplasia…’

3. Dealing with text where there are subclauses ‘There is no obvious dysplasia, but there may be some at the sample margin’

This is tricky. The methodology I use is as follows:

1. Order the text into sentences with defined borders (eg Ensure that at the end of sentences there is a newline or a full stop).

2. Extract logical sections of the text into their own columns

3. Clean the text (eg get rid of unnecessary characters).

4. Negation extraction.

5. Extraction of terms of interest into their own column.

Of the steps above, point two is the most difficult to make generic between reports especially from different hospitals, and point 3 is the most difficult for all units. But where there’s a will….

This page will describe the use of gsub as a way of ordering and cleaning the text:

## The problem: How do I clean my text?

Lets create a report, lets say from pathology

pathRep<-c("CLINICAL DETAILSOesophageal stricture.MACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5.Nature of specimen not stated on pot.Three pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip.  HISTOLOGYThe appearances are suggestive of Barretts oesophagus.Neither dysplasia nor malignancy is seen. DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation.- Squamo-columnar mucosa with intestinal metaplasia.- Suggestive of Barretts oesophagus.- No evidence of dysplasia or malignancy.")

At the moment it is fairly unstructured so lets create come borders. The structure looks as through each section starts with a header in capitals and ends with a full stop so lets replace each full stop with a newline first to see things more clearly:

### Using gsub to find and replace

pathRep<-gsub("\\.","\n",pathRep)
pathRep
## [1] "CLINICAL DETAILSOesophageal stricture\nMACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5\nNature of specimen not stated on pot\nThree pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip\n  HISTOLOGYThe appearances are suggestive of Barretts oesophagus\nNeither dysplasia nor malignancy is seen\n DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation\n- Squamo-columnar mucosa with intestinal metaplasia\n- Suggestive of Barretts oesophagus\n- No evidence of dysplasia or malignancy\n"

So gsub is a find and replace funciton in R. At least now we can see what it looks like more clearly. So lets clean it up a little more. We want to extract as much as possible here so firstly lets extract sections according to their logical separation. It looks as though the headers are in capitals so lets seperate those into their own columns: We are going to use the library stringr to do this. THis allows us to use regular expressions which are flexible ways of matching text expression patterns. More information is here http://www.regular-expressions.info. Because we have used a newline we can just define the start border and it will match everything up to the newline

### Using string_extract to separate text of interest into columns

#Lets start again:
pathRep<-c("CLINICAL DETAILSOesophageal stricture.MACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5.Nature of specimen not stated on pot.Three pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip.  HISTOLOGYThe appearances are suggestive of Barretts oesophagus.Neither dysplasia nor malignancy is seen. DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation.- Squamo-columnar mucosa with intestinal metaplasia.- Suggestive of Barretts oesophagus.- No evidence of dysplasia or malignancy.")

library(stringr)
ClinDet<-str_extract(pathRep,"CLINICAL DETAILS.*MACROSCOPICAL DESCRIPTION")
MacDet<-str_extract(pathRep,"MACROSCOPICAL DESCRIPTION.*HISTOLOGY")
HistolDet<-str_extract(pathRep,"HISTOLOGY.*DIAGNOSIS")
DiagDet<-str_extract(pathRep,"DIAGNOSIS.*")
myPath<-data.frame(ClinDet,MacDet,HistolDet,DiagDet,stringsAsFactors=F)
myPath
##                                                           ClinDet
## 1 CLINICAL DETAILSOesophageal stricture.MACROSCOPICAL DESCRIPTION
##                                                                                                                                                                                                                                                                        MacDet
## 1 MACROSCOPICAL DESCRIPTIONNature of specimen as stated on request form = Lower oesophagus x5.Nature of specimen not stated on pot.Three pieces of tissue, the largest measuring 1 x 2 x 2 mm and the smallest 1x 1 x 1 mm, received on a pointed cellulose strip.  HISTOLOGY
##                                                                                                            HistolDet
## 1 HISTOLOGYThe appearances are suggestive of Barretts oesophagus.Neither dysplasia nor malignancy is seen. DIAGNOSIS
##                                                                                                                                                                                                         DiagDet
## 1 DIAGNOSISOesophagus, biopsies_ - Ulceration and chronic active inflammation.- Squamo-columnar mucosa with intestinal metaplasia.- Suggestive of Barretts oesophagus.- No evidence of dysplasia or malignancy.

### Using stringr and gsub to get the size and number of biopsies

OK that’s quite useful. What if we want to do more? Now I want to be able to calculate the size of each of the biopsy specimens, or at least get the size of the largest biopsy taken here. Also I want to find out how many biopsies are taken as a number. Again we can use gsub and stringr

For the size of the biopsy specimen we can use the following:

#Lets extract just the largest biopsy
myPath$BxSize<-str_extract(myPath$MacDet, "the largest.*?mm")
myPath$BxSize ## [1] "the largest measuring 1 x 2 x 2 mm" #Lets get rid of some text we don't need myPath$BxSize<-gsub("the largest measuring ","",myPath$BxSize) myPath$BxSize<-gsub("mm","",myPath$BxSize) myPath$BxSize<-gsub("less than","",myPath$BxSize) myPath$BxSize
## [1] "1 x 2 x 2 "
#Now lets use the stringr library to match the biopsy size
myPath$BxSize<-as.numeric(str_match(myPath$BxSize, "([0-9]+).*?([0-9])+.*?([0-9])")[, 2])*as.numeric(str_match(myPath$BxSize, "([0-9]+).*?([0-9])+.*?([0-9])")[, 3])*as.numeric(str_match(myPath$BxSize, "([0-9]+).*?([0-9])+.*?([0-9])")[, 4])
jeopCorpus <- Corpus(VectorSource(EndoHistoMergeDiagnoses)) #jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument) jeopCorpus <- tm_map(jeopCorpus, content_transformer(removeWords), stopwords("english")) jeopCorpus <- tm_map(jeopCorpus, removePunctuation) jeopCorpus <- tm_map(jeopCorpus, stripWhitespace) jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english')) jeopCorpus <- tm_map(jeopCorpus, stemDocument) #wordcloud(jeopCorpus, max.words = 100, random.order = FALSE) #Get the frequency table of terms being used over all the reports (ie counts x2 if mentioned twice in the report) dtm <- TermDocumentMatrix(jeopCorpus) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) Inspecting each aspect of the dtm is interesting. The matrix shows the number of times each term is found in each report (each report is a column in the matrix) # Just show a snapshot of the data kable(m[1:10,1:10]) 1 2 3 4 5 6 7 8 9 10 activ 1 0 0 2 1 0 2 1 0 1 admix 0 1 0 1 0 0 1 1 1 1 appear 2 2 0 2 0 1 1 1 1 0 atypia 0 0 0 1 0 0 1 0 0 0 atypiabas 0 0 0 0 0 0 0 0 0 0 atypiahigh 0 0 0 0 0 0 0 0 0 0 atypiaintestin 0 0 0 0 0 0 0 0 0 0 atypianeith 0 0 0 0 0 0 0 0 0 0 atypiano 0 0 0 0 0 0 0 0 0 0 atypianumer 0 1 0 0 0 0 0 0 0 0 We can also see the summary of all the terms kable(head(d)) word freq dysplasia dysplasia 24432 consist consist 13946 appear appear 13886 there there 13273 present present 10720 malign malign 10586 There are many things that can be done with such a matrix and more will be written here at another point. Data visualisation is always fun so to end for now here is a pretty word cloud made out of the document term matrix. library(wordcloud) wordcloud(jeopCorpus, max.words = 100, random.order = FALSE,colors=brewer.pal(8, "Dark2")) ### More advanced text analysis using Natural Language Processing One more advanced way to analyse text is by using natural language processing. Various packages exist for this but one of the most useful, because it shares the philosophy of tidy data, is tidytext. This uses many of the features of dplyr which is so heavily used on this site and also allows for the use of piping in the form of %>% For example, lets say we have a lot of pathology reports and we want to get all the reports that contain a reference to candida. We can do the following: library(tidytext) mytidy<-head(EndoHistoMerge,100)%>% unnest_tokens(word, Diagnoses) %>% anti_join(stop_words)%>% count(word, sort = TRUE) %>% filter(n > 30) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip() mytidy This takes the EndoHistoMerge dataset, splits out the words (a process called tokenisation and which can also be done for sentences, n-grams and paragraphs) so that one word is on one row and also removes stop_words such as “and”, “the” etc. The format of the dataframe is now in tidy format around the words from the pathology report. This is great as ggplot loves tidy formatted data so we can pipe straight into a graph. ### Using NLP to determine commonly used phrases. One particularly interesting use of NLP is to determine the commonly used phrases in a report. I find this useful to study how doctors might express their findings but particularly how negative findings (ie the lack of a diagnosis) are expressed. This is particularly important as we may want to exclude ‘negative’ findings so that they don’t confuse word searches and cause over-estimates in our analysis: We again start with the EndoMerge dataset, but this time we are going to tokenize according to n-grams. Note how we don’t exclude stop words here because we are interested in sentences that have no and neither etc. The top 10 are shown here. So now we can sift through them and in our algorithm for negative detection we can see what the most common negatives are: ngramEndoHistoMerge<-EndoHistoMerge%>% unnest_tokens(trigram, Diagnoses, token = "ngrams", n = 3) %>% separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% count(word1, word2, word3, sort = TRUE) # kable(head(ngramEndoHistoMerge,10)) So now of course we can graph it because that’s what we love to do: triigrams_united <- ngramEndoHistoMerge %>% unite(trigram, word1, word2,word3, sep = " ") ggplot(head(triigrams_united,20), aes(trigram, n)) + geom_histogram(stat="identity",show.legend = FALSE) + theme(axis.text.x=element_text(angle = -90, hjust = 0)) ### Clustering algorithms using NLP In fact text mining can get pretty interesting very quickly. Let’s say we can’t to be able to cluster endoscopy reports based on their content. Maybe we think that they will cluster according to endoscopist, or maybe we are interested to see if everyone reports the same type of disease similarly or differently. To to this we need to take the document term matrix we created above and make it into a formal matrix as follows: mywords<-head(EndoHistoMergeDiagnoses,100)
jeopCorpus <- Corpus(VectorSource(mywords))
#jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, content_transformer(removeWords), stopwords("english"))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, stripWhitespace)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)

#wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

#Get the frequency table of terms being used over all the reports (ie counts x2 if mentioned twice in the report)
dtm <- TermDocumentMatrix(jeopCorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

distMatrix <- dist(m, method="euclidean")

groups <- hclust(distMatrix,method="ward.D")
{plot(groups, cex=0.5, hang=-1)
rect.hclust(groups, k=15)}

So now we can see how specific terms are related to each other.

### Topic modelling

The difference between topic modelling and text clustering is subtle. In clustering you are deciding how to group documents based on how similar they are. This is based on the weighting of the words which itself relies on tf-idf (term frequency-inverse document frequency) . In topic modelling you are representing a document as a function of the topics in it so that topic modelling returns a list of topics within a document whereas clustering returns groups that documents belong to. You can use topic modelling to do clustering as well.

Topic modelling is often done with Latent Dirichelet Allocation (LDA). To quote another source, “LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document”

We can use it to solve the following issue: How can I decide who wrote the following reports, or How can I decide what the combined method of reporting is for a certain illness and what is the variation in reporting?

We will use the topicmodels package to run the LDA() function. This takes a document term matrix as its input.

library(topicmodels)
#Create the document term matrix
jeopCorpus <- Corpus(VectorSource(mywords))
#jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, content_transformer(removeWords), stopwords("english"))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, stripWhitespace)
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)

#wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)

#Get the frequency table of terms being used over all the reports (ie counts x2 if mentioned twice in the report)
dtm <- t(TermDocumentMatrix(jeopCorpus))
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
ap_lda <- LDA(dtm, k = 2, control = list(seed = 1234))

This then allows us to determine which topics are more or less likely to have particular words. It shows the probabilities in each document so you can see how they are similar or dissimilar

library(tidytext)
head(ap_topics<-tidy(ap_lda, matrix = "beta"),10)
## # A tibble: 10 × 3
##    topic           term         beta
##    <int>          <chr>        <dbl>
## 1      1          activ 0.0305416188
## 2      2          activ 0.0138672701
## 5      1         appear 0.0425472213
## 6      2         appear 0.0225511315
## 7      1         atypia 0.0041407381
## 8      2         atypia 0.0007238893
## 9      1 atypiaintestin 0.0003766227
## 10     2 atypiaintestin 0.0008412766

but because we love to visualise things, we will plot this out

library(ggplot2)
library(dplyr)

ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)

ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()

Of course we have to remember this is heavily generated data but the LDA function has defined two document types and the graph above shows how they have been characterised.

Another method is to look at what the greatest differences are between topics. The mutate column is logged to make the result symmetrical

library(tidyr)

coord_flip()