There are many occasions when a column of data needs to be created from an already existing column for ease of data manipulation. For example, perhaps you have a body of text as a pathology report and you want to extract all the reports where the diagnosis is dysplasia. You could just subset the data using grepl so that you only get the reports that mention this word…but what if the data needs to be cleaned prior to subsetting like excluding reports where the diagnosis is normal but the phrase ‘No evidence of dysplasia’ is present. Or perhaps there are other manipulations needed prior to subsetting.

This is where data accordionisation is useful. This simply means the creation of data from (usually) a column into another column in the same dataframe.

The neatest way to do this is with the mutate function from the ‘dplyr’ package which is devoted to this. There are also other ways which I will demonstrate at the end.



The input data here will be an endoscopy data set:

Age<-sample(1:100, 130, replace=TRUE)
Dx<-sample(c("NDBE","LGD","HGD","IMC"), 130, replace = TRUE)
TimeOfEndoscopy<-sample(1:60, 130, replace=TRUE)


EMRdf<-data.frame(Age,Dx,TimeOfEndoscopy,stringsAsFactors=F)

Perhaps you need to calculate the number of hours spent doing each endoscopy rather than the number of minutes

EMRdftbb<-EMRdf%>%mutate(TimeOfEndoscopy/60)
#Just show the top 20 results
kable(head(EMRdftbb,20))
Age Dx TimeOfEndoscopy TimeOfEndoscopy/60
30 IMC 42 0.7000000
75 LGD 7 0.1166667
71 LGD 51 0.8500000
55 HGD 28 0.4666667
67 HGD 8 0.1333333
82 NDBE 34 0.5666667
19 IMC 52 0.8666667
20 LGD 51 0.8500000
31 HGD 6 0.1000000
98 NDBE 46 0.7666667
8 HGD 54 0.9000000
99 NDBE 33 0.5500000
45 LGD 49 0.8166667
21 LGD 59 0.9833333
24 LGD 8 0.1333333
50 HGD 56 0.9333333
15 NDBE 30 0.5000000
60 LGD 39 0.6500000
22 LGD 1 0.0166667
100 NDBE 60 1.0000000

That is useful but what if you want to classify the amount of time spent doing each endoscopy as follows: <0.4 hours is too little time and >0.4 hours is too long.



Using ifelse() with mutate for conditional accordionisation

For this we would use ifelse(). However this can be combined with mutate() so that the result gets put in another column as follows

EMRdf2<-EMRdf%>%mutate(TimeInHours=TimeOfEndoscopy/60)%>%mutate(TimeClassification = ifelse(TimeInHours>0.4, "Too Long", "Too Short"))
#Just show the top 20 results
kable(head(EMRdf2,20))
Age Dx TimeOfEndoscopy TimeInHours TimeClassification
30 IMC 42 0.7000000 Too Long
75 LGD 7 0.1166667 Too Short
71 LGD 51 0.8500000 Too Long
55 HGD 28 0.4666667 Too Long
67 HGD 8 0.1333333 Too Short
82 NDBE 34 0.5666667 Too Long
19 IMC 52 0.8666667 Too Long
20 LGD 51 0.8500000 Too Long
31 HGD 6 0.1000000 Too Short
98 NDBE 46 0.7666667 Too Long
8 HGD 54 0.9000000 Too Long
99 NDBE 33 0.5500000 Too Long
45 LGD 49 0.8166667 Too Long
21 LGD 59 0.9833333 Too Long
24 LGD 8 0.1333333 Too Short
50 HGD 56 0.9333333 Too Long
15 NDBE 30 0.5000000 Too Long
60 LGD 39 0.6500000 Too Long
22 LGD 1 0.0166667 Too Short
100 NDBE 60 1.0000000 Too Long

Note how we can chain the mutate() function together.



Using multiple ifelse()

What if we want to get more complex and put several classifiers in? We just use more ifelse’s:

EMRdf2<-EMRdf%>%mutate(TimeInHours=TimeOfEndoscopy/60)%>%mutate(TimeClassification = ifelse(TimeInHours>0.8, "Too Long", ifelse(TimeInHours<0.5,"Too Short",ifelse(TimeInHours>=0.5&TimeInHours<=0.8,"Just Right","N"))))
#Just show the top 20 results
kable(head(EMRdf2,20))
Age Dx TimeOfEndoscopy TimeInHours TimeClassification
30 IMC 42 0.7000000 Just Right
75 LGD 7 0.1166667 Too Short
71 LGD 51 0.8500000 Too Long
55 HGD 28 0.4666667 Too Short
67 HGD 8 0.1333333 Too Short
82 NDBE 34 0.5666667 Just Right
19 IMC 52 0.8666667 Too Long
20 LGD 51 0.8500000 Too Long
31 HGD 6 0.1000000 Too Short
98 NDBE 46 0.7666667 Just Right
8 HGD 54 0.9000000 Too Long
99 NDBE 33 0.5500000 Just Right
45 LGD 49 0.8166667 Too Long
21 LGD 59 0.9833333 Too Long
24 LGD 8 0.1333333 Too Short
50 HGD 56 0.9333333 Too Long
15 NDBE 30 0.5000000 Just Right
60 LGD 39 0.6500000 Just Right
22 LGD 1 0.0166667 Too Short
100 NDBE 60 1.0000000 Too Long



Using multiple ifelse() with grepl() or string_extract

Of course we need to extract information from text as well as numeric data. We can do this using grepl or string_extract from the library(stringr). We have used this before here so you may want to refresh yourself.

Let’s say we want to extract all the samples that had IMC. We don’t want to subset the data, just extract IMC into a column that says IMC and the rest say ’Non-IMC

Using the dataset above:

library(stringr)
EMRdf$MyIMC_Column<-str_extract(EMRdf$Dx,"IMC")

#to fill the NA's we would do:
EMRdf$MyIMC_Column<-ifelse(grepl("IMC",EMRdf$Dx),"IMC","NoIMC")
#Another way to do this (really should be for more complex examples when you want to extract the entire contents of the cell that has the match)

EMRdf$MyIMC_Column<-ifelse(grepl("IMC",EMRdf$Dx),str_extract(EMRdf$Dx,"IMC"),"NoIMC")

So data can be usefully created from data for further analysis