How to group

Its fairly common to need to group data by an aspect of the date. For example, how many patients underwent a colonoscopy this month, or week etc. To do this we need a date object as part of the data as usual. As long as we can extract the aspect we want to group by this should be a breeze with dplyr

The problem

How do I find out how many endoscopies were done by month for the past calendar year, by endoscopy type. We will use some data that is already created. If you want to know how it was created you can check out this page…..

#Create the data

#Neaten up the names
#Lets just select the columns relevant to this page
###The resulting data looks like this:
EndoHospNumId Date.x
U234252 2013-06-18
U234252 2013-06-18
P029834 2015-10-19
J322909 2014-08-30
J322909 2014-08-30

Using lubridate, we can extract the month very simply as follows: This can then be incorporated into dplyr


kable(GroupDatesExample %>% group_by(month=month(Date.x)) %>% summarise(Number=n()))%>%
  kable_styling(bootstrap_options = "striped", full_width = F)
month Number
1 1454
2 1472
3 1612
4 1577
5 1250
6 1067
7 1237
8 1312
9 1126
10 1296
11 1104
12 1226

Breaking this down, we are using select() to get the columns we are interested in, then we use group_by() to group according to the two aspects we are interested in. We then use summarise to count each of the groups. The output is then as expected.

Simple numbers per year

What if you simply want to plot out the number of procedures done by year. You don’t have to use lubrudate to do this, you can do this is base RThat needs you to extract the year from the date and then summarise as follows:

Tots<-GroupDatesExample %>%
  mutate(year = format(Date.x, "%Y")) %>%
  summarise(n = n())

year n
2013 3662
2014 3552
2015 3457
2016 3699
2017 1363

Get difference between two dates in consecutive rows

Often you need to know the time between consecutive tests for a patient. This is done using the difftime() function. Not we use the following functions a lot in the surveillance page so these are worth understanding:

DateBetween<-GroupDatesExample %>% arrange(EndoHospNumId, Date.x) %>% group_by(EndoHospNumId) %>%
  mutate(diffDate = difftime(Date.x, lag(Date.x,1),units="weeks"))

EndoHospNumId Date.x diffDate
D0739033 2013-01-10 NA
D0739033 2013-01-10 0.0000000 weeks
D0739033 2013-01-10 0.0000000 weeks
D0739033 2013-01-10 0.0000000 weeks
D0739033 2013-01-12 0.2857143 weeks
D0739033 2013-01-12 0.0000000 weeks
D0739033 2013-01-12 0.0000000 weeks
D0739033 2013-01-13 0.1428571 weeks
D0739033 2013-01-19 0.8571429 weeks
D0739033 2013-01-19 0.0000000 weeks

Get the first date or the last date in a group

It may also be that you just need to know the first or last date in the tests for a patient, again using dplyr and the slice() function:

#To get the first
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(1)
#To get the last
GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(n())
#To get the first and the last
kable(head(GroupDatesExample %>% arrange(Date.x) %>% group_by(EndoHospNumId) %>% slice(c(1,n())),10))
EndoHospNumId Date.x
D0739033 2013-01-10
D0739033 2017-04-28
F630230 2013-01-02
F630230 2017-04-24
G244224 2013-01-06
G244224 2017-04-29
I927282 2013-01-01
I927282 2017-04-26
J322909 2013-01-03
J322909 2017-04-27

Selecting rows by date position based on a conditional

There are many occasions when simply grouping by dates is not sufficient for what you need. Perhaps you want to order the number of investigations that a patient has had by date so that you are ordering the dates once the grouping by hospital number has already been done, or perhaps you need to know the time difference between one test and another for a particular patient

As always, dplyr has a solution for this: Let’s use a new data set just to make things more interesting:

#Generate some sample data:

proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
HospNum_Id<-sample(c("P433224","P633443","K522332","G244224","S553322","D0739033","U873352","P223333","Y763634","I927282","P223311","P029834","U22415","U234252","S141141","O349253","T622722","J322909","F630230","T432452"), 100, replace = TRUE)

So now we group the data according to patient number:

Upstage<-df %>%
  group_by(HospNum_Id) %>%
#Only show the first 25 samples
proc dat HospNum_Id
Biopsies 2014-07-29 D0739033
EMR 2015-05-05 D0739033
Biopsies 2013-03-20 F630230
EMR 2013-08-07 F630230
RFA 2013-08-15 F630230
RFA 2013-12-04 F630230
EMR 2013-12-31 F630230
RFA 2014-04-03 F630230
RFA 2016-04-01 F630230
RFA 2016-08-08 F630230
EMR 2015-09-25 G244224
Biopsies 2015-10-03 G244224
Biopsies 2015-10-10 G244224
EMR 2016-08-23 G244224
RFA 2015-04-03 I927282
Biopsies 2016-05-02 I927282
EMR 2016-07-27 I927282
RFA 2016-08-05 I927282
EMR 2016-10-21 I927282
RFA 2016-11-12 I927282
Biopsies 2017-01-30 I927282
EMR 2017-05-01 I927282
EMR 2013-09-15 J322909
Biopsies 2013-12-15 J322909
EMR 2014-08-23 J322909

But actually we want only those patients who have had and EMR followed by RFA. lead() means the leading row ie the row that leads to the next row (which should contain RFA in the proc column).

Upstage<-df %>%
mutate(ind = proc=="RFA" & lead(proc)=="EMR") %>%

#Only show the first 25 samples
proc dat HospNum_Id ind
Biopsies 2014-07-29 D0739033 FALSE
EMR 2015-05-05 D0739033 FALSE
Biopsies 2013-03-20 F630230 FALSE
EMR 2013-08-07 F630230 FALSE
RFA 2013-08-15 F630230 FALSE
RFA 2013-12-04 F630230 NA
EMR 2013-12-31 F630230 FALSE
RFA 2014-04-03 F630230 FALSE
RFA 2016-04-01 F630230 TRUE
RFA 2016-08-08 F630230 FALSE
EMR 2015-09-25 G244224 FALSE
Biopsies 2015-10-03 G244224 FALSE
Biopsies 2015-10-10 G244224 FALSE
EMR 2016-08-23 G244224 FALSE
RFA 2015-04-03 I927282 FALSE
Biopsies 2016-05-02 I927282 FALSE
EMR 2016-07-27 I927282 FALSE
RFA 2016-08-05 I927282 NA
EMR 2016-10-21 I927282 FALSE
RFA 2016-11-12 I927282 TRUE
Biopsies 2017-01-30 I927282 FALSE
EMR 2017-05-01 I927282 FALSE
EMR 2013-09-15 J322909 FALSE
Biopsies 2013-12-15 J322909 FALSE
EMR 2014-08-23 J322909 FALSE

But that simply gives those values where consecutive rows show EMR and then RFA but we want it for patients so we have to do something a little more complex. We use the fact that the mutate column is boolean (so gives us a TRUE or FALSE return value) and we ask to return only those hospital numbers where all those values where this is true (and we also sort it).

Upstage<-df %>%
 mutate(ind = proc=="EMR" & lead(proc)=="RFA") %>%
proc dat HospNum_Id ind
EMR 2013-08-07 F630230 TRUE
RFA 2013-12-04 F630230 FALSE
EMR 2013-12-31 F630230 TRUE
RFA 2014-04-03 F630230 FALSE
RFA 2016-08-05 I927282 FALSE
EMR 2017-05-01 I927282 TRUE
EMR 2013-09-15 J322909 TRUE
EMR 2014-08-23 J322909 TRUE
RFA 2015-03-20 J322909 FALSE
RFA 2015-03-26 J322909 FALSE
RFA 2013-11-03 K522332 FALSE
EMR 2014-10-24 K522332 TRUE
RFA 2015-02-04 O349253 FALSE
EMR 2017-04-05 O349253 TRUE
EMR 2016-12-27 S141141 TRUE
RFA 2017-01-06 S141141 FALSE
EMR 2014-08-12 T622722 TRUE
EMR 2015-04-11 T622722 TRUE
RFA 2015-04-26 T622722 FALSE
RFA 2015-10-26 T622722 FALSE
EMR 2014-01-25 U22415 TRUE
RFA 2017-02-10 U22415 FALSE