An introduction to, an example of, and my (lazy) journey to discovering the funneljoin
package. [10 min read]
A while back, I was helping1 out on one of our new projects: an expansion of CHARTWatch to new units. To learn a bit more about CHARTWatch you should read (Verma et al. 2021) that explains how to get models in the clinical environment and (Pou-Prom et al., n.d.) on all the technical parts required.
We were doing some data exploration for the project and we had a fairly straight forward question to answer: What is the first event after a procedure for a patient?
With that question, I dove into the data files we had started to get a feel for what to do2. I had a few files to work with, lots of timestamps all over the place, and knew I had some sort of join
type problem. I tried—and failed with—many variations of joins (e.g., fuzzyjoin
) and even trying to get what was needed in more manual3 ways. One thought kept ringing in my head: “someone must have solved this problem.” So at that point I spent a few half days4 searching. Many fruitless paths later and on the verge of the more manual5 approaches I finally found what I was looking for: funneljoin
.
Though funneljoin
was what I was looking for, my use of it was also a journey through multiple mistakes and misunderstandings all of which are my own6. But I eventually got exactly what I wanted: a straightforward way to join some data and find very specific events that occur after other events.
Let’s start out by taking a look at the data. That’ll give us a better sense of the problem and the behavior we’re expecting. We’re going to be working with a tiny example of what the real data could look like. These fake data have been created from real data and then we used uuid
, dplyr::group_by
, lubridate
and some good old fashioned randomization to make it fake.
We have two data files which look a lot like our real data:
ALL_ADT_EVENTS.csv: A file that contains all the Admit-Discharge-Transfer (ADT) events for patients while they are in the hospital, and
SPECIFIC_PROCEDURES.csv: which contains a very specific set of procedures while in the hospital.
It’s very worth noting that all the events in SPECIFIC_PROCEDURES
are in ALL_ADT_EVENTS
. These are separate for a few reasons including (but certainly not limited to): it’s easier to work with when we want to know only about the procedures, and things like procedures can (and are) pulled from separate pipelines more specific than ADT pipelines.
Let’s start out by taking a look at some of the ADT file and a few (preselected) rows to highlight these data. And while we’re at it, we’ll see the code, too!
library(here) ## for here::here() and referencing files from this .RProj
library(dplyr) ## for some processing and those fancy pipes
library(kableExtra) ## for some extra fancy looking tables
library(rmarkdown) ## for some even more extra fancy tables
ADT_FILE_PATH <- here::here("_posts","2022-08-12-putting-the-fun-in-funneljoin", "ALL_ADT_EVENTS.csv")
ADT_EVENTS <- read.csv(ADT_FILE_PATH,
stringsAsFactors = FALSE)
ADT_EVENTS %>%
slice( c(1, 2, 3, 12, 13, 16, 20, 21) ) %>%
kableExtra::kbl() %>%
kableExtra::kable_styling()
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED | FROM_SERVICE | TO_SERVICE |
---|---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-05-10 06:08:10 | NA | ED TRIAGE |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-05-10 06:13:24 | ED TRIAGE | ED TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-05-10 06:54:10 | ED TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-13 11:32:48 | INTENSIVE CARE TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-18 19:53:13 | INTENSIVE CARE TRAUMA | TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-28 02:14:32 | INTENSIVE CARE TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-29 22:09:10 | INTENSIVE CARE TRAUMA | NA |
In our table we see some (preselected) rows and all of our columns. We’re looking at just 1 patient (ENCOUNTER_NUM_ANONYMIZED
) with a snapshot of some of their events (EVENT_TS_FUZZED
), which service they were coming from (FROM_SERVICE
) and which service they were going to (TO_SERVICE
).
Let’s now take a look at the SPECIFIC_PROCEDURES
data
PROCEDURE_FILE_PATH <- here::here("_posts","2022-08-12-putting-the-fun-in-funneljoin", "SPECIFIC_PROCEDURES.csv")
SPECIFIC_PROCEDURES <- read.csv(PROCEDURE_FILE_PATH,
stringsAsFactors = FALSE)
SPECIFIC_PROCEDURES %>%
kableExtra::kbl() %>%
kableExtra::kable_styling()
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED | FROM_SERVICE | TO_SERVICE |
---|---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 21:34:13 | TRAUMA | INTENSIVE CARE TRAUMA |
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 | 2083-06-02 10:46:45 | TRAUMA | INTENSIVE CARE TRAUMA |
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 | 2083-06-22 02:59:57 | TRAUMA | INTENSIVE CARE TRAUMA |
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7 | 2090-09-05 15:01:52 | NEUROSURGERY | INTENSIVE CARE NEURO SURGERY |
We’re showing all of SPECIFIC_PROCEDURES
because it’s much smaller. It has the same structure as the ADT file (and that’s because the procedures are a subset of all the ADT events). Now that we see the procedures we can also see that it’s an event in ADT_EVENTS
. The first SPECIFIC_PROCEDURE
:
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED | FROM_SERVICE | TO_SERVICE |
---|---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | TRAUMA | INTENSIVE CARE TRAUMA |
and now the ADT events with the event before, the procedure event, and the event after (which is what we want to specifically identify; eventually that is).
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED | FROM_SERVICE | TO_SERVICE |
---|---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-18 20:02:58 | TRAUMA | TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 19:54:10 | INTENSIVE CARE TRAUMA | TRAUMA |
A fun note before we move on: When you look at these data you’ll see some NA
in there. Those NA
are R
s NA
which is effectively missing data. In this case NA
is absolutely not NA which is sodium (sodium is not a service). You should really take a look at some of the “fun” with NA
we’ve had.
Our task was to identify the event that happens after specific procedures. From the above, we can see an example of that:
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED | FROM_SERVICE | TO_SERVICE |
---|---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 19:54:10 | INTENSIVE CARE TRAUMA | TRAUMA |
The first row is the procedure and the second row is the event that happens after a specific procedure. At the end of what we do, we want all of those events after procedures because we needed to understand more about patient movements after procedures for some of our modelling.
funneljoin
Given that we have two data sets and we know that we want to match on certain things (ENCOUNTER_NUM_ANONYMIZED
) but conditional on subsequent time stamps (EVENT_TS_FUZZED
), we probably have some sort of join
problem.
I spent a lot of time trying out a lot of the standard join
and merge
options we find in R
: dplyr
, base::merge
, and even ventured off into the land of fuzzyjoin
. I tried a lot of things and all of those things were wrong or overly complicated.
So instead of just writing some code to find the next line in the ADT events data after a matching line in the procedures data, I spent a few7 half days8 searching for a package that probably does the thing I’m looking for. It took me quite a while and a variety of search terms (e.g., “time series join,” “join events after,” “fuzzyjoin for time,” “why doesn’t this specific thing I want exist and why am I so bad at this?”) until I eventually found funneljoin
9.
The funneljoin
package includes a lot of join
options for time series data. In particular after_join
is when we’re looking for events in one data set that occur after events in another data set. That was exactly what I was looking for. Though after_join
was the key to solving my problems, I still had a few more problems10 but did eventually figure it all out and it was magical. Let’s walk through three examples with these data and after_join
to show where I failed and where I eventually succeeded.
I dove into after_join
with, basically, the default parameters. So let’s break this down:
x
is our procedures data because that’s our reference point: we want ADT events after proceduresy
is our ADT data to get those events after proceduresby_user
is the column we use for identifiers (ENCOUNTER_NUM_ANONYMIZED
); this is a very typical parameter to expect in join
and merge
, as we usually want to join data sets based on some identifierby_time
is the column we use to find the events (EVENT_TS_FUZZED
); this is how funneljoin
makes use of time series datasuffix
appends labels to the newly created columns after the join, respectively for x
then y
Seems straight forward enough! Let’s see what happens:
library(funneljoin)
join_attempt_one <- after_join(
x = SPECIFIC_PROCEDURES,
y = ADT_EVENTS,
by_user = "ENCOUNTER_NUM_ANONYMIZED",
by_time = "EVENT_TS_FUZZED",
suffix = c("_PROCEDURES","_ADT")
)
rmarkdown::paged_table(join_attempt_one)
Err… it sort of looks like nothing happens. Why is that? Well it’s because our resulting join is empty11
I clearly missed something12. On a closer look at the parameters, it seemed like I needed a couple of more to make things work. So I brought them in and tried:
mode
is inner
which is a specific type of join which is basically the intersectiontype
here is one of many options on how to think about the join and we set it to first-firstafter
Before we dive into the updated code let’s pause for what type
is and what first-firstafter
means. Heavily borrowing from the funneljoin
site:
first-firstafter: Take the first x, then the first y after that. For example, we have the first procedure for the first patient in the procedures (x) data, and we want the first event from the ADT (y) data that occurs afterwards. We don’t want all afterward, we don’t want any before. Just the one! So let’s try it:
join_attempt_two <- after_join(
x = SPECIFIC_PROCEDURES,
y = ADT_EVENTS,
by_user = "ENCOUNTER_NUM_ANONYMIZED",
by_time = "EVENT_TS_FUZZED",
suffix = c("_PROCEDURES","_ADT"),
mode = "inner",
type = "first-firstafter"
)
rmarkdown::paged_table(join_attempt_two)
The above shows the whole table but let’s take a closer look at something…
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED_PROCEDURES | EVENT_TS_FUZZED_ADT |
---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | 2082-06-22 10:27:06 |
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 | 2083-06-02 10:46:45 | 2083-06-02 10:46:45 |
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7 | 2090-09-05 15:01:52 | 2090-09-05 15:01:52 |
Well that didn’t quite work because it’s actually finding the same time stamped events. So we’re not yet finding the first event after but we’ve at least got something. So… what are we missing?
What we’re missing is the next event. Right now, we’re getting back the same event. Fortunately there’s a parameter for that:
min_gap
allows us to specify how much time there must be in between the first-firstafter
events. There are also two companion parameters to this: max_gap
and gap_col
which tell us, respectively, the maximum time between events and a column to include the gap in time itself. For fun, let’s also add in gap_col
so we can see the amount of time between events.join_attempt_three <- after_join(
x = SPECIFIC_PROCEDURES,
y = ADT_EVENTS,
by_user = "ENCOUNTER_NUM_ANONYMIZED",
by_time = "EVENT_TS_FUZZED",
suffix = c("_PROCEDURES","_ADT"),
mode = "inner",
type = "first-firstafter",
min_gap = base::as.difftime(1,units="secs"),
gap_col = TRUE
)
rmarkdown::paged_table(join_attempt_three)
Oh that looks like we did it! Let’s look at just the snapshot of encounter numbers and timestamps with the .gap
column:
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED_PROCEDURES | EVENT_TS_FUZZED_ADT | .gap |
---|---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | 2082-06-22 19:54:10 | 34024 |
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 | 2083-06-02 10:46:45 | 2083-06-16 14:29:44 | 1222979 |
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7 | 2090-09-05 15:01:52 | 2090-09-06 11:50:43 | 74931 |
OH I THINK WE REALLY DID DO IT. Let’s just take a quick look back at an earlier chunk of code and verify based on just the ADT file for just one example13
ENCOUNTER_NUM_ANONYMIZED | EVENT_TS_FUZZED | FROM_SERVICE | TO_SERVICE |
---|---|---|---|
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 10:27:06 | TRAUMA | INTENSIVE CARE TRAUMA |
a4461971-22e1-46e1-8454-aa7fbce54205 | 2082-06-22 19:54:10 | INTENSIVE CARE TRAUMA | TRAUMA |
The first conclusion is that funneljoin
is awesome and the second conclusion is that I’m lazy. It is also a safe assumption to conclude that I’ve used an excessive amount of unnecessary footnotes14.
When we look back at these data and the problem, we probably could have solved this with some dplyr::group_by
at the encounter (ID) level, do some checks on the timestamps, and some stuff like that. But we had these two data sets—each used separately for different reasons in the same project—so why not make this easy and intuitive? That’s the major advantage of funneljoin
here.
Maybe the way I solved this problem with funneljoin
could have been better15, but this was really useful for me. It was also a super useful exercise for us to find this package and start to understand it because we deal with a lot of time stamps. We often have frequent questions about order of events, or we have to pull multiple pieces of data together from a variety of source systems and ensure it’s all in the right order.
And learning funneljoin
was fun and you can tell because it’s literally in the name.
“helping” is generous: I mostly asked a million stupid questions and as we’ve seen, spent many days finding a package to do what I want instead of just doing it↩︎
again: not without asking a million stupid questions and getting back a million fantastic answers↩︎
and very hacky↩︎
a ‘few’ is defined as a week and ‘half day’ is defined as ‘definitely more than half a day’↩︎
and very hacky↩︎
a million stupid questions followed by a million stupid mistakes↩︎
again: not a few↩︎
again: definitely more than half days↩︎
found is a generous term, I more so stumbled across it with no recollection how I found it after a week↩︎
Did I mention I’m bad at this?↩︎
again: I ain’t so good at this↩︎
this is a generous way of saying I didn’t read the documentation↩︎
we really should verify for all but have I mentioned I’m lazy?↩︎
not enough footnotes and they are totally necessary↩︎
almost certainly it could have been better↩︎