Putting the fun in funneljoin

language-R project-chartwatch post-miscellaneous funneljoin

An introduction to, an example of, and my (lazy) journey to discovering the funneljoin package. [10 min read]

Derek Beaton https://twitter.com/derek__beaton (DSAA, Unity Health Toronto)https://chartdatascience.ca
2022-08-12

My lazy journey to discovery

A while back, I was helping1 out on one of our new projects: an expansion of CHARTWatch to new units. To learn a bit more about CHARTWatch you should read (Verma et al. 2021) that explains how to get models in the clinical environment and (Pou-Prom et al., n.d.) on all the technical parts required.

We were doing some data exploration for the project and we had a fairly straight forward question to answer: What is the first event after a procedure for a patient?

With that question, I dove into the data files we had started to get a feel for what to do2. I had a few files to work with, lots of timestamps all over the place, and knew I had some sort of join type problem. I tried—and failed with—many variations of joins (e.g., fuzzyjoin) and even trying to get what was needed in more manual3 ways. One thought kept ringing in my head: “someone must have solved this problem.” So at that point I spent a few half days4 searching. Many fruitless paths later and on the verge of the more manual5 approaches I finally found what I was looking for: funneljoin.

Though funneljoin was what I was looking for, my use of it was also a journey through multiple mistakes and misunderstandings all of which are my own6. But I eventually got exactly what I wanted: a straightforward way to join some data and find very specific events that occur after other events.

A quick view of the data

Let’s start out by taking a look at the data. That’ll give us a better sense of the problem and the behavior we’re expecting. We’re going to be working with a tiny example of what the real data could look like. These fake data have been created from real data and then we used uuid, dplyr::group_by, lubridate and some good old fashioned randomization to make it fake.

We have two data files which look a lot like our real data:

It’s very worth noting that all the events in SPECIFIC_PROCEDURES are in ALL_ADT_EVENTS. These are separate for a few reasons including (but certainly not limited to): it’s easier to work with when we want to know only about the procedures, and things like procedures can (and are) pulled from separate pipelines more specific than ADT pipelines.

A look at the data

Let’s start out by taking a look at some of the ADT file and a few (preselected) rows to highlight these data. And while we’re at it, we’ll see the code, too!

library(here)       ## for here::here() and referencing files from this .RProj
library(dplyr)      ## for some processing and those fancy pipes
library(kableExtra) ## for some extra fancy looking tables
library(rmarkdown)  ## for some even more extra fancy tables

ADT_FILE_PATH <- here::here("_posts","2022-08-12-putting-the-fun-in-funneljoin", "ALL_ADT_EVENTS.csv")
ADT_EVENTS <- read.csv(ADT_FILE_PATH, 
                        stringsAsFactors = FALSE)

ADT_EVENTS %>%
  slice( c(1, 2, 3, 12, 13, 16, 20, 21) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling() 
ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED FROM_SERVICE TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205 2082-05-10 06:08:10 NA ED TRIAGE
a4461971-22e1-46e1-8454-aa7fbce54205 2082-05-10 06:13:24 ED TRIAGE ED TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-05-10 06:54:10 ED TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-13 11:32:48 INTENSIVE CARE TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-18 19:53:13 INTENSIVE CARE TRAUMA TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-28 02:14:32 INTENSIVE CARE TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-29 22:09:10 INTENSIVE CARE TRAUMA NA

In our table we see some (preselected) rows and all of our columns. We’re looking at just 1 patient (ENCOUNTER_NUM_ANONYMIZED) with a snapshot of some of their events (EVENT_TS_FUZZED), which service they were coming from (FROM_SERVICE) and which service they were going to (TO_SERVICE).

Let’s now take a look at the SPECIFIC_PROCEDURES data

PROCEDURE_FILE_PATH <- here::here("_posts","2022-08-12-putting-the-fun-in-funneljoin", "SPECIFIC_PROCEDURES.csv")
SPECIFIC_PROCEDURES <- read.csv(PROCEDURE_FILE_PATH,
                                 stringsAsFactors = FALSE)

SPECIFIC_PROCEDURES %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling() 
ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED FROM_SERVICE TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 21:34:13 TRAUMA INTENSIVE CARE TRAUMA
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 2083-06-02 10:46:45 TRAUMA INTENSIVE CARE TRAUMA
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 2083-06-22 02:59:57 TRAUMA INTENSIVE CARE TRAUMA
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7 2090-09-05 15:01:52 NEUROSURGERY INTENSIVE CARE NEURO SURGERY

We’re showing all of SPECIFIC_PROCEDURES because it’s much smaller. It has the same structure as the ADT file (and that’s because the procedures are a subset of all the ADT events). Now that we see the procedures we can also see that it’s an event in ADT_EVENTS. The first SPECIFIC_PROCEDURE:

SPECIFIC_PROCEDURES %>%
  slice(1) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling() 
ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED FROM_SERVICE TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 TRAUMA INTENSIVE CARE TRAUMA

and now the ADT events with the event before, the procedure event, and the event after (which is what we want to specifically identify; eventually that is).

ADT_EVENTS %>%
  slice( c(15, 16, 17) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling() 
ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED FROM_SERVICE TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-18 20:02:58 TRAUMA TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 19:54:10 INTENSIVE CARE TRAUMA TRAUMA

A fun note before we move on: When you look at these data you’ll see some NA in there. Those NA are Rs NA which is effectively missing data. In this case NA is absolutely not NA which is sodium (sodium is not a service). You should really take a look at some of the “fun” with NA we’ve had.

What we want and the behavior we’d expect

Our task was to identify the event that happens after specific procedures. From the above, we can see an example of that:

ADT_EVENTS %>%
  slice( c(16, 17) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling() 
ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED FROM_SERVICE TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 19:54:10 INTENSIVE CARE TRAUMA TRAUMA

The first row is the procedure and the second row is the event that happens after a specific procedure. At the end of what we do, we want all of those events after procedures because we needed to understand more about patient movements after procedures for some of our modelling.

A journey to find and a journey through funneljoin

Given that we have two data sets and we know that we want to match on certain things (ENCOUNTER_NUM_ANONYMIZED) but conditional on subsequent time stamps (EVENT_TS_FUZZED), we probably have some sort of join problem.

I spent a lot of time trying out a lot of the standard join and merge options we find in R: dplyr, base::merge, and even ventured off into the land of fuzzyjoin. I tried a lot of things and all of those things were wrong or overly complicated.

So instead of just writing some code to find the next line in the ADT events data after a matching line in the procedures data, I spent a few7 half days8 searching for a package that probably does the thing I’m looking for. It took me quite a while and a variety of search terms (e.g., “time series join,” “join events after,” “fuzzyjoin for time,” “why doesn’t this specific thing I want exist and why am I so bad at this?”) until I eventually found funneljoin9.

The funneljoin package includes a lot of join options for time series data. In particular after_join is when we’re looking for events in one data set that occur after events in another data set. That was exactly what I was looking for. Though after_join was the key to solving my problems, I still had a few more problems10 but did eventually figure it all out and it was magical. Let’s walk through three examples with these data and after_join to show where I failed and where I eventually succeeded.

My first failure

I dove into after_join with, basically, the default parameters. So let’s break this down:

Seems straight forward enough! Let’s see what happens:

library(funneljoin)

join_attempt_one <- after_join(
  x = SPECIFIC_PROCEDURES,
  y = ADT_EVENTS,
  by_user = "ENCOUNTER_NUM_ANONYMIZED",
  by_time = "EVENT_TS_FUZZED",
  suffix = c("_PROCEDURES","_ADT")
)

rmarkdown::paged_table(join_attempt_one)

Err… it sort of looks like nothing happens. Why is that? Well it’s because our resulting join is empty11

My second failure

I clearly missed something12. On a closer look at the parameters, it seemed like I needed a couple of more to make things work. So I brought them in and tried:

Before we dive into the updated code let’s pause for what type is and what first-firstafter means. Heavily borrowing from the funneljoin site:

first-firstafter: Take the first x, then the first y after that. For example, we have the first procedure for the first patient in the procedures (x) data, and we want the first event from the ADT (y) data that occurs afterwards. We don’t want all afterward, we don’t want any before. Just the one! So let’s try it:

join_attempt_two <- after_join(
  x = SPECIFIC_PROCEDURES,
  y = ADT_EVENTS,
  by_user = "ENCOUNTER_NUM_ANONYMIZED",
  by_time = "EVENT_TS_FUZZED",
  suffix = c("_PROCEDURES","_ADT"),
  mode = "inner",
  type = "first-firstafter"
)

rmarkdown::paged_table(join_attempt_two)

The above shows the whole table but let’s take a closer look at something…

ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED_PROCEDURES EVENT_TS_FUZZED_ADT
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 2082-06-22 10:27:06
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 2083-06-02 10:46:45 2083-06-02 10:46:45
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7 2090-09-05 15:01:52 2090-09-05 15:01:52

Well that didn’t quite work because it’s actually finding the same time stamped events. So we’re not yet finding the first event after but we’ve at least got something. So… what are we missing?

A third successful attempt

What we’re missing is the next event. Right now, we’re getting back the same event. Fortunately there’s a parameter for that:

join_attempt_three <- after_join(
  x = SPECIFIC_PROCEDURES,
  y = ADT_EVENTS,
  by_user = "ENCOUNTER_NUM_ANONYMIZED",
  by_time = "EVENT_TS_FUZZED",
  suffix = c("_PROCEDURES","_ADT"),
  mode = "inner",
  type = "first-firstafter",
  min_gap = base::as.difftime(1,units="secs"),
  gap_col = TRUE
)

rmarkdown::paged_table(join_attempt_three)

Oh that looks like we did it! Let’s look at just the snapshot of encounter numbers and timestamps with the .gap column:

ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED_PROCEDURES EVENT_TS_FUZZED_ADT .gap
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 2082-06-22 19:54:10 34024
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6 2083-06-02 10:46:45 2083-06-16 14:29:44 1222979
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7 2090-09-05 15:01:52 2090-09-06 11:50:43 74931

OH I THINK WE REALLY DID DO IT. Let’s just take a quick look back at an earlier chunk of code and verify based on just the ADT file for just one example13

ADT_EVENTS %>%
  slice( c(16, 17) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling() 
ENCOUNTER_NUM_ANONYMIZED EVENT_TS_FUZZED FROM_SERVICE TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 10:27:06 TRAUMA INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205 2082-06-22 19:54:10 INTENSIVE CARE TRAUMA TRAUMA

Conclusions

The first conclusion is that funneljoin is awesome and the second conclusion is that I’m lazy. It is also a safe assumption to conclude that I’ve used an excessive amount of unnecessary footnotes14.

When we look back at these data and the problem, we probably could have solved this with some dplyr::group_by at the encounter (ID) level, do some checks on the timestamps, and some stuff like that. But we had these two data sets—each used separately for different reasons in the same project—so why not make this easy and intuitive? That’s the major advantage of funneljoin here.

Maybe the way I solved this problem with funneljoin could have been better15, but this was really useful for me. It was also a super useful exercise for us to find this package and start to understand it because we deal with a lot of time stamps. We often have frequent questions about order of events, or we have to pull multiple pieces of data together from a variety of source systems and ensure it’s all in the right order.

And learning funneljoin was fun and you can tell because it’s literally in the name.

Pou-Prom, Chloé, Joshua Murray, Sebnem Kuzulugil, Muhammad Mamdani, and Amol Verma. n.d. “From Compute to Care: Lessons Learned from Deploying an Early Warning System into Clinical Practice.” Frontiers in Digital Health, 174. https://www.frontiersin.org/articles/10.3389/fdgth.2022.932123/abstract.
Verma, Amol A., Joshua Murray, Russell Greiner, Joseph Paul Cohen, Kaveh G. Shojania, Marzyeh Ghassemi, Sharon E. Straus, Chloe Pou-Prom, and Muhammad Mamdani. 2021. “Implementing Machine Learning in Medicine.” CMAJ 193 (34): E1351–57. https://doi.org/10.1503/cmaj.202434.

  1. “helping” is generous: I mostly asked a million stupid questions and as we’ve seen, spent many days finding a package to do what I want instead of just doing it↩︎

  2. again: not without asking a million stupid questions and getting back a million fantastic answers↩︎

  3. and very hacky↩︎

  4. a ‘few’ is defined as a week and ‘half day’ is defined as ‘definitely more than half a day’↩︎

  5. and very hacky↩︎

  6. a million stupid questions followed by a million stupid mistakes↩︎

  7. again: not a few↩︎

  8. again: definitely more than half days↩︎

  9. found is a generous term, I more so stumbled across it with no recollection how I found it after a week↩︎

  10. Did I mention I’m bad at this?↩︎

  11. again: I ain’t so good at this↩︎

  12. this is a generous way of saying I didn’t read the documentation↩︎

  13. we really should verify for all but have I mentioned I’m lazy?↩︎

  14. not enough footnotes and they are totally necessary↩︎

  15. almost certainly it could have been better↩︎

References