My lazy journey to discovery

A while back, I was helping¹ out on one of our new projects: an expansion of CHARTWatch to new units. To learn a bit more about CHARTWatch you should read (Verma et al. 2021) that explains how to get models in the clinical environment and (Pou-Prom et al., n.d.) on all the technical parts required.

We were doing some data exploration for the project and we had a fairly straight forward question to answer: What is the first event after a procedure for a patient?

With that question, I dove into the data files we had started to get a feel for what to do². I had a few files to work with, lots of timestamps all over the place, and knew I had some sort of join type problem. I tried—and failed with—many variations of joins (e.g., fuzzyjoin) and even trying to get what was needed in more manual³ ways. One thought kept ringing in my head: “someone must have solved this problem.” So at that point I spent a few half days⁴ searching. Many fruitless paths later and on the verge of the more manual⁵ approaches I finally found what I was looking for: funneljoin.

Though funneljoin was what I was looking for, my use of it was also a journey through multiple mistakes and misunderstandings all of which are my own⁶. But I eventually got exactly what I wanted: a straightforward way to join some data and find very specific events that occur after other events.

A quick view of the data

Let’s start out by taking a look at the data. That’ll give us a better sense of the problem and the behavior we’re expecting. We’re going to be working with a tiny example of what the real data could look like. These fake data have been created from real data and then we used uuid, dplyr::group_by, lubridate and some good old fashioned randomization to make it fake.

We have two data files which look a lot like our real data:

ALL_ADT_EVENTS.csv: A file that contains all the Admit-Discharge-Transfer (ADT) events for patients while they are in the hospital, and
SPECIFIC_PROCEDURES.csv: which contains a very specific set of procedures while in the hospital.

It’s very worth noting that all the events in SPECIFIC_PROCEDURES are in ALL_ADT_EVENTS. These are separate for a few reasons including (but certainly not limited to): it’s easier to work with when we want to know only about the procedures, and things like procedures can (and are) pulled from separate pipelines more specific than ADT pipelines.

A look at the data

Let’s start out by taking a look at some of the ADT file and a few (preselected) rows to highlight these data. And while we’re at it, we’ll see the code, too!

library(here)       ## for here::here() and referencing files from this .RProj
library(dplyr)      ## for some processing and those fancy pipes
library(kableExtra) ## for some extra fancy looking tables
library(rmarkdown)  ## for some even more extra fancy tables

ADT_FILE_PATH <- here::here("_posts","2022-08-12-putting-the-fun-in-funneljoin", "ALL_ADT_EVENTS.csv")
ADT_EVENTS <- read.csv(ADT_FILE_PATH, 
                        stringsAsFactors = FALSE)

ADT_EVENTS %>%
  slice( c(1, 2, 3, 12, 13, 16, 20, 21) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling()

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED	FROM_SERVICE	TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205	2082-05-10 06:08:10	NA	ED TRIAGE
a4461971-22e1-46e1-8454-aa7fbce54205	2082-05-10 06:13:24	ED TRIAGE	ED TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-05-10 06:54:10	ED TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-13 11:32:48	INTENSIVE CARE TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-18 19:53:13	INTENSIVE CARE TRAUMA	TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-28 02:14:32	INTENSIVE CARE TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-29 22:09:10	INTENSIVE CARE TRAUMA	NA

In our table we see some (preselected) rows and all of our columns. We’re looking at just 1 patient (ENCOUNTER_NUM_ANONYMIZED) with a snapshot of some of their events (EVENT_TS_FUZZED), which service they were coming from (FROM_SERVICE) and which service they were going to (TO_SERVICE).

Let’s now take a look at the SPECIFIC_PROCEDURES data

PROCEDURE_FILE_PATH <- here::here("_posts","2022-08-12-putting-the-fun-in-funneljoin", "SPECIFIC_PROCEDURES.csv")
SPECIFIC_PROCEDURES <- read.csv(PROCEDURE_FILE_PATH,
                                 stringsAsFactors = FALSE)

SPECIFIC_PROCEDURES %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling()

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED	FROM_SERVICE	TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 21:34:13	TRAUMA	INTENSIVE CARE TRAUMA
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6	2083-06-02 10:46:45	TRAUMA	INTENSIVE CARE TRAUMA
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6	2083-06-22 02:59:57	TRAUMA	INTENSIVE CARE TRAUMA
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7	2090-09-05 15:01:52	NEUROSURGERY	INTENSIVE CARE NEURO SURGERY

We’re showing all of SPECIFIC_PROCEDURES because it’s much smaller. It has the same structure as the ADT file (and that’s because the procedures are a subset of all the ADT events). Now that we see the procedures we can also see that it’s an event in ADT_EVENTS. The first SPECIFIC_PROCEDURE:

SPECIFIC_PROCEDURES %>%
  slice(1) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling()

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED	FROM_SERVICE	TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	TRAUMA	INTENSIVE CARE TRAUMA

and now the ADT events with the event before, the procedure event, and the event after (which is what we want to specifically identify; eventually that is).

ADT_EVENTS %>%
  slice( c(15, 16, 17) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling()

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED	FROM_SERVICE	TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-18 20:02:58	TRAUMA	TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 19:54:10	INTENSIVE CARE TRAUMA	TRAUMA

A fun note before we move on: When you look at these data you’ll see some NA in there. Those NA are Rs NA which is effectively missing data. In this case NA is absolutely not NA which is sodium (sodium is not a service). You should really take a look at some of the “fun” with NA we’ve had.

What we want and the behavior we’d expect

Our task was to identify the event that happens after specific procedures. From the above, we can see an example of that:

ADT_EVENTS %>%
  slice( c(16, 17) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling()

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED	FROM_SERVICE	TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 19:54:10	INTENSIVE CARE TRAUMA	TRAUMA

The first row is the procedure and the second row is the event that happens after a specific procedure. At the end of what we do, we want all of those events after procedures because we needed to understand more about patient movements after procedures for some of our modelling.

A journey to find and a journey through `funneljoin`

Given that we have two data sets and we know that we want to match on certain things (ENCOUNTER_NUM_ANONYMIZED) but conditional on subsequent time stamps (EVENT_TS_FUZZED), we probably have some sort of join problem.

I spent a lot of time trying out a lot of the standard join and merge options we find in R: dplyr, base::merge, and even ventured off into the land of fuzzyjoin. I tried a lot of things and all of those things were wrong or overly complicated.

So instead of just writing some code to find the next line in the ADT events data after a matching line in the procedures data, I spent a few⁷ half days⁸ searching for a package that probably does the thing I’m looking for. It took me quite a while and a variety of search terms (e.g., “time series join,” “join events after,” “fuzzyjoin for time,” “why doesn’t this specific thing I want exist and why am I so bad at this?”) until I eventually found funneljoin⁹.

The funneljoin package includes a lot of join options for time series data. In particular after_join is when we’re looking for events in one data set that occur after events in another data set. That was exactly what I was looking for. Though after_join was the key to solving my problems, I still had a few more problems¹⁰ but did eventually figure it all out and it was magical. Let’s walk through three examples with these data and after_join to show where I failed and where I eventually succeeded.

My first failure

I dove into after_join with, basically, the default parameters. So let’s break this down:

x is our procedures data because that’s our reference point: we want ADT events after procedures
y is our ADT data to get those events after procedures
by_user is the column we use for identifiers (ENCOUNTER_NUM_ANONYMIZED); this is a very typical parameter to expect in join and merge, as we usually want to join data sets based on some identifier
by_time is the column we use to find the events (EVENT_TS_FUZZED); this is how funneljoin makes use of time series data
suffix appends labels to the newly created columns after the join, respectively for x then y

Seems straight forward enough! Let’s see what happens:

library(funneljoin)

join_attempt_one <- after_join(
  x = SPECIFIC_PROCEDURES,
  y = ADT_EVENTS,
  by_user = "ENCOUNTER_NUM_ANONYMIZED",
  by_time = "EVENT_TS_FUZZED",
  suffix = c("_PROCEDURES","_ADT")
)

rmarkdown::paged_table(join_attempt_one)

Err… it sort of looks like nothing happens. Why is that? Well it’s because our resulting join is empty¹¹

My second failure

I clearly missed something¹². On a closer look at the parameters, it seemed like I needed a couple of more to make things work. So I brought them in and tried:

mode is inner which is a specific type of join which is basically the intersection
type here is one of many options on how to think about the join and we set it to first-firstafter

Before we dive into the updated code let’s pause for what type is and what first-firstafter means. Heavily borrowing from the funneljoin site:

first-firstafter: Take the first x, then the first y after that. For example, we have the first procedure for the first patient in the procedures (x) data, and we want the first event from the ADT (y) data that occurs afterwards. We don’t want all afterward, we don’t want any before. Just the one! So let’s try it:

join_attempt_two <- after_join(
  x = SPECIFIC_PROCEDURES,
  y = ADT_EVENTS,
  by_user = "ENCOUNTER_NUM_ANONYMIZED",
  by_time = "EVENT_TS_FUZZED",
  suffix = c("_PROCEDURES","_ADT"),
  mode = "inner",
  type = "first-firstafter"
)

rmarkdown::paged_table(join_attempt_two)

The above shows the whole table but let’s take a closer look at something…

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED_PROCEDURES	EVENT_TS_FUZZED_ADT
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	2082-06-22 10:27:06
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6	2083-06-02 10:46:45	2083-06-02 10:46:45
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7	2090-09-05 15:01:52	2090-09-05 15:01:52

Well that didn’t quite work because it’s actually finding the same time stamped events. So we’re not yet finding the first event after but we’ve at least got something. So… what are we missing?

A third successful attempt

What we’re missing is the next event. Right now, we’re getting back the same event. Fortunately there’s a parameter for that:

min_gap allows us to specify how much time there must be in between the first-firstafter events. There are also two companion parameters to this: max_gap and gap_col which tell us, respectively, the maximum time between events and a column to include the gap in time itself. For fun, let’s also add in gap_col so we can see the amount of time between events.

join_attempt_three <- after_join(
  x = SPECIFIC_PROCEDURES,
  y = ADT_EVENTS,
  by_user = "ENCOUNTER_NUM_ANONYMIZED",
  by_time = "EVENT_TS_FUZZED",
  suffix = c("_PROCEDURES","_ADT"),
  mode = "inner",
  type = "first-firstafter",
  min_gap = base::as.difftime(1,units="secs"),
  gap_col = TRUE
)

rmarkdown::paged_table(join_attempt_three)

Oh that looks like we did it! Let’s look at just the snapshot of encounter numbers and timestamps with the .gap column:

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED_PROCEDURES	EVENT_TS_FUZZED_ADT	.gap
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	2082-06-22 19:54:10	34024
7cc7972a-58a1-4c78-a0dc-83021c6dc0c6	2083-06-02 10:46:45	2083-06-16 14:29:44	1222979
87d439b0-4f26-4b73-8df3-bf5dbbf34ca7	2090-09-05 15:01:52	2090-09-06 11:50:43	74931

OH I THINK WE REALLY DID DO IT. Let’s just take a quick look back at an earlier chunk of code and verify based on just the ADT file for just one example¹³

ADT_EVENTS %>%
  slice( c(16, 17) ) %>%
  kableExtra::kbl() %>%
  kableExtra::kable_styling()

ENCOUNTER_NUM_ANONYMIZED	EVENT_TS_FUZZED	FROM_SERVICE	TO_SERVICE
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 10:27:06	TRAUMA	INTENSIVE CARE TRAUMA
a4461971-22e1-46e1-8454-aa7fbce54205	2082-06-22 19:54:10	INTENSIVE CARE TRAUMA	TRAUMA

Conclusions

The first conclusion is that funneljoin is awesome and the second conclusion is that I’m lazy. It is also a safe assumption to conclude that I’ve used an excessive amount of unnecessary footnotes¹⁴.

When we look back at these data and the problem, we probably could have solved this with some dplyr::group_by at the encounter (ID) level, do some checks on the timestamps, and some stuff like that. But we had these two data sets—each used separately for different reasons in the same project—so why not make this easy and intuitive? That’s the major advantage of funneljoin here.

Maybe the way I solved this problem with funneljoin could have been better¹⁵, but this was really useful for me. It was also a super useful exercise for us to find this package and start to understand it because we deal with a lot of time stamps. We often have frequent questions about order of events, or we have to pull multiple pieces of data together from a variety of source systems and ensure it’s all in the right order.

And learning funneljoin was fun and you can tell because it’s literally in the name.

Pou-Prom, Chloé, Joshua Murray, Sebnem Kuzulugil, Muhammad Mamdani, and Amol Verma. n.d. “From Compute to Care: Lessons Learned from Deploying an Early Warning System into Clinical Practice.” Frontiers in Digital Health, 174. https://www.frontiersin.org/articles/10.3389/fdgth.2022.932123/abstract.

Verma, Amol A., Joshua Murray, Russell Greiner, Joseph Paul Cohen, Kaveh G. Shojania, Marzyeh Ghassemi, Sharon E. Straus, Chloe Pou-Prom, and Muhammad Mamdani. 2021. “Implementing Machine Learning in Medicine.” CMAJ 193 (34): E1351–57. https://doi.org/10.1503/cmaj.202434.

“helping” is generous: I mostly asked a million stupid questions and as we’ve seen, spent many days finding a package to do what I want instead of just doing it↩︎
again: not without asking a million stupid questions and getting back a million fantastic answers↩︎
and very hacky↩︎
a ‘few’ is defined as a week and ‘half day’ is defined as ‘definitely more than half a day’↩︎
and very hacky↩︎
a million stupid questions followed by a million stupid mistakes↩︎
again: not a few↩︎
again: definitely more than half days↩︎
found is a generous term, I more so stumbled across it with no recollection how I found it after a week↩︎
Did I mention I’m bad at this?↩︎
again: I ain’t so good at this↩︎
this is a generous way of saying I didn’t read the documentation↩︎
we really should verify for all but have I mentioned I’m lazy?↩︎
not enough footnotes and they are totally necessary↩︎
almost certainly it could have been better↩︎

Putting the fun in funneljoin

My lazy journey to discovery

A quick view of the data

A look at the data

What we want and the behavior we’d expect

A journey to find and a journey through `funneljoin`

My first failure

My second failure

A third successful attempt

Conclusions

References

Putting the fun in funneljoin

My lazy journey to discovery

A quick view of the data

A look at the data

What we want and the behavior we’d expect

A journey to find and a journey through funneljoin

My first failure

My second failure

A third successful attempt

Conclusions

References

A journey to find and a journey through `funneljoin`