Ooh na na… where are my sodium labs?

language-R project-chartwatch post-miscellaneous

The NA bug, or, what happens when the same word is used in different contexts. [5 min read]

Chloe Pou-Prom (DSAA, Unity Health Toronto)https://chartdatascience.ca
2022-05-09

Silent deployment

Our team had been working actively on developing CHARTwatch, an early warning system for patients in general internal medicine at St. Michael’s Hospital. In November 2019 we were ready to move to a silent deployment phase, which means our entire pipeline was running (from data extraction to data processing to model prediction), but no outputs were going to the end-user.

Typically, the goal of the silent deployment phase is to uncover unexpected behaviors with the data, system, or model. During model development and evaluation, we had only worked with historical extracts of the data. When moving from historical data to live data, there’s the risk of running into data issues (Cohen et al. 2021).

Monitoring labs

We had set up a monitoring dashboard to measure model inputs and model outputs. On close inspection, we made a discovery that was unquestionably odd… no sodium labs had been measured since we had moved to silent testing!

Figure 1: Daily counts of lab measurements: this includes counts for calcium (CA), chloride (CL), glucose (GLPOC), potassium (K), and sodium (NA).

Did this make sense? NA! Sodium is measured in routinely ordered blood tests. It’ll usually get ordered alongside other tests (such as calcium, chloride, glucose, and potassium) as part of a basic metabolic panel. In Figure 1, we look at the daily counts of labs on units in which CHARTwatch was silently deployed. The other labs were regularly measured, but our pipeline had not detected a single sodium lab. There was NA way sodium would be missing!

The NA bug

After hours of detective work, we found the issue:

Figure 2: Daily counts of lab measurements after fixing the NA bug

Depending on the context, the symbol meant something different! Our data extraction pipeline was interpreting the chemical element Na as “not available!”

The fix was quite straightforward. We updated the parameters of one of our function calls to specify that "" (empty string) should be used to represent “not available,” instead of "NA". From the documentation of the RODBC package:

na.strings: character string(s) to be mapped to NA when reading character data, default “NA”

After deploying this fix, sodium counts were back to normal (as seen in Figure 2).

While the fix was a simple one-line change, the problem we uncovered lead to plenty of follow-up questions!

Recently, there’s been a push for improvement in data quality standards, such as “Datasheets for Datasets” (Gebru et al. 2021) and the explosion of features stores, model stores, and evaluation stores1.

Takeaways

Cohen, Joseph Paul, Tianshi Cao, Joseph D. Viviano, Chin-Wei Huang, Michael Fralick, Marzyeh Ghassemi, Muhammad Mamdani, Russell Greiner, and Yoshua Bengio. 2021. “Problems in the Deployment of Machine-Learned Models in Health Care.” CMAJ 193 (35): E1391–94. https://doi.org/10.1503/cmaj.202066.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. “Datasheets for Datasets.” Communications of the ACM 64 (12): 86–92. http://arxiv.org/abs/1803.09010.

  1. What kind of “store” do we think is next? 🤔↩︎

References