The NA bug, or, what happens when the same word is used in different contexts. [5 min read]
Our team had been working actively on developing CHARTwatch, an early warning system for patients in general internal medicine at St. Michael’s Hospital. In November 2019 we were ready to move to a silent deployment phase, which means our entire pipeline was running (from data extraction to data processing to model prediction), but no outputs were going to the end-user.
Typically, the goal of the silent deployment phase is to uncover unexpected behaviors with the data, system, or model. During model development and evaluation, we had only worked with historical extracts of the data. When moving from historical data to live data, there’s the risk of running into data issues (Cohen et al. 2021).
The data can be different due to external factors. For example, all of our models were trained on data prior to COVID-19, but shortly after the beginning of our silent deployment phase, we began to observe cases of COVID-19 in the hospital.
The data can be different due to data entry errors. For example, a body temperature could incorrectly be entered as 3700 °C instead of 37.00 °C.
The data can be different due to selection bias. For example, during training we excluded patients with really short and really long visits, as they were rare. However, we may encounter these kinds of visits in the live data.
We had set up a monitoring dashboard to measure model inputs and model outputs. On close inspection, we made a discovery that was unquestionably odd… no sodium labs had been measured since we had moved to silent testing!
Did this make sense? NA! Sodium is measured in routinely ordered blood tests. It’ll usually get ordered alongside other tests (such as calcium, chloride, glucose, and potassium) as part of a basic metabolic panel. In Figure 1, we look at the daily counts of labs on units in which CHARTwatch was silently deployed. The other labs were regularly measured, but our pipeline had not detected a single sodium lab. There was NA way sodium would be missing!
After hours of detective work, we found the issue:
In R, the programming language we used to develop CHARTwatch, the symbol NA
stands for “not available” and is used to represent missing data.
In chemistry, Na
is the symbol used to represent the chemical element of sodium.
Depending on the context, the symbol meant something different! Our data extraction pipeline was interpreting the chemical element Na
as “not available!”
The fix was quite straightforward. We updated the parameters of one of our function calls to specify that ""
(empty string) should be used to represent “not available,” instead of "NA"
. From the documentation of the RODBC package:
na.strings
: character string(s) to be mapped toNA
when reading character data, default “NA”
After deploying this fix, sodium counts were back to normal (as seen in Figure 2).
While the fix was a simple one-line change, the problem we uncovered lead to plenty of follow-up questions!
Were there other cases where the same symbol meant two different things based on the context?
What does our electronic health record use to represent a missing value? Do they go with a number that’s biologically impossible? (e.g., a body temperature of -1000) Do they use a specific symbol/term? (e.g., “not measured,” “missing”)
How are these decisions made?
Recently, there’s been a push for improvement in data quality standards, such as “Datasheets for Datasets” (Gebru et al. 2021) and the explosion of features stores, model stores, and evaluation stores1.
NA (sodium) ≠ NA (not available)
Silent deployment is important.
Thorough metadata and data quality standards are important to mitigating these kinds of issues.
What kind of “store” do we think is next? 🤔↩︎