Link Search Menu Expand Document

Smoking Status Variable - writing rules from scratch

In this tutorial, we will be using the smoking status example data to write rules that will allow us to classify charts according to the following labels: Current smoker, Former smoker, Never smoked, or Not dictated.


  1. Download the smoking status example data.
    • text_data.csv
    • labels_train.csv
    • labels_valid.csv
  2. Upload the text data file.
    • From the Settings view, press on the Project Settings tab.
    • Press on the Data File button to add the smoking status text data file (text_data.csv).
    • Set the Data ID column to be 0.
    • Set the Data First Row to be 1.
    • Set the Data Column to be 1.
  3. Upload the labels files.
    • Still in the Project Settings tab, set the Create Train and Validation Set flag to be False. We don’t need to create train and validation sets, since we will be uploading our own train and validation labels.
    • Set Prediction Mode to be False.
    • Press on the folder icon next to Train Label File to upload the smoking status train data labels (labels_train.csv).
    • Press on the folder icon next to Valid Label File to upload the smoking status valid data labels (labels_valid.csv).
    • Set the Label ID Column to be 0.
    • Set the Label First Row to be 1.
  4. Set Rules folder file.
    • Still in the Project Settings tab, press on the folder icon next to Rules Folder. Set the rules folder to be a folder in which you can save files. All of your variable rules files will be saved in the rules folder.
    • Press on the Save Project Settings button.
  5. Create a new variable.
    • Press on the Variable Settings tab.
    • Press on green ‘+’ symbol. This will toggle a text input field next to Current Variable. Type in smoking_status. This will be the name of your variable.
    • Set the Label column to be 1.
    • Press on the Save Variable Settings button.
  6. Set the classifier settings.
    • Press on the Classifier Settings tab.
    • Set the Classifier Type to RegexClassifier.
    • Set the Negative Label to Not dictated.
  7. Set up your rules.
    • Go to the Rules view.
    • Press on the green ‘+’ symbol to add a new label.
    • Double click on the new button and type in Current smoker. You have now added a label for Current smoker.
    • Add labels for Former smoker and Never smoked.
  8. Add rules for Current smoker.
    • Select the Current smoker label.
    • Press on the “Add Primary Rule” button.
    • In the text box, write “smok”.
    • Press on the green button next to the text box, and select “Replace Rule” from the dropdown menu.
    • A new field will apear. This is where you enter your secondary rule. In the secondary rule text box, type in “current”, then press on the Enter key. Next, type in “or” and press on the Enter key. Type “every day” and press on the Enter key.
    • Set the secondary rule score to be 1.
    • You have just created your first set of rules for the Current smoker label! CHARTextract will search for sentences that contain the word “smok” (primary rule). If the sentence also contains either the words “current” or “every day” (secondary rule), then the sentence will be replaced with a score of 1 for the Current smoker label.
  9. Add rules for Former smoker.
    • Select the Former smoker label.
    • Add a primary rule with the word “smok”.
    • Add a Replace secondary rule. Set the text to be history of OR used to, and assign a score of 1.
  10. Add rules for Never smoked.
    • Select the Never smoked label.
    • Add a primary rule with the “smok”.
    • Add a Replace secondary rule. Set the text to be never OR no history of, and assign a score of 1.
  11. Run the tool.
    • Press on the Run button.
    • The top pane will display misclassified charts (i.e., the label predicted by CHARTextract is not the same as the ground truth label).
  12. Refine the rules.
    • Press on the misclassified chart with ID 1003.
    • CHARTextract classified this chart as Never smoked, but its ground truth label is Former smoker.
    • If you press on the yellow highlighted text, you will see that CHARTextract detected the word “smok” in the sentence.
    • The rules in their current state are not able to correctly classify this chart. Let’s fix this.
    • Go back to the rules view. Press on the Former smoker label.
    • Edit the secondary rule and add in the following: OR former.
    • Re-run the tool by pressing on the Run button.
    • You’ll notice that now chart with ID 1003 no longer shows up in the misclassified instances.

Further exercises:

  • Instead of adding the rules specified in the tutorial, create your own rules.
  • Try using the advanced view instead.
  • Instead of writing rules from scratch, start from pre-existing rules and modify them.