Smoking Status Variable - writing rules from scratch
In this tutorial, we will be using the smoking status example data to write rules that will allow us to classify charts according to the following labels: Current smoker, Former smoker, Never smoked, or Not dictated.
- Download the smoking status example data.
text_data.csv
labels_train.csv
labels_valid.csv
- Upload the text data file.
- From the Settings view, press on the Project Settings tab.
- Press on the Data File button to add the smoking status text data file (
text_data.csv
). - Set the
Data ID column
to be 0. - Set the
Data First Row
to be 1. - Set the
Data Column
to be 1.
- Upload the labels files.
- Still in the Project Settings tab, set the
Create Train and Validation Set
flag to be False. We don’t need to create train and validation sets, since we will be uploading our own train and validation labels. - Set
Prediction Mode
to be False. - Press on the folder icon next to
Train Label File
to upload the smoking status train data labels (labels_train.csv
). - Press on the folder icon next to
Valid Label File
to upload the smoking status valid data labels (labels_valid.csv
). - Set the
Label ID Column
to be 0. - Set the
Label First Row
to be 1.
- Still in the Project Settings tab, set the
- Set Rules folder file.
- Still in the Project Settings tab, press on the folder icon next to
Rules Folder
. Set the rules folder to be a folder in which you can save files. All of your variable rules files will be saved in the rules folder. - Press on the
Save Project Settings
button.
- Still in the Project Settings tab, press on the folder icon next to
- Create a new variable.
- Press on the Variable Settings tab.
- Press on green ‘+’ symbol. This will toggle a text input field next to
Current Variable
. Type insmoking_status
. This will be the name of your variable. - Set the
Label column
to be 1. - Press on the
Save Variable Settings
button.
- Set the classifier settings.
- Press on the Classifier Settings tab.
- Set the
Classifier Type
to RegexClassifier. - Set the
Negative Label
toNot dictated
.
- Set up your rules.
- Go to the Rules view.
- Press on the green ‘+’ symbol to add a new label.
- Double click on the new button and type in
Current smoker
. You have now added a label forCurrent smoker
. - Add labels for
Former smoker
andNever smoked
.
- Add rules for
Current smoker
.- Select the
Current smoker
label. - Press on the “Add Primary Rule” button.
- In the text box, write “smok”.
- Press on the green button next to the text box, and select “Replace Rule” from the dropdown menu.
- A new field will apear. This is where you enter your secondary rule. In the secondary rule text box, type in “current”, then press on the Enter key. Next, type in “or” and press on the Enter key. Type “every day” and press on the Enter key.
- Set the secondary rule score to be 1.
- You have just created your first set of rules for the
Current smoker
label! CHARTextract will search for sentences that contain the word “smok” (primary rule). If the sentence also contains either the words “current” or “every day” (secondary rule), then the sentence will be replaced with a score of 1 for theCurrent smoker
label.
- Select the
- Add rules for
Former smoker
.- Select the
Former smoker
label. - Add a primary rule with the word “smok”.
- Add a Replace secondary rule. Set the text to be
history of
OR
used to
, and assign a score of 1.
- Select the
- Add rules for
Never smoked
.- Select the
Never smoked
label. - Add a primary rule with the “smok”.
- Add a Replace secondary rule. Set the text to be
never
OR
no history of
, and assign a score of 1.
- Select the
- Run the tool.
- Press on the Run button.
- The top pane will display misclassified charts (i.e., the label predicted by CHARTextract is not the same as the ground truth label).
- Refine the rules.
- Press on the misclassified chart with ID 1003.
- CHARTextract classified this chart as
Never smoked
, but its ground truth label isFormer smoker
. - If you press on the yellow highlighted text, you will see that CHARTextract detected the word “smok” in the sentence.
- The rules in their current state are not able to correctly classify this chart. Let’s fix this.
- Go back to the rules view. Press on the
Former smoker
label. - Edit the secondary rule and add in the following:
OR
former
. - Re-run the tool by pressing on the Run button.
- You’ll notice that now chart with ID 1003 no longer shows up in the misclassified instances.
Further exercises:
- Instead of adding the rules specified in the tutorial, create your own rules.
- Try using the advanced view instead.
- Instead of writing rules from scratch, start from pre-existing rules and modify them.