Smoking Status Variable - modifying existing rules
In this tutorial, we will be using the smoking status example data and existing rulesets that will allow us to classify charts according to the following labels: Current smoker, Former smoker, Never smoked, or Not dictated.
- Download the smoking status example data.
text_data.csv
labels_train.csv
labels_valid.csv
- Download the existing smoking status rules.
- After downloading the rules, extract the zipped file and save the rules to a folder. Remember the path to this folder! You will need it later in step 5.
- Upload the text data file.
- From the Settings view, press on the Project Settings tab.
- Press on the Data File button to add the smoking status text data file (
text_data.csv
). - Set the
Data ID column
to be 0. - Set the
Data First Row
to be 1. - Set the
Data Column
to be 1.
- Upload the labels files.
- Still in the Project Settings tab, set the
Create Train and Validation Set
flag to be False. We don’t need to create train and validation sets, since we will be uploading our own train and validation labels. - Set
Prediction Mode
to be False. - Press on the folder icon next to
Train Label File
to upload the smoking status train data labels (labels_train.csv
). - Press on the folder icon next to
Valid Label File
to upload the smoking status valid data labels (labels_valid.csv
). - Set the
Label ID Column
to be 0. - Set the
Label First Row
to be 1.
- Still in the Project Settings tab, set the
- Set Rules folder file.
- Still in the Project Settings tab, press on the folder icon next to
Rules Folder
. Set the rules folder to be the folder from Step 2 (i.e., the folder where you saved the smoking status rules). - Press on the
Save Project Settings
button.
- Still in the Project Settings tab, press on the folder icon next to
- Load the variable rules.
- Go to the Variable Settings tab.
- In the dropdown menu for
Current variable
, you should see thesmoking_status
variable. Select it. - Press on the
Save Variable Settings
button.
- Set the classifier settings.
- Press on the Classifier Settings tab.
- Set the
Classifier Type
to RegexClassifier. - Set the
Negative Label
toNot dictated
.
- View the existing rules.
- Go to the Rules view.
- You should see the
Current smoker
,Former smoker
andNever smoked
labels.
- Run the tool.
- Press on the Run button.
- The top pane will display misclassified charts (i.e., the label predicted by CHARTextract is not the same as the ground truth label).
- Refine the rules.
- Keep in mind that the rules we are releasing were developed on data that may be different from yours. Thus, we will want to refine the rules
- Press on the misclassified chart with ID 1003.
- CHARTextract classified this chart as
Current smoker
, but its ground truth label isFormer smoker
. - If you press on the yellow highlighted text, you will see that CHARTextract detected the words “former smoker of 1 pack”, and this gave a score of 4 to the
Current smoker
label and a score of 3 to theFormer smoker
label. - The rules in their current state are not able to correctly classify this chart. Let’s fix this.
- Go back to the rules view. Press on the
Current smoker
label. - Add a secondary Ignore rule to the second Primary rule (Rule #2). In the text input field, add the following:
former
. Press on the dropdown menu next toappears, then score
, and selectAppears Before 2
. This rule will be ignored if the wordformer
appears before it. - Re-run the tool by pressing on the Run button.
- You’ll notice that now chart with ID 1003 no longer shows up in the misclassified instances.
Further exercises:
- Write your own rules.
- Try this in the advanced view.