05. Supervised Learning: Categorization

Overview or Supervised Learning: Categorizers

“Supervised” learning simply means that the system is trained with a “labeled” dataset. For example, in the Titanic dataset, each passenger’s row included a variable of whether they lived or died. With a labeled dataset, the supervised learning approach splits (or samples) the data into a training set and a test set. The learning model is trained using training set, and it is automatically optimized to correlating the weights of the various features (variables) to the outcome (the label). The model is then tested using the test set to see how well its predictions (developed with the training set) correlate with the label. This is a relatively easy-to-understand model of AI/ML.

If the outcome measure is a continuous numeric variable (like cost, age, weight, percentage, time, or distance), then the system uses a regression model to produce a numeric output. If the outcome measure is a set of discrete categories like lived/died, buy/hold/sell, accept/reject, or egg size (extra large, large, medium), then the model outputs the category. A financial AI that reviews loan applications is a categorizer that outputs either “approve” or “deny” the loan to the customer. This page’s main topic is such a “categorizer” system.

First Demo of “Decision Tree” Categorizing of Flower Species

image of iris vericolor
iris vericolor

In 1936, famed British statistician, Ronald Fisher conducted a study on three similar species of iris flowers collected by Edgar Anderson. He hoped that precise measurement of the length and width of the petals and sepals could be used to differentiate the species, allowing a prediction of species just by the four measurements. The resulting dataset of 50 samples of each species is called the Fisher or Anderson set.

The “iris” dataset is built into R and is commonly used in introducing students to categorization. The three species of iris are: setosa, virginica, and vericolor. The model is developed in the 20-min YouTube video, “Decision Tree Classification in R” https://www.youtube.com/watch?v=JFJIQ0_2ijg . This 2017 video has received nearly 75,000 views, but it now needs one change from his directions. The first statement in your R code should be install.packages(“rpart.plot”). The only other tip is to use RStudio and enter his code in the top left window (the source code window). Highlight the parts you want to run and click the “run” icon at the top of the window. Your code will run in the “console” window below.

Multi-Variate Classification with Error Table & Decision Tree Graph

One dataset that is frequently used to learn classification is data collected on Italian wines. The data records chemical test results (features) on 178 samples that include three varieties of Italian wine (labeled wine A, wine B, and wine C). The dataset represents 14 features (columns) from 14 different chemical tests. The tutorial below predicts the type of wine based on data from five features, and it shows the results in a prediction table showing correct and incorrect predictions from both the training set and the test set. A decision tree (or “ctree”) is graphed to show how the system uses the five features to “decide” on a predicted label.

Use the “wine.csv” download button below for data file. Load the R-script into the top-left RStudio window (just “open” the R file). Great example of the power of a few lines of R code to perform complex operations: Classification using RStudio: https://www.youtube.com/watch?v=ImbXdYrT59

Predicting “Wellbeing” with Real-World Health Data

Imagine access to community data about individuals’ high blood pressure, feelings of safety, ancestry, home language, sendentaryness (is that a word?), education, access to doctors, and a hundred other variables. Can this data be used to predict the physical and emotional health of an individual? Australia has collected this information in a data set that can explore this exact question. The individuals were asked to self-report their feeling of “wellbeing” so that data column is used as the “label” for supervised learning. The exercise below explores ways to predict wellbeing using the powerful common Naïve Bayes and k-Nearest-Neighbors algorithms. You can not only compare the accuracy of the two algorithms, but you are guided to tweak or tune the algorithms to reduce error.

This is a strong example of real-world AI use; the data scientist needs to both select appropriate algorithms AND tune them to the specific relationships in the data factors (variables). The video narrates ways to repeat operations in the R code that are not apparent in the R-script file alone. FOLLOW the VIDEO DIRECTIONS! Note that you will need to set the path to the data file in YOUR COMPUTER. The example uses many R libraries. The “install.package” commands are provided as comments if the packages are not already installed in your system. Here’s the tutorial link: R Stats: Naive Bayes and k-NN: https://www.youtube.com/watch?v=MbBvtnpcx2c

Prediction Accuracy: How accurate are the two above models? How would you compare their accuracy? What might account for the differences?

Insight into AI and ML in Education

The following two articles suggest first that investment in school technology is increasing and that some of that is investment in the design and development of AI/ML systems. The second article suggests the ways that schools are currently using AI and ML systems. Read both and compare: which one is aware of emerging AI/ML power in education and which one seems to be unaware. Where would you place the educators you know: in the aware or unaware column?