Supervised Learning is a system of learning the characteristics of labeled categories or values from relatively large amounts of data. The key in supervised learning is that the “right” category or value is known in the training process. After training (or learning) the system, it can categorize unlabeled data. The TITANIC data set is an example of a labeled set because for every person, we know the outcome (whether they lived or died).
The data is split into two sets: the TRAINING set and the TEST set. The training set is used to train (or learn) the system to correlate specific data features to the known outcome features (which are labeled or known in advance). When learning seems complete, the system is tested with the data in the test set, which it has not previously seen. The system is tested to see how well it categorizes or values the outcome (without knowing it in advance) from the data, comparing its guess to the known category or value. Using the Titanic example, the system would learn from a training set by correlating the passengers’ data features to the label (lived or died). The prediction ability would then be tested using the test set (which the system had not seen, and which hides the outcome) to see how well it predicts survival when the Titanic sank.
Machine Learning Algorithms (big picture of ways ML systems think)
Pedro Domingos has been a leader in machine learning theory and development. Domingos is not a dynamic presenter, but he has a special ability to make complicated ideas accessible to an interested listener or reader. His recent book, The Master Algorithm, clearly explains the five major ways that ML systems can “think.” To help keep track of them in his 1-hour Google Talk, below are the five ways. Here’s the link to his talk: https://www.youtube.com/watch?v=B8J4uefCQMc
- Fill in gaps in existing knowledge (formal logic, “If-Then” chains)
- Emulate the brain (neural networks, connectionism, deep learning)
- Simulate evolution (genetic algorithms)
- Systematically reduce uncertainty (Bayesian models, Markov chains)
- Notice similarities between old and new (reason by analogy, nearest neighbor, recommenders)
Using ML to decipher individual genetic differences. Ricardo Sabatini’s 2016 TED talk (16-min) shows you the how ML is now being used to understand YOUR genome and to make predictions about you (both your strengths and your diseases). Learn about YOUR personal future!: https://www.ted.com/talks/riccardo_sabatini_how_to_read_the_genome_and_build_a_human_being
Hands-On with Hans Rosling and Gapminder
Continuing our earlier exploration of data sources and visualization of data, here’s Hans Rosling, a master of explaining complex phenomena with simple data visualizations. Start with a 4.5 min video of 200 years of world history, shown through amazing graphs of global data: https://www.gapminder.org/videos/200-years-that-changed-the-world/
Try your hand at creating similar bubble graphs. Choose your own data for x- and y-axes (pull-down menu for each axis). Explore relationships you can see with powerful visualization of the data: https://www.gapminder.org/tools/#$chart-type=bubbles
Check out the Gapminder data sources that are freely downloadable for any projects for you and your students. There are links to 519 data sources on this page, and some links take you to other massive groups of datasets. Most of the data is up-to-date and comes from global sources like the World Health Organization. You can view any of the datasets to see what the variables (features) are. You can download any dataset as an “Excel” file. It may download as a “csv” file, which R likes. Or you can open it in Excel and save it as a .csv file, then open in R. Here’s the link. Enjoy exploring the world’s data! https://www.gapminder.org/data/