Machine Learning with Python Professional Training
Basic Data Mining Sessions
Course Descriptions and Videos
Introduction to Data Mining
Module 1. Introduction to the Course and Introduction to Data Mining
The learning outcomes of this module are:
- Describe the difference between analytics and analysis and identify viable and profitable business problems for data analytics.
- Apply knowledge of the different application areas of analytics to develop analytics approaches more effectively in the organization.
- Identify common challenges facing the use of analytics to overcome such challenges in data mining projects.
- Understand when to apply descriptive, predictive, or prescriptive analytics
Introduction to Predictive Modeling
Module 2. Introduction to Data Mining (continue) and Introduction to Predictive Modeling
Describe the evolution of data mining and the power and applicability of contemporary data mining approaches to organizational business problems.
- Utilize knowledge of the most common data mining application areas to select appropriate data mining tools, techniques, and methodologies for various projects.
- Apply knowledge from various disciplines to handle data analytics tasks more effectively.
- Understand and use the patterns that data mining can discover e.g., associations, classifications, and clustering, and use them more effectively.
- Avoid common traps in data mining.
The Data Mining Process (CRISP-DM)
Module 3. The Data Mining Process, in particular, CRISP-DM
Use the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology to carry out data mining projects. The six phases in CRISP-DM are: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
Supervised Segmentation, example Decision Tree
Module 4. Supervised Segmentation, example Decision Tree
This module delves into one of the main topics of data mining: predictive modeling. We will begin by thinking of predictive modeling as supervised segmentation—how can we segment the population into groups that differ from each other with respect to some quantity of interest. In particular, how can we segment the population with respect to something that we would like to predict or estimate. The target of this prediction can be something we would like to avoid, such as which customers are likely to leave the company when their contracts expire, which accounts have been defrauded, which potential customers are likely not to pay off their account balances (write-offs, such as defaulting on one’s phone bill or credit card balance), or which web pages contain objectionable content. The target might instead be cast in a positive light, such as which consumers are most likely to respond to an advertisement or special offer, or which web pages are most appropriate for a search query.
Module 5. Discriminant Functions
This module specifies the structure of the model with certain numeric parameters left unspecified. Then the data mining calculates the best parameter values given a particular set of training data. A very common case is where the structure of the model is a parameterized mathematical function or equation of a set of numeric attributes. The attributes used in the model could be chosen based on domain knowledge regarding which attributes ought to be informative in predicting the target variable, or they could be chosen based on other data mining techniques, such as the attribute selection procedures. The data miner specifies the form of the model and the attributes; the goal of the data mining is to tune the parameters so that the model fits the data as well as possible. This general approach is called parameter learning or parametric modeling.
Model Performance Analysis
Module 6. Model Performance Analysis
One of the most important fundamental notions of data science is that of overfitting and generalization. If we allow ourselves enough flexibility in searching for patterns in a particular dataset, we will find patterns. Unfortunately, these “patterns” may be just chance occurrences in the data. As discussed previously, we are interested in patterns that generalize—that predict well for instances that we have not yet observed. Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called overfitting the data.
Model Performance Evaluation Metrics
Module 7. Model Performance Evaluation Metrics
What is a good model? For data science to add value to an application, it is important for the data scientists and other stakeholders to consider carefully what they would like to achieve by mining data. Both data scientists themselves and the people who work with them often avoid—perhaps without even realizing it—connecting the results of mining data back to the goal of the undertaking. This may manifest itself in the reporting of a statistic without a clear understanding of why it is the right statistic, or in the failure to figure out how to measure performance in a meaningful way.
Often it is not possible to perfectly measure one’s goal, for example because the systems are inadequate, or because it is too costly to gather the right data, or because it is difficult to assess causality. So, we might conclude that we need to measure some surrogate for what we would really like to measure. It is nonetheless crucial to think carefully about what we would really like to measure. If we must choose a surrogate, we should do it via careful, data-analytic thinking.