Python Professional Training - Basic Data Mining Sessions - | New Jersey Alliance for Clinical and Translational Science

Machine Learning with Python Professional Training
Basic Data Mining Sessions

Course Descriptions and Videos

Introduction to Data Mining

Module 1. Introduction to the Course and Introduction to Data Mining

The learning outcomes of this module are:

Describe the difference between analytics and analysis and identify viable and profitable business problems for data analytics.
Apply knowledge of the different application areas of analytics to develop analytics approaches more effectively in the organization.
Identify common challenges facing the use of analytics to overcome such challenges in data mining projects.
Understand when to apply descriptive, predictive, or prescriptive analytics

*Pre-recorded Lesson: https://rutgers.mediaspace.kaltura.com/media/Module+1+-+Introduction+to+DataMining-NJIT/1_jycr82py

Introduction to Predictive Modeling

Module 2. Introduction to Data Mining (continue) and Introduction to Predictive Modeling

Describe the evolution of data mining and the power and applicability of contemporary data mining approaches to organizational business problems.

Utilize knowledge of the most common data mining application areas to select appropriate data mining tools, techniques, and methodologies for various projects.
Apply knowledge from various disciplines to handle data analytics tasks more effectively.
Understand and use the patterns that data mining can discover e.g., associations, classifications, and clustering, and use them more effectively.
Avoid common traps in data mining.

*Pre-recorded Lesson: https://rutgers.mediaspace.kaltura.com/media/Module+2+-+Introduction+to+Predictive+Modeling-NJIT/1_wwdj46po

The Data Mining Process (CRISP-DM)

Module 3. The Data Mining Process, in particular, CRISP-DM

Use the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology to carry out data mining projects. The six phases in CRISP-DM are: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

*Pre-recorded Lesson: https://rutgers.mediaspace.kaltura.com/media/Module+3+-+CRISP-DM+-+Overview+-+NJIT/1_8nmz622e

Supervised Segmentation, example Decision Tree

Module 4. Supervised Segmentation, example Decision Tree

This module delves into one of the main topics of data mining: predictive modeling. We will begin by thinking of predictive modeling as supervised segmentation—how can we segment the population into groups that differ from each other with respect to some quantity of interest. In particular, how can we segment the population with respect to something that we would like to predict or estimate. The target of this prediction can be something we would like to avoid, such as which customers are likely to leave the company when their contracts expire, which accounts have been defrauded, which potential customers are likely not to pay off their account balances (write-offs, such as defaulting on one’s phone bill or credit card balance), or which web pages contain objectionable content. The target might instead be cast in a positive light, such as which consumers are most likely to respond to an advertisement or special offer, or which web pages are most appropriate for a search query.

*Pre-recorded Lesson: https://rutgers.mediaspace.kaltura.com/media/Module+4+-+Supervised+Segmentation/1_m15zbx0y

Discriminant Functions

Module 5. Discriminant Functions

This module specifies the structure of the model with certain numeric parameters left unspecified. Then the data mining calculates the best parameter values given a particular set of training data. A very common case is where the structure of the model is a parameterized mathematical function or equation of a set of numeric attributes. The attributes used in the model could be chosen based on domain knowledge regarding which attributes ought to be informative in predicting the target variable, or they could be chosen based on other data mining techniques, such as the attribute selection procedures. The data miner specifies the form of the model and the attributes; the goal of the data mining is to tune the parameters so that the model fits the data as well as possible. This general approach is called parameter learning or parametric modeling.

*Pre-recorded Lesson: https://rutgers.mediaspace.kaltura.com/media/Module+5+-+Discriminant+Functions-New/1_f2w2vv1i

Model Performance Analysis

Module 6. Model Performance Analysis

One of the most important fundamental notions of data science is that of overfitting and generalization. If we allow ourselves enough flexibility in searching for patterns in a particular dataset, we will find patterns. Unfortunately, these “patterns” may be just chance occurrences in the data. As discussed previously, we are interested in patterns that generalize—that predict well for instances that we have not yet observed. Finding chance occurrences in data that look like interesting patterns, but which do not generalize, is called overfitting the data.

*Pre-recorded Lesson: https://rutgers.mediaspace.kaltura.com/media/Module+6+-+Model+Performance+Analytics-New/1_a9kkufd8

Model Performance Evaluation Metrics

Module 7. Model Performance Evaluation Metrics

What is a good model? For data science to add value to an application, it is important for the data scientists and other stakeholders to consider carefully what they would like to achieve by mining data. Both data scientists themselves and the people who work with them often avoid—perhaps without even realizing it—connecting the results of mining data back to the goal of the undertaking. This may manifest itself in the reporting of a statistic without a clear understanding of why it is the right statistic, or in the failure to figure out how to measure performance in a meaningful way.

Often it is not possible to perfectly measure one’s goal, for example because the systems are inadequate, or because it is too costly to gather the right data, or because it is difficult to assess causality. So, we might conclude that we need to measure some surrogate for what we would really like to measure. It is nonetheless crucial to think carefully about what we would really like to measure. If we must choose a surrogate, we should do it via careful, data-analytic thinking.

*Pre-recorded Lesson: https://rutgers.mediaspace.kaltura.com/media/Module+7+-+ModelPerformanceEvaluationMetrics/1_ojx6rg5p

Click Here to Go to Advanced Data Mining Sessions

Machine Learning with Python Professional TrainingBasic Data Mining SessionsCourse Descriptions and Videos