The bread and butter of practical machine learning

November 28, 2019


‘In theory there is no difference between theory and practice. But in practice, there is.’

What is Core machine learning track?

This learning track is a complete solution for credit risk modeling, first of all. It teaches how to assess credit risk with not only the standard classification approach but also regressions and survival analysis, hence different target variables are used – binary, numeric, time-based – for thorough assessment.

Secondly, the training is reproducible on other types of machine learning tasks. The code libraries that we deliver are re-usable for most ML problems except computer vision, natural language processing and other specific topics.

Trainees learn the essentials of machine learning through extensive practice. We start with an introduction to statistics, SQL and databases. Then we practice data manipulation, pre-processing and feature engineering. Those are essential steps to build high-performing machine learning models. It is followed by feature selection and model training. Credit risk modeling is the base ‘learner’ in the training program; as mentioned above, we use ‘standard’ classification for binary default/non-default, survival analysis to predict time till default and regressions to predict other quantitative measures of default. Then we interpret the models with LIME and Shapley values, measure models’ performance, select the best performing ones by analyzing the bias-variance tradeoff and various evaluation metrics. Reject inference analysis is performed against real-life cases to minimize false positive & false negative predictions.

The course provides a sound mix of both theoretical and technical insights and conforms to a standard curriculum of typical data scientist specializations. It also delivers practical implementation that lays ground for credit risk modeling in the context of the Basel and IFRS 9 guidelines. Trainees will learn necessary tools to assess 3 key credit risk parameters: Probability of Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD).


Core machine learning track is composed of 5 interconnected parts: a sequence of end-to-end R / Python tutorials with code for the entire CRISP-DM cycle from data exploration through model interpretation/selection/calibration, instructor-led labs by DS/ML practitioners, lectures on the fundamentals of DS/ML theory by university professors, DataCamp short interactive online courses, and practice-based projects.

Optional: we use 1-2 relevant real-life Kaggle competitions as additional learners alongside the credit risk modeling project by re-engineering those competitions with customer’s data.


What trainees will learn:

Trainees learn the ‘classic’ junior data scientist toolbox: {data exploration + visualization + classification + regression}

Online specializations and skill tracks in DS/ML (see links at the bottom of this page) that correspond to common definition of the ‘junior/middle data scientist’ skillset usually span 6-8 months and require 300-400 hours of studying @ 12-15 hours per week. Such specializations typically exclude deep learning & big data infrastructure but sometimes cover additional topics like association rules, recommender systems, graph analysis or time series modeling. Our syllabus is similar with few notable differences or enhancements, shall we say.

  • Full program duration is 3-3.5 months

  • Guided labs by ML practitioners ~2 times / 4 hours per week

  • 20-25 lecture hours by university professors on ML fundamentals

  • Trainees complete 20+ DataCamp short online courses, most of which are part of ‘Data Scientist with R / Python’ skill tracks

  • Model interpretability is the new norm in DS/ML. Still a frontline of DS/ML research, it is not yet widely taught in online courses or textbooks. Trainees learn the latest SHAP and LIME techniques to explain ‘black box’ models

  • Special emphasis is placed on the cornerstone concepts of ML: bias-variance tradeoff, overfitting and underfitting, model selection using performance metics (ROC/AUC, Gini, Sensitivity, Specificity, Kappa, RMSE, etc.)

  • The following data exploration tasks – treatment of outliers, anomalies & missing values, variable discretization, feature selection (using 6-8 algorithms) – are often overlooked in online courses and textbooks but covered in detail in our program.

  • Credit risk modeling is the base learner in the capstone practice-based project

    • 300+ variables generated in DataStore from XML credit reports

    • Several target variable types reflecting different ways to measure credit risk – ‘classic’ binary default / non-default, continuous numeric, time-based values – are handled with classification, regression and survival analysis models

    • ‘Dormant’ borrowers who recently took a new loan after prolonged period of inactivity are treated as a separate category

    • Borrowers with 2+ loans are another separate category of a higher risk profile

    • Aggregate credit risk of a borrower as a single entity is assessed, i.e. probability of default on any of the borrower’s outstanding loans

    • Probability of default is also calculated for each loan depending on its type, amount, term as well as borrower’s overall debt

  • Deep learning frameworks (Keras, Tensorflow) are used for model training alongside ‘conventional’ algorithms like boosting, bagging, SVM, etc.

  • Reject inference analysis is a highly recommended part of the training

The following topics – clustering, association rules, sequence rules, causality, graph/network analysis – are available as custom training programs upon request. They are customized to specific data and business requirements.


What trainees will take away, i.e. our deliverables:

  • R / Python code libraries are delivered in 2 forms: as executable code inside RMarkdown files as well as a 200+ page pdf/html book.
  • Lecture notes
  • ML textbooks
  • DataCamp subscriptions for self-study through 300+ courses beyond the syllabus
  • Optional: review and re-engineering of relevant Kaggle competitions
  • Optional: TeamHub, the codebase & documentation knowledgebase

Pre-requisites

Mathematics or computer science background is not a requirement; trainees without prior programming experience will learn to write code during the training. We welcome business domain experts – risk managers, financial analysts, marketing & customer service professionals – to learn DS/ML.


Why Core machine learning track?

Core machine learning track might be a good match if:

  • you need to build a DS/ML team that should have the skills to tackle real-life business tasks from ‘day 1’
  • you consider training employees into data scientists to increase the DS/ML team
  • IFRS 9 and Basel regulation calls for more sophisticated credit risk modeling than the current one —> i.e. if full-fledged PD, LGD, EAD modeling is required

And if you believe that:

  • it is easier and faster to train 10 business experts in DS/ML rather than 3-5 data scientists in 10 different business domains
  • it is cheaper because the business domain experts are likely your employees, hence you depend less on outside hiring and rely more on internal human resources

The key reason: it saves time and money

Firstly, trainees are driven by guided training through the syllabus 2x faster than in online ‘data scientist’ specializations.

Secondly, code re-use is efficient and effective. It will eventually save you hundreds/thousands of man-hours and thousands of US$ in labour costs once trainees have learned and start re-using the standardized code templates that come as part of this learning track.

Thirdly, you will likely depend less on hiring data scientists from outside because your business domain experts can learn these skills via in-house training and start contributing valuable data analysis insights to DS/ML projects. Cross-functional teams will speed up DS/ML adoption by adding intelligent layers of data-driven decisions across a range of operations.

Our code templates work end-to-end from raw data mining to model interpretation & selection and contain substantially more re-usable code than online courses. Data cleaning, exploration and manipulation usually take up to 80% of time in DS/ML. Those fairly simple activities are often the most expensive ones in terms of time they consume.

Shorter model-to-production time = higher ROI on DS/ML investment

Data scientists will no longer have to spend hundreds of hours on writing code for nearly same recurring tasks on different projects. No time has to be spent on adaptation of generic coding exercises from online courses or tutorials, rather trainees start delivering production-grade solutions by re-using the code templates that they learn during the training. This will save senior data scientists time to focus on more complex tasks, share advanced solutions with a DS/ML team, mentor juniors, so that more things will get done with available resources.

Multiply hundreds and thousands man-hours by labour cost per hour – that’s how much payroll money we can help save, especially when the training comes with our TeamHub solution for code & documentation sharing.

Few more reasons

Please see our FAQ section for more detailed overview of how Core machine learning track delivers value to your business:

  • Model interpretability
  • Data analytics as a by-product
  • Practical implementation of real-life tasks

The Core machine learning track syllabus

1. Introduction to R programming

The learning materials:

  • Online courses: 5 core and 1 optional DataCamp courses, Python equivalent courses also available
  • Textbooks: none
  • Ready-to-use code libraries: none

Objectives: Introduction to programming for trainees without prior programming experience

Difficulty level: Beginner

* Vectors, lists, data frames, factors
* If/Else, loops, functions, apply functions, time and dates
* Basics of R programming
* Optional: Functional programming in R

#2. SQL fundamentals

The learning materials:

  • Online courses: 2 core DataCamp courses, 1 optional course by Mode Analytics, 1 optional set of online SQL exercises
  • Textbooks: optional, available upon request
  • Ready-to-use code libraries: none

Objectives: Introduction to data manipulation with SQL, Structured Query Language, for trainees without prior experience in database management systems

Difficulty level: Beginner

* Select, Where, Between, Order by, Group by, Having
* Left join, Inner join, Semi join, Except, Intersect

#3. Statistics fundamentals

The learning materials:

  • Online courses: 2 core DataCamp courses, 1 core and 2 optional Stepik courses;
  • Textbooks: optional, available upon request;
  • Ready-to-use code libraries: none.

Objectives: Introduction of the concept of statistical distributions, learning to draw insights, develop a hypothesis, run experiments

Difficulty level: Beginner

* Density, variance, binomial & Poisson distributions, intro to Bayesian statistics
* Hypothesis testing, confidence intervals, linear regression for testing

#4. Data manipulation, exploration and visualization

The learning materials:

  • Online courses: 9 core and 2 optional DataCamp course
  • Textbooks: 2 textbooks
  • Ready-to-use code libraries: 7 RMarkdown files with template code and charts

Objectives: Foundational data manipulation skills; how to analyze datasets, to effectively communicate data analysis by building powerful visualizations and interactive charts

Difficulty level: Beginner

* Importing data
* Filter, arrange, mutate, summarize, group_by, various versions of joins
* Line plot, boxplot, histogram, scatterplot

* Generation of dataset summaries
* Exploration of categorical and numerical data
* Variation. Univariate analysis of categorical and numerical variables.
  High Cardinality. Visualization
* Covariation. Covariate analysis of categorical and numerical variables.
  Target profiling. Visualization
* Correlation and relationship. Correlation tables, linear regression, RMSE and R2

#5. Data transformation

The learning materials:

  • Online courses: 4 core and 2 optional DataCamp courses
  • Textbooks: 3 textbooks
  • Ready-to-use code libraries: 11 RMarkdown files with template code

Objectives: Foundational data transformation skills; how to prepare datasets for predictive analytics and modeling

Difficulty level: Beginner

* Information Value and Weight of Evidence
* Detection and treatment of outliers and anomalies
* Detection and treatment of missing values
* Cleaning and tidying up data. String manipulation
* Manipulating categorical/factor data
* Time and dates

* Removing duplicate entries, incomplete observations,
  variables with zero or near-zero variance
* Discretization with standard tools
* Equal width, equal frequency, custom range discretization
* Visualization of output
* Custom discretization/binning
* Pre-processing. Dummy variables, centering, scaling,
  class distance calculations
* Dataset preparation. Splitting train and test datasets.
  Imbalanced class datasets. Generating response variables

#6. Feature selection

The learning materials:

  • Online courses: no specific online course
  • Textbooks: no specific textbooks
  • Ready-to-use code libraries: 7 RMarkdown files with template code

Objectives: Identifying the most important variables or parameters that help in prediction of the outcome in model construction

Difficulty level: Intermediate

* Feature selection overview
* PCA as an alternative view at feature selection
* Feature selection based on IV and WoE
* Variable importance from ML algorithms
* Lasso, ridge and elastic net regressions
* Recursive feature elimination
* Genetic algorithms
* Simulated annealing
* Variable importance; dropout loss based on DALEX package

#7. Supervised learning – Classification

The learning materials:

  • Online courses: 2 core and 1 optional DataCamp courses
  • Textbooks: 2 core textbooks and several optional textbooks upon request
  • Ready-to-use code libraries: 22 RMarkdown files with template code

Objectives: Setting parameters and controls for classification models; understanding basic classification model metrics and selection criteria.

Difficulty level: Intermediate

* Train controls
* Resampling. Cross-validation, bootstrap
* Metrics. ROC/AUC, Gini
* Sensitivity and specificity. Asymmetrical costs of errors.
  Threshold (cut-off) levels
* Metrics. Accuracy, Precision, Recall, Kappa

Objectives: Training classification models; introduction to most commonly used classification algorithms.

* Hyper-parameter tuning
* Logistic Regression
* Decision trees
* Bagging, random forest
* Boosting
* Elastic net, Lasso, Ridge, regularization
* Support vector machines
* Neural networks. Feed-forward, MLP
* Classification with deep neural networks (Keras, Tensorflow)
* Ensemble learning
* Optional: Cost sensitive classification
* Optional: Multi-class & multi-label classification

Objectives: Detection of model misfit, model diagnostics and selection; how to select the right models using a range of metrics; special emphasis is placed on profound understanding of these fundamentally important concepts of machine learning.

* Bias-variance trade-off. Overfitting and underfitting. Learning curves
* Diagnostics and selection of classification models

Objectives: Model interpretability; how to interpret models’ decisions on any given observation for algorithms that are generally considered black-boxes.

* Model decision interpretation with LIME
* Model decision interpretation with Shapley values

#8. Supervised learning – Regression

The learning materials:

  • Online courses: 2 core DataCamp courses
  • Textbooks: 2 core textbooks and several optional textbooks upon request
  • Ready-to-use code libraries: 7 RMarkdown files with template code

Objectives: Training regression models with a range of algorithms, model diagnostics and selection.

Difficulty level: Intermediate

* Bagging, random forest
* Boosting
* Elastic net, Lasso, Ridge, regularization
* Support vector machines
* Neural networks. Feed forward, MLP
* Diagnostics and selection of regression models
* Regressions with deep neural networks (Keras, Tensorflow)

Optional. Survival analysis as a sub-class of regression models

* Optional. Survival analysis

#9. The practice-based capstone project

Credit risk modeling is the base learner in the practice-based capstone project.

The learning materials:

  • Online courses: no specific online course
  • Textbooks: no specific textbooks
  • Ready-to-use code libraries: 7 RMarkdown and SQL files with template code

Objectives: disclosure and review of the SQL code

Difficulty level: Intermediate

* Datawarehouse model and SQL code
* Feature engineering in SQL -- running on customer's data

Objectives: Practice-based modeling using several methods and target variables for a holistic credit risk assessment

* Classic binary classification *Good/Bad* model
* Regression modeling of credit risk with non-binary response variables,
  e.g. quantitative measures of exposure and default
* 'What-If' analysis: predicting borrowing behaviour and effect on credit risk

* Optional. Survival analysis: predicting time to default
* Optional. Comparison and combination of segmented classification models

Objectives: Include reject inference analysis into the toolbox. Optional but highly recommended.

* Reject inference analysis of approved and rejected loan applications
  for model re-calibration and fine-tuning

Project finalization

* Combination of the methods above into a unified credit risk model

The Core machine learning track summary

* 9 chapters, 200+ pages book
* 20+ DataCamp courses @5 hours per course and 9 optional courses
* 50-60 hours of guided labs @2 hours * @2 times a week during 12-14 weeks
* Standard program duration is 14-15 weeks @15 hours per week spent by trainees
* Actual training duration may vary depending on prior programming experience
  of the trainees and the number of optional topics chosen from the syllabus