November 28, 2019
- What is
Core machine learning track
? - What trainees will learn:
- What trainees will take away, i.e. our deliverables:
- Pre-requisites
- Why
Core machine learning track
? - The
Core machine learning track
syllabus- 1. Introduction to R programming
- #2. SQL fundamentals
- #3. Statistics fundamentals
- #4. Data manipulation, exploration and visualization
- #5. Data transformation
- #6. Feature selection
- #7. Supervised learning – Classification
- #8. Supervised learning – Regression
- #9. The practice-based capstone project
- The
Core machine learning track
summary
- Useful links
‘In theory there is no difference between theory and practice. But in practice, there is.’
What is Core machine learning track
?
This learning track is a complete solution for credit risk modeling, first of all. It teaches how to assess credit risk with not only the standard classification approach but also regressions and survival analysis, hence different target variables are used – binary, numeric, time-based – for thorough assessment.
Secondly, the training is reproducible on other types of machine learning tasks. The code libraries that we deliver are re-usable for most ML problems except computer vision, natural language processing and other specific topics.
Trainees learn the essentials of machine learning through extensive practice. We start with an introduction to statistics, SQL and databases. Then we practice data manipulation, pre-processing and feature engineering. Those are essential steps to build high-performing machine learning models. It is followed by feature selection and model training. Credit risk modeling is the base ‘learner’ in the training program; as mentioned above, we use ‘standard’ classification for binary default/non-default, survival analysis to predict time till default and regressions to predict other quantitative measures of default. Then we interpret the models with LIME and Shapley values, measure models’ performance, select the best performing ones by analyzing the bias-variance tradeoff and various evaluation metrics. Reject inference analysis is performed against real-life cases to minimize false positive & false negative predictions.
The course provides a sound mix of both theoretical and technical insights and conforms to a standard curriculum of typical data scientist specializations. It also delivers practical implementation that lays ground for credit risk modeling in the context of the Basel and IFRS 9 guidelines. Trainees will learn necessary tools to assess 3 key credit risk parameters: Probability of Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD).
Core machine learning track
is composed of 5 interconnected parts:
a sequence of end-to-end R / Python tutorials with code for the entire
CRISP-DM cycle
from data exploration through model interpretation/selection/calibration,
instructor-led labs by DS/ML practitioners,
lectures on the fundamentals of DS/ML theory by university professors,
DataCamp short interactive online courses,
and practice-based projects.
Optional: we use 1-2 relevant real-life Kaggle competitions as additional learners alongside the credit risk modeling project by re-engineering those competitions with customer’s data.
What trainees will learn:
Trainees learn the ‘classic’ junior data scientist toolbox: {data exploration + visualization + classification + regression}
Online specializations and skill tracks in DS/ML (see links at the bottom of this page) that correspond to common definition of the ‘junior/middle data scientist’ skillset usually span 6-8 months and require 300-400 hours of studying @ 12-15 hours per week. Such specializations typically exclude deep learning & big data infrastructure but sometimes cover additional topics like association rules, recommender systems, graph analysis or time series modeling. Our syllabus is similar with few notable differences or enhancements, shall we say.
Full program duration is 3-3.5 months
Guided labs by ML practitioners ~2 times / 4 hours per week
20-25 lecture hours by university professors on ML fundamentals
Trainees complete 20+ DataCamp short online courses, most of which are part of ‘Data Scientist with R / Python’ skill tracks
Model interpretability is the new norm in DS/ML. Still a frontline of DS/ML research, it is not yet widely taught in online courses or textbooks. Trainees learn the latest SHAP and LIME techniques to explain ‘black box’ models
Special emphasis is placed on the cornerstone concepts of ML: bias-variance tradeoff, overfitting and underfitting, model selection using performance metics (ROC/AUC, Gini, Sensitivity, Specificity, Kappa, RMSE, etc.)
The following data exploration tasks – treatment of outliers, anomalies & missing values, variable discretization, feature selection (using 6-8 algorithms) – are often overlooked in online courses and textbooks but covered in detail in our program.
Credit risk modeling is the base learner in the capstone practice-based project
300+ variables generated in
DataStore
from XML credit reportsSeveral target variable types reflecting different ways to measure credit risk – ‘classic’ binary default / non-default, continuous numeric, time-based values – are handled with classification, regression and survival analysis models
‘Dormant’ borrowers who recently took a new loan after prolonged period of inactivity are treated as a separate category
Borrowers with 2+ loans are another separate category of a higher risk profile
Aggregate credit risk of a borrower as a single entity is assessed, i.e. probability of default on any of the borrower’s outstanding loans
Probability of default is also calculated for each loan depending on its type, amount, term as well as borrower’s overall debt
Deep learning frameworks (Keras, Tensorflow) are used for model training alongside ‘conventional’ algorithms like boosting, bagging, SVM, etc.
Reject inference analysis is a highly recommended part of the training
The following topics – clustering, association rules, sequence rules, causality, graph/network analysis – are available as custom training programs upon request. They are customized to specific data and business requirements.
What trainees will take away, i.e. our deliverables:
- R / Python code libraries are delivered in 2 forms: as executable code inside RMarkdown files as well as a 200+ page pdf/html book.
- Lecture notes
- ML textbooks
- DataCamp subscriptions for self-study through 300+ courses beyond the syllabus
- Optional: review and re-engineering of relevant Kaggle competitions
- Optional:
TeamHub
, the codebase & documentation knowledgebase
Pre-requisites
Mathematics or computer science background is not a requirement; trainees without prior programming experience will learn to write code during the training. We welcome business domain experts – risk managers, financial analysts, marketing & customer service professionals – to learn DS/ML.
Why Core machine learning track
?
Core machine learning track
might be a good match if:
- you need to build a DS/ML team that should have the skills to tackle real-life business tasks from ‘day 1’
- you consider training employees into data scientists to increase the DS/ML team
- IFRS 9 and Basel regulation calls for more sophisticated credit risk modeling than the current one —> i.e. if full-fledged PD, LGD, EAD modeling is required
And if you believe that:
- it is easier and faster to train 10 business experts in DS/ML rather than 3-5 data scientists in 10 different business domains
- it is cheaper because the business domain experts are likely your employees, hence you depend less on outside hiring and rely more on internal human resources
The key reason: it saves time and money
Firstly, trainees are driven by guided training through the syllabus 2x faster than in online ‘data scientist’ specializations.
Secondly, code re-use is efficient and effective. It will eventually save you hundreds/thousands of man-hours and thousands of US$ in labour costs once trainees have learned and start re-using the standardized code templates that come as part of this learning track.
Thirdly, you will likely depend less on hiring data scientists from outside because your business domain experts can learn these skills via in-house training and start contributing valuable data analysis insights to DS/ML projects. Cross-functional teams will speed up DS/ML adoption by adding intelligent layers of data-driven decisions across a range of operations.
Our code templates work end-to-end from raw data mining to model interpretation & selection and contain substantially more re-usable code than online courses. Data cleaning, exploration and manipulation usually take up to 80% of time in DS/ML. Those fairly simple activities are often the most expensive ones in terms of time they consume.
Shorter model-to-production time = higher ROI on DS/ML investment
Data scientists will no longer have to spend hundreds of hours on writing code for nearly same recurring tasks on different projects. No time has to be spent on adaptation of generic coding exercises from online courses or tutorials, rather trainees start delivering production-grade solutions by re-using the code templates that they learn during the training. This will save senior data scientists time to focus on more complex tasks, share advanced solutions with a DS/ML team, mentor juniors, so that more things will get done with available resources.
Multiply hundreds and thousands man-hours by labour cost per hour –
that’s how much payroll money we can help save, especially when the training
comes with our TeamHub
solution
for code & documentation sharing.
Few more reasons
Please see our FAQ section for more detailed overview of how
Core machine learning track
delivers value to your business:
- Model interpretability
- Data analytics as a by-product
- Practical implementation of real-life tasks
The Core machine learning track
syllabus
1. Introduction to R programming
The learning materials:
- Online courses: 5 core and 1 optional DataCamp courses, Python equivalent courses also available
- Textbooks: none
- Ready-to-use code libraries: none
Objectives: Introduction to programming for trainees without prior programming experience
Difficulty level: Beginner
* Vectors, lists, data frames, factors
* If/Else, loops, functions, apply functions, time and dates
* Basics of R programming
* Optional: Functional programming in R
#2. SQL fundamentals
The learning materials:
- Online courses: 2 core DataCamp courses, 1 optional course by Mode Analytics, 1 optional set of online SQL exercises
- Textbooks: optional, available upon request
- Ready-to-use code libraries: none
Objectives: Introduction to data manipulation with SQL, Structured Query Language, for trainees without prior experience in database management systems
Difficulty level: Beginner
* Select, Where, Between, Order by, Group by, Having
* Left join, Inner join, Semi join, Except, Intersect
#3. Statistics fundamentals
The learning materials:
- Online courses: 2 core DataCamp courses, 1 core and 2 optional Stepik courses;
- Textbooks: optional, available upon request;
- Ready-to-use code libraries: none.
Objectives: Introduction of the concept of statistical distributions, learning to draw insights, develop a hypothesis, run experiments
Difficulty level: Beginner
* Density, variance, binomial & Poisson distributions, intro to Bayesian statistics
* Hypothesis testing, confidence intervals, linear regression for testing
#4. Data manipulation, exploration and visualization
The learning materials:
- Online courses: 9 core and 2 optional DataCamp course
- Textbooks: 2 textbooks
- Ready-to-use code libraries: 7 RMarkdown files with template code and charts
Objectives: Foundational data manipulation skills; how to analyze datasets, to effectively communicate data analysis by building powerful visualizations and interactive charts
Difficulty level: Beginner
* Importing data
* Filter, arrange, mutate, summarize, group_by, various versions of joins
* Line plot, boxplot, histogram, scatterplot
* Generation of dataset summaries
* Exploration of categorical and numerical data
* Variation. Univariate analysis of categorical and numerical variables.
High Cardinality. Visualization
* Covariation. Covariate analysis of categorical and numerical variables.
Target profiling. Visualization
* Correlation and relationship. Correlation tables, linear regression, RMSE and R2
#5. Data transformation
The learning materials:
- Online courses: 4 core and 2 optional DataCamp courses
- Textbooks: 3 textbooks
- Ready-to-use code libraries: 11 RMarkdown files with template code
Objectives: Foundational data transformation skills; how to prepare datasets for predictive analytics and modeling
Difficulty level: Beginner
* Information Value and Weight of Evidence
* Detection and treatment of outliers and anomalies
* Detection and treatment of missing values
* Cleaning and tidying up data. String manipulation
* Manipulating categorical/factor data
* Time and dates
* Removing duplicate entries, incomplete observations,
variables with zero or near-zero variance
* Discretization with standard tools
* Equal width, equal frequency, custom range discretization
* Visualization of output
* Custom discretization/binning
* Pre-processing. Dummy variables, centering, scaling,
class distance calculations
* Dataset preparation. Splitting train and test datasets.
Imbalanced class datasets. Generating response variables
#6. Feature selection
The learning materials:
- Online courses: no specific online course
- Textbooks: no specific textbooks
- Ready-to-use code libraries: 7 RMarkdown files with template code
Objectives: Identifying the most important variables or parameters that help in prediction of the outcome in model construction
Difficulty level: Intermediate
* Feature selection overview
* PCA as an alternative view at feature selection
* Feature selection based on IV and WoE
* Variable importance from ML algorithms
* Lasso, ridge and elastic net regressions
* Recursive feature elimination
* Genetic algorithms
* Simulated annealing
* Variable importance; dropout loss based on DALEX package
#7. Supervised learning – Classification
The learning materials:
- Online courses: 2 core and 1 optional DataCamp courses
- Textbooks: 2 core textbooks and several optional textbooks upon request
- Ready-to-use code libraries: 22 RMarkdown files with template code
Objectives: Setting parameters and controls for classification models; understanding basic classification model metrics and selection criteria.
Difficulty level: Intermediate
* Train controls
* Resampling. Cross-validation, bootstrap
* Metrics. ROC/AUC, Gini
* Sensitivity and specificity. Asymmetrical costs of errors.
Threshold (cut-off) levels
* Metrics. Accuracy, Precision, Recall, Kappa
Objectives: Training classification models; introduction to most commonly used classification algorithms.
* Hyper-parameter tuning
* Logistic Regression
* Decision trees
* Bagging, random forest
* Boosting
* Elastic net, Lasso, Ridge, regularization
* Support vector machines
* Neural networks. Feed-forward, MLP
* Classification with deep neural networks (Keras, Tensorflow)
* Ensemble learning
* Optional: Cost sensitive classification
* Optional: Multi-class & multi-label classification
Objectives: Detection of model misfit, model diagnostics and selection; how to select the right models using a range of metrics; special emphasis is placed on profound understanding of these fundamentally important concepts of machine learning.
* Bias-variance trade-off. Overfitting and underfitting. Learning curves
* Diagnostics and selection of classification models
Objectives: Model interpretability; how to interpret models’ decisions on any given observation for algorithms that are generally considered black-boxes.
* Model decision interpretation with LIME
* Model decision interpretation with Shapley values
#8. Supervised learning – Regression
The learning materials:
- Online courses: 2 core DataCamp courses
- Textbooks: 2 core textbooks and several optional textbooks upon request
- Ready-to-use code libraries: 7 RMarkdown files with template code
Objectives: Training regression models with a range of algorithms, model diagnostics and selection.
Difficulty level: Intermediate
* Bagging, random forest
* Boosting
* Elastic net, Lasso, Ridge, regularization
* Support vector machines
* Neural networks. Feed forward, MLP
* Diagnostics and selection of regression models
* Regressions with deep neural networks (Keras, Tensorflow)
Optional. Survival analysis as a sub-class of regression models
* Optional. Survival analysis
#9. The practice-based capstone project
Credit risk modeling is the base learner in the practice-based capstone project.
The learning materials:
- Online courses: no specific online course
- Textbooks: no specific textbooks
- Ready-to-use code libraries: 7 RMarkdown and SQL files with template code
Objectives: disclosure and review of the SQL code
Difficulty level: Intermediate
* Datawarehouse model and SQL code
* Feature engineering in SQL -- running on customer's data
Objectives: Practice-based modeling using several methods and target variables for a holistic credit risk assessment
* Classic binary classification *Good/Bad* model
* Regression modeling of credit risk with non-binary response variables,
e.g. quantitative measures of exposure and default
* 'What-If' analysis: predicting borrowing behaviour and effect on credit risk
* Optional. Survival analysis: predicting time to default
* Optional. Comparison and combination of segmented classification models
Objectives: Include reject inference analysis into the toolbox. Optional but highly recommended.
* Reject inference analysis of approved and rejected loan applications
for model re-calibration and fine-tuning
Project finalization
* Combination of the methods above into a unified credit risk model
The Core machine learning track
summary
* 9 chapters, 200+ pages book
* 20+ DataCamp courses @5 hours per course and 9 optional courses
* 50-60 hours of guided labs @2 hours * @2 times a week during 12-14 weeks
* Standard program duration is 14-15 weeks @15 hours per week spent by trainees
* Actual training duration may vary depending on prior programming experience
of the trainees and the number of optional topics chosen from the syllabus
Useful links
A few links to the leading online data science skill tracks and programs.
- DataCamp – Data Scientist with R
- DataCamp – Data Scientist with Python
- Udacity – Data Scientist
- DataQuest – Data Scientist In Python
- University of California, San Diego – MicroMasters Program in Data Science
- MIT – MicroMasters Program in Statistics and Data Science
- МФТИ – Машинное обучение и анализ данных
- Open Data Science ML course