1 ML in the bussines lascape and data collection

1.1 Machine learning (ML)

In short, A machine learning problem generally relates to a prediction when data is available. Machine Learning is the science (and art) of programming computers so they can learn from data (Géron 2023). It is about extracting knowledge from data, a research field at the intersection of statistics, artificial intelligence, and computer science. It is also known as predictive analytics or statistical learning (Muller?). Machine learning is about programming, but only some programming problems require ML. To detect if the problem we are facing is a ML problem, we need to have an objective, and what the benefit is for the company (client) we apply in business (Burton and Shah 2013).

In this paragraph, we describe some examples. When describing those examples, we use some terminology that may sound unfamiliar to you, but we will cover it in further chapters.

Understanding the goal allows us to determine what kind of data we expect to handle and the models we could apply. Suppose we are facing a problem in the housing market, and the business objective is to detect investment opportunities such as buying sub-valuated (in price) houses by predicting the housing prices. In this example, we would expect housing price data, such as housing location in latitude, longitude, median age z, total rooms, etc. In that case, we may apply supervised models such as linear regression and evaluate the model performance with RMSE. Another example in the financial sector could be predicting if a new bank customer would default on a loan or not (will repay the loan or not). In that case, it would be a classification problem, and we could use models such as logit or LDA and measure the performance with the Confusion Matrix. On the other hand, in the same housing prices example, if we already have the information we mentioned in the last paragraph, and we want to know how crime affects the price in some areas. We could solve it by linear regression, but it wúltnd be a ML problem, but a causality one. It would be a ML problem if we want to predict the house prices in certain areas where crime has increased. Even when that would be an ML problem, it wouldn´t necessarily benefit our client or us. For example, if our client or we are a housing builder, it could help to decide where to build. Still, if the client is unaffected by the relationship between crime-house prices, then it would be an ML problem, but it is not benefiting our client or us.

In conclusion, before handling data and running algorithms, we suggest establishing a business goal, detecting if it is an ML problem, and if it would benefit the company (or client).

1.2 Diferrence between Machine Leagning and causality approach

1.3 Business objectives and data sources

In this book, we will use a cross-sectional data set consisting of a sample of houses and bank clients taken at a given time: i) house pricing; ii) credit analysis. For time series objectives that consist of observations on a variable or several variables over time, see chapters four and five of the book (Bernal 2023).

1.4 Variables terminology and notation

Many of the models we will cover in this book are of this kind:

\[y=\alpha_{0}\ +\beta_{1}\ x_{1}+\beta_{2}\ x_{2}+...+\beta_{n}\ x_{n}+e\] Where \(y\) is called the dependent variable, but also in other materials are, called the explained, output variable or response variable. On the other hand, \(x\) are called independent variables and input variables, predictors or features. The \(\beta´s\) are the parameters to be estimated, and \(e\) is the error term.

In regression, the idea is to estimate the parameters \(\beta_{1}, \beta_{2},...,\beta_{n}\), to predict the value of \(y\). When that happens, we represent the predicted values and estimated parameters as:

\[\hat{y}=\beta_{0}\ +\hat{\beta_{1}}\ x_{1}+\hat{\beta_{1}}\ x_{2}+...+\ \hat{\beta_{1}}\ x_{n}\] Also, if we compare \(y_{i}\) and a predicted value, we call that the residual, usually denoted by \(\hat{e}\). It is defined as:

\[\hat{e_{i}}=y_{i}-\hat{y_{i}}=y_{i}-\beta_{0}\ -\hat{\beta_{1}}\ x_{1i}-\hat{\beta_{1}}\ x_{2i}-,...,-\ \hat{\beta_{1}}\ x_{ni}\] In other words, each \(\hat{e_{i}}\) is the residual for each observation. For example, if we have a data set with \(n\) variables and \(m\) observations, there would be \(m\) residuals.

1.4.1 House pricing

In this case, we are facing a problem in the housing market, and the business objective is to detect investment opportunities, such as buying sub-valuated (in price) houses by predicting the median housing price.

We can get the data from GitHub:

house<-read.csv("https://raw.githubusercontent.com/abernal30/ml_book/main/housing.csv")
str(house)
#> 'data.frame':    584 obs. of  52 variables:
#>  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ SalePrice    : int  181500 223500 140000 250000 307000 129900 118000 345000 279500 325300 ...
#>  $ MSSubClass   : int  20 60 70 60 20 50 190 60 20 60 ...
#>  $ MSZoning     : int  4 4 4 4 4 5 4 4 4 4 ...
#>  $ LotFrontage  : int  80 68 60 84 75 51 50 85 91 101 ...
#>  $ LotArea      : int  9600 11250 9550 14260 10084 6120 7420 11924 10652 14215 ...
#>  $ LotShape     : int  4 1 1 1 4 4 4 1 1 1 ...
#>  $ LotConfig    : int  3 5 1 3 5 5 1 5 5 1 ...
#>  $ Neighborhood : int  25 6 7 14 21 18 4 16 6 16 ...
#>  $ Condition1   : int  2 3 3 3 3 1 1 3 3 3 ...
#>  $ BldgType     : int  1 1 1 1 1 1 2 1 1 1 ...
#>  $ HouseStyle   : int  3 6 6 6 3 1 2 6 3 6 ...
#>  $ OverallQual  : int  6 7 7 8 8 7 5 9 7 8 ...
#>  $ OverallCond  : int  8 5 5 5 5 5 6 5 5 5 ...
#>  $ YearRemodAdd : int  1976 2002 1970 2000 2005 1950 1950 2006 2007 2006 ...
#>  $ RoofStyle    : int  2 2 2 2 2 2 2 4 2 2 ...
#>  $ Exterior1st  : int  9 13 14 13 13 4 9 15 13 13 ...
#>  $ MasVnrType   : int  3 2 3 2 4 3 3 4 4 2 ...
#>  $ MasVnrArea   : int  0 162 0 350 186 0 0 286 306 380 ...
#>  $ ExterQual    : int  4 3 4 3 3 4 4 1 3 3 ...
#>  $ ExterCond    : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ Foundation   : int  2 3 1 3 3 1 1 3 3 3 ...
#>  $ BsmtQual     : int  3 3 4 3 1 4 4 1 3 1 ...
#>  $ BsmtExposure : int  2 3 4 1 1 4 4 4 1 1 ...
#>  $ BsmtFinType1 : int  1 3 1 3 3 6 3 3 6 6 ...
#>  $ BsmtFinSF1   : int  978 486 216 655 1369 0 851 998 0 0 ...
#>  $ BsmtUnfSF    : int  284 434 540 490 317 952 140 177 1494 1158 ...
#>  $ HeatingQC    : int  1 1 3 1 1 3 1 1 1 1 ...
#>  $ CentralAir   : int  2 2 2 2 2 2 2 2 2 2 ...
#>  $ Electrical   : int  5 5 5 5 5 2 5 5 5 5 ...
#>  $ X1stFlrSF    : int  1262 920 961 1145 1694 1022 1077 1182 1494 1158 ...
#>  $ X2ndFlrSF    : int  0 866 756 1053 0 752 0 1142 0 1218 ...
#>  $ BsmtFullBath : int  0 1 1 1 1 0 1 1 0 0 ...
#>  $ BsmtHalfBath : int  1 0 0 0 0 0 0 0 0 0 ...
#>  $ FullBath     : int  2 2 1 2 2 2 1 3 2 3 ...
#>  $ HalfBath     : int  0 1 0 1 0 0 0 0 0 1 ...
#>  $ BedroomAbvGr : int  3 3 3 4 3 2 2 4 3 4 ...
#>  $ KitchenQual  : int  4 3 3 3 3 4 4 1 3 3 ...
#>  $ TotRmsAbvGrd : int  6 6 7 9 7 8 5 11 7 9 ...
#>  $ Fireplaces   : int  1 1 1 1 1 2 2 2 1 1 ...
#>  $ FireplaceQu  : int  5 5 3 5 3 5 5 3 3 3 ...
#>  $ GarageType   : int  2 2 6 2 2 6 2 4 2 4 ...
#>  $ GarageYrBlt  : int  1976 2001 1998 2000 2004 1931 1939 2005 2006 2005 ...
#>  $ GarageFinish : int  2 2 3 2 2 3 2 1 2 2 ...
#>  $ GarageArea   : int  460 608 642 836 636 468 205 736 840 853 ...
#>  $ PavedDrive   : int  3 3 3 3 3 3 3 3 3 3 ...
#>  $ WoodDeckSF   : int  298 0 0 192 255 90 0 147 160 240 ...
#>  $ OpenPorchSF  : int  0 42 35 84 57 0 4 21 33 154 ...
#>  $ MoSold       : int  5 9 2 12 8 4 1 7 8 11 ...
#>  $ YrSold       : int  2007 2008 2006 2008 2007 2008 2008 2006 2007 2006 ...
#>  $ SaleType     : int  9 9 9 9 9 9 9 7 7 7 ...
#>  $ SaleCondition: int  5 5 1 5 5 1 5 6 6 6 ...

We have housing data such as house media prices, houses location in latitude x, longitude y, housing median age z, total rooms, etc. We will apply the following model.

\[median\_house\_value=\beta_{0}\ +\beta_{1}\ longitude+\beta_{2}\ latitude+...+\beta_{n}\ variable_{n}+e\]

1.4.2 The credit analysis

In the credit analysis case, we are interested in predicting if a new bank customer would default on a loan or not (will repay the loan or not). It is a classification problem, and we will use models such as logit or LDA. We can get the data also from GitHub.

credit<-read.csv("https://raw.githubusercontent.com/abernal30/ml_book/main/credit.csv")

The database has historical information on Lendingclub, https://www.lendingclub.com/ fintech marketplace bank at scale. The original data set has at least 2 million observations and 150 variables. You will find only 873 observations (rows) and 71 columns. Each row represents a Lendingclub client. We previously made the data cleaning (missing values, correlated variables, Zero- and Near Zero-Variance Predictors).

str(credit[,1:5])
#> 'data.frame':    873 obs. of  5 variables:
#>  $ Default    : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
#>  $ term       : int  1 1 2 2 1 1 1 1 1 1 ...
#>  $ installment: num  123 820 433 290 405 ...
#>  $ grade      : int  3 3 2 6 3 2 2 1 2 3 ...
#>  $ emp_title  : num  299 209 623 126 633 636 481 540 631 314 ...

For this case, the model we will be.

\[Default=\beta_{0}\ +\beta_{1}\ term_{1}+\beta_{2}\ grade_{2}+...+\beta_{n}\ variable_{n}+e\]

The variable “Default” winch originally had the name “loan_status”; it has two labels:

table(credit[,"Default"])
#> 
#> Charged Off  Fully Paid 
#>         145         728

“Charge off” means that the credit grantor wrote your account off their receivables as a loss and is closed to future charges. When an account displays a status of “charge off,” it is closed to future use, although the customer still owns the debt. For this example, we will consider Charged Off equivalent to Default and Fully Paid as no default.

In a previous output, we show that the “Default” variable class is “character,” and a function we will apply below does only accept numeric or factor variables. We transform that variable into a “factor.”

credit[,"Default"]<-factor(credit[,"Default"])

1.5 Take a Quick Look at the Data Structure

Most ML literature suggests looking at the data structure to see issues, such as numerical or categorical variables, missing values, etc. Because several books cover that, we will not cover it in this book. I suggest chapter two of the book (Bernal 2023). However, why do we include this section in this book? We expect this book to be an introductory guide for ML. Then, the book’s structure is the steps to develop a ML analysis without being redundant with other materials.

Preface

2 Training and evaluating regression models