By Rendani Mbuvha
This notebook will provide all the interactive material for the MasterClass: Predictive Analytics in Insurance workshop. The workshop will cover:
The workshop will be based on data from the Computational Actuarial Science Package
Workshop attendees are requested to bring their own laptops with the programmes and packages below installed.
First of all, we need to install R, R studio, and the required packages.
In R, install the following libraries:
Using install.packages(“CASdatasets”, repos = “http://cas.uqam.ca/pub/R/”, type=“source”)
Use the command on the R console install.packages(“package-name”)
While the workshop will briefly touch on some basics of R. R will mainly be used as an analysis tool. Some good introductory R material can be found here:
Radford Neal’s R Course (Browsing through weeks 1-4 is sufficient)
Pre-made libraries are the most efficient way to work in R. This allows one to use already defined functions in-order to run functions or fit models they require.
The statement below will load the library CASdatasets which contains most of the datasets that we will use in this workshop.
Set working directory to the folder where you would to save your work.
setwd("~/Dropbox/Masterclass")
#for windows
#dir.create("C:/Documents/Masterclass")
#setwd("C:/Documents/Masterclass")
rm(list=ls()) #clear workspace
library(knitr)
library(CASdatasets) #mainly datasets
library(ROCR)
library(rpart.plot)
library(rpart)
library(caret)
library(h2o)
library(pROC)
library(corrplot)
How to find help
?help
The vector is the basic unit of computations in R
x = c(1,3,2,5)
x
x = c(1,6,2)
x
Adding vectors
y = c(1,4,3)
length(x)
length(y)
x+y
Listing and removing variables to the workspace
ls()
rm(x,y)
ls()
Creating matrices
?matrix #calling help webpage
x=matrix(data=c(1,2,3,4), nrow=2, ncol=2)
x
Some mathematical functions
matrix(c(1,2,3,4),2,2,byrow=TRUE)
sqrt(x)
x^2
#Random Sample from Standard Normal
x=rnorm(50)
y=x+rnorm(50,mean=50,sd=.1)
#Correlation Matrix
cor(x,y)
set.seed(1303)
rnorm(50)
set.seed(3)
y=rnorm(100)
mean(y)
var(y)
sqrt(var(y))
sd(y)
# Graphics
x=rnorm(500)
y=rnorm(500)
plot(x,y)
plot(x,y,xlab="x-axis",ylab="y-axis",main="Plot of X vs Y")
pdf("Figure.pdf")
plot(x,y,col="green")
dev.off()
# Indexing Data
A=matrix(1:16,4,4)
A
A[2,3]
A[c(1,3),c(2,4)]
A[1:3,2:4]
A[1:2,]
A[,1:2]
A[1,]
A[-c(1,3),]
A[-c(1,3),-c(1,3,4)]
dim(A)
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
Download it here
# Loading Data from csv
Bank = read.csv("bank-full.csv",header=T,sep = ";")
dim(Bank)
head(Bank)
Removing Nulls
Bank=na.omit(Bank)
Number of dimensions
dim(Bank)
Names of variables
names(Bank)
Find Correlations
numerical_vars = sapply(Bank, is.numeric)
corrm = cor(Bank[ , numerical_vars], use = "complete.obs")
corrplot(corrm, method = "circle", tl.cex = 0.5)
ggplot(Bank, aes(x=education, y = age, fill=education)) +
geom_boxplot()
ggplot(Bank, aes(x = balance, fill = education)) +
geom_density()+xlim(-10,10000)
ggplot(Bank, aes(balance)) +
geom_density() +
facet_wrap(c("education",'y'), labeller = "label_both", scales = "free")+xlim(-1000,10000)
Here we will explore performing statistical inference on data sets using Traditional frequentist methods.
Load package that contains functions for fitting parametric distributions to data.
library(fitdistrplus)
Lets explore fitting some distributions to the itamtplcost which contains large claims (>500 Euro) of Italian Motor-TPL company since 1997.To find out more about the data set use the “help” function.
help(itamtplcost)
data("itamtplcost") #load the dataset
head(itamtplcost) #see some of the top rows
summary(itamtplcost$UltimateCost) #lets see some summmary statistics of the Ultimate costs
Lets fit the Gamma distribution to the data using maximum likelihood.
x = itamtplcost$UltimateCost / 1000000 # Creating a separate scaled variable for claims cost
gamma_mle = fitdist(x, "gamma", method = "mle")
plot(gamma_mle) #Plots the density function and some diagnostic plots
summary(gamma_mle) #get a summary of the fit including the estimated parameters
Consider what would happen if you fit a normal distribution instead
norm_mle = fitdist(x, "norm", method = "mle")
plot(norm_mle) #Plots the density function and some diagnostic plots
summary(norm_mle)
Clearly in this case the Gamma distribution is a better fit looking the higher values of the log likelihood, AIC and BIC values. The diagnostic plots also concur. One can also change the method of estimation from maximum likelihood to method of moments by changing the method parameter to “mme”.
Remember that AIC = \(-\frac{2}{N} * loglik + 2\frac{d}{N}\)
One can also check for outliers using boxplots
boxplot(x, col = "green", main = "Boxplot of Ultimate Claims Cost")
We will now show how one can perform regression using the freMTPLfreq data set which contains risk features and claim numbers of around 413 169 of a French insurer.
help(freMTPLfreq)
data(freMTPLfreq)
head(freMTPLfreq) #see some of the first rows
summary(freMTPLfreq)
One can create some bins for continuous variables like DriveAge and CarAge using the cut function.
freMTPLfreq$DriverAge_bin = cut(freMTPLfreq$DriverAge,breaks = c(0,25,30,40,50,60,70,Inf),right = FALSE) # This "cuts" DriverAge into bins
freMTPLfreq$CarAge_bin = cut(freMTPLfreq$CarAge,breaks = c(0,2,5,10,15,Inf),right = FALSE) # This "cuts" DriverAge into bins
freMTPLfreq$Density_bin = cut(freMTPLfreq$Density,breaks = c(0,50,100,500,1500,5000,10000,Inf),right = FALSE) # This "cuts" DriverAge into bins
summary(freMTPLfreq)
Lets build a regression model to predict the number of claims per unit exposure given the risk characteristics. Assuming that the number of claims follows a Poisson distribution.
Lets assume that we don’t know whether to use the continuous variables above of the binned ones
pois_reg = glm(ClaimNb ~ Gas+CarAge+DriverAge+Density+Brand+CarAge_bin+DriverAge_bin+Density_bin,offset = log(Exposure), data = freMTPLfreq,family = poisson(link=log))
summary(pois_reg)
We see that some of the variables are not significant perhaps we could perform some variable selection to see which combinations of variables yields the best model.
We do this using stepwise regression. Starting from the “full model” and iteratively removing variables that do not contribute to model fit. Lets use the stepwise method to choose between the continuous and discrete binned variables.
pois_null =glm(ClaimNb ~ 1 ,offset = log(Exposure), data = freMTPLfreq,family = poisson(link=log))
step(pois_null, scope=list(lower=pois_null, upper=pois_reg), direction="forward")
From the above a prudent model would therefore be as shown below. Note that the AIC value is lower compared to the over parameterised model in indicating that there is no predictive value in keeping both types of variables.
Also note that there might some factor levels that are insignificant, this typically means that they should re-group together to create new factor levels. For example the Brand variable can be re-grouped into “BrandJapanese (except Nissan) or Korean” and “other”
pruned_reg =glm(ClaimNb ~ Gas+Brand+CarAge_bin+DriverAge_bin+Density_bin,offset = log(Exposure), data = freMTPLfreq,family = poisson(link=log))
summary(pruned_reg)
This section will now focus on supervised learning techniques.
Here we will be using the German Credit Dataset. This dataset contains information of 1,000 credit records.
Look at the data and see what it?s about:
data("credit")
help("credit")
summary of the variables
summary(credit)
checking_status duration credit_history purpose credit_amount savings
A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250 A61:603
A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366 A62:103
A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320 A63: 63
A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271 A64: 48
3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972 A65:183
Max. :72.0 A46 : 50 Max. :18424
(Other): 55
employment installment_rate personal_status other_parties residence_since
A71: 62 Min. :1.000 A91: 50 A101:907 Min. :1.000
A72:172 1st Qu.:2.000 A92:310 A102: 41 1st Qu.:2.000
A73:339 Median :3.000 A93:548 A103: 52 Median :3.000
A74:174 Mean :2.973 A94: 92 Mean :2.845
A75:253 3rd Qu.:4.000 3rd Qu.:4.000
Max. :4.000 Max. :4.000
property_magnitude age other_payment_plans housing existing_credits
A121:282 Min. :19.00 A141:139 A151:179 Min. :1.000
A122:232 1st Qu.:27.00 A142: 47 A152:713 1st Qu.:1.000
A123:332 Median :33.00 A143:814 A153:108 Median :1.000
A124:154 Mean :35.55 Mean :1.407
3rd Qu.:42.00 3rd Qu.:2.000
Max. :75.00 Max. :4.000
job num_dependents telephone foreign_worker class
A171: 22 Min. :1.000 A191:596 A201:963 Min. :0.0
A172:200 1st Qu.:1.000 A192:404 A202: 37 1st Qu.:0.0
A173:630 Median :1.000 Median :0.0
A174:148 Mean :1.155 Mean :0.3
3rd Qu.:1.000 3rd Qu.:1.0
Max. :2.000 Max. :1.0
Lets do some processing
credit$class = factor(credit$class)
age_d = quantile(credit$age,seq(0,1,by=0.1)) #discretise age into deciles
credit$age=cut(credit$age,age_d)
Create and training - test data split
#convert class into discrete
n_records_gc = nrow(credit)
n_train_gc = round(0.7*n_records_gc,0)
training_ids = sample(1:n_records_gc,n_train_gc)
testing_ids = setdiff(1:n_records_gc,training_ids)
#Creating subsets.
training_gc= credit[training_ids,]
testing_gc = credit[testing_ids,]
head(training_gc)
checking_status duration credit_history purpose credit_amount savings employment
391 A14 18 A32 A40 1820 A61 A73
503 A13 9 A32 A43 1126 A62 A75
991 A14 12 A34 A46 3565 A65 A72
203 A14 27 A33 A49 5117 A61 A74
197 A14 6 A34 A43 1382 A61 A73
714 A11 9 A34 A43 1138 A61 A73
installment_rate personal_status other_parties residence_since property_magnitude
391 2 A94 A101 2 A122
503 2 A91 A101 4 A121
991 2 A93 A101 1 A122
203 3 A93 A101 4 A123
197 1 A92 A101 1 A123
714 4 A93 A101 4 A121
age other_payment_plans housing existing_credits job num_dependents telephone
391 (28,30] A143 A152 1 A174 1 A192
503 (45,52] A143 A152 1 A173 1 A191
991 (36,39] A143 A152 2 A172 2 A191
203 (23,26] A143 A152 2 A173 1 A191
197 (26,28] A143 A152 2 A173 1 A192
714 (23,26] A143 A152 2 A172 1 A191
foreign_worker class
391 A201 0
503 A201 0
991 A201 0
203 A201 0
197 A201 0
714 A201 0
What is the balance of classes in the different data splits
print("Full Dataset")
[1] "Full Dataset"
table(credit$class)
0 1
700 300
print("Training Dataset")
[1] "Training Dataset"
table(training_gc$class)
0 1
480 220
print("Testing Dataset")
[1] "Testing Dataset"
table(testing_gc$class)
0 1
220 80
glm.logistic = glm(class ~ checking_status+duration+purpose+other_parties+savings, data = training_gc, family = binomial(link = "logit"))
summary(glm.logistic)
Call:
glm(formula = class ~ checking_status + duration + purpose +
other_parties + savings, family = binomial(link = "logit"),
data = training_gc)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1502 -0.7953 -0.4585 0.9141 2.6274
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.407275 0.259131 -1.572 0.116022
checking_statusA12 -0.243229 0.227911 -1.067 0.285877
checking_statusA13 -1.399152 0.412669 -3.390 0.000698 ***
checking_statusA14 -1.753380 0.252553 -6.943 3.85e-12 ***
duration 0.039677 0.007906 5.019 5.20e-07 ***
purposeA41 -1.087781 0.372314 -2.922 0.003482 **
purposeA410 -0.916395 0.762178 -1.202 0.229233
purposeA42 -0.226305 0.274347 -0.825 0.409436
purposeA43 -0.695570 0.265585 -2.619 0.008819 **
purposeA44 0.626872 0.902953 0.694 0.487528
purposeA45 -0.126148 0.649915 -0.194 0.846099
purposeA46 0.710168 0.409705 1.733 0.083031 .
purposeA48 -14.323999 552.625543 -0.026 0.979321
purposeA49 -0.371833 0.359075 -1.036 0.300421
other_partiesA102 0.705302 0.473324 1.490 0.136197
other_partiesA103 -0.904446 0.415119 -2.179 0.029349 *
savingsA62 -0.131973 0.304776 -0.433 0.665004
savingsA63 -0.909127 0.488428 -1.861 0.062697 .
savingsA64 -0.976342 0.530072 -1.842 0.065489 .
savingsA65 -0.751079 0.298788 -2.514 0.011945 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 871.48 on 699 degrees of freedom
Residual deviance: 705.91 on 680 degrees of freedom
AIC: 745.91
Number of Fisher Scoring iterations: 14
Test the model fit on testing set
glm_pred= predict(glm.logistic, newdata = testing_gc, type = 'response') # use the model to #make predictions on the testing set.
Evaluate model performance
#Create a prediction object
glm.prediction = prediction(glm_pred, testing_gc$class)
#infer True positive and false positive rates
glm.perf = performance(glm.prediction, measure = "tpr", x.measure = "fpr")
#accuracy and area under the curve
glm.acc = performance(glm.prediction, measure = "acc")
glm.auc = performance(glm.prediction, measure = "auc")
Plot accuracy at different cutoffs
plot(glm.acc,main=" GLM Model Accuracy by Cutoff",lwd=3,col=rainbow(10))
Using Decision Trees
tree = rpart(class~ checking_status+duration+purpose+other_parties+savings, data = training_gc)
#make predictions on testing set
tree_pred = predict(tree,testing_gc)[,2]
Visualise the Tree Created
rpart.plot(tree)
rpart.rules(tree)
class
0.13 when checking_status is A13 or A14
0.21 when checking_status is A11 or A12 & duration is 12 to 23 & purpose is A40 or A410 or A42 or A45 & savings is A62 or A63 or A64
0.27 when checking_status is A11 or A12 & duration < 12 & purpose is A40 or A410 or A42 or A45
0.28 when checking_status is A11 or A12 & duration < 32 & purpose is A41 or A43 or A48 or A49
0.30 when checking_status is A11 or A12 & duration is 23 to 32 & purpose is A40 or A410 or A42 or A44 or A45 or A46 & savings is A62 or A63 or A64
0.37 when checking_status is A11 or A12 & duration is 12 to 23 & purpose is A410 or A42 & savings is A61 or A65
0.61 when checking_status is A11 or A12 & duration is 12 to 23 & purpose is A40 or A45 & savings is A61 or A65
0.65 when checking_status is A11 or A12 & duration >= 32
0.76 when checking_status is A11 or A12 & duration is 23 to 32 & purpose is A40 or A410 or A42 or A44 or A45 or A46 & savings is A61 or A65
0.79 when checking_status is A11 or A12 & duration < 23 & purpose is A44 or A46
initialize h2o
localH2O = h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 days 1 hours
H2O cluster timezone: Africa/Nairobi
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.2
H2O cluster version age: 2 months and 12 days
H2O cluster name: H2O_started_from_R_rendanimbuvha_cva598
H2O cluster total nodes: 1
H2O cluster total memory: 1.60 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.5.1 (2018-07-02)
h2o.removeAll(timeout_secs = 30)
[1] 30
Convert Dataframes to h2o format
training_gc.hex = as.h2o(training_gc)
|
| | 0%
|
|===================================================================================| 100%
test_gc.hex = as.h2o(testing_gc)
|
| | 0%
|
|===================================================================================| 100%
predictors =c( "checking_status","duration","credit_history","purpose","credit_amount","savings","employment","installment_rate","personal_status","other_parties" , "residence_since","property_magnitude", "age","other_payment_plans", "housing", "existing_credits","job","num_dependents","telephone","foreign_worker")
target ="class"
Fit Random Forest
rf_fit = h2o.randomForest(
training_frame = training_gc.hex, nfolds = 2,
keep_cross_validation_predictions = TRUE, x = predictors, y = target,
ntrees = 100,
stopping_rounds = 2,
score_each_iteration = TRUE
)
|
| | 0%
|
|= | 1%
|
|===================================================================================| 100%
#Make predictions for the testing set
rf_pred = predict(rf_fit, test_gc.hex)
|
| | 0%
|
|===================================================================================| 100%
rf_pred_vec = as.vector(rf_pred$p1)
Obtain Variable importance Plot
h2o.varimp_plot(rf_fit)
Fit xgboost which is just a computationally efficient gradient boosting with penalty for more complex trees
xgboost_fit = h2o.xgboost(
training_frame = training_gc.hex, nfolds = 5,
keep_cross_validation_predictions = TRUE, x = predictors, y = target,
ntrees = 100,
stopping_rounds = 2,
score_each_iteration = TRUE
)
|
| | 0%
|
|============================ | 34%
|
|======================================== | 48%
|
|============================================================ | 72%
|
|====================================================================== | 84%
|
|============================================================================ | 92%
|
|===================================================================================| 100%
#Make predictions for the testing set
xgb_pred = predict(xgboost_fit, test_gc.hex)
|
| | 0%
|
|===================================================================================| 100%
xgb_pred_vec = as.vector(xgb_pred$p1)
h2o.varimp_plot(xgboost_fit)
gbm_fit = h2o.gbm(
training_frame = training_gc.hex,
nfolds = 2,
keep_cross_validation_predictions = TRUE,
x = predictors,
y = target
)
#Make predictions for the testing set
gbm_pred = predict(gbm_fit, test_gc.hex)
gbm_pred_vec = as.vector(gbm_pred$p1)
h2o.varimp_plot(gbm_fit)
nn_fit = h2o.deeplearning(x = predictors,
y = target,
training_frame = training_gc.hex,
nfolds = 5,
distribution = "AUTO",
activation = "TanhWithDropout",
hidden = c(50, 50, 50),
input_dropout_ratio = 0.1,
sparse = TRUE,
l1 = 0.001,
epochs = 1000,
keep_cross_validation_predictions = TRUE,
variable_importances = TRUE)
#quantile_alpha = 0.8)
#Make predictions for the testing set
nn_pred = predict(nn_fit, test_gc.hex)
nn_pred_vec = as.vector(nn_pred$p1)
h2o.varimp_plot(nn_fit)
roc_glm = plot.roc(testing_gc$class, glm_pred, main="ROC Comparison", col="black",auc=TRUE)
roc_tree = lines.roc(testing_gc$class, tree_pred, col="red")
roc_rf = lines.roc(testing_gc$class, rf_pred_vec, col="blue")
roc_xgb = lines.roc(testing_gc$class, xgb_pred_vec, col="green")
roc_gbm = lines.roc(testing_gc$class, gbm_pred_vec, col="orange")
roc_nn = lines.roc(testing_gc$class, nn_pred_vec, col="purple")
legend("bottomright", legend=c(paste("GLM AUC = ", round(auc(roc_glm),2)), paste("D TREE AUC = ", round(auc(roc_tree),2)), paste("RF AUC = ", round(auc(roc_rf),2)), paste("XGB AUC = ", round(auc(roc_xgb),2)), paste("GBM AUC = ", round(auc(roc_gbm),2)),paste("NN AUC = ", round(auc(roc_nn),2))), col=c("black", "red", "blue", "green","orange", "purple"), lwd=3, bty = "n")
The models above seem to give an okay performance but rather these were obtained using “Default” hyperparameters which are parameters of parameters of the model. One can improve this search for the right hyperparamters. This can be computationally intensive and is done by greedily or randomly searching the hyperparameter space.
Tuning the GBM
#we are going to create a list of hyperparameters to search for
ntrees_opt = c(100)
learn_rate_opt = seq(0.01, 0.2, 0.01)
max_depth_opt = seq(3, 10)
sample_rate_opt = c(0.8, 1.0)
col_sample_rate_opt = c(0.6, 0.8, 1)
gbm_hyper_params = list(ntrees = ntrees_opt,
learn_rate = learn_rate_opt,
max_depth = max_depth_opt,
sample_rate = sample_rate_opt,
col_sample_rate = col_sample_rate_opt
)
search_criteria = list(strategy = "RandomDiscrete",
max_runtime_secs = 4000,
max_models = 30
)
## GBM gridsearch
gbm_fit_grid = h2o.grid(
algorithm = "gbm",
grid_id = "gbm_grid",
x = predictors,
y = target,
hyper_params = gbm_hyper_params,
search_criteria = search_criteria,
training_frame = training_gc.hex,
nfolds = 5,
keep_cross_validation_predictions = TRUE,
stopping_rounds = 2,
stopping_tolerance = 1e-3,
stopping_metric = "AUTO"
)
sorted_grid_gbm = h2o.getGrid(grid_id = "gbm_grid", sort_by = "AUC", decreasing = TRUE)
print(sorted_grid_gbm)
gbm_fit_tuned_f = h2o.getModel(sorted_grid_gbm@model_ids[[1]])
#predict on testing set
gbm_tuned_pred = predict(gbm_fit_tuned_f, test_gc.hex)
gbm_tuned_pred_vec = as.vector(gbm_tuned_pred$p1)
Tuning the Neural Network
structure = list(c(50, 50, 50),c(30, 30, 30),c(50, 60, 50))
learning_rates = c(1e-3,1e-4)
drop_outs = c(0.1, 0.2)
nn_hyperparameters = list(hidden = structure , l1 = learning_rates, input_dropout_ratio = drop_outs)
search_criteria = list(strategy = "RandomDiscrete",
max_runtime_secs = 4000,
max_models = 40
)
nn_model_grid = h2o.grid("deeplearning",
grid_id = "nn_grid",
hyper_params = nn_hyperparameters,
x = predictors,
y = target,
distribution = "AUTO",
activation = "TanhWithDropout",
training_frame = training_gc.hex,
nfolds = 2,
stopping_rounds = 3,
epochs = 1000,
search_criteria = search_criteria,
stopping_tolerance = 0.05,
stopping_metric = "AUC")
sorted_grid_nn = h2o.getGrid(grid_id = "nn_grid", sort_by = "AUC", decreasing = TRUE)
print(sorted_grid_nn)
nn_fit_tuned_f = h2o.getModel(sorted_grid_nn@model_ids[[1]])
#predict on testing set
nn_tuned_pred = predict(nn_fit_tuned_f, test_gc.hex)
nn_tuned_pred_vec = as.vector(nn_tuned_pred$p1)
Lets check performance again
roc_gbm_t = plot.roc(testing_gc$class, gbm_tuned_pred_vec, main="ROC Curves Tuned Models", col="green",auc=TRUE)
roc_nn_t = lines.roc(testing_gc$class, nn_tuned_pred_vec, col="purple")
legend("bottomright", legend=c(paste("GBM T AUC = ", round(auc(roc_gbm_t),2)), paste("NN T AUC = ", round(auc(roc_nn_t),2))), col=c("green", "purple"), lwd=3, bty = "n")
h2o.shutdown()
Session on Unsupervised Learning Techniques
install.packages(c("factoextra", "fpc", "NbClust"))
also installing the dependencies ‘rematch’, ‘forcats’, ‘cellranger’, ‘zip’, ‘SparseM’, ‘MatrixModels’, ‘haven’, ‘curl’, ‘readxl’, ‘openxlsx’, ‘minqa’, ‘nloptr’, ‘RcppEigen’, ‘carData’, ‘pbkrtest’, ‘quantreg’, ‘maptools’, ‘rio’, ‘lme4’, ‘whisker’, ‘viridis’, ‘car’, ‘ellipse’, ‘flashClust’, ‘leaps’, ‘scatterplot3d’, ‘ggsci’, ‘cowplot’, ‘ggsignif’, ‘gridExtra’, ‘modeltools’, ‘dendextend’, ‘FactoMineR’, ‘ggpubr’, ‘mclust’, ‘flexmix’, ‘prabclus’, ‘diptest’, ‘mvtnorm’, ‘trimcluster’
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/rematch_1.0.1.tgz'
Content type 'application/x-gzip' length 12397 bytes (12 KB)
==================================================
downloaded 12 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/forcats_0.3.0.tgz'
Content type 'application/x-gzip' length 209101 bytes (204 KB)
==================================================
downloaded 204 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/cellranger_1.1.0.tgz'
Content type 'application/x-gzip' length 102084 bytes (99 KB)
==================================================
downloaded 99 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/zip_1.0.0.tgz'
Content type 'application/x-gzip' length 121059 bytes (118 KB)
==================================================
downloaded 118 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/SparseM_1.77.tgz'
Content type 'application/x-gzip' length 1112464 bytes (1.1 MB)
==================================================
downloaded 1.1 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/MatrixModels_0.4-1.tgz'
Content type 'application/x-gzip' length 387316 bytes (378 KB)
==================================================
downloaded 378 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/haven_1.1.2.tgz'
Content type 'application/x-gzip' length 937657 bytes (915 KB)
==================================================
downloaded 915 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/curl_3.2.tgz'
Content type 'application/x-gzip' length 427918 bytes (417 KB)
==================================================
downloaded 417 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/readxl_1.1.0.tgz'
Content type 'application/x-gzip' length 1498484 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/openxlsx_4.1.0.tgz'
Content type 'application/x-gzip' length 2958017 bytes (2.8 MB)
==================================================
downloaded 2.8 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/minqa_1.2.4.tgz'
Content type 'application/x-gzip' length 321982 bytes (314 KB)
==================================================
downloaded 314 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/nloptr_1.0.4.tgz'
Content type 'application/x-gzip' length 1043348 bytes (1018 KB)
==================================================
downloaded 1018 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/RcppEigen_0.3.3.4.0.tgz'
Content type 'application/x-gzip' length 4648821 bytes (4.4 MB)
==================================================
downloaded 4.4 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/carData_3.0-1.tgz'
Content type 'application/x-gzip' length 1779010 bytes (1.7 MB)
==================================================
downloaded 1.7 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/pbkrtest_0.4-7.tgz'
Content type 'application/x-gzip' length 258535 bytes (252 KB)
==================================================
downloaded 252 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/quantreg_5.36.tgz'
Content type 'application/x-gzip' length 2111161 bytes (2.0 MB)
==================================================
downloaded 2.0 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/maptools_0.9-3.tgz'
Content type 'application/x-gzip' length 2139843 bytes (2.0 MB)
==================================================
downloaded 2.0 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/rio_0.5.10.tgz'
Content type 'application/x-gzip' length 151938 bytes (148 KB)
==================================================
downloaded 148 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/lme4_1.1-18-1.tgz'
Content type 'application/x-gzip' length 7680606 bytes (7.3 MB)
==================================================
downloaded 7.3 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/whisker_0.3-2.tgz'
Content type 'application/x-gzip' length 64237 bytes (62 KB)
==================================================
downloaded 62 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/viridis_0.5.1.tgz'
Content type 'application/x-gzip' length 1866361 bytes (1.8 MB)
==================================================
downloaded 1.8 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/car_3.0-2.tgz'
Content type 'application/x-gzip' length 1482315 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/ellipse_0.4.1.tgz'
Content type 'application/x-gzip' length 70509 bytes (68 KB)
==================================================
downloaded 68 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/flashClust_1.01-2.tgz'
Content type 'application/x-gzip' length 24127 bytes (23 KB)
==================================================
downloaded 23 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/leaps_3.0.tgz'
Content type 'application/x-gzip' length 101301 bytes (98 KB)
==================================================
downloaded 98 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/scatterplot3d_0.3-41.tgz'
Content type 'application/x-gzip' length 333783 bytes (325 KB)
==================================================
downloaded 325 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/ggsci_2.9.tgz'
Content type 'application/x-gzip' length 2977735 bytes (2.8 MB)
==================================================
downloaded 2.8 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/cowplot_0.9.3.tgz'
Content type 'application/x-gzip' length 2457197 bytes (2.3 MB)
==================================================
downloaded 2.3 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/ggsignif_0.4.0.tgz'
Content type 'application/x-gzip' length 73587 bytes (71 KB)
==================================================
downloaded 71 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/gridExtra_2.3.tgz'
Content type 'application/x-gzip' length 1102824 bytes (1.1 MB)
==================================================
downloaded 1.1 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/modeltools_0.2-22.tgz'
Content type 'application/x-gzip' length 210141 bytes (205 KB)
==================================================
downloaded 205 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/dendextend_1.8.0.tgz'
Content type 'application/x-gzip' length 2084103 bytes (2.0 MB)
==================================================
downloaded 2.0 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/FactoMineR_1.41.tgz'
Content type 'application/x-gzip' length 3625207 bytes (3.5 MB)
==================================================
downloaded 3.5 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/ggpubr_0.1.7.tgz'
Content type 'application/x-gzip' length 1718583 bytes (1.6 MB)
==================================================
downloaded 1.6 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/mclust_5.4.1.tgz'
Content type 'application/x-gzip' length 3916013 bytes (3.7 MB)
==================================================
downloaded 3.7 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/flexmix_2.3-14.tgz'
Content type 'application/x-gzip' length 1726928 bytes (1.6 MB)
==================================================
downloaded 1.6 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/prabclus_2.2-6.tgz'
Content type 'application/x-gzip' length 380326 bytes (371 KB)
==================================================
downloaded 371 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/diptest_0.75-7.tgz'
Content type 'application/x-gzip' length 350588 bytes (342 KB)
==================================================
downloaded 342 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/mvtnorm_1.0-8.tgz'
Content type 'application/x-gzip' length 260771 bytes (254 KB)
==================================================
downloaded 254 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/trimcluster_0.1-2.1.tgz'
Content type 'application/x-gzip' length 18712 bytes (18 KB)
==================================================
downloaded 18 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/factoextra_1.0.5.tgz'
Content type 'application/x-gzip' length 408822 bytes (399 KB)
==================================================
downloaded 399 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/fpc_2.1-11.1.tgz'
Content type 'application/x-gzip' length 659237 bytes (643 KB)
==================================================
downloaded 643 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/NbClust_3.0.tgz'
Content type 'application/x-gzip' length 120005 bytes (117 KB)
==================================================
downloaded 117 KB
The downloaded binary packages are in
/var/folders/zr/78rycd0909z1vfk2tyjn5_mc0000gn/T//Rtmp0UQSqv/downloaded_packages
library(factoextra)
Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(fpc)
library(NbClust)
credit_num_vars = sapply(credit, is.numeric)
credit_num = credit[ , credit_num_vars]
credit_num_s = scale(credit_num)
# Hierarchical clustering
hc_clust = eclust(credit_num_s, "hclust", k = 3, hc_metric = "euclidean", hc_method = "ward.D2", graph = FALSE)
# Visualize dendrograms the clusters
fviz_dend(hc_clust, show_labels = FALSE, palette = "jco", as.ggplot = TRUE)
# K-means clustering
km_clust = eclust(credit_num_s, "kmeans", k = 2, nstart = 25, graph = FALSE)
# Visualize k-means clusters
fviz_cluster(km_clust, geom = "point", ellipse.type = "norm",
palette = "jco", ggtheme = theme_minimal())
#Lets do the some Principal Component Analysis on the same numerical variable
credit_pca = prcomp(credit_num_s,
center = TRUE,
scale. = TRUE)
summary(credit_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 1.3187 1.1928 1.0811 0.9868 0.9496 0.9001 0.83744 0.53147
Proportion of Variance 0.2174 0.1778 0.1461 0.1217 0.1127 0.1013 0.08766 0.03531
Cumulative Proportion 0.2174 0.3952 0.5413 0.6631 0.7758 0.8770 0.96469 1.00000
principals_components = credit_pca$rotation
plot(credit_pca,main="Variance Explained by Each Component",type="l")
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.00
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.
Hussami, N., Kraljevic, T., Lanford, J., Nykodym, T., Rao, A., and Wang, A. (Aug. 2015). Generalized Linear Modeling with H2O. http://h2o.ai/resources.
Computational Actuarial Science With R by Arthur Charpentier (editor), 2015, Boca Raton, FL: CRC Press, 618 pages, ISBN: 978‐1‐4665‐9259‐9.