In today’s post, I will go through how to get started with solving Kaggle-competitions in R using e.g. xgboost and recipes. The Kaggle-competition used in the example is the IEEE-CIS Fraud Detection: https://www.kaggle.com/c/ieee-fraud-detection/overview.
In just 100 lines of code and without creating any new features, we will create a xgboost-model which puts us at 93,5% AUC. Now this is quite far down on the leader-board as the competition is fierce, but it is actually only 3-percentage points away from the current leader.
The IEEE-Kaggle competition is about predicting fraud for credit cards, based on a vast number of features (about 400). It is a supervised machine learning problem as we have access to the dependent variable, isFraud, which is equal to 1 in the case of fraud.
There are a total of 591 000 observed transactions in the training set, where 144 000 of these transactions also have a corresponding identity (which seems to be information about the website where the credit card was applied).
In this post, we will solve the problem using the machine learning algorithm xgboost, which is one of the most popular algorithms for GBM-models. Note that while xgboost used to be the most popular algorithm on Kaggle, Microsoft’s algorithm lightgbm has challenged that position, which I (hopefully) will cover later.
First, put the data in a /data folder and import the files. As a csv-reader, I recommend using Jim Hester’s vroom-package as it’s incredibly fast (read more here: https://github.com/r-lib/vroom).
After reading in the data, I join the relevant files and split the data in training and testing using rsample::split.
library(tidyverse) library(rsample) library(yardstick) library(recipes) library(xgboost) library(vroom) identity <- vroom("data/train_identity.csv", delim = ",") transaction <- vroom("data/train_transaction.csv", delim = ",") score_identity <- vroom("data/test_identity.csv", delim = ",") score_transaction <- vroom("data/test_transaction.csv", delim = ",") # Combine data # Note: not all transactions have a corresponding identity. df <- transaction %>% left_join(identity) score <- score_transaction %>% left_join(score_identity) # Split in train/test split <- initial_split(df, prop = 0.8) train <- training(split) test <- testing(split)
Creating the recipe
Now let’s create a recipe. Remember, a recipe is a pre-defined collection of functions which is to be trained on your training data, and then applied on your test data and scoring data. Using recipes is an extremely effective approach to avoid data leakage between your training and testing data (in fact, recipes has quickly become one of my favourite packages in R).
Here, I impute missing variables with mean/median/mode (note: this is not strictly necessary for xgboost and may even worsen your model, but it is necessary for other model types such as glm).
There are also some factor variables with way too many levels, so I lump these together using step_other, which puts all factor levels with < 1% of observations into a “other”-category. Finally, note that “step_dummy” has to be used before creating a xgboost-model as it requires a numeric matrix as input.
# Define recipe, which treats missing data, factors with too many levels and creates dummy variables. rec <- recipe(isFraud ~ ., data = train) %>% step_rm(TransactionID) %>% step_meanimpute(all_numeric()) %>% step_modeimpute(all_nominal()) %>% step_integer(has_type(match = "logical")) %>% step_medianimpute(all_numeric()) %>% step_zv(all_predictors()) %>% step_other(all_nominal(), threshold = 0.01) %>% step_dummy(all_nominal()) %>% check_missing(all_predictors()) # Prepare the recipe and use juice/bake to get the data! prepped_rec <- prep(rec) train <- juice(prepped_rec) test <- bake(prepped_rec, new_data = test)
Defining the model
Let’s build the xgboost-model. Now, the xgboost package is a bit strict when it comes to data format, and requires its input to be a matrix. This is quite easy to fix, as long as one remembers to dummify the categorical variables beforehand.
The parameters chosen here have not been tuned, but are my “best guess”. The model could easily be improved further by using hyper-parameter tuning, which will be covered in a future blog post. The chosen parameters are as follows:
- booster: Here the obvious choice is “gbtree” as we want a tree-based model (I haven’t really seen a useful application of “gblinear”).
- objective: We are predicting a 1/0 variable, so we need a binary/logistic objective.
- gamma: This parameter controls overfitting through requiring a minimum loss reduction in order to create a split. Often, a large gamma results in an underfitted model.
- max_depth: This is one of the most important parameters and defines the depth of your decision trees. A deeper tree allows for more complex models, with far more interactions, which again may result in overfitting. A tree depth of 6 is normally enough, but I have seen problems where you need much deeper trees (40+) due to the complex nature of the problem. Be careful, though - a large max_depth will greatly increase your training time.
- eta: This controls the learning rate of the algorithm. A larger learning rate results in a model that hits optimum faster, but it increases the probability of overshooting. In my experience, it is typically set between 0.01 - 0.10.
- min_child_weight: Controls the minimum amount of observations needed to make a split. Typically set between 1-5.
- subsample: How many percent of your training data is to be used in each tree. A low percentage may help prevent overfitting.
- colsample_bytree: How many percent of your variables is to be used in each tree. A low percentage may help prevent overfitting, particularly if you have a problem where one variable is much stronger than the others. In that case, it might be helpful to force the model to create trees without said variable, e.g. to make the model useful in the situation where that particular variable is missing or wrong.
- tree_method: I typically use either “exact” (very slow), “hist” (much faster, but may be less precise) or “gpu_hist” (fastest, but requires a suitable GPU).
- early_stopping_rounds: This parameter tells your model to stop training after > N training rounds where the validation error has not improved.
- nrounds: The number of trees to grow. Be careful here - if nrounds is not large enough, your model might not have converged to the population mean. This is a quite complex problem with many variables, so we need a large number of trees.
# Create sparse-matrix xgtrain <- xgb.DMatrix(as.matrix(train %>% select(-isFraud)), label = train$isFraud) xgtest <- xgb.DMatrix(as.matrix(test %>% select(-isFraud)), label = test$isFraud) params <- list( booster = "gbtree", objective = "binary:logistic", gamma = 1, max_depth = 6, eta = 0.05, min_child_weight = 2, subsample = 0.7, colsample_bytree = 0.7, tree_method = "hist", early_stopping_rounds = 10 ) # Train model with ROC AUC as evaluation metric xgmodel <- xgb.train( params = params, data = xgtrain, nrounds = 800, watchlist = list(val = xgtest, train = xgtrain), eval_metric = "auc", verbose = FALSE )
Note: If you train your model and see that you have chosen a too low “nround” (i.e., your validation auc is clearly still improving at the last iteration), you can actually just keep training the same model by passing the model-argument to the xgb.train-function.
Now we are ready to evaluate the model:
test <- test %>% mutate(prediction = predict(xgmodel, xgtest)) test %>% yardstick::roc_auc(as.factor(isFraud), prediction) # Plot importance imp <- xgb.importance(model = xgmodel) xgb.ggplot.importance(imp, top_n = 15) + theme_minimal()
Delivering your model
We are now ready to turn in our model. Because we used recipes, applying our steps to the scoring data set is very simple - we just bake the prepped recipe with new_data = score.
score <- score %>% mutate(isFraud = 0) xgscore <- bake(prepped_rec, new_data = score) %>% select(-isFraud) %>% as.matrix() %>% xgb.DMatrix() res <- score %>% mutate(isFraud = predict(xgmodel, xgscore), TransactionID = as.integer(score$TransactionID)) %>% select(TransactionID, isFraud) # write_csv(res, path = "submission.csv")
Now your model is ready for delivery!
In this post, we have outlined the basic steps for creating a xgboost model for a Kaggle-competition, but the hardest part still remains - the feature engineering (i.e. creating new features based on existing features, such as aggregated values).
Don’t be fooled be the term machine learning - hand crafted features are still extremely important if you want to create top tier models, and almost all Kaggle-competitions are won by models including a large number of hand crafted features.
Unfortunately, the feature engineering is unusually tricky in this particular case, due to presence of anonymous columns in the data set and vast number of variables (about 400). If you want to look into how to do feature engineering for this particular case, there are several interesting notebooks available at Kaggle: https://www.kaggle.com/c/ieee-fraud-detection/overview