Gradient Boosted Trees (GBT) is an ensemble mechanism which learns incrementally new trees optimizing the present ensemble's residual error. This residual error is resemblance to a gradient step of a linear model. A GBT tries to estimate gradient steps by a new tree and update the present ensemble with this new tree so that whole model is updated in the optimizing direction. This is not very formal explanation but it gives my intuition.

One formal way to think about GBT is, there are all possible tree constructions and our algorithms is just selects the useful ones for the given data. Hence, compared to all possible trees, number of tress constructed in the model is very small. This is similar to constructing all these infinite number of trees and averaging them with the weights estimated by LASSO.

GBT includes different hyper parameters mostly for regularization.

- Early Stopping : How many rounds your GBT continue.
- Shrinkage : Limit the update of each tree with the coefficient
- Data subsampling: Do not use whole the data for each tree, instead sample instances. In general sample ration but it can be lower for larger datasets.
- One side note: Subsampling without shrinkage performs poorly.

Then my initial setting is:

- Run pretty long with many many round observing a validation data loss.
- Use small shrinkage value
- Sample 0.5 of the data
- Sample 0.9 of the features as well or do the reverse.