Hunt the papertiger from boosting to XGBoost, intuitively, mathematically, implementably

We will use this simple data​1​ set \{x, y\} in all our tutorials. If we use 4th column as label, the 3rd column will be feature, vice versa.

(1)   \begin{equation*} % Table generated by Excel2LaTeX from sheet 'Sheet1' \begin{tabular}{llll} \multicolumn{1}{p{3.6em}}{\textcolor[rgb]{ .2,  .2,  .2}{Sleep quality}} & \multicolumn{1}{p{3.335em}}{\textcolor[rgb]{ .2,  .2,  .2}{Time in bed}} & \multicolumn{1}{p{2.635em}}{\textcolor[rgb]{ .898,  .231,  .318}{\textbf{Wake up}}} & \multicolumn{1}{p{3em}}{\textcolor[rgb]{ .898,  .231,  .318}{\textbf{Heart rate}}} \\ \textcolor[rgb]{ .2,  .2,  .2}{100\%} & \textcolor[rgb]{ .2,  .2,  .2}{8:32} & \textcolor[rgb]{ .118,  .118,  .118}{\blacksmiley{} } & \textcolor[rgb]{ .2,  .2,  .2}{59} \\ \textcolor[rgb]{ .2,  .2,  .2}{3\%} & \textcolor[rgb]{ .2,  .2,  .2}{0:16} & \textcolor[rgb]{ .118,  .118,  .118}{\frownie{}} & \textcolor[rgb]{ .2,  .2,  .2}{72} \\ \textcolor[rgb]{ .2,  .2,  .2}{98\%} & \textcolor[rgb]{ .2,  .2,  .2}{8:30} & \textcolor[rgb]{ .118,  .118,  .118}{\frownie{}} & \textcolor[rgb]{ .2,  .2,  .2}{57} \\ \textcolor[rgb]{ .2,  .2,  .2}{97\%} & \textcolor[rgb]{ .2,  .2,  .2}{9:06} & \textcolor[rgb]{ .118,  .118,  .118}{\blacksmiley{} } & \multicolumn{1}{p{3em}}{\textcolor[rgb]{ .2,  .2,  .2}{NaN}} \\ \textcolor[rgb]{ .2,  .2,  .2}{72\%} & \textcolor[rgb]{ .2,  .2,  .2}{6:44} & \textcolor[rgb]{ .118,  .118,  .118}{\blacksmiley{} } & \textcolor[rgb]{ .2,  .2,  .2}{68} \\ \textcolor[rgb]{ .2,  .2,  .2}{83\%} & \textcolor[rgb]{ .2,  .2,  .2}{7:12} & \textcolor[rgb]{ .118,  .118,  .118}{\blacksmiley{} } & \textcolor[rgb]{ .2,  .2,  .2}{60} \\ \textcolor[rgb]{ .2,  .2,  .2}{78\%} & \textcolor[rgb]{ .2,  .2,  .2}{7:18} & \textcolor[rgb]{ .118,  .118,  .118}{\blacksmiley{} } & \textcolor[rgb]{ .2,  .2,  .2}{57} \\ \textcolor[rgb]{ .2,  .2,  .2}{69\%} & \textcolor[rgb]{ .2,  .2,  .2}{7:27} & \textcolor[rgb]{ .118,  .118,  .118}{\blacksmiley{} } & \textcolor[rgb]{ .2,  .2,  .2}{56} \\ \textcolor[rgb]{ .2,  .2,  .2}{74\%} & \textcolor[rgb]{ .2,  .2,  .2}{7:35} & \textcolor[rgb]{ .118,  .118,  .118}{\frownie{}} & \textcolor[rgb]{ .2,  .2,  .2}{64} \\ \textcolor[rgb]{ .2,  .2,  .2}{81\%} & \textcolor[rgb]{ .2,  .2,  .2}{9:19} & \textcolor[rgb]{ .118,  .118,  .118}{\blacksmiley{} } & \textcolor[rgb]{ .2,  .2,  .2}{62} \\ \end{tabular}% \end{equation*}

Boosting

Gradient boost for regression

The loss function of gradient boost defined as

(2)   \begin{equation*} \mathcal{L}^{(t)}  =\underbrace{ \sum_{i=1}^n l(\overbrace{y_i}^{\text{label}},  \underbrace{ \hat{y_i}^{(t-1)} + f_t(x_i) }_{(z_0+\Delta{z}) \text{ where } z_0 = \hat{y_i}^{(t-1)} \text{ and } \Delta{z} = f_t(x_i) } ) }_{\mathcal{L}^{(t)}(z_0+\Delta{z})} + \cancelto{\text{ignore now}}{\Omega(f_t)}. \end{equation*}

Gradient boost for classification

Caveat To avoid heavy notation, we ignored summation \sum symbol.

For a binary classificaton problem, we can define odds as

(3)   \begin{equation*} \text{odds} = \frac{\text{win}}{\text{lose}} \end{equation*}

and probability as

(4)   \begin{equation*} p = \frac{\text{win}}{\text{win}+\text{lose}}. \end{equation*}

You might wonder why we define this. In the following developments, you will find this definition will make the result be consistent with regression.

With some simple algebra,

(5)   \begin{equation*} p = \frac{\text{odds}}{1+\text{odds}} =\frac{ e^{\log{(\text{odds})}} }{ 1 + e^{\log{(\text{odds})}} }. \end{equation*}

We can define our loss function as cross entropy, such that

(6)   \begin{equation*} \begin{aligned} \mathcal{L}(y, F_{m-1} + \gamma)  = & -(y \log(p) + (1-y) \log(1-p) \\ = & -( y(\log(p) - log(1-p)) + \log(1-p)) \\ = & -( y\log{\frac{p}{1-p}} ) - \log(1+\text{odds}) \\ = & -[ y \log(\text{odds}) - \log(1+e^{\log(\text{odds})}) ] , \end{aligned} \end{equation*}

in which

(7)   \begin{equation*} \log(\text{odds}) = F_m = {F_{m-1} + \gamma}. \end{equation*}

We want to find \gamma which can minimize the loss, in symbol,

(8)   \begin{equation*} \underset{\gamma}{\text{arg\,max\;}} \mathcal{L}. \end{equation*}

We could directly work on Equation 6 with gradient descent or closed-form solution, such that

(9)   \begin{equation*} \dv{\mathcal{L}}{\gamma}=0 \end{equation*}

However, this will be quite complex.

We can use Taylor series to approximate the loss fucntion, you should convince yourself this will make things simpler, such that

(10)   \begin{equation*} \mathcal{L}(y, F_{m-1}+\gamma) \approx \mathcal{L}(y, F_{m-1}) + \dv{\mathcal{L}}{F} \gamma + \dv[2]{\mathcal{L}}{F} \gamma^2 \end{equation*}

Caveat Caveat Two kinds of derivatives of \mathcal{L} appeared here, one is w.r.t. F and one is w.r.t. \gamma.

With Equation 9, and set

(11)   \begin{equation*} \dv{\mathcal{L}}{\gamma} = \dv{}{\gamma} {\Bigg[ \mathcal{L}(y, F_{m-1}) + \dv{\mathcal{L}}{F} \gamma + \dv[2]{\mathcal{L}}{F} \gamma^2 \Bigg]} = 0 \end{equation*}

\gamma can be solved that

(12)   \begin{equation*} \gamma = \frac{ \dv{\mathcal{L}}{F} }{ \dv[2]{L}{F} } \end{equation*}

With Equation 6, the first order derivative with respect to F can be calculated as

(13)   \begin{equation*} \dv{\mathcal{L}}{\log(\text{odds})} = - (y - \frac{e^{\log(\text{odds})}}{1 + e^{\log(\text{odds})}}) = -{(y-p)} \\ \end{equation*}

with some illustration

Rendered by QuickLaTeX.com

The second derivative of \mathcal{L} with respect to \log(\text{odds}) is

(14)   \begin{equation*} \begin{aligned} \dv[2]{\mathcal{L}}{\log(\text{odds})} = & \dv[2]{ \frac{e^{\log(\text{odds})}}{1 + e^{\log(\text{odds})}} } {\log(\text{odds})} \\ = & \frac{e^{\log(\text{odds})}(1 + e^{\log(\text{odds})}) - e^{2\log(\text{odds})}}{ (1 + e^{\log(\text{odds})})^2 } \\ = & \frac{1}{(1 + e^{\log(\text{odds})})} \frac{e^{\log(\text{odds})}}{ (1 + e^{\log(\text{odds})}) } \\ = & p(1 - p) \end{aligned}. \end{equation*}

XGBoost

Reference

  1. 1.
    Dana D. Sleep Data Personal Sleep Data from Sleep Cycle iOS App. Kaggle. https://www.kaggle.com/danagerous/sleep-data#

Leave a Reply

en_USEnglish