1 Introduction

This technical blog provides a comprehensive demonstration of the Iteratively Reweighted Least Squares (IRLS) algorithm for estimating coefficients in a logistic regression model. Logistic regression is one of the most widely used tools for modeling binary outcomes in disciplines ranging from from epidemiology (e.g., cancer versus no cancer) to economics (e.g., credit default versus non‑default). When posed as a generalized linear model with the canonical logit link, maximum‑likelihood estimation naturally leads to an elegant iterative scheme where IRLS algorithm transforms the nonlinear score equations into a sequence of weighted least‑squares problems on a “working response,” thereby leveraging the simplicity and numerical stability of ordinary least squares within a Newton‑Raphson framework.

2 Exponential Family

Recall that one possible representation of an exponential family is the following: \[ f(y; \Theta, \phi)=exp\bigl\{\frac{y\eta(\Theta)-b(\Theta)}{a(\phi)}+c(y,\phi) \bigl\} \] In the canonical form \(\eta(\Theta)=\Theta=g(\mu)\), and: \[ f(y; \Theta, \phi)=exp\bigl\{\frac{y\Theta-b(\Theta)}{a(\phi)}+c(y,\phi) \bigl\} \] Log-likelihood in this case is: \[ \log\{f(y; \Theta, \phi)\}=\frac{y\Theta-b(\Theta)}{a(\phi)}+c(y,\phi)=l(\Theta) \] Differentiating log-likelihood with respect to \(\Theta\) produces score function \(l^{\prime}(\Theta)\), whilst differentiating one more time results in the observed Fisher information \(l^{\prime\prime}(\Theta)\): \[ l^{\prime}(\Theta)=\frac{y-b^{\prime}(\Theta)}{a(\phi)} \\ l^{\prime\prime}(\Theta)=\frac{-b^{\prime\prime}(\Theta)}{a(\phi)} \\ \] It can be shown that \(E[l^{\prime}(\Theta)]=0\): \[ \int{f(y|\Theta)}dy=1 \\ \frac{\partial}{\partial\Theta}\int{f(y|\Theta)}dy=0 \\ \int{\frac{\partial f(y|\Theta)}{\partial\Theta}}dy=0 \\ \int{\frac{\partial f(y|\Theta)}{\partial\Theta} \times \frac{f(y|\Theta)}{f(y|\Theta)}}dy=0\\ \int{\frac{\partial\log(f(y|\Theta))}{\partial\Theta}f(y|\Theta)}dy=E\bigl[ \frac{\partial\log(f(y|\Theta))}{\partial\Theta}\bigr]=E[l^{\prime}(\Theta)]=0\\ \] It can also be shown that: \[ E[-l^{\prime\prime}(\Theta)]=E\bigl[ -\frac{\partial^{2}\log(f(y|\Theta))}{\partial\Theta^{2}}\bigr]=cov\bigl[ \frac{\partial\log(f(y|\Theta))}{\partial\Theta}\bigr]=cov[l^{\prime}(\Theta)]=I(\Theta) \] where \(I(\Theta)\) is the Fisher Information which is a positive semidefinite matrix in the multivariate case.

Typically, we deal with the i.i.d. observations \(y_1, .., y_N\) having a common density \(f(y|\mathbf{\Theta})\), therefore,log-likelihood has the following form: \[ l(\Theta)=\sum_{i=1}^{N}\log[f(y_i|\mathbf{\Theta})] \] The objective of the Maximum Likelihood Estimation is then: \[ \hat{\Theta}_{MLE}=\underset{\Theta}{\mathrm{argmax}}(l(\Theta)) \]

3 Newton-Raphson Method and Fisher Scoring

Let’s perform Taylor series approximation of \(l(\Theta)\) about \(\Theta=\Theta_0\). \[ l(\Theta) \approx l(\Theta_0)+l^{\prime}(\Theta_0)(\Theta-\Theta_0) \] Differentiate with respect to \(\Theta\): \[ l^{\prime}(\Theta) \approx l^{\prime}(\Theta_0)+l^{\prime\prime}(\Theta_0)(\Theta-\Theta_0) \] If \(\Theta^{*}=\hat{\Theta}_{MLE}\), then \(l^{\prime}(\Theta^{*})=0\). Therefore, plugging \(\Theta=\Theta^{*}\): \[ 0 \approx l^{\prime}(\Theta_0)+l^{\prime\prime}(\Theta_0)(\Theta^{*}-\Theta_0)\\ l^{\prime\prime}(\Theta_0)\Theta_0 \approx l^{\prime}(\Theta_0)+l^{\prime\prime}(\Theta_0)\Theta^{*} \\ l^{\prime\prime}(\Theta_0)\Theta_0 - l^{\prime}(\Theta_0) \approx l^{\prime\prime}(\Theta_0)\Theta^{*} \\ \Theta^{*}=\Theta_0-\frac{l^{\prime}(\Theta_0)}{l^{\prime\prime}(\Theta_0)} \] Thus, the algorithm starts at randomly initialized \(\Theta_0\) and proceeds in the following manner until convergence: \[ \Theta^{t}=\Theta_{t-1}-\frac{l^{\prime}(\Theta_{t-1})}{l^{\prime\prime}(\Theta_{t-1})} \] In the multivariate case: \[ \mathbf{\Theta^{t}}=\mathbf{\Theta_{t-1}}-J^{-1}(\mathbf{\Theta_{t-1}})U(\mathbf{\Theta_{t-1}}) \] Note that \(J^{-1}(\mathbf{\Theta_{t-1}})\) might not be invertible, therefore, \(l^{\prime\prime}(\Theta)\) can be replaced with its expected value: \[ \Theta^{t}=\Theta_{t-1}-\frac{l^{\prime}(\Theta_{t-1})}{E[l^{\prime\prime}(\Theta_{t-1})]}=\Theta_{t-1}+\frac{l^{\prime}(\Theta_{t-1})}{E[-l^{\prime\prime}(\Theta_{t-1})]}= \Theta_{t-1}+\frac{l^{\prime}(\Theta_{t-1})}{I(\Theta_{t-1})} \] This method is called Fisher Scoring . It is straightforward to verify that for a canonical link function \(\eta(\Theta)=\Theta\), Fisher Scoring method and Newton-Raphson are equivalent, i.e. \(E[l^{\prime\prime}(\Theta)]=\int{\frac{-b^{\prime\prime}(\Theta)}{a(\phi)}f(y|\Theta)}dy=\frac{-b^{\prime\prime}(\Theta)}{a(\phi)}=l^{\prime\prime}(\Theta)\).

4 Binomial Distribution

4.1 Binomial pmf as a member of the exponential families

A binomial random variable
\[ Y_i \sim \mathrm{Binomial}(n_i, p_i), \quad E[Y_i] = \mu_i = n_ip_i, \quad \mathrm{Var}(Y_i) = n_ip_i(1-p_i), \]
has a probability mass function (pmf) that can be re-written in the exponential family form: \[ P(Y_i=y_i)=\binom{n_i}{y_i}\,p_i^{y_i}\,(1-p_i)^{n_i-y_i} =\exp\bigl\{\log\bigl[\binom{n_i}{y_i}\,p_i^{y_i}\,(1-p_i)^{n_i-y_i}\bigr]\bigr\} \\ =\exp\bigl\{\log\binom{n_i}{y_i} + \log(p_i^{y_i})+\log(1-p_i)^{n_i-y_i}) \bigr\} \\ =\exp\bigl\{\log\binom{n_i}{y_i} + {y_i}\log(p_i)+(n_i-y_i)\log(1-p_i) \bigr\} \\ =\exp\bigl\{\log\binom{n_i}{y_i} + {y_i}\log(p_i)-y_i\log(1-p_i) +n_i\log(1-p_i) \bigr\} \\ =\exp\bigl\{y_i\log\frac{p_i}{1-p_i}+n_i\log(1-p_i)+\log\binom{n_i}{y_i}\bigr\} \]

Recall, that exponential family functions have the following form: \[ f(y_i; \Theta_i, \phi)=exp\bigl\{\frac{y_i\Theta_i-b(\Theta_i)}{a(\phi)}+c(y_i,\phi) \bigl\} \] By comparison: \[ \eta(\Theta_i)=\Theta_i=\log\frac{p_i}{1-p_i}=\log\frac{\frac{\mu_i}{n_i}}{1-\frac{\mu_i}{n_i}}=\log\frac{\mu_i}{n_i-\mu_i}=g(\mu_i) \\ a(\phi)=1 \]
Now: \[ \Theta_i=\log\frac{p_i}{1-p_i} \rightarrow p_i=\frac{e^{\Theta_i}}{1+e^{\Theta_i}} \rightarrow \log(1-p_i)=\log(1-\frac{e^{\Theta_i}}{1+e^{\Theta_i}}) = -\log(1+e^{\Theta_i})=-\log(1+e^{\eta_i}) \]

Therefore, the density can be re-written as: \[ P(Y_i=y_i)=\exp\bigl\{y_i\eta_i-n_i\log(1+e^{\eta_i})+\log\binom{n_i}{y_i}\bigr\} \] In the canonical GLM form:

  • Natural parameter: \(\Theta_i = \eta_i=\log\frac{p_i}{1-p_i}\).
  • Cumulant: \(b(\Theta_i) = n_i\log\bigl(1+e^{\Theta_i}\bigr)\).

For N observations the log-likelihood function, ingoring the constant, has the following form: \[ \ell=\log \bigl[ \prod_{i=1}^{N}f(y_i;\Theta_i)\bigr]=\sum_{i=1}^{N}\bigl\{y_i\eta_i-n_i\log(1+e^{\eta_i})\bigr\} \] Linking \(\mu_i\) and \(\eta_i\) via the logit gives
\[ \eta_i = g(\mu_i) = \log\frac{\mu_i}{n_i-\mu_i} = \log\frac{p_i}{1-p_i} =\mathbf{x_i}^\top\mathbf{\beta}=\sum_{j=1}^{p}x_{ij}\beta_j \] Note: \[ \ell=f(\mathbf{\eta}), \quad \eta=f(\mathbf{\beta}) \\ \] Therfore: \[ \frac{\partial\ell}{\partial\beta}=\frac{\partial\ell}{\partial\eta}\frac{\partial\eta}{\partial\beta} \]

4.2 Log-Likelihood function, Score vector, and Hessian matrix

  1. Recall the log‑likelihood (up to a constant): \[ \ell(\beta) = \sum_{i=1}^N \bigl[y_i\,\eta_i - n_i\log(1+e^{\eta_i})\bigr]. \]
  2. Calculate the Score (gradient): \[ \frac{\partial\ell}{\partial\beta_j}=\sum_{i=1}^N \bigl[y_i - n_i\frac{e^{\eta_i}}{1+e^{\eta_i}}\bigr]x_{ij}=\sum_{i=1}^N \bigl[y_i - n_ip_i\bigr]x_{ij}=\sum_{i=1}^N \bigl[y_i - \mu_i\bigr]x_{ij}\\ \] In the matrix form: \[ U(\mathbf{\beta}) = X^\top(\mathbf{y} - \mathbf{\mu}) \]
  3. Calculate the Fisher information (negative Hessian): \[ -\frac{\partial}{\partial\beta_k}( \frac{\partial\ell}{\partial\beta_j})= -\frac{\partial}{\partial\beta_k}(\sum_{i=1}^N \bigl[y_i - n_i\frac{e^{\eta_i}}{1+e^{\eta_i}}\bigr]x_{ij} ) = \sum_{i=1}^N \bigl[n_i\frac{1}{(1+e^{\eta_i})^2}\bigr]x_{ij} = \sum_{i=1}^N \bigl[n_ip_i(1-p_i)\bigr]x_{ik}x_{ij}\\ \] In the matrix form: \[ I(\mathbf{\beta)}= X^\top W X, \quad W =diag(w_{11}, ... ,w_{NN}), \quad w_{ii}= n_ip_i(1-p_i) \quad \]

4.3 Newton-Raphson & Weighted Least Squares

The Newton update
\[ \mathbf{\beta^{(t)}}= \mathbf{\beta^{(t-1)}}+ \bigl(X^\top W^{(t-1)}X\bigr)^{-1} X^\top\bigl(\mathbf{y} - \mathbf{\mu^{(t-1)}}\bigr) \]
can we rewritten as: \[ \mathbf{\beta^{(t)}}= X^{-1}X\mathbf{\beta^{(t-1)}}+ \bigl(X^\top W^{(t-1)}X\bigr)^{-1} X^\top W^{(t-1)}(W^{(t-1)})^{-1} \bigl(\mathbf{y} - \mathbf{\mu^{(t-1)}}\bigr)\\ =\bigl(X^\top W^{(t-1)}X\bigr)^{-1} X^\top W^{(t-1)}X\mathbf{\beta^{(t-1)}}+ \bigl(X^\top W^{(t-1)}X\bigr)^{-1} X^\top W^{(t-1)}(W^{(t-1)})^{-1} \bigl(\mathbf{y} - \mathbf{\mu^{(t-1)}}\bigr)\\ =\bigl(X^\top W^{(t-1)}X\bigr)^{-1} X^\top W^{(t-1)}[ X\mathbf{\beta^{(t-1)}}+(W^{(t-1)})^{-1} \bigl(\mathbf{y} - \mathbf{\mu^{(t-1)}}\bigr)]\\ =\bigl(X^\top W^{(t-1)}X\bigr)^{-1} X^\top W^{(t-1)}\mathbf{z^{(t-1)}} \]
This is equivalent to solving the weighted least‑squares problem (compare \((X^{T}X)^{-1}X^{T}\mathbf{y}\) and \((X^{T}W^{\frac{1}{2}}W^{\frac{1}{2}}X)^{-1}X^{T}W^{\frac{1}{2}}W^{\frac{1}{2}}\mathbf{z}\)):
\[ \underset{\beta}{\mathrm{argmin}}(z-X\beta)^{T}W(z-X\beta) \] which gives the IRLS method its name.

5 R Implementation

5.1 Notes

  • Under the canonical logit link, Fisher scoring coincides exactly with the Newton–Raphson procedure.
  • In many practical settings each observation is a single Bernoulli trial (i.e. \(Y_i\in\{0,1\}\)), so one may simply set \(n_i=1\) and the same IRLS updates apply.
  • A recommended initialization for \(\beta\) is obtained by regressing the empirical logit \[ \log\frac{y_i + \varepsilon}{\,1 - y_i + \varepsilon} \] on the covariates. Here a small constant \(\varepsilon>0\) is added to numerator and denominator to guard against infinite values and improve numerical stability.

5.2 Custom Function

IRLS_logistic_binomial = function(data_matrix, Y, ni=1, maxiter = 25, tol = 1e-6) {
  
  n = nrow(data_matrix)
  X = data_matrix
  
  # 1. Initialization via logit transform (+0.5 for stability)
  y_init = log((Y + 0.5) / (1 - Y + 0.5))
  lm0    = lm(y_init ~ X[ , -1])
  betas  = matrix(coef(lm0), ncol = 1)
  
  # 2. Iterations
  for (iter in seq_len(maxiter)) {
    eta = X %*% betas
    p   = exp(eta) / (1 + exp(eta))
    mu  = ni*p
    W   = diag(as.numeric(mu * (1 - p)), n, n)
    #z   = eta + (Y - mu) / (mu * (1 - mu))
    z   = eta + solve(W) %*% (Y - mu) 
    
    betas_new = solve(t(X) %*% W %*% X, t(X) %*% W %*% z)
    
    if (max(abs(betas_new - betas)) < tol) {
      betas = betas_new
      break
    }
    betas = betas_new
  }
  
  drop(betas)
  
  return(betas)
}

5.3 Test data

We will use S&P Stock Market Data which is included in the ISLR package and the objective here is to predict the direction of the market (positive or negative return) given percentage returns for the previous five days and the volume traded.

# install.packages("ISLR")

require(ISLR)
## Loading required package: ISLR
head(Smarket)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

5.4 Built-in function

glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
              data = Smarket, 
              family = binomial(link="logit"))

summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial(link = "logit"), data = Smarket)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Note that the convergence criteria is met after 3 iterations of Fisher scoring.

5.5 Direct calculations

We will use the same 3 iterations to check if our results match the ones above. For direct calculation some pre-processing is required.

x = Smarket[,2:7]                                                               # Returns at Lags 1-5 and Volume
y = lapply(Smarket[,9],function(x) if(x =="Up") {1} else {0})                   # convert response to 0,1
y=matrix(as.numeric(y), ncol=1)                                                 # convert to matrix

x=cbind(matrix(rep(1,length(y)),ncol=1),x)                                      # add a column of 1s
colnames(x)=NULL                                                                # remove column names
rownames(x)=NULL                                                                # remove row names
x=as.matrix(x)                                                                  # convert to matrix

round(IRLS_logistic_binomial(x,y,maxiter = 3), 6)                               # 3 iterations
##           [,1]
## [1,] -0.126000
## [2,] -0.073074
## [3,] -0.042301
## [4,]  0.011085
## [5,]  0.009359
## [6,]  0.010313
## [7,]  0.135441

As a matter of fact, our results match the previous calculations exactly.