Title: | Generalised Linear Models by Subsampling and One-Step Polishing |
---|---|
Description: | Fast fitting of generalised linear models on moderately large datasets, by taking an initial sample, fitting in memory, then evaluating the score function for the full data in the database. Thomas Lumley <doi:10.1080/10618600.2019.1610312>. |
Authors: | Thomas Lumley [aut, cph], Shangqing Cao [ctb, cre] |
Maintainer: | Shangqing Cao <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2024-12-09 07:22:42 UTC |
Source: | CRAN |
Fast generalized linear model in a database
dbglm(formula, family = binomial(), tbl, sd = FALSE, weights = .NotYetImplemented(), subset = .NotYetImplemented(), ...)
dbglm(formula, family = binomial(), tbl, sd = FALSE, weights = .NotYetImplemented(), subset = .NotYetImplemented(), ...)
... |
This argument is required for S3 method extension. |
formula |
A model formula. It can have interactions but cannot have any transformations except |
family |
Model family |
tbl |
An object inheriting from |
sd |
Experimental: compute the standard deviation of the score as well as the mean in the update and use it to improve the information matrix estimate |
weights |
We don't support weights |
subset |
If you want to analyze a subset, use |
For a dataset of size N
the subsample is of size N^(5/9)
. Unless N
is large the approximation won't be very good. Also, with small N
it's quite likely that, eg, some factor levels will be missing in the subsample.
A list with elements
tildebeta |
coefficients from subsample |
hatbeta |
final estimate |
tildeV |
variance matrix from subsample |
hatV |
final estimate |
http://notstatschat.tumblr.com/post/171570186286/faster-generalised-linear-models-in-largeish-data
Data of vehicles registered in New Zealand as of November 2017
data(fleet1)
data(fleet1)
A tibble with 10000 rows and 34 variables:
chracter colour of the car
numeric horsepower of the car
numeric mass of the vehicle in kg
numeric number of seats in the car