Re: bigglm() results different from glm()

10 messages Options
Embed this post
Permalink
utkarshsinghal

Re: bigglm() results different from glm()

Reply Threaded More More options
Print post
Permalink
Hi Sir,

Thanks for making package available to us. I am facing few problems if
you can give some hints:

Problem-1:
The model summary and residual deviance matched (in the mail below) but
I didn't understand why AIC is still different.

 > AIC(m1)
[1] 532965

 > AIC(m1big_longer)
[1] 101442.9


Problem-2:
chunksize argument is there in bigglm but not in biglm, consequently,
udate.biglm is there, but not update.bigglm
Is my observation correct? If yes, why is this difference?


Regards
Utkarsh

/

/
From: Thomas Lumley <tlumley_at_u.washington.edu
<mailto:tlumley_at_u.washington.edu?Subject=Re:%20%5BR%5D%20bigglm%28%29%20results%20different%20from%20glm%28%29>>

Date: Tue, 17 Mar 2009 00:50:20 -0700 (PDT)

This is a surprisingly interesting problem that took a while to debug,
because the computations all seemed correct.

Your model hasn't converged yet. You can get the right answer either by
running longer:

/> summary(m1big_longer) /

Large data regression model: bigglm(y ~ ttment, data = dat, family =
poisson(link = "log"),

     chunksize = 100000, maxit = 20)
Sample size = 100000

              Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0

or supplying starting values:

/> summary(m1big_started) /

Large data regression model: bigglm(y ~ ttment, data = dat, family =
poisson(link = "log"),

     chunksize = 100000, start = c(2, 0)) Sample size = 100000

              Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0

The bug is that you weren't told about the lack of convergence. There is
a flag in the object, but it is only set after the model is converged,
so it is not there when convergence fails.

/> m1big$converged /

*NULL *
/> m1big_longer$converged /

*[1] TRUE *
/> m1big_started$converged /

*[1] TRUE *For the next version I will make sure there is a clear
warning when the model hasn't converged. The default maximum number of
iterations is fairly small, by design --- if it isn't working, you want
to find out and specify starting values rather than wait for dozens of
potentially slow iterations. This strategy obviously breaks down when
you don't notice that failure. :(

      -thomas

On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:

 > Dear all,
<http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8169.html#8182qlink1>
/> /
/> I am using the bigglm package to fit a few GLM's to a large dataset
(3 million /
/> rows, 6 columns). While trying to fit a Poisson GLM I noticed that the /
/> coefficient estimates were very different from what I obtained when
estimating /
/> the model on a smaller dataset using glm(), I wrote a very basic toy
example to /
/> compare the results of bigglm() against a glm() call. Consider the
following /
/> code: /
/> /
/> /
/>> require(biglm) /
/>> options(digits=6, scipen=3, contrasts = c("contr.treatment",
"contr.poly")) /
/>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
ttment=gl(2,50000)) /
/>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) /
/>> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) /
/>> summary(m1) /
/> /
/> <snipped output for this email> /
/> Coefficients: /
/> Estimate Std. Error z value Pr(>|z|) /
/> (Intercept) 2.30305 0.00141 1629 <2e-16 *** /
/> ttment2 0.40429 0.00183 221 <2e-16 *** /
/> /
/> Null deviance: 151889 on 99999 degrees of freedom /
/> Residual deviance: 101848 on 99998 degrees of freedom /
*/> AIC: 533152 /
*/> /
/>> summary(m1big) /
/> Large data regression model: bigglm(y ~ ttment, data = dat, family = /
/> poisson(link = "log")) /
/> Sample size = 100000 /
/> Coef (95% CI) SE p /
/> (Intercept) 2.651 2.650 2.653 0.001 0 /
/> ttment2 4.346 4.344 4.348 0.001 0 /
/> /
/>> m1big$deviance /
/> [1] 287158986 /
/> /
/> /
/> Notice that the coefficients and deviance are quite different in the
model /
/> estimated using bigglm(). If I change the chunk to
seq(1000,10000,1000) the /
/> estimates remain the same. /
/> /
/> Can someone help me understand what is causing these differences? /
/> /
/> Here is my version info: /
/> /
/>> version /
/> _ /
/> platform i386-pc-mingw32 /
/> arch i386 /
/> os mingw32 /
/> system i386, mingw32 /
/> status /
/> major 2 /
/> minor 8.1 /
/> year 2008 /
/> month 12 /
/> day 22 /
/> svn rev 47281 /
/> language R /
/> version.string R version 2.8.1 (2008-12-22) /
/> /
/> /
/> Many thanks in advance for your help, /
/> /
/> Francisco /
/> /
/> -- /
/> Francisco J. Zagmutt /
/> Vose Consulting /
/> 2891 20th Street /
/> Boulder, CO, 80304 /
*/> USA /
*/> www.voseconsulting.com /
/> /
/> ______________________________________________ /
/> R-help_at_r-project.org mailing list /
/> https://stat.ethz.ch/mailman/listinfo/r-help /
/> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html /
/> and provide commented, minimal, self-contained, reproducible code. /
/> /

Thomas Lumley Assoc. Professor, Biostatistics
tlumley_at_u.washington.edu University of Washington, Seattle


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Greg Snow-2

Re: bigglm() results different from glm()

Reply Threaded More More options
Print post
Permalink
I don't know why the AIC is different, but remember that there are multiple definitions for AIC (generally differing in the constant added) and it may just be a difference in the constant, or it could be that you have not fit the whole dataset (based on your other question).

For an lm model biglm only needs to make a single pass through the data.  This was the first function written for the package and the update mechanism was an easy way to write the function (and still works well).

The bigglm function came later and the models other than Gaussian require multiple passes through the data so instead of the update mechanism that biglm uses, bigglm requires the data argument to be a function that returns the next chunk of data and can restart to the beginning of the dataset.

Also note that the bigglm function usually only does a few passes through the data, usually this is good enough, but in some cases you may need to increase the number of passes.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of utkarshsinghal
> Sent: Friday, July 03, 2009 7:26 AM
> To: [hidden email]; r help
> Subject: Re: [R] bigglm() results different from glm()
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below) but
> I didn't understand why AIC is still different.
>
>  > AIC(m1)
> [1] 532965
>
>  > AIC(m1big_longer)
> [1] 101442.9
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently,
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Regards
> Utkarsh
>
> /
>
> /
> From: Thomas Lumley <tlumley_at_u.washington.edu
> <mailto:tlumley_at_u.washington.edu?Subject=Re:%20%5BR%5D%20bigglm%28%2
> 9%20results%20different%20from%20glm%28%29>>
>
> Date: Tue, 17 Mar 2009 00:50:20 -0700 (PDT)
>
> This is a surprisingly interesting problem that took a while to debug,
> because the computations all seemed correct.
>
> Your model hasn't converged yet. You can get the right answer either by
> running longer:
>
> /> summary(m1big_longer) /
>
> Large data regression model: bigglm(y ~ ttment, data = dat, family =
> poisson(link = "log"),
>
>      chunksize = 100000, maxit = 20)
> Sample size = 100000
>
>               Coef (95% CI) SE p
> (Intercept) 2.304 2.301 2.307 0.001 0
> ttment2 0.405 0.401 0.408 0.002 0
>
> or supplying starting values:
>
> /> summary(m1big_started) /
>
> Large data regression model: bigglm(y ~ ttment, data = dat, family =
> poisson(link = "log"),
>
>      chunksize = 100000, start = c(2, 0)) Sample size = 100000
>
>               Coef (95% CI) SE p
> (Intercept) 2.304 2.301 2.307 0.001 0
> ttment2 0.405 0.401 0.408 0.002 0
>
> The bug is that you weren't told about the lack of convergence. There
> is
> a flag in the object, but it is only set after the model is converged,
> so it is not there when convergence fails.
>
> /> m1big$converged /
>
> *NULL *
> /> m1big_longer$converged /
>
> *[1] TRUE *
> /> m1big_started$converged /
>
> *[1] TRUE *For the next version I will make sure there is a clear
> warning when the model hasn't converged. The default maximum number of
> iterations is fairly small, by design --- if it isn't working, you want
> to find out and specify starting values rather than wait for dozens of
> potentially slow iterations. This strategy obviously breaks down when
> you don't notice that failure. :(
>
>       -thomas
>
> On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:
>
>  > Dear all,
> <http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8169.html#8182qlink1>
> /> /
> /> I am using the bigglm package to fit a few GLM's to a large dataset
> (3 million /
> /> rows, 6 columns). While trying to fit a Poisson GLM I noticed that
> the /
> /> coefficient estimates were very different from what I obtained when
> estimating /
> /> the model on a smaller dataset using glm(), I wrote a very basic toy
> example to /
> /> compare the results of bigglm() against a glm() call. Consider the
> following /
> /> code: /
> /> /
> /> /
> />> require(biglm) /
> />> options(digits=6, scipen=3, contrasts = c("contr.treatment",
> "contr.poly")) /
> />> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
> ttment=gl(2,50000)) /
> />> m1 <- glm(y~ttment, data=dat, family=poisson(link="log")) /
> />> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log")) /
> />> summary(m1) /
> /> /
> /> <snipped output for this email> /
> /> Coefficients: /
> /> Estimate Std. Error z value Pr(>|z|) /
> /> (Intercept) 2.30305 0.00141 1629 <2e-16 *** /
> /> ttment2 0.40429 0.00183 221 <2e-16 *** /
> /> /
> /> Null deviance: 151889 on 99999 degrees of freedom /
> /> Residual deviance: 101848 on 99998 degrees of freedom /
> */> AIC: 533152 /
> */> /
> />> summary(m1big) /
> /> Large data regression model: bigglm(y ~ ttment, data = dat, family =
> /
> /> poisson(link = "log")) /
> /> Sample size = 100000 /
> /> Coef (95% CI) SE p /
> /> (Intercept) 2.651 2.650 2.653 0.001 0 /
> /> ttment2 4.346 4.344 4.348 0.001 0 /
> /> /
> />> m1big$deviance /
> /> [1] 287158986 /
> /> /
> /> /
> /> Notice that the coefficients and deviance are quite different in the
> model /
> /> estimated using bigglm(). If I change the chunk to
> seq(1000,10000,1000) the /
> /> estimates remain the same. /
> /> /
> /> Can someone help me understand what is causing these differences? /
> /> /
> /> Here is my version info: /
> /> /
> />> version /
> /> _ /
> /> platform i386-pc-mingw32 /
> /> arch i386 /
> /> os mingw32 /
> /> system i386, mingw32 /
> /> status /
> /> major 2 /
> /> minor 8.1 /
> /> year 2008 /
> /> month 12 /
> /> day 22 /
> /> svn rev 47281 /
> /> language R /
> /> version.string R version 2.8.1 (2008-12-22) /
> /> /
> /> /
> /> Many thanks in advance for your help, /
> /> /
> /> Francisco /
> /> /
> /> -- /
> /> Francisco J. Zagmutt /
> /> Vose Consulting /
> /> 2891 20th Street /
> /> Boulder, CO, 80304 /
> */> USA /
> */> www.voseconsulting.com /
> /> /
> /> ______________________________________________ /
> /> R-help_at_r-project.org mailing list /
> /> https://stat.ethz.ch/mailman/listinfo/r-help /
> /> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html /
> /> and provide commented, minimal, self-contained, reproducible code. /
> /> /
>
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley_at_u.washington.edu University of Washington, Seattle
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
utkarshsinghal

Re: bigglm() results different from glm()

Reply Threaded More More options
Print post
Permalink
In reply to this post by utkarshsinghal
Thank you Mr. Lumley and Mr. Greg. That was helpful.

Regards
Utkarsh



Thomas Lumley wrote:

> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>>
>> Hi Sir,
>>
>> Thanks for making package available to us. I am facing few problems
>> if you can give some hints:
>>
>> Problem-1:
>> The model summary and residual deviance matched (in the mail below)
>> but I didn't understand why AIC is still different.
>>
>>> AIC(m1)
>> [1] 532965
>>
>>> AIC(m1big_longer)
>> [1] 101442.9
>
> That's because AIC.default uses the unnormalized loglikelihood and
> AIC.biglm uses the deviance.  Only differences in AIC between models
> are meaningful, not individual values.
>
>>
>> Problem-2:
>> chunksize argument is there in bigglm but not in biglm, consequently,
>> udate.biglm is there, but not update.bigglm
>> Is my observation correct? If yes, why is this difference?
>>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires
> multiple passes through the data. Fitting a linear model requires only
> a single pass. update.biglm can take a fitted or partially fitted
> biglm and add more data. To do the same thing for a bigglm you would
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can
> iterate over.  You do this with a function that can be called
> repeatedly to return the next chunk of data.
>
>       -thomas
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> [hidden email]    University of Washington, Seattle
>
>
>
>

I don't know why the AIC is different, but remember that there are multiple definitions for AIC (generally differing in the constant added) and it may just be a difference in the constant, or it could be that you have not fit the whole dataset (based on your other question).

For an lm model biglm only needs to make a single pass through the data.  This was the first function written for the package and the update mechanism was an easy way to write the function (and still works well).

The bigglm function came later and the models other than Gaussian require multiple passes through the data so instead of the update mechanism that biglm uses, bigglm requires the data argument to be a function that returns the next chunk of data and can restart to the beginning of the dataset.

Also note that the bigglm function usually only does a few passes through the data, usually this is good enough, but in some cases you may need to increase the number of passes.

Hope this helps,

--

Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
utkarshsinghal

Re: bigglm() results different from glm()+Another question

Reply Threaded More More options
Print post
Permalink

The AIC of the biglm models is highly dependent on the size of chunks
selected (example provided below). This I can somehow expect because the
model error will increase with the number of chunks.

It will be helpful if you can provide your opinion for comparing
different models in such cases:

    * can I compare two models fitted with different chunksizes, or
      should I always use the same chunk size.

    * although I am not going to use AIC at all in my model selection,
      but I think any other model parameters will also vary in the same
      way. Am I right?
    * what would be the ideal chunksize? should it be the maximum
      possible size R and my system's RAM is able to handle?

Any comments will be helpful.


*Example of AIC variation with chunksize:*

I ran the following code on my data which has 10000 observations and 3
independent variables

 > chunksize = 500
 > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
 > for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
data=xx[(i+1):(i+chunksize),])
 > AIC(fit)
[1] 30647.79

Here are the AIC for other chunksizes:
chunksize    AIC
500          30647.79
1000        29647.79
2000        27647.79
2500        26647.79
5000        21647.79
10000      11647.79


Regards
Utkarsh




utkarshsinghal wrote:

> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>>
>>>
>>> Hi Sir,
>>>
>>> Thanks for making package available to us. I am facing few problems
>>> if you can give some hints:
>>>
>>> Problem-1:
>>> The model summary and residual deviance matched (in the mail below)
>>> but I didn't understand why AIC is still different.
>>>
>>>> AIC(m1)
>>> [1] 532965
>>>
>>>> AIC(m1big_longer)
>>> [1] 101442.9
>>
>> That's because AIC.default uses the unnormalized loglikelihood and
>> AIC.biglm uses the deviance.  Only differences in AIC between models
>> are meaningful, not individual values.
>>
>>>
>>> Problem-2:
>>> chunksize argument is there in bigglm but not in biglm,
>>> consequently, udate.biglm is there, but not update.bigglm
>>> Is my observation correct? If yes, why is this difference?
>>>
>>
>> Because update.bigglm is impossible.
>>
>> Fitting a glm requires iteration, which means that it requires
>> multiple passes through the data. Fitting a linear model requires
>> only a single pass. update.biglm can take a fitted or partially
>> fitted biglm and add more data. To do the same thing for a bigglm you
>> would need to start over again from the beginning of the data set.
>>
>> To fit a glm, you need to specify a data source that bigglm() can
>> iterate over.  You do this with a function that can be called
>> repeatedly to return the next chunk of data.
>>
>>       -thomas
>>
>> Thomas Lumley            Assoc. Professor, Biostatistics
>> [hidden email]    University of Washington, Seattle
>>
>>
>>
>>
>
> I don't know why the AIC is different, but remember that there are
> multiple definitions for AIC (generally differing in the constant
> added) and it may just be a difference in the constant, or it could be
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the
> data.  This was the first function written for the package and the
> update mechanism was an easy way to write the function (and still
> works well).
>
> The bigglm function came later and the models other than Gaussian
> require multiple passes through the data so instead of the update
> mechanism that biglm uses, bigglm requires the data argument to be a
> function that returns the next chunk of data and can restart to the
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes
> through the data, usually this is good enough, but in some cases you
> may need to increase the number of passes.
>
> Hope this helps,


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Greg Snow-2

Re: bigglm() results different from glm()+Another question

Reply Threaded More More options
Print post
Permalink
Are you sure that you are fitting all the models on the same total data?  A first glance looks like you may be including more data in some of the chunk sizes, or be producing an error that update does not know how to deal with.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111

From: utkarshsinghal [mailto:[hidden email]]
Sent: Monday, July 06, 2009 8:58 AM
To: Thomas Lumley; Greg Snow
Cc: r help
Subject: Re: [R] bigglm() results different from glm()+Another question


The AIC of the biglm models is highly dependent on the size of chunks selected (example provided below). This I can somehow expect because the model error will increase with the number of chunks.

It will be helpful if you can provide your opinion for comparing different models in such cases:

 *   can I compare two models fitted with different chunksizes, or should I always use the same chunk size.

 *   although I am not going to use AIC at all in my model selection, but I think any other model parameters will also vary in the same way. Am I right?
 *   what would be the ideal chunksize? should it be the maximum possible size R and my system's RAM is able to handle?
Any comments will be helpful.


Example of AIC variation with chunksize:

I ran the following code on my data which has 10000 observations and 3 independent variables

> chunksize = 500
> fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> for(i in seq(chunksize,10000,chunksize)) fit=update(fit, data=xx[(i+1):(i+chunksize),])
> AIC(fit)
[1] 30647.79

Here are the AIC for other chunksizes:
chunksize    AIC
500          30647.79
1000        29647.79
2000        27647.79
2500        26647.79
5000        21647.79
10000      11647.79


Regards
Utkarsh




utkarshsinghal wrote:
Thank you Mr. Lumley and Mr. Greg. That was helpful.

Regards
Utkarsh



Thomas Lumley wrote:

On Fri, 3 Jul 2009, utkarshsinghal wrote:



Hi Sir,

Thanks for making package available to us. I am facing few problems if you can give some hints:

Problem-1:
The model summary and residual deviance matched (in the mail below) but I didn't understand why AIC is still different.


AIC(m1)
[1] 532965


AIC(m1big_longer)
[1] 101442.9

That's because AIC.default uses the unnormalized loglikelihood and AIC.biglm uses the deviance.  Only differences in AIC between models are meaningful, not individual values.



Problem-2:
chunksize argument is there in bigglm but not in biglm, consequently, udate.biglm is there, but not update.bigglm
Is my observation correct? If yes, why is this difference?

Because update.bigglm is impossible.

Fitting a glm requires iteration, which means that it requires multiple passes through the data. Fitting a linear model requires only a single pass. update.biglm can take a fitted or partially fitted biglm and add more data. To do the same thing for a bigglm you would need to start over again from the beginning of the data set.

To fit a glm, you need to specify a data source that bigglm() can iterate over.  You do this with a function that can be called repeatedly to return the next chunk of data.

      -thomas

Thomas Lumley            Assoc. Professor, Biostatistics
[hidden email]<mailto:[hidden email]>    University of Washington, Seattle




I don't know why the AIC is different, but remember that there are multiple definitions for AIC (generally differing in the constant added) and it may just be a difference in the constant, or it could be that you have not fit the whole dataset (based on your other question).

For an lm model biglm only needs to make a single pass through the data.  This was the first function written for the package and the update mechanism was an easy way to write the function (and still works well).

The bigglm function came later and the models other than Gaussian require multiple passes through the data so instead of the update mechanism that biglm uses, bigglm requires the data argument to be a function that returns the next chunk of data and can restart to the beginning of the dataset.

Also note that the bigglm function usually only does a few passes through the data, usually this is good enough, but in some cases you may need to increase the number of passes.

Hope this helps,


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
utkarshsinghal

Re: bigglm() results different from glm()+Another question

Reply Threaded More More options
Print post
Permalink
Trust me, it is the same total data I am using, even the chunksizes are
all equal. I also crosschecked by manually creating the chunks and
updating as in example given on biglm help page.
 > ?biglm


Regards
Utkarsh



Greg Snow wrote:

>
> Are you sure that you are fitting all the models on the same total
> data?  A first glance looks like you may be including more data in
> some of the chunk sizes, or be producing an error that update does not
> know how to deal with.
>
>  
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> [hidden email]
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:[hidden email]]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
>
> The AIC of the biglm models is highly dependent on the size of chunks
> selected (example provided below). This I can somehow expect because
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing
> different models in such cases:
>
>     * can I compare two models fitted with different chunksizes, or
>       should I always use the same chunk size.
>
>     * although I am not going to use AIC at all in my model selection,
>       but I think any other model parameters will also vary in the
>       same way. Am I right?
>     * what would be the ideal chunksize? should it be the maximum
>       possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize    AIC
> 500          30647.79
> 1000        29647.79
> 2000        27647.79
> 2500        26647.79
> 5000        21647.79
> 10000      11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below)
> but I didn't understand why AIC is still different.
>
>
> AIC(m1)
>
> [1] 532965
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and
> AIC.biglm uses the deviance.  Only differences in AIC between models
> are meaningful, not individual values.
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently,
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires
> multiple passes through the data. Fitting a linear model requires only
> a single pass. update.biglm can take a fitted or partially fitted
> biglm and add more data. To do the same thing for a bigglm you would
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can
> iterate over.  You do this with a function that can be called
> repeatedly to return the next chunk of data.
>
>       -thomas
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> [hidden email] <mailto:[hidden email]>    
> University of Washington, Seattle
>
>
>
>
> I don't know why the AIC is different, but remember that there are
> multiple definitions for AIC (generally differing in the constant
> added) and it may just be a difference in the constant, or it could be
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the
> data.  This was the first function written for the package and the
> update mechanism was an easy way to write the function (and still
> works well).
>
> The bigglm function came later and the models other than Gaussian
> require multiple passes through the data so instead of the update
> mechanism that biglm uses, bigglm requires the data argument to be a
> function that returns the next chunk of data and can restart to the
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes
> through the data, usually this is good enough, but in some cases you
> may need to increase the number of passes.
>
> Hope this helps,
>
>  
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Greg Snow-2

Re: bigglm() results different from glm()+Another question

Reply Threaded More More options
Print post
Permalink
How many rows does xx have?

Let's look at your example for chunksize 10000, you initially fit the first 10000 observations, then the seq results in just the value 10000 which means that you do the update based on vaues 10001 through 20000, if xx only has 10000 rows, then this should give at least one error.  If xx has 20000 or more rows, then only chunksize 10000 will ever see the 20000th value, the other chunksizes will use less of the data.

Also looking at the help for update.biglm, the 2nd argument is "moredata" not "data", so if the code below is the code that you actually ran, then the new data chunks are going into the "..." argument (and being ignored as that is there for future expansion and does nothing yet) and the "moredata" argument is left empty, which should also be giving an error.  For the code below, the model is only being fit to the initial chunk and never updated, so with different chunk sizes, there is different amounts of data per model.  You can check this by doing summary(fit) and looking at the sample size in the 2nd line.

It is easier for us to help you if you provide code that can be run by copying and pasting (we don't have xx, so we can't just run the code below, you could include a line to randomly generate an xx, or a link to where a copy of xx can be downloaded from).  It also helps if you mention any errors or warnings that you receive in the process of running your code.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111

From: utkarshsinghal [mailto:[hidden email]]
Sent: Tuesday, July 07, 2009 12:10 AM
To: Greg Snow
Cc: Thomas Lumley; r help
Subject: Re: [R] bigglm() results different from glm()+Another question

Trust me, it is the same total data I am using, even the chunksizes are all equal. I also crosschecked by manually creating the chunks and updating as in example given on biglm help page.
> ?biglm


Regards
Utkarsh



Greg Snow wrote:
Are you sure that you are fitting all the models on the same total data?  A first glance looks like you may be including more data in some of the chunk sizes, or be producing an error that update does not know how to deal with.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]<mailto:[hidden email]>
801.408.8111

From: utkarshsinghal [mailto:[hidden email]]
Sent: Monday, July 06, 2009 8:58 AM
To: Thomas Lumley; Greg Snow
Cc: r help
Subject: Re: [R] bigglm() results different from glm()+Another question


The AIC of the biglm models is highly dependent on the size of chunks selected (example provided below). This I can somehow expect because the model error will increase with the number of chunks.

It will be helpful if you can provide your opinion for comparing different models in such cases:

 *   can I compare two models fitted with different chunksizes, or should I always use the same chunk size.

 *   although I am not going to use AIC at all in my model selection, but I think any other model parameters will also vary in the same way. Am I right?
 *   what would be the ideal chunksize? should it be the maximum possible size R and my system's RAM is able to handle?
Any comments will be helpful.


Example of AIC variation with chunksize:

I ran the following code on my data which has 10000 observations and 3 independent variables

> chunksize = 500
> fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> for(i in seq(chunksize,10000,chunksize)) fit=update(fit, data=xx[(i+1):(i+chunksize),])
> AIC(fit)
[1] 30647.79

Here are the AIC for other chunksizes:
chunksize    AIC
500          30647.79
1000        29647.79
2000        27647.79
2500        26647.79
5000        21647.79
10000      11647.79


Regards
Utkarsh




utkarshsinghal wrote:
Thank you Mr. Lumley and Mr. Greg. That was helpful.

Regards
Utkarsh



Thomas Lumley wrote:


On Fri, 3 Jul 2009, utkarshsinghal wrote:




Hi Sir,

Thanks for making package available to us. I am facing few problems if you can give some hints:

Problem-1:
The model summary and residual deviance matched (in the mail below) but I didn't understand why AIC is still different.



AIC(m1)
[1] 532965



AIC(m1big_longer)
[1] 101442.9

That's because AIC.default uses the unnormalized loglikelihood and AIC.biglm uses the deviance.  Only differences in AIC between models are meaningful, not individual values.




Problem-2:
chunksize argument is there in bigglm but not in biglm, consequently, udate.biglm is there, but not update.bigglm
Is my observation correct? If yes, why is this difference?

Because update.bigglm is impossible.

Fitting a glm requires iteration, which means that it requires multiple passes through the data. Fitting a linear model requires only a single pass. update.biglm can take a fitted or partially fitted biglm and add more data. To do the same thing for a bigglm you would need to start over again from the beginning of the data set.

To fit a glm, you need to specify a data source that bigglm() can iterate over.  You do this with a function that can be called repeatedly to return the next chunk of data.

      -thomas

Thomas Lumley            Assoc. Professor, Biostatistics
[hidden email]<mailto:[hidden email]>    University of Washington, Seattle





I don't know why the AIC is different, but remember that there are multiple definitions for AIC (generally differing in the constant added) and it may just be a difference in the constant, or it could be that you have not fit the whole dataset (based on your other question).

For an lm model biglm only needs to make a single pass through the data.  This was the first function written for the package and the update mechanism was an easy way to write the function (and still works well).

The bigglm function came later and the models other than Gaussian require multiple passes through the data so instead of the update mechanism that biglm uses, bigglm requires the data argument to be a function that returns the next chunk of data and can restart to the beginning of the dataset.

Also note that the bigglm function usually only does a few passes through the data, usually this is good enough, but in some cases you may need to increase the number of passes.

Hope this helps,



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
utkarshsinghal

Re: bigglm() results different from glm()+Another question

Reply Threaded More More options
Print post
Permalink
Hi Greg,

Many thanks for your precious time. Here is a workable code:

set.seed(1)
xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10),
x3=runif(10000,0,10))
xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000)

chunksize = 500
fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
moredata=xx[(i+1):(i+chunksize),])
AIC(fit)
[1] 28956.91

And the AIC for other chunksizes:
chunksize    AIC
500          28956.91
1000        27956.91
2000        25956.91
2500        24956.91
5000        19956.91
10000        9956.91

Also I noted that the estimated coefficients are not dependent on
chunksize and AIC is exactly a linear function of chunksize. So I guess
it is some problem with the calculation of AIC, may be in some degree of
freedom or adding some constant somewhere.

And my comments below.


Regards
Utkarsh


Greg Snow wrote:

>
> How many rows does xx have?
>
> Let's look at your example for chunksize 10000, you initially fit the
> first 10000 observations, then the seq results in just the value 10000
> which means that you do the update based on vaues 10001 through 20000,
> if xx only has 10000 rows, then this should give at least one error.  
> If xx has 20000 or more rows, then only chunksize 10000 will ever see
> the 20000^th value, the other chunksizes will use less of the data.
>
Understood your point and apologize that you had to spend time going
into the logic inside for loop. I definitely thought of that but my
actual problem was the variation in AICs (which I was sure about), so to
ignore this loop problem (temporarily), I deliberately chose the
chunksizes such that the number of rows is a multiple of chunksize. I
knew there is still one extra iteration happening and I checked that it
was not causing any problem, the "moredata" in the last iteration will
be all NA's and "update" does nothing in such a case.

For example:
Let's say chunksize=5000, even though "xx" has only 10000 rows, "fit2"
and "fit3" below are exactly same.

fit1 = biglm(y~x1+x2+x3, data=xx[1:5000,])
fit2 = update(fit1, moredata=xx[5001:10000,])
fit3 = update(fit2, moredata=xx[10001:15000,])
AIC(fit1); AIC(fit2); AIC(fit3)
[1] 5018.282
[1] 19956.91
[1] 19956.91

(The AIC matches with the table above and no warnings at all)

I checked all these things before sending my first mail and dropped the
idea of refining the for loop as this will save me a few lines of code
and also the loop looks good and easy to understand. Moreover it is
neither taking any extra run time nor producing any warnings or errors.

>  
>
> Also looking at the help for update.biglm, the 2^nd argument is
> "moredata" not "data", so if the code below is the code that you
> actually ran, then the new data chunks are going into the "..."
> argument (and being ignored as that is there for future expansion and
> does nothing yet) and the "moredata" argument is left empty, which
> should also be giving an error.  For the code below, the model is only
> being fit to the initial chunk and never updated, so with different
> chunk sizes, there is different amounts of data per model.  You can
> check this by doing summary(fit) and looking at the sample size in the
> 2^nd line.
>
My fault in writing the mail. In the actual code, I gave "update(fit,
xx[(i+1):(i+chunksize),])" ,i.e., I just passed the new chunk as the 2nd
argument without mentioning the argument name, which is correct, but
while writing the mail I added the argument name as "data" without
checking what it is.

>  
>
> It is easier for us to help you if you provide code that can be run by
> copying and pasting (we don't have xx, so we can't just run the code
> below, you could include a line to randomly generate an xx, or a link
> to where a copy of xx can be downloaded from).  It also helps if you
> mention any errors or warnings that you receive in the process of
> running your code.
>
>  
>
> Hope this helps,
>
>  
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> [hidden email]
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:[hidden email]]
> *Sent:* Tuesday, July 07, 2009 12:10 AM
> *To:* Greg Snow
> *Cc:* Thomas Lumley; r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
> Trust me, it is the same total data I am using, even the chunksizes
> are all equal. I also crosschecked by manually creating the chunks and
> updating as in example given on biglm help page.
> > ?biglm
>
>
> Regards
> Utkarsh
>
>
>
> Greg Snow wrote:
>
> Are you sure that you are fitting all the models on the same total
> data?  A first glance looks like you may be including more data in
> some of the chunk sizes, or be producing an error that update does not
> know how to deal with.
>
>  
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> [hidden email] <mailto:[hidden email]>
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:[hidden email]]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
>
> The AIC of the biglm models is highly dependent on the size of chunks
> selected (example provided below). This I can somehow expect because
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing
> different models in such cases:
>
>     * can I compare two models fitted with different chunksizes, or
>       should I always use the same chunk size.
>
>     * although I am not going to use AIC at all in my model selection,
>       but I think any other model parameters will also vary in the
>       same way. Am I right?
>     * what would be the ideal chunksize? should it be the maximum
>       possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize    AIC
> 500          30647.79
> 1000        29647.79
> 2000        27647.79
> 2500        26647.79
> 5000        21647.79
> 10000      11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below)
> but I didn't understand why AIC is still different.
>
>
>
> AIC(m1)
>
> [1] 532965
>
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and
> AIC.biglm uses the deviance.  Only differences in AIC between models
> are meaningful, not individual values.
>
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently,
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires
> multiple passes through the data. Fitting a linear model requires only
> a single pass. update.biglm can take a fitted or partially fitted
> biglm and add more data. To do the same thing for a bigglm you would
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can
> iterate over.  You do this with a function that can be called
> repeatedly to return the next chunk of data.
>
>       -thomas
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> [hidden email] <mailto:[hidden email]>    
> University of Washington, Seattle
>
>
>
>
>
> I don't know why the AIC is different, but remember that there are
> multiple definitions for AIC (generally differing in the constant
> added) and it may just be a difference in the constant, or it could be
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the
> data.  This was the first function written for the package and the
> update mechanism was an easy way to write the function (and still
> works well).
>
> The bigglm function came later and the models other than Gaussian
> require multiple passes through the data so instead of the update
> mechanism that biglm uses, bigglm requires the data argument to be a
> function that returns the next chunk of data and can restart to the
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes
> through the data, usually this is good enough, but in some cases you
> may need to increase the number of passes.
>
> Hope this helps,
>
>  
>
>  
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Greg Snow-2

Re: bigglm() results different from glm()+Another question

Reply Threaded More More options
Print post
Permalink
OK, it appears that the problem is the df.resid component of the biglm object.  Everything else is being updated by the update function except the df.resid piece, so it is based solely on the initial fit and the chunksize used there.  The df.resid piece is then used in the computation of the AIC and hence the differences that you see.  There could also be a difference in the p-values and confidence intervals, but at those high of numbers, the differences are smaller than can be seen at the level of rounding done.

This appears to be a bug/overlooked piece to me, Thomas is cc'd on this so he should be able to fix this.

A work around in the meantime is to do something like:

> fit$df.resid <- 10000-4

Then compute the AIC.

Also as an aside, if you change your seq to: seq(chunksize, 10000-chunksize, chunksize) then you won't get the error messages.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111

From: utkarshsinghal [mailto:[hidden email]]
Sent: Wednesday, July 08, 2009 2:24 AM
To: Greg Snow
Cc: Thomas Lumley; r help
Subject: Re: [R] bigglm() results different from glm()+Another question

Hi Greg,

Many thanks for your precious time. Here is a workable code:

set.seed(1)
xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10), x3=runif(10000,0,10))
xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000)

chunksize = 500
fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
for(i in seq(chunksize,10000,chunksize)) fit=update(fit, moredata=xx[(i+1):(i+chunksize),])
AIC(fit)
[1] 28956.91

And the AIC for other chunksizes:
chunksize    AIC
500          28956.91
1000        27956.91
2000        25956.91
2500        24956.91
5000        19956.91
10000        9956.91

Also I noted that the estimated coefficients are not dependent on chunksize and AIC is exactly a linear function of chunksize. So I guess it is some problem with the calculation of AIC, may be in some degree of freedom or adding some constant somewhere.

And my comments below.


Regards
Utkarsh


Greg Snow wrote:
How many rows does xx have?
Let's look at your example for chunksize 10000, you initially fit the first 10000 observations, then the seq results in just the value 10000 which means that you do the update based on vaues 10001 through 20000, if xx only has 10000 rows, then this should give at least one error.  If xx has 20000 or more rows, then only chunksize 10000 will ever see the 20000th value, the other chunksizes will use less of the data.
Understood your point and apologize that you had to spend time going into the logic inside for loop. I definitely thought of that but my actual problem was the variation in AICs (which I was sure about), so to ignore this loop problem (temporarily), I deliberately chose the chunksizes such that the number of rows is a multiple of chunksize. I knew there is still one extra iteration happening and I checked that it was not causing any problem, the "moredata" in the last iteration will be all NA's and "update" does nothing in such a case.

For example:
Let's say chunksize=5000, even though "xx" has only 10000 rows, "fit2" and "fit3" below are exactly same.

fit1 = biglm(y~x1+x2+x3, data=xx[1:5000,])
fit2 = update(fit1, moredata=xx[5001:10000,])
fit3 = update(fit2, moredata=xx[10001:15000,])
AIC(fit1); AIC(fit2); AIC(fit3)
[1] 5018.282
[1] 19956.91
[1] 19956.91

(The AIC matches with the table above and no warnings at all)

I checked all these things before sending my first mail and dropped the idea of refining the for loop as this will save me a few lines of code and also the loop looks good and easy to understand. Moreover it is neither taking any extra run time nor producing any warnings or errors.



Also looking at the help for update.biglm, the 2nd argument is "moredata" not "data", so if the code below is the code that you actually ran, then the new data chunks are going into the "..." argument (and being ignored as that is there for future expansion and does nothing yet) and the "moredata" argument is left empty, which should also be giving an error.  For the code below, the model is only being fit to the initial chunk and never updated, so with different chunk sizes, there is different amounts of data per model.  You can check this by doing summary(fit) and looking at the sample size in the 2nd line.
My fault in writing the mail. In the actual code, I gave "update(fit, xx[(i+1):(i+chunksize),])" ,i.e., I just passed the new chunk as the 2nd argument without mentioning the argument name, which is correct, but while writing the mail I added the argument name as "data" without checking what it is.



It is easier for us to help you if you provide code that can be run by copying and pasting (we don't have xx, so we can't just run the code below, you could include a line to randomly generate an xx, or a link to where a copy of xx can be downloaded from).  It also helps if you mention any errors or warnings that you receive in the process of running your code.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]<mailto:[hidden email]>
801.408.8111

From: utkarshsinghal [mailto:[hidden email]]
Sent: Tuesday, July 07, 2009 12:10 AM
To: Greg Snow
Cc: Thomas Lumley; r help
Subject: Re: [R] bigglm() results different from glm()+Another question

Trust me, it is the same total data I am using, even the chunksizes are all equal. I also crosschecked by manually creating the chunks and updating as in example given on biglm help page.
> ?biglm


Regards
Utkarsh



Greg Snow wrote:
Are you sure that you are fitting all the models on the same total data?  A first glance looks like you may be including more data in some of the chunk sizes, or be producing an error that update does not know how to deal with.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]<mailto:[hidden email]>
801.408.8111

From: utkarshsinghal [mailto:[hidden email]]
Sent: Monday, July 06, 2009 8:58 AM
To: Thomas Lumley; Greg Snow
Cc: r help
Subject: Re: [R] bigglm() results different from glm()+Another question


The AIC of the biglm models is highly dependent on the size of chunks selected (example provided below). This I can somehow expect because the model error will increase with the number of chunks.

It will be helpful if you can provide your opinion for comparing different models in such cases:

 *   can I compare two models fitted with different chunksizes, or should I always use the same chunk size.

 *   although I am not going to use AIC at all in my model selection, but I think any other model parameters will also vary in the same way. Am I right?
 *   what would be the ideal chunksize? should it be the maximum possible size R and my system's RAM is able to handle?
Any comments will be helpful.


Example of AIC variation with chunksize:

I ran the following code on my data which has 10000 observations and 3 independent variables

> chunksize = 500
> fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> for(i in seq(chunksize,10000,chunksize)) fit=update(fit, data=xx[(i+1):(i+chunksize),])
> AIC(fit)
[1] 30647.79

Here are the AIC for other chunksizes:
chunksize    AIC
500          30647.79
1000        29647.79
2000        27647.79
2500        26647.79
5000        21647.79
10000      11647.79


Regards
Utkarsh




utkarshsinghal wrote:
Thank you Mr. Lumley and Mr. Greg. That was helpful.

Regards
Utkarsh



Thomas Lumley wrote:



On Fri, 3 Jul 2009, utkarshsinghal wrote:





Hi Sir,

Thanks for making package available to us. I am facing few problems if you can give some hints:

Problem-1:
The model summary and residual deviance matched (in the mail below) but I didn't understand why AIC is still different.




AIC(m1)
[1] 532965




AIC(m1big_longer)
[1] 101442.9

That's because AIC.default uses the unnormalized loglikelihood and AIC.biglm uses the deviance.  Only differences in AIC between models are meaningful, not individual values.





Problem-2:
chunksize argument is there in bigglm but not in biglm, consequently, udate.biglm is there, but not update.bigglm
Is my observation correct? If yes, why is this difference?

Because update.bigglm is impossible.

Fitting a glm requires iteration, which means that it requires multiple passes through the data. Fitting a linear model requires only a single pass. update.biglm can take a fitted or partially fitted biglm and add more data. To do the same thing for a bigglm you would need to start over again from the beginning of the data set.

To fit a glm, you need to specify a data source that bigglm() can iterate over.  You do this with a function that can be called repeatedly to return the next chunk of data.

      -thomas

Thomas Lumley            Assoc. Professor, Biostatistics
[hidden email]<mailto:[hidden email]>    University of Washington, Seattle






I don't know why the AIC is different, but remember that there are multiple definitions for AIC (generally differing in the constant added) and it may just be a difference in the constant, or it could be that you have not fit the whole dataset (based on your other question).

For an lm model biglm only needs to make a single pass through the data.  This was the first function written for the package and the update mechanism was an easy way to write the function (and still works well).

The bigglm function came later and the models other than Gaussian require multiple passes through the data so instead of the update mechanism that biglm uses, bigglm requires the data argument to be a function that returns the next chunk of data and can restart to the beginning of the dataset.

Also note that the bigglm function usually only does a few passes through the data, usually this is good enough, but in some cases you may need to increase the number of passes.

Hope this helps,




        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
utkarshsinghal

Re: bigglm() results different from glm()+Another question

Reply Threaded More More options
Print post
Permalink
That was definitely helpful.

"if you change your seq to: seq(chunksize, 10000-chunksize, chunksize)
then you won't get the error messages"

Even previously, I was not getting any error messages or warnings at
all, i.e., "update" removes the extra NA rows silently, for example:

set.seed(1)
xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10),
x3=runif(10000,0,10))
xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000)
fit1 = biglm(y~x1+x2+x3, data=xx[1:7000,])
fit2 = update(fit1, moredata=xx[7001:10000,])
fit3 = update(fit2, moredata=xx[7001:14000,])

Here fit2 and fit3 are exactly same.
Also if the number of rows is not a multiple of chunksize then your
sequence will not accommodate the last small chunk of data .

Regards
Utkarsh


Greg Snow wrote:

>
> OK, it appears that the problem is the df.resid component of the biglm
> object.  Everything else is being updated by the update function
> except the df.resid piece, so it is based solely on the initial fit
> and the chunksize used there.  The df.resid piece is then used in the
> computation of the AIC and hence the differences that you see.  There
> could also be a difference in the p-values and confidence intervals,
> but at those high of numbers, the differences are smaller than can be
> seen at the level of rounding done.
>
>  
>
> This appears to be a bug/overlooked piece to me, Thomas is cc'd on
> this so he should be able to fix this.
>
>  
>
> A work around in the meantime is to do something like:
>
> > fit$df.resid <- 10000-4
>
>  
>
> Then compute the AIC.
>
>  
>
> Also as an aside, if you change your seq to: seq(chunksize,
> 10000-chunksize, chunksize) then you won't get the error messages.
>
>  
>
> Hope this helps,
>
>  
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> [hidden email]
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:[hidden email]]
> *Sent:* Wednesday, July 08, 2009 2:24 AM
> *To:* Greg Snow
> *Cc:* Thomas Lumley; r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
> Hi Greg,
>
> Many thanks for your precious time. Here is a workable code:
>
> set.seed(1)
> xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10),
> x3=runif(10000,0,10))
> xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000)
>
> chunksize = 500
> fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> moredata=xx[(i+1):(i+chunksize),])
> AIC(fit)
> [1] 28956.91
>
> And the AIC for other chunksizes:
> chunksize    AIC
> 500          28956.91
> 1000        27956.91
> 2000        25956.91
> 2500        24956.91
> 5000        19956.91
> 10000        9956.91
>
> Also I noted that the estimated coefficients are not dependent on
> chunksize and AIC is exactly a linear function of chunksize. So I
> guess it is some problem with the calculation of AIC, may be in some
> degree of freedom or adding some constant somewhere.
>
> And my comments below.
>
>
> Regards
> Utkarsh
>
>
> Greg Snow wrote:
>
> How many rows does xx have?
>
>     Let's look at your example for chunksize 10000, you initially fit
>     the first 10000 observations, then the seq results in just the
>     value 10000 which means that you do the update based on vaues
>     10001 through 20000, if xx only has 10000 rows, then this should
>     give at least one error.  If xx has 20000 or more rows, then only
>     chunksize 10000 will ever see the 20000^th value, the other
>     chunksizes will use less of the data.
>
> Understood your point and apologize that you had to spend time going
> into the logic inside for loop. I definitely thought of that but my
> actual problem was the variation in AICs (which I was sure about), so
> to ignore this loop problem (temporarily), I deliberately chose the
> chunksizes such that the number of rows is a multiple of chunksize. I
> knew there is still one extra iteration happening and I checked that
> it was not causing any problem, the "moredata" in the last iteration
> will be all NA's and "update" does nothing in such a case.
>
> For example:
> Let's say chunksize=5000, even though "xx" has only 10000 rows, "fit2"
> and "fit3" below are exactly same.
>
> fit1 = biglm(y~x1+x2+x3, data=xx[1:5000,])
> fit2 = update(fit1, moredata=xx[5001:10000,])
> fit3 = update(fit2, moredata=xx[10001:15000,])
> AIC(fit1); AIC(fit2); AIC(fit3)
> [1] 5018.282
> [1] 19956.91
> [1] 19956.91
>
> (The AIC matches with the table above and no warnings at all)
>
> I checked all these things before sending my first mail and dropped
> the idea of refining the for loop as this will save me a few lines of
> code and also the loop looks good and easy to understand. Moreover it
> is neither taking any extra run time nor producing any warnings or errors.
>
>
>  
>
> Also looking at the help for update.biglm, the 2^nd argument is
> "moredata" not "data", so if the code below is the code that you
> actually ran, then the new data chunks are going into the "..."
> argument (and being ignored as that is there for future expansion and
> does nothing yet) and the "moredata" argument is left empty, which
> should also be giving an error.  For the code below, the model is only
> being fit to the initial chunk and never updated, so with different
> chunk sizes, there is different amounts of data per model.  You can
> check this by doing summary(fit) and looking at the sample size in the
> 2^nd line.
>
> My fault in writing the mail. In the actual code, I gave "update(fit,
> xx[(i+1):(i+chunksize),])" ,i.e., I just passed the new chunk as the
> 2nd argument without mentioning the argument name, which is correct,
> but while writing the mail I added the argument name as "data" without
> checking what it is.
>
>
>  
>
> It is easier for us to help you if you provide code that can be run by
> copying and pasting (we don't have xx, so we can't just run the code
> below, you could include a line to randomly generate an xx, or a link
> to where a copy of xx can be downloaded from).  It also helps if you
> mention any errors or warnings that you receive in the process of
> running your code.
>
>  
>
> Hope this helps,
>
>  
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> [hidden email] <mailto:[hidden email]>
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:[hidden email]]
> *Sent:* Tuesday, July 07, 2009 12:10 AM
> *To:* Greg Snow
> *Cc:* Thomas Lumley; r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
> Trust me, it is the same total data I am using, even the chunksizes
> are all equal. I also crosschecked by manually creating the chunks and
> updating as in example given on biglm help page.
> > ?biglm
>
>
> Regards
> Utkarsh
>
>
>
> Greg Snow wrote:
>
> Are you sure that you are fitting all the models on the same total
> data?  A first glance looks like you may be including more data in
> some of the chunk sizes, or be producing an error that update does not
> know how to deal with.
>
>  
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> [hidden email] <mailto:[hidden email]>
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:[hidden email]]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
>
> The AIC of the biglm models is highly dependent on the size of chunks
> selected (example provided below). This I can somehow expect because
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing
> different models in such cases:
>
>     * can I compare two models fitted with different chunksizes, or
>       should I always use the same chunk size.
>
>     * although I am not going to use AIC at all in my model selection,
>       but I think any other model parameters will also vary in the
>       same way. Am I right?
>     * what would be the ideal chunksize? should it be the maximum
>       possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize    AIC
> 500          30647.79
> 1000        29647.79
> 2000        27647.79
> 2500        26647.79
> 5000        21647.79
> 10000      11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
>
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below)
> but I didn't understand why AIC is still different.
>
>
>
>
> AIC(m1)
>
> [1] 532965
>
>
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and
> AIC.biglm uses the deviance.  Only differences in AIC between models
> are meaningful, not individual values.
>
>
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently,
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires
> multiple passes through the data. Fitting a linear model requires only
> a single pass. update.biglm can take a fitted or partially fitted
> biglm and add more data. To do the same thing for a bigglm you would
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can
> iterate over.  You do this with a function that can be called
> repeatedly to return the next chunk of data.
>
>       -thomas
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> [hidden email] <mailto:[hidden email]>    
> University of Washington, Seattle
>
>
>
>
>
>
> I don't know why the AIC is different, but remember that there are
> multiple definitions for AIC (generally differing in the constant
> added) and it may just be a difference in the constant, or it could be
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the
> data.  This was the first function written for the package and the
> update mechanism was an easy way to write the function (and still
> works well).
>
> The bigglm function came later and the models other than Gaussian
> require multiple passes through the data so instead of the update
> mechanism that biglm uses, bigglm requires the data argument to be a
> function that returns the next chunk of data and can restart to the
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes
> through the data, usually this is good enough, but in some cases you
> may need to increase the number of passes.
>
> Hope this helps,
>
>  
>
>  
>
>  
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.