R 2.10.0: Error in gsub/calloc

14 messages Options
Embed this post
Permalink
Richard R. Liu

R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this
is a Mac-specific problem.

I have a very large (158,908 possible sentences, ca. 58 MB) plain text
document d which I am
trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
encountering the following error:

Error in base::gsub(pattern, rs, x, ...) :
  Calloc could not allocate (-1398215180 of 1) memory

This happens regardless of whether I run in 32- or 64-bit mode.  The
machine has 8 GB of RAM, so
I can hardly believe that RAM is a problem.

Thanks,
Richard

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Uwe Ligges-3

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink


[hidden email] wrote:
> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this
> is a Mac-specific problem.
>
> I have a very large (158,908 possible sentences, ca. 58 MB) plain text
> document d which I am
> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
> encountering the following error:


What is strapply() and what is d?

Uwe Ligges




> Error in base::gsub(pattern, rs, x, ...) :
>   Calloc could not allocate (-1398215180 of 1) memory
>
> This happens regardless of whether I run in 32- or 64-bit mode.  The
> machine has 8 GB of RAM, so
> I can hardly believe that RAM is a problem.
>
> Thanks,
> Richard
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Richard R. Liu

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
I apologize for not being clear.  d is a character vector of length  
158908.  Each element in the vector has been designated by sentDetect  
(package: openNLP) as a sentence.  Some of these are really  
sentences.  Others are merely groups of meaningless characters  
separated by white space.  strapply is a function in the package  
gosubfn.  It applies to each element of the first argument the regular  
expression (second argument).  Every match is then sent to the  
designated function (third argument, in my case missing, hence the  
identity function).  Thus, with strapply I am simply performing a  
white-space tokenization of each sentence.  I am doing this in the  
hope of being able to distinguish true sentences from false ones on  
the basis of mean length of token, maximum length of token, or similar.

Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  [hidden email]


On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:

>
>
> [hidden email] wrote:
>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think  
>> this
>> is a Mac-specific problem.
>> I have a very large (158,908 possible sentences, ca. 58 MB) plain  
>> text
>> document d which I am
>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>> encountering the following error:
>
>
> What is strapply() and what is d?
>
> Uwe Ligges
>
>
>
>
>> Error in base::gsub(pattern, rs, x, ...) :
>>  Calloc could not allocate (-1398215180 of 1) memory
>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>> machine has 8 GB of RAM, so
>> I can hardly believe that RAM is a problem.
>> Thanks,
>> Richard
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Kenneth Roy Cabrera Torres

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
Try the patch version...
Maybe is the same problem I had with large
database when using gsub()

HTH

El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribió:

> I apologize for not being clear.  d is a character vector of length  
> 158908.  Each element in the vector has been designated by sentDetect  
> (package: openNLP) as a sentence.  Some of these are really  
> sentences.  Others are merely groups of meaningless characters  
> separated by white space.  strapply is a function in the package  
> gosubfn.  It applies to each element of the first argument the regular  
> expression (second argument).  Every match is then sent to the  
> designated function (third argument, in my case missing, hence the  
> identity function).  Thus, with strapply I am simply performing a  
> white-space tokenization of each sentence.  I am doing this in the  
> hope of being able to distinguish true sentences from false ones on  
> the basis of mean length of token, maximum length of token, or similar.
>
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
>
> Tel.:  +41 61 331 10 47
> Email:  [hidden email]
>
>
> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>
> >
> >
> > [hidden email] wrote:
> >> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think  
> >> this
> >> is a Mac-specific problem.
> >> I have a very large (158,908 possible sentences, ca. 58 MB) plain  
> >> text
> >> document d which I am
> >> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
> >> encountering the following error:
> >
> >
> > What is strapply() and what is d?
> >
> > Uwe Ligges
> >
> >
> >
> >
> >> Error in base::gsub(pattern, rs, x, ...) :
> >>  Calloc could not allocate (-1398215180 of 1) memory
> >> This happens regardless of whether I run in 32- or 64-bit mode.  The
> >> machine has 8 GB of RAM, so
> >> I can hardly believe that RAM is a problem.
> >> Thanks,
> >> Richard
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
>
> --Apple-Mail-8--203371287--
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Bert Gunter

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
In reply to this post by Richard R. Liu
Try:

tokens <- strsplit(d,"[^[:space:]]+")

This splits each "sentence" in your vector into a vector of groups of
whitespace characters that you can then play with as you described, I think
(The results is a list of such vectors -- see strsplit()).

## example:

> x <- "xx  xdfg; *&^%kk    "

> strsplit(x,"[^[:blank:]]+")
[[1]]
[1] ""     "  "   " "    "    "


You might have to use PERL = TRUE and "\\w+" depending on your locale and
what "[:space:]" does there.

If this works, it should be way faster than strapply() and should not have
any memory allocation issues either.

HTH.

Bert Gunter
Genentech Nonclinical Biostatistics
 
 

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of Richard R. Liu
Sent: Tuesday, November 03, 2009 11:32 AM
To: Uwe Ligges
Cc: [hidden email]
Subject: Re: [R] R 2.10.0: Error in gsub/calloc

I apologize for not being clear.  d is a character vector of length  
158908.  Each element in the vector has been designated by sentDetect  
(package: openNLP) as a sentence.  Some of these are really  
sentences.  Others are merely groups of meaningless characters  
separated by white space.  strapply is a function in the package  
gosubfn.  It applies to each element of the first argument the regular  
expression (second argument).  Every match is then sent to the  
designated function (third argument, in my case missing, hence the  
identity function).  Thus, with strapply I am simply performing a  
white-space tokenization of each sentence.  I am doing this in the  
hope of being able to distinguish true sentences from false ones on  
the basis of mean length of token, maximum length of token, or similar.

Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  [hidden email]


On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:

>
>
> [hidden email] wrote:
>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think  
>> this
>> is a Mac-specific problem.
>> I have a very large (158,908 possible sentences, ca. 58 MB) plain  
>> text
>> document d which I am
>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>> encountering the following error:
>
>
> What is strapply() and what is d?
>
> Uwe Ligges
>
>
>
>
>> Error in base::gsub(pattern, rs, x, ...) :
>>  Calloc could not allocate (-1398215180 of 1) memory
>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>> machine has 8 GB of RAM, so
>> I can hardly believe that RAM is a problem.
>> Thanks,
>> Richard
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Richard R. Liu

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
In reply to this post by Kenneth Roy Cabrera Torres
Kenneth,

Thanks for the hint.  I downloaded and installed the latest patch, but  
to no avail.  I can reproduce the error on a single sentence, the  
longest in the document.  It contains 743,393 characters.  It isn't a  
true sentence, but since it is more than three standard deviations  
longer than the mean sentence length, I might be able to use the mean  
and the standard deviation as a way of weeding ot the really evident  
"non-sentences" before I take into account the characteristics of the  
the tokens.

Regards,
Richard

On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:

> Try the patch version...
> Maybe is the same problem I had with large
> database when using gsub()
>
> HTH
>
> El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribió:
>> I apologize for not being clear.  d is a character vector of length
>> 158908.  Each element in the vector has been designated by sentDetect
>> (package: openNLP) as a sentence.  Some of these are really
>> sentences.  Others are merely groups of meaningless characters
>> separated by white space.  strapply is a function in the package
>> gosubfn.  It applies to each element of the first argument the  
>> regular
>> expression (second argument).  Every match is then sent to the
>> designated function (third argument, in my case missing, hence the
>> identity function).  Thus, with strapply I am simply performing a
>> white-space tokenization of each sentence.  I am doing this in the
>> hope of being able to distinguish true sentences from false ones on
>> the basis of mean length of token, maximum length of token, or  
>> similar.
>>
>> Richard R. Liu
>> Dittingerstr. 33
>> CH-4053 Basel
>> Switzerland
>>
>> Tel.:  +41 61 331 10 47
>> Email:  [hidden email]
>>
>>
>> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>>
>>>
>>>
>>> [hidden email] wrote:
>>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
>>>> this
>>>> is a Mac-specific problem.
>>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>>>> text
>>>> document d which I am
>>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>>>> encountering the following error:
>>>
>>>
>>> What is strapply() and what is d?
>>>
>>> Uwe Ligges
>>>
>>>
>>>
>>>
>>>> Error in base::gsub(pattern, rs, x, ...) :
>>>> Calloc could not allocate (-1398215180 of 1) memory
>>>> This happens regardless of whether I run in 32- or 64-bit mode.  
>>>> The
>>>> machine has 8 GB of RAM, so
>>>> I can hardly believe that RAM is a problem.
>>>> Thanks,
>>>> Richard
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> --Apple-Mail-8--203371287--
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
William Dunlap

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
Here is a more self-contained way to reproduce the problem in 2.10.0
using the prebuilt Windows executable.  Putting a trace on gsub in
the call to strapply showed that it died in the first call to gsub
when the replacement included "\\1" and the string was about 900000
characters long (and included 150000 "words").  It looks like it
dies if the string is >= 731248 characters.

> d<-substring(paste(collapse=" ", sapply(1:150000,function(i)"abcde")), 1, 731248)
> nchar(d)
[1] 731248
> substring(d, nchar(d)-10)
[1] " abcde abcd"
> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=FALSE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
  Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
  Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
  Reached total allocation of 1535Mb: see help(memory.size)
> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=TRUE)
Error in gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
  Calloc could not allocate (-2146542248 of 1) memory
In addition: Warning messages:
1: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)
2: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
  Reached total allocation of 1535Mb: see help(memory.size)

Make d one character shorter and it succeeds with either
perl=TRUE or perl=FALSE.

> version
               _                            
platform       i386-pc-mingw32              
arch           i386                        
os             mingw32                      
system         i386, mingw32                
status                                      
major          2                            
minor          10.0                        
year           2009                        
month          10                          
day            26                          
svn rev        50208                        
language       R                            
version.string R version 2.10.0 (2009-10-26)
> sessionInfo()
R version 2.10.0 (2009-10-26)
i386-pc-mingw32

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252  
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

loaded via a namespace (and not attached):
[1] tcltk_2.10.0

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Richard R. Liu
> Sent: Tuesday, November 03, 2009 3:00 PM
> To: Kenneth Roy Cabrera Torres
> Cc: [hidden email]; Uwe Ligges
> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>
> Kenneth,
>
> Thanks for the hint.  I downloaded and installed the latest
> patch, but  
> to no avail.  I can reproduce the error on a single sentence, the  
> longest in the document.  It contains 743,393 characters.  It
> isn't a  
> true sentence, but since it is more than three standard deviations  
> longer than the mean sentence length, I might be able to use
> the mean  
> and the standard deviation as a way of weeding ot the really evident  
> "non-sentences" before I take into account the
> characteristics of the  
> the tokens.
>
> Regards,
> Richard
>
> On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:
>
> > Try the patch version...
> > Maybe is the same problem I had with large
> > database when using gsub()
> >
> > HTH
> >
> > El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribió:
> >> I apologize for not being clear.  d is a character vector of length
> >> 158908.  Each element in the vector has been designated by
> sentDetect
> >> (package: openNLP) as a sentence.  Some of these are really
> >> sentences.  Others are merely groups of meaningless characters
> >> separated by white space.  strapply is a function in the package
> >> gosubfn.  It applies to each element of the first argument the  
> >> regular
> >> expression (second argument).  Every match is then sent to the
> >> designated function (third argument, in my case missing, hence the
> >> identity function).  Thus, with strapply I am simply performing a
> >> white-space tokenization of each sentence.  I am doing this in the
> >> hope of being able to distinguish true sentences from false ones on
> >> the basis of mean length of token, maximum length of token, or  
> >> similar.
> >>
> >> Richard R. Liu
> >> Dittingerstr. 33
> >> CH-4053 Basel
> >> Switzerland
> >>
> >> Tel.:  +41 61 331 10 47
> >> Email:  [hidden email]
> >>
> >>
> >> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
> >>
> >>>
> >>>
> >>> [hidden email] wrote:
> >>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I
> don't think
> >>>> this
> >>>> is a Mac-specific problem.
> >>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
> >>>> text
> >>>> document d which I am
> >>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
> >>>> encountering the following error:
> >>>
> >>>
> >>> What is strapply() and what is d?
> >>>
> >>> Uwe Ligges
> >>>
> >>>
> >>>
> >>>
> >>>> Error in base::gsub(pattern, rs, x, ...) :
> >>>> Calloc could not allocate (-1398215180 of 1) memory
> >>>> This happens regardless of whether I run in 32- or
> 64-bit mode.  
> >>>> The
> >>>> machine has 8 GB of RAM, so
> >>>> I can hardly believe that RAM is a problem.
> >>>> Thanks,
> >>>> Richard
> >>>> ______________________________________________
> >>>> [hidden email] mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained,
> reproducible code.
> >>
> >>
> >> --Apple-Mail-8--203371287--
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Gabor Grothendieck

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
In reply to this post by Richard R. Liu
Note that you don't need perl = T since by default strapply uses tcl
regular expressions and they support \w.  What happens if you omit the
perl = T?

Also please specify the version of gsubfn you are using and if its not
the latest then try it with the latest version.


On Tue, Nov 3, 2009 at 11:01 AM,  <[hidden email]> wrote:

> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think this
> is a Mac-specific problem.
>
> I have a very large (158,908 possible sentences, ca. 58 MB) plain text
> document d which I am
> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
> encountering the following error:
>
> Error in base::gsub(pattern, rs, x, ...) :
>  Calloc could not allocate (-1398215180 of 1) memory
>
> This happens regardless of whether I run in 32- or 64-bit mode.  The
> machine has 8 GB of RAM, so
> I can hardly believe that RAM is a problem.
>
> Thanks,
> Richard

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Prof Brian Ripley

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
In reply to this post by William Dunlap
This seems to be simply integer overflow in a calculation.
Changed in R-patched to use doubles.

The issue I patched for Kenneth Roy Cabrera was for perl = FALSE only.

On Tue, 3 Nov 2009, William Dunlap wrote:

> Here is a more self-contained way to reproduce the problem in 2.10.0
> using the prebuilt Windows executable.  Putting a trace on gsub in
> the call to strapply showed that it died in the first call to gsub
> when the replacement included "\\1" and the string was about 900000
> characters long (and included 150000 "words").  It looks like it
> dies if the string is >= 731248 characters.
>
>> d<-substring(paste(collapse=" ", sapply(1:150000,function(i)"abcde")), 1, 731248)
>> nchar(d)
> [1] 731248
>> substring(d, nchar(d)-10)
> [1] " abcde abcd"
>> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=FALSE)
> Error in gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
>  Calloc could not allocate (-2146542248 of 1) memory
> In addition: Warning messages:
> 1: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
> 2: In gsub("([[:alpha:]]+)", "\\1", d, perl = FALSE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
>> p<-gsub("([[:alpha:]]+)", "\\1", d, perl=TRUE)
> Error in gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
>  Calloc could not allocate (-2146542248 of 1) memory
> In addition: Warning messages:
> 1: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
> 2: In gsub("([[:alpha:]]+)", "\\1", d, perl = TRUE) :
>  Reached total allocation of 1535Mb: see help(memory.size)
>
> Make d one character shorter and it succeeds with either
> perl=TRUE or perl=FALSE.
>
>> version
>               _
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status
> major          2
> minor          10.0
> year           2009
> month          10
> day            26
> svn rev        50208
> language       R
> version.string R version 2.10.0 (2009-10-26)
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] tcltk_2.10.0
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: [hidden email]
>> [mailto:[hidden email]] On Behalf Of Richard R. Liu
>> Sent: Tuesday, November 03, 2009 3:00 PM
>> To: Kenneth Roy Cabrera Torres
>> Cc: [hidden email]; Uwe Ligges
>> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>>
>> Kenneth,
>>
>> Thanks for the hint.  I downloaded and installed the latest
>> patch, but
>> to no avail.  I can reproduce the error on a single sentence, the
>> longest in the document.  It contains 743,393 characters.  It
>> isn't a
>> true sentence, but since it is more than three standard deviations
>> longer than the mean sentence length, I might be able to use
>> the mean
>> and the standard deviation as a way of weeding ot the really evident
>> "non-sentences" before I take into account the
>> characteristics of the
>> the tokens.
>>
>> Regards,
>> Richard
>>
>> On Nov 3, 2009, at 20:44 , Kenneth Roy Cabrera Torres wrote:
>>
>>> Try the patch version...
>>> Maybe is the same problem I had with large
>>> database when using gsub()
>>>
>>> HTH
>>>
>>> El mar, 03-11-2009 a las 20:31 +0100, Richard R. Liu escribió:
>>>> I apologize for not being clear.  d is a character vector of length
>>>> 158908.  Each element in the vector has been designated by
>> sentDetect
>>>> (package: openNLP) as a sentence.  Some of these are really
>>>> sentences.  Others are merely groups of meaningless characters
>>>> separated by white space.  strapply is a function in the package
>>>> gosubfn.  It applies to each element of the first argument the
>>>> regular
>>>> expression (second argument).  Every match is then sent to the
>>>> designated function (third argument, in my case missing, hence the
>>>> identity function).  Thus, with strapply I am simply performing a
>>>> white-space tokenization of each sentence.  I am doing this in the
>>>> hope of being able to distinguish true sentences from false ones on
>>>> the basis of mean length of token, maximum length of token, or
>>>> similar.
>>>>
>>>> Richard R. Liu
>>>> Dittingerstr. 33
>>>> CH-4053 Basel
>>>> Switzerland
>>>>
>>>> Tel.:  +41 61 331 10 47
>>>> Email:  [hidden email]
>>>>
>>>>
>>>> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>>>>
>>>>>
>>>>>
>>>>> [hidden email] wrote:
>>>>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I
>> don't think
>>>>>> this
>>>>>> is a Mac-specific problem.
>>>>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>>>>>> text
>>>>>> document d which I am
>>>>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>>>>>> encountering the following error:
>>>>>
>>>>>
>>>>> What is strapply() and what is d?
>>>>>
>>>>> Uwe Ligges
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Error in base::gsub(pattern, rs, x, ...) :
>>>>>> Calloc could not allocate (-1398215180 of 1) memory
>>>>>> This happens regardless of whether I run in 32- or
>> 64-bit mode.
>>>>>> The
>>>>>> machine has 8 GB of RAM, so
>>>>>> I can hardly believe that RAM is a problem.
>>>>>> Thanks,
>>>>>> Richard
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
>> reproducible code.
>>>>
>>>>
>>>> --Apple-Mail-8--203371287--
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Richard R. Liu

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
In reply to this post by Gabor Grothendieck
I am using gsubfn 0.5-0.  When I do not specify perl = TRUE I now get  
the following error on the same document:

Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class  
= "tclObj") :
   [tcl] bad index "1e+05": must be integer?[+-]integer? or end?
[+-]integer?.

Regards,
Richard


On Nov 4, 2009, at 05:34 , Gabor Grothendieck wrote:

> Note that you don't need perl = T since by default strapply uses tcl
> regular expressions and they support \w.  What happens if you omit the
> perl = T?
>
> Also please specify the version of gsubfn you are using and if its not
> the latest then try it with the latest version.
>
>
> On Tue, Nov 3, 2009 at 11:01 AM,  <[hidden email]> wrote:
>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think  
>> this
>> is a Mac-specific problem.
>>
>> I have a very large (158,908 possible sentences, ca. 58 MB) plain  
>> text
>> document d which I am
>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>> encountering the following error:
>>
>> Error in base::gsub(pattern, rs, x, ...) :
>>  Calloc could not allocate (-1398215180 of 1) memory
>>
>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>> machine has 8 GB of RAM, so
>> I can hardly believe that RAM is a problem.
>>
>> Thanks,
>> Richard

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Richard R. Liu

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
In reply to this post by Bert Gunter
Bert,

Thanks for the tip.  Yes, strsplit works, and works fast!  For me,  
white-space tokenization means splitting at the white spaces, so the  
"^" and the outermost square brackets should/can be omitted.

Regards ... from Basel to South San Francisco,
Richard

On Nov 3, 2009, at 22:03 , Bert Gunter wrote:

> Try:
>
> tokens <- strsplit(d,"[^[:space:]]+")
>
> This splits each "sentence" in your vector into a vector of groups of
> whitespace characters that you can then play with as you described,  
> I think
> (The results is a list of such vectors -- see strsplit()).
>
> ## example:
>
>> x <- "xx  xdfg; *&^%kk    "
>
>> strsplit(x,"[^[:blank:]]+")
> [[1]]
> [1] ""     "  "   " "    "    "
>
>
> You might have to use PERL = TRUE and "\\w+" depending on your  
> locale and
> what "[:space:]" does there.
>
> If this works, it should be way faster than strapply() and should  
> not have
> any memory allocation issues either.
>
> HTH.
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]
> ] On
> Behalf Of Richard R. Liu
> Sent: Tuesday, November 03, 2009 11:32 AM
> To: Uwe Ligges
> Cc: [hidden email]
> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>
> I apologize for not being clear.  d is a character vector of length
> 158908.  Each element in the vector has been designated by sentDetect
> (package: openNLP) as a sentence.  Some of these are really
> sentences.  Others are merely groups of meaningless characters
> separated by white space.  strapply is a function in the package
> gosubfn.  It applies to each element of the first argument the regular
> expression (second argument).  Every match is then sent to the
> designated function (third argument, in my case missing, hence the
> identity function).  Thus, with strapply I am simply performing a
> white-space tokenization of each sentence.  I am doing this in the
> hope of being able to distinguish true sentences from false ones on
> the basis of mean length of token, maximum length of token, or  
> similar.
>
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
>
> Tel.:  +41 61 331 10 47
> Email:  [hidden email]
>
>
> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>
>>
>>
>> [hidden email] wrote:
>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
>>> this
>>> is a Mac-specific problem.
>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>>> text
>>> document d which I am
>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>>> encountering the following error:
>>
>>
>> What is strapply() and what is d?
>>
>> Uwe Ligges
>>
>>
>>
>>
>>> Error in base::gsub(pattern, rs, x, ...) :
>>> Calloc could not allocate (-1398215180 of 1) memory
>>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>>> machine has 8 GB of RAM, so
>>> I can hardly believe that RAM is a problem.
>>> Thanks,
>>> Richard
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Gabor Grothendieck

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
Note that strapply without perl = TRUE runs an order of magnitude
faster than with perl = TRUE and takes nearly the same set of regular
expressions anyways since its default is tcl regular expressions.
strsplit should still be fastest where it applies since splitting is
its only purpose.

On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <[hidden email]> wrote:

> Bert,
>
> Thanks for the tip.  Yes, strsplit works, and works fast!  For me,
> white-space tokenization means splitting at the white spaces, so the "^" and
> the outermost square brackets should/can be omitted.
>
> Regards ... from Basel to South San Francisco,
> Richard
>
> On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
>
>> Try:
>>
>> tokens <- strsplit(d,"[^[:space:]]+")
>>
>> This splits each "sentence" in your vector into a vector of groups of
>> whitespace characters that you can then play with as you described, I
>> think
>> (The results is a list of such vectors -- see strsplit()).
>>
>> ## example:
>>
>>> x <- "xx  xdfg; *&^%kk    "
>>
>>> strsplit(x,"[^[:blank:]]+")
>>
>> [[1]]
>> [1] ""     "  "   " "    "    "
>>
>>
>> You might have to use PERL = TRUE and "\\w+" depending on your locale and
>> what "[:space:]" does there.
>>
>> If this works, it should be way faster than strapply() and should not have
>> any memory allocation issues either.
>>
>> HTH.
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>>
>>
>>
>> -----Original Message-----
>> From: [hidden email] [mailto:[hidden email]]
>> On
>> Behalf Of Richard R. Liu
>> Sent: Tuesday, November 03, 2009 11:32 AM
>> To: Uwe Ligges
>> Cc: [hidden email]
>> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>>
>> I apologize for not being clear.  d is a character vector of length
>> 158908.  Each element in the vector has been designated by sentDetect
>> (package: openNLP) as a sentence.  Some of these are really
>> sentences.  Others are merely groups of meaningless characters
>> separated by white space.  strapply is a function in the package
>> gosubfn.  It applies to each element of the first argument the regular
>> expression (second argument).  Every match is then sent to the
>> designated function (third argument, in my case missing, hence the
>> identity function).  Thus, with strapply I am simply performing a
>> white-space tokenization of each sentence.  I am doing this in the
>> hope of being able to distinguish true sentences from false ones on
>> the basis of mean length of token, maximum length of token, or similar.
>>
>> Richard R. Liu
>> Dittingerstr. 33
>> CH-4053 Basel
>> Switzerland
>>
>> Tel.:  +41 61 331 10 47
>> Email:  [hidden email]
>>
>>
>> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>>
>>>
>>>
>>> [hidden email] wrote:
>>>>
>>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
>>>> this
>>>> is a Mac-specific problem.
>>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>>>> text
>>>> document d which I am
>>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>>>> encountering the following error:
>>>
>>>
>>> What is strapply() and what is d?
>>>
>>> Uwe Ligges
>>>
>>>
>>>
>>>
>>>> Error in base::gsub(pattern, rs, x, ...) :
>>>> Calloc could not allocate (-1398215180 of 1) memory
>>>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>>>> machine has 8 GB of RAM, so
>>>> I can hardly believe that RAM is a problem.
>>>> Thanks,
>>>> Richard
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>
>> http://www.R-project.org/posting-guide.html
>>>>
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Richard R. Liu

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
Gabor,

What about the error message that I got with strapply?  That seemed to be the
same kind of problem (i.e., integer overflow of index) as with gsub.

Regards,
Richard

On Fri, 6 Nov 2009 08:00:06 -0500, Gabor Grothendieck wrote

> Note that strapply without perl = TRUE runs an order of magnitude
> faster than with perl = TRUE and takes nearly the same set of regular
> expressions anyways since its default is tcl regular expressions.
> strsplit should still be fastest where it applies since splitting is
> its only purpose.
>
> On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <richard.liu@pueo-
> owl.ch> wrote:
> > Bert,
> >
> > Thanks for the tip.  Yes, strsplit works, and works fast!  For me,
> > white-space tokenization means splitting at the white spaces, so the "^" and
> > the outermost square brackets should/can be omitted.
> >
> > Regards ... from Basel to South San Francisco,
> > Richard
> >
> > On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
> >
> >> Try:
> >>
> >> tokens <- strsplit(d,"[^[:space:]]+")
> >>
> >> This splits each "sentence" in your vector into a vector of groups of
> >> whitespace characters that you can then play with as you described, I
> >> think
> >> (The results is a list of such vectors -- see strsplit()).
> >>
> >> ## example:
> >>
> >>> x <- "xx  xdfg; *&^%kk    "
> >>
> >>> strsplit(x,"[^[:blank:]]+")
> >>
> >> [[1]]
> >> [1] ""     "  "   " "    "    "
> >>
> >>
> >> You might have to use PERL = TRUE and "\\w+" depending on your locale and
> >> what "[:space:]" does there.
> >>
> >> If this works, it should be way faster than strapply() and should not have
> >> any memory allocation issues either.
> >>
> >> HTH.
> >>
> >> Bert Gunter
> >> Genentech Nonclinical Biostatistics
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: [hidden email] [mailto:[hidden email]]
> >> On
> >> Behalf Of Richard R. Liu
> >> Sent: Tuesday, November 03, 2009 11:32 AM
> >> To: Uwe Ligges
> >> Cc: [hidden email]
> >> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
> >>
> >> I apologize for not being clear.  d is a character vector of length
> >> 158908.  Each element in the vector has been designated by sentDetect
> >> (package: openNLP) as a sentence.  Some of these are really
> >> sentences.  Others are merely groups of meaningless characters
> >> separated by white space.  strapply is a function in the package
> >> gosubfn.  It applies to each element of the first argument the regular
> >> expression (second argument).  Every match is then sent to the
> >> designated function (third argument, in my case missing, hence the
> >> identity function).  Thus, with strapply I am simply performing a
> >> white-space tokenization of each sentence.  I am doing this in the
> >> hope of being able to distinguish true sentences from false ones on
> >> the basis of mean length of token, maximum length of token, or similar.
> >>
> >> Richard R. Liu
> >> Dittingerstr. 33
> >> CH-4053 Basel
> >> Switzerland
> >>
> >> Tel.:  +41 61 331 10 47
> >> Email:  [hidden email]
> >>
> >>
> >> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
> >>
> >>>
> >>>
> >>> [hidden email] wrote:
> >>>>
> >>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
> >>>> this
> >>>> is a Mac-specific problem.
> >>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
> >>>> text
> >>>> document d which I am
> >>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
> >>>> encountering the following error:
> >>>
> >>>
> >>> What is strapply() and what is d?
> >>>
> >>> Uwe Ligges
> >>>
> >>>
> >>>
> >>>
> >>>> Error in base::gsub(pattern, rs, x, ...) :
> >>>> Calloc could not allocate (-1398215180 of 1) memory
> >>>> This happens regardless of whether I run in 32- or 64-bit mode.  The
> >>>> machine has 8 GB of RAM, so
> >>>> I can hardly believe that RAM is a problem.
> >>>> Thanks,
> >>>> Richard
> >>>> ______________________________________________
> >>>> [hidden email] mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>
> >> http://www.R-project.org/posting-guide.html
> >>>>
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >


--
Richard R. Liu
Dittingerstr. 33
CH-4053 Basel
Switzerland

Tel.:  +41 61 331 10 47
Email:  [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Gabor Grothendieck

Re: R 2.10.0: Error in gsub/calloc

Reply Threaded More More options
Print post
Permalink
I will have a look at it this weekend if you can give me sufficient
info to reproduce it. I noticed there was an attachment on one of your
emails and it seems to be some sort of binary file with no
accompanying description.

On Fri, Nov 6, 2009 at 10:01 AM, Richard R. Liu <[hidden email]> wrote:

> Gabor,
>
> What about the error message that I got with strapply?  That seemed to be the
> same kind of problem (i.e., integer overflow of index) as with gsub.
>
> Regards,
> Richard
>
> On Fri, 6 Nov 2009 08:00:06 -0500, Gabor Grothendieck wrote
>> Note that strapply without perl = TRUE runs an order of magnitude
>> faster than with perl = TRUE and takes nearly the same set of regular
>> expressions anyways since its default is tcl regular expressions.
>> strsplit should still be fastest where it applies since splitting is
>> its only purpose.
>>
>> On Fri, Nov 6, 2009 at 1:43 AM, Richard R. Liu <richard.liu@pueo-
>> owl.ch> wrote:
>> > Bert,
>> >
>> > Thanks for the tip.  Yes, strsplit works, and works fast!  For me,
>> > white-space tokenization means splitting at the white spaces, so the "^" and
>> > the outermost square brackets should/can be omitted.
>> >
>> > Regards ... from Basel to South San Francisco,
>> > Richard
>> >
>> > On Nov 3, 2009, at 22:03 , Bert Gunter wrote:
>> >
>> >> Try:
>> >>
>> >> tokens <- strsplit(d,"[^[:space:]]+")
>> >>
>> >> This splits each "sentence" in your vector into a vector of groups of
>> >> whitespace characters that you can then play with as you described, I
>> >> think
>> >> (The results is a list of such vectors -- see strsplit()).
>> >>
>> >> ## example:
>> >>
>> >>> x <- "xx  xdfg; *&^%kk    "
>> >>
>> >>> strsplit(x,"[^[:blank:]]+")
>> >>
>> >> [[1]]
>> >> [1] ""     "  "   " "    "    "
>> >>
>> >>
>> >> You might have to use PERL = TRUE and "\\w+" depending on your locale and
>> >> what "[:space:]" does there.
>> >>
>> >> If this works, it should be way faster than strapply() and should not have
>> >> any memory allocation issues either.
>> >>
>> >> HTH.
>> >>
>> >> Bert Gunter
>> >> Genentech Nonclinical Biostatistics
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: [hidden email] [mailto:[hidden email]]
>> >> On
>> >> Behalf Of Richard R. Liu
>> >> Sent: Tuesday, November 03, 2009 11:32 AM
>> >> To: Uwe Ligges
>> >> Cc: [hidden email]
>> >> Subject: Re: [R] R 2.10.0: Error in gsub/calloc
>> >>
>> >> I apologize for not being clear.  d is a character vector of length
>> >> 158908.  Each element in the vector has been designated by sentDetect
>> >> (package: openNLP) as a sentence.  Some of these are really
>> >> sentences.  Others are merely groups of meaningless characters
>> >> separated by white space.  strapply is a function in the package
>> >> gosubfn.  It applies to each element of the first argument the regular
>> >> expression (second argument).  Every match is then sent to the
>> >> designated function (third argument, in my case missing, hence the
>> >> identity function).  Thus, with strapply I am simply performing a
>> >> white-space tokenization of each sentence.  I am doing this in the
>> >> hope of being able to distinguish true sentences from false ones on
>> >> the basis of mean length of token, maximum length of token, or similar.
>> >>
>> >> Richard R. Liu
>> >> Dittingerstr. 33
>> >> CH-4053 Basel
>> >> Switzerland
>> >>
>> >> Tel.:  +41 61 331 10 47
>> >> Email:  [hidden email]
>> >>
>> >>
>> >> On Nov 3, 2009, at 18:30 , Uwe Ligges wrote:
>> >>
>> >>>
>> >>>
>> >>> [hidden email] wrote:
>> >>>>
>> >>>> I'm running R 2.10.0 under Mac OS X 10.5.8; however, I don't think
>> >>>> this
>> >>>> is a Mac-specific problem.
>> >>>> I have a very large (158,908 possible sentences, ca. 58 MB) plain
>> >>>> text
>> >>>> document d which I am
>> >>>> trying to tokenize:  t <- strapply(d, "\\w+", perl = T).  I am
>> >>>> encountering the following error:
>> >>>
>> >>>
>> >>> What is strapply() and what is d?
>> >>>
>> >>> Uwe Ligges
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>> Error in base::gsub(pattern, rs, x, ...) :
>> >>>> Calloc could not allocate (-1398215180 of 1) memory
>> >>>> This happens regardless of whether I run in 32- or 64-bit mode.  The
>> >>>> machine has 8 GB of RAM, so
>> >>>> I can hardly believe that RAM is a problem.
>> >>>> Thanks,
>> >>>> Richard
>> >>>> ______________________________________________
>> >>>> [hidden email] mailing list
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide
>> >>
>> >> http://www.R-project.org/posting-guide.html
>> >>>>
>> >>>> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >
>> > ______________________________________________
>> > [hidden email] mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>
>
> --
> Richard R. Liu
> Dittingerstr. 33
> CH-4053 Basel
> Switzerland
>
> Tel.:  +41 61 331 10 47
> Email:  [hidden email]
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.