collapsing unicode white space

4 messages Options
Embed this post
Permalink
Scott Wilson-11

collapsing unicode white space

Reply Threaded More More options
Print post
Permalink
Hi everyone,

I need to implement a W3C processing algorithm which states:

10.1.8 Rule for Getting Text Content with Normalized White Space
The rule for getting text content with normalized white space is given  
in the following algorithm. The algorithm always returns a string,  
which MAY be empty.

        • Let input be the Element to be processed.
        • Let result be the result of applying the rule for getting text  
content to input.
        • In result, convert any sequence of one or more Unicode white space  
characters into a single U+0020 SPACE.
        • Return result.

The step I'm having problems with is "convert any sequence of one or  
more Unicode white space characters into a single U+0020 SPACE."

The StringUtils replace() and CharSetUtils squeeze() methods would  
seem to be best suited for solving this one, but there doesn't seem to  
be a set syntax for easily specifying unicode white space chars  
defined for one thing.

Has anyone else solved a similar problem using commons lang, or should  
I consider using something else?

Thanks!

S


/-/-/-/-/-/
Scott Wilson
Apache Wookie: http://incubator.apache.org/projects/wookie.html



smime.p7s (3K) Download Attachment
Sujit Pal

Re: [lang] collapsing unicode white space

Reply Threaded More More options
Print post
Permalink
Hi Scott,

I just use something like this:

s = s.replaceAll("\\s+", " ");

or since you are doing unicode:

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = s.replaceAll("\u0200+", "\u0200");
System.out.println("after=" + s);

Gives me this:
before=ThisȀȀisȀaȀȀtest
after=ThisȀisȀaȀtest

Of course, you lose the null checking that commons-lang gives you. Using
CharsetUtils.squeeze() also gives me identical results...

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[]
{"\u0200"});
System.out.println("after=" + s);

Also changed your subject line to include [lang] per guidelines on this
list.

-sujit

On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:

> Hi everyone,
>
> I need to implement a W3C processing algorithm which states:
>
> 10.1.8 Rule for Getting Text Content with Normalized White Space
> The rule for getting text content with normalized white space is given  
> in the following algorithm. The algorithm always returns a string,  
> which MAY be empty.
>
> • Let input be the Element to be processed.
> • Let result be the result of applying the rule for getting text  
> content to input.
> • In result, convert any sequence of one or more Unicode white space  
> characters into a single U+0020 SPACE.
> • Return result.
>
> The step I'm having problems with is "convert any sequence of one or  
> more Unicode white space characters into a single U+0020 SPACE."
>
> The StringUtils replace() and CharSetUtils squeeze() methods would  
> seem to be best suited for solving this one, but there doesn't seem to  
> be a set syntax for easily specifying unicode white space chars  
> defined for one thing.
>
> Has anyone else solved a similar problem using commons lang, or should  
> I consider using something else?
>
> Thanks!
>
> S
>
>
> /-/-/-/-/-/
> Scott Wilson
> Apache Wookie: http://incubator.apache.org/projects/wookie.html
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Scott Wilson-11

Re: [lang] collapsing unicode white space

Reply Threaded More More options
Print post
Permalink
Thanks, Sujit.

The main problem I'm having is with normalizing the wide range of  
unicode white space characters (e.g. u+0085, U+00A0...) to U+0020  
before squeezing - the only thing I can find is the isWhitespace()  
function which would require iterating over each of the characters in  
the string and testing/replacing them individually. I was wondering if  
there was a charset pattern that squeeze could take that would  
represent all unicode white space characters?

S

On 29 Oct 2009, at 18:26, Sujit Pal wrote:

> Hi Scott,
>
> I just use something like this:
>
> s = s.replaceAll("\\s+", " ");
>
> or since you are doing unicode:
>
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = s.replaceAll("\u0200+", "\u0200");
> System.out.println("after=" + s);
>
> Gives me this:
> before=ThisȀȀisȀaȀȀtest
> after=ThisȀisȀaȀtest
>
> Of course, you lose the null checking that commons-lang gives you.  
> Using
> CharsetUtils.squeeze() also gives me identical results...
>
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[]
> {"\u0200"});
> System.out.println("after=" + s);
>
> Also changed your subject line to include [lang] per guidelines on  
> this
> list.
>
> -sujit
>
> On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:
>> Hi everyone,
>>
>> I need to implement a W3C processing algorithm which states:
>>
>> 10.1.8 Rule for Getting Text Content with Normalized White Space
>> The rule for getting text content with normalized white space is  
>> given
>> in the following algorithm. The algorithm always returns a string,
>> which MAY be empty.
>>
>> • Let input be the Element to be processed.
>> • Let result be the result of applying the rule for getting text
>> content to input.
>> • In result, convert any sequence of one or more Unicode white  
>> space
>> characters into a single U+0020 SPACE.
>> • Return result.
>>
>> The step I'm having problems with is "convert any sequence of one or
>> more Unicode white space characters into a single U+0020 SPACE."
>>
>> The StringUtils replace() and CharSetUtils squeeze() methods would
>> seem to be best suited for solving this one, but there doesn't seem  
>> to
>> be a set syntax for easily specifying unicode white space chars
>> defined for one thing.
>>
>> Has anyone else solved a similar problem using commons lang, or  
>> should
>> I consider using something else?
>>
>> Thanks!
>>
>> S
>>
>>
>> /-/-/-/-/-/
>> Scott Wilson
>> Apache Wookie: http://incubator.apache.org/projects/wookie.html
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


smime.p7s (3K) Download Attachment
Scott Wilson-11

Re: [lang] collapsing unicode white space

Reply Threaded More More options
Print post
Permalink
In reply to this post by Sujit Pal
Some javascript/style in this post has been disabled (why?)
Well after a bit of research I finally found a solution to this problem, and though StringUtils and CharSetUtils play a role, there was still a bit of a gap.

Here is the code:

private static String normalize(String in, boolean includeWhitespace){
if (in == nullreturn "";
String out = "";
for (int x=0;x<in.length();x++){
String s = in.substring(x, x+1);
char ch = s.charAt(0);
if (Character.isSpaceChar(ch) || (Character.isWhitespace(ch) && includeWhitespace)){
s = " ";
}
out = out + s;
}
out = CharSetUtils.squeeze(out, " ");
out = StringUtils.strip(out);
return out;
}

Interestingly enough there is no "normalize unicode white space/space chars" method in any of the libs that I tested (e.g. jdom, dom4j).

I've committed the code into Apache Wookie (incubating) as part of a UnicodeUtils class: https://svn.apache.org/viewvc/incubator/wookie/trunk/src/org/apache/wookie/util/UnicodeUtils.java?revision=832940&view=markup

If there is interest in adding the method(s) to StringUtils I can submit a patch.

S

On 29 Oct 2009, at 18:26, Sujit Pal wrote:

Hi Scott,

I just use something like this:

s = s.replaceAll("\\s+", " ");

or since you are doing unicode:

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = s.replaceAll("\u0200+", "\u0200");
System.out.println("after=" + s);

Gives me this:
before=ThisȀȀisȀaȀȀtest
after=ThisȀisȀaȀtest

Of course, you lose the null checking that commons-lang gives you. Using
CharsetUtils.squeeze() also gives me identical results...

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[]
{"\u0200"});
System.out.println("after=" + s);

Also changed your subject line to include [lang] per guidelines on this
list.

-sujit

On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:
Hi everyone,

I need to implement a W3C processing algorithm which states:

10.1.8 Rule for Getting Text Content with Normalized White Space
The rule for getting text content with normalized white space is given  
in the following algorithm. The algorithm always returns a string,  
which MAY be empty.

• Let input be the Element to be processed.
• Let result be the result of applying the rule for getting text  
content to input.
• In result, convert any sequence of one or more Unicode white space  
characters into a single U+0020 SPACE.
• Return result.

The step I'm having problems with is "convert any sequence of one or  
more Unicode white space characters into a single U+0020 SPACE."

The StringUtils replace() and CharSetUtils squeeze() methods would  
seem to be best suited for solving this one, but there doesn't seem to  
be a set syntax for easily specifying unicode white space chars  
defined for one thing.

Has anyone else solved a similar problem using commons lang, or should  
I consider using something else?

Thanks!

S


/-/-/-/-/-/
Scott Wilson
Apache Wookie: http://incubator.apache.org/projects/wookie.html



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




smime.p7s (3K) Download Attachment