Unicode, NFC,NFD and node names

10 messages Options
Embed this post
Permalink
gregoryjoseph

Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
Hi list,

Given the following code,
import java.text.Normalizer;
...

         final Session session = ...

         final Repository rep = session.getRepository();
         System.out.println(rep.getDescriptor("jcr.repository.name") +  
" " + rep.getDescriptor("jcr.repository.version"));

         final Node root = session.getRootNode();
         final String name = "föö";
         System.out.println("Normalizer.isNormalized(name,  
Normalizer.Form.NFC) = " + Normalizer.isNormalized(name,  
Normalizer.Form.NFC)); // true
         System.out.println("Normalizer.isNormalized(name,  
Normalizer.Form.NFD) = " + Normalizer.isNormalized(name,  
Normalizer.Form.NFD)); // false
         root.addNode(name);
         session.save();

         final Node node1 = root.getNode(name);
         System.out.println("node1 = " + node1);
         final Node node2 = root.getNode(Normalizer.normalize(name,  
Normalizer.Form.NFC));
         System.out.println("node2 = " + node2);
         final Node node3 = root.getNode(Normalizer.normalize(name,  
Normalizer.Form.NFD)); // fails
         System.out.println("node3 = " + node3);

There's a good chance fetching node3 won't work. It might be dependent  
on the underlying os and database, but in the case of OSX and Derby,  
this fails. It's not that surprising, really, given that  
Normalizer.normalize(name,  
Normalizer.Form.NFC).equals(Normalizer.normalize(name,  
Normalizer.Form.NFD)) is NOT true.

Now, taking into account the fact that all sorts of clients will use a  
different Normalizing Form (Firefox seems to encode URL parameters  
with NFD, Safari with NFC; linux NFC, OSX finder seems to favor NFD),  
wouldn't it be a safe bet to normalize all input at repository level ?  
Or do you consider this is something client applications should do ?

ref: http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms

Thanks for any tip, pointer, idea, feedback or reaction !

Cheers,

-greg


gregoryjoseph

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
fwiw, the following solves the simple problem shown by my previous  
example:

     private Session wrap(final SessionImpl origSession) throws  
RepositoryException {
         final WorkspaceImpl workspace = (WorkspaceImpl)  
origSession.getWorkspace();
         final RepositoryImpl rep = (RepositoryImpl)  
origSession.getRepository();
         return new SessionImpl(rep, origSession.getSubject(),  
workspace.getConfig()) {
             public Path getQPath(String path) throws  
MalformedPathException, IllegalNameException, NamespaceException {
                // this is the only relevant part:
                 return super.getQPath(Normalizer.normalize(path,  
Normalizer.Form.NFC));
             }
         };
     }

If there was a way to swap the session implementation or the Name-and/
or-PathResolver implementations that are used by default, I might give  
this a spin.

Any opinions about the whole problem?

Cheers,

-g

On Nov 4, 2009, at 6:11 PM, Grégory Joseph wrote:

> Hi list,
>
> Given the following code,
> import java.text.Normalizer;
> ...
>
>        final Session session = ...
>
>        final Repository rep = session.getRepository();
>        System.out.println(rep.getDescriptor("jcr.repository.name") +  
> " " + rep.getDescriptor("jcr.repository.version"));
>
>        final Node root = session.getRootNode();
>        final String name = "föö";
>        System.out.println("Normalizer.isNormalized(name,  
> Normalizer.Form.NFC) = " + Normalizer.isNormalized(name,  
> Normalizer.Form.NFC)); // true
>        System.out.println("Normalizer.isNormalized(name,  
> Normalizer.Form.NFD) = " + Normalizer.isNormalized(name,  
> Normalizer.Form.NFD)); // false
>        root.addNode(name);
>        session.save();
>
>        final Node node1 = root.getNode(name);
>        System.out.println("node1 = " + node1);
>        final Node node2 = root.getNode(Normalizer.normalize(name,  
> Normalizer.Form.NFC));
>        System.out.println("node2 = " + node2);
>        final Node node3 = root.getNode(Normalizer.normalize(name,  
> Normalizer.Form.NFD)); // fails
>        System.out.println("node3 = " + node3);
>
> There's a good chance fetching node3 won't work. It might be  
> dependent on the underlying os and database, but in the case of OSX  
> and Derby, this fails. It's not that surprising, really, given that  
> Normalizer.normalize(name,  
> Normalizer.Form.NFC).equals(Normalizer.normalize(name,  
> Normalizer.Form.NFD)) is NOT true.
>
> Now, taking into account the fact that all sorts of clients will use  
> a different Normalizing Form (Firefox seems to encode URL parameters  
> with NFD, Safari with NFC; linux NFC, OSX finder seems to favor  
> NFD), wouldn't it be a safe bet to normalize all input at repository  
> level ? Or do you consider this is something client applications  
> should do ?
>
> ref: http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
>
> Thanks for any tip, pointer, idea, feedback or reaction !
>
> Cheers,
>
> -greg
>
>


Tobias Bocanegra-3

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
hi,
i don't think this should be the job of the repository to do
normalization of the paths. likewise a good filesystem (a case
sensitive one :-) does no normalization of it's paths neither.

regards, toby

2009/11/4 Grégory Joseph <[hidden email]>:

> fwiw, the following solves the simple problem shown by my previous example:
>
>    private Session wrap(final SessionImpl origSession) throws
> RepositoryException {
>        final WorkspaceImpl workspace = (WorkspaceImpl)
> origSession.getWorkspace();
>        final RepositoryImpl rep = (RepositoryImpl)
> origSession.getRepository();
>        return new SessionImpl(rep, origSession.getSubject(),
> workspace.getConfig()) {
>            public Path getQPath(String path) throws MalformedPathException,
> IllegalNameException, NamespaceException {
>                // this is the only relevant part:
>                return super.getQPath(Normalizer.normalize(path,
> Normalizer.Form.NFC));
>            }
>        };
>    }
>
> If there was a way to swap the session implementation or the
> Name-and/or-PathResolver implementations that are used by default, I might
> give this a spin.
>
> Any opinions about the whole problem?
>
> Cheers,
>
> -g
>
> On Nov 4, 2009, at 6:11 PM, Grégory Joseph wrote:
>
>> Hi list,
>>
>> Given the following code,
>> import java.text.Normalizer;
>> ...
>>
>>       final Session session = ...
>>
>>       final Repository rep = session.getRepository();
>>       System.out.println(rep.getDescriptor("jcr.repository.name") + " " +
>> rep.getDescriptor("jcr.repository.version"));
>>
>>       final Node root = session.getRootNode();
>>       final String name = "föö";
>>       System.out.println("Normalizer.isNormalized(name,
>> Normalizer.Form.NFC) = " + Normalizer.isNormalized(name,
>> Normalizer.Form.NFC)); // true
>>       System.out.println("Normalizer.isNormalized(name,
>> Normalizer.Form.NFD) = " + Normalizer.isNormalized(name,
>> Normalizer.Form.NFD)); // false
>>       root.addNode(name);
>>       session.save();
>>
>>       final Node node1 = root.getNode(name);
>>       System.out.println("node1 = " + node1);
>>       final Node node2 = root.getNode(Normalizer.normalize(name,
>> Normalizer.Form.NFC));
>>       System.out.println("node2 = " + node2);
>>       final Node node3 = root.getNode(Normalizer.normalize(name,
>> Normalizer.Form.NFD)); // fails
>>       System.out.println("node3 = " + node3);
>>
>> There's a good chance fetching node3 won't work. It might be dependent on
>> the underlying os and database, but in the case of OSX and Derby, this
>> fails. It's not that surprising, really, given that
>> Normalizer.normalize(name,
>> Normalizer.Form.NFC).equals(Normalizer.normalize(name, Normalizer.Form.NFD))
>> is NOT true.
>>
>> Now, taking into account the fact that all sorts of clients will use a
>> different Normalizing Form (Firefox seems to encode URL parameters with NFD,
>> Safari with NFC; linux NFC, OSX finder seems to favor NFD), wouldn't it be a
>> safe bet to normalize all input at repository level ? Or do you consider
>> this is something client applications should do ?
>>
>> ref: http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
>>
>> Thanks for any tip, pointer, idea, feedback or reaction !
>>
>> Cheers,
>>
>> -greg
>>
>>
>
>
>
gregoryjoseph

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
Hi Toby,

On Nov 5, 2009, at 12:26 AM, Tobias Bocanegra wrote:

> hi,
> i don't think this should be the job of the repository to do
> normalization of the paths. likewise a good filesystem (a case
> sensitive one :-) does no normalization of it's paths neither.

Since I wrote this yesterday in quite a rush, let me just stress the  
fact that I'm only talking about unicode normalization forms; a  
filesystem won't have to bother about that, since it doesn't have a  
whole slew of clients who decide to use one form or the other for no  
apparent reason. For "fun", you might want to see this: http://www.mail-archive.com/bug-bash@.../msg05818.html

I can see why one would want to make a differentiation between the 2  
forms in *values*; in item names, not so much.

Thoughts ?

-g

> 2009/11/4 Grégory Joseph <[hidden email]>:
>> fwiw, the following solves the simple problem shown by my previous  
>> example:
>>
>>    private Session wrap(final SessionImpl origSession) throws
>> RepositoryException {
>>        final WorkspaceImpl workspace = (WorkspaceImpl)
>> origSession.getWorkspace();
>>        final RepositoryImpl rep = (RepositoryImpl)
>> origSession.getRepository();
>>        return new SessionImpl(rep, origSession.getSubject(),
>> workspace.getConfig()) {
>>            public Path getQPath(String path) throws  
>> MalformedPathException,
>> IllegalNameException, NamespaceException {
>>                // this is the only relevant part:
>>                return super.getQPath(Normalizer.normalize(path,
>> Normalizer.Form.NFC));
>>            }
>>        };
>>    }
>>
>> If there was a way to swap the session implementation or the
>> Name-and/or-PathResolver implementations that are used by default,  
>> I might
>> give this a spin.
>>
>> Any opinions about the whole problem?
>>
>> Cheers,
>>
>> -g
>>
>> On Nov 4, 2009, at 6:11 PM, Grégory Joseph wrote:
>>
>>> Hi list,
>>>
>>> Given the following code,
>>> import java.text.Normalizer;
>>> ...
>>>
>>>       final Session session = ...
>>>
>>>       final Repository rep = session.getRepository();
>>>       System.out.println(rep.getDescriptor("jcr.repository.name")  
>>> + " " +
>>> rep.getDescriptor("jcr.repository.version"));
>>>
>>>       final Node root = session.getRootNode();
>>>       final String name = "föö";
>>>       System.out.println("Normalizer.isNormalized(name,
>>> Normalizer.Form.NFC) = " + Normalizer.isNormalized(name,
>>> Normalizer.Form.NFC)); // true
>>>       System.out.println("Normalizer.isNormalized(name,
>>> Normalizer.Form.NFD) = " + Normalizer.isNormalized(name,
>>> Normalizer.Form.NFD)); // false
>>>       root.addNode(name);
>>>       session.save();
>>>
>>>       final Node node1 = root.getNode(name);
>>>       System.out.println("node1 = " + node1);
>>>       final Node node2 = root.getNode(Normalizer.normalize(name,
>>> Normalizer.Form.NFC));
>>>       System.out.println("node2 = " + node2);
>>>       final Node node3 = root.getNode(Normalizer.normalize(name,
>>> Normalizer.Form.NFD)); // fails
>>>       System.out.println("node3 = " + node3);
>>>
>>> There's a good chance fetching node3 won't work. It might be  
>>> dependent on
>>> the underlying os and database, but in the case of OSX and Derby,  
>>> this
>>> fails. It's not that surprising, really, given that
>>> Normalizer.normalize(name,
>>> Normalizer.Form.NFC).equals(Normalizer.normalize(name,  
>>> Normalizer.Form.NFD))
>>> is NOT true.
>>>
>>> Now, taking into account the fact that all sorts of clients will  
>>> use a
>>> different Normalizing Form (Firefox seems to encode URL parameters  
>>> with NFD,
>>> Safari with NFC; linux NFC, OSX finder seems to favor NFD),  
>>> wouldn't it be a
>>> safe bet to normalize all input at repository level ? Or do you  
>>> consider
>>> this is something client applications should do ?
>>>
>>> ref: http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
>>>
>>> Thanks for any tip, pointer, idea, feedback or reaction !
>>>
>>> Cheers,
>>>
>>> -greg
>>>
>>>
>>
>>
>>


Tobias Bocanegra-3

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
2009/11/5 Grégory Joseph <[hidden email]>:

> Hi Toby,
>
> On Nov 5, 2009, at 12:26 AM, Tobias Bocanegra wrote:
>
>> hi,
>> i don't think this should be the job of the repository to do
>> normalization of the paths. likewise a good filesystem (a case
>> sensitive one :-) does no normalization of it's paths neither.
>
> Since I wrote this yesterday in quite a rush, let me just stress the fact
> that I'm only talking about unicode normalization forms; a filesystem won't
> have to bother about that, since it doesn't have a whole slew of clients who
> decide to use one form or the other for no apparent reason. For "fun", you
> might want to see this:
> http://www.mail-archive.com/bug-bash@.../msg05818.html
>
> I can see why one would want to make a differentiation between the 2 forms
> in *values*; in item names, not so much.
well, i see a repository somewhere in between filesystems and databases.

however, i think the path to an item needs to be solid - the search
can still provide you with all stemming and normalization you need.
regards, toby

>
> Thoughts ?
>
> -g
>
>> 2009/11/4 Grégory Joseph <[hidden email]>:
>>>
>>> fwiw, the following solves the simple problem shown by my previous
>>> example:
>>>
>>>   private Session wrap(final SessionImpl origSession) throws
>>> RepositoryException {
>>>       final WorkspaceImpl workspace = (WorkspaceImpl)
>>> origSession.getWorkspace();
>>>       final RepositoryImpl rep = (RepositoryImpl)
>>> origSession.getRepository();
>>>       return new SessionImpl(rep, origSession.getSubject(),
>>> workspace.getConfig()) {
>>>           public Path getQPath(String path) throws
>>> MalformedPathException,
>>> IllegalNameException, NamespaceException {
>>>               // this is the only relevant part:
>>>               return super.getQPath(Normalizer.normalize(path,
>>> Normalizer.Form.NFC));
>>>           }
>>>       };
>>>   }
>>>
>>> If there was a way to swap the session implementation or the
>>> Name-and/or-PathResolver implementations that are used by default, I
>>> might
>>> give this a spin.
>>>
>>> Any opinions about the whole problem?
>>>
>>> Cheers,
>>>
>>> -g
>>>
>>> On Nov 4, 2009, at 6:11 PM, Grégory Joseph wrote:
>>>
>>>> Hi list,
>>>>
>>>> Given the following code,
>>>> import java.text.Normalizer;
>>>> ...
>>>>
>>>>      final Session session = ...
>>>>
>>>>      final Repository rep = session.getRepository();
>>>>      System.out.println(rep.getDescriptor("jcr.repository.name") + " " +
>>>> rep.getDescriptor("jcr.repository.version"));
>>>>
>>>>      final Node root = session.getRootNode();
>>>>      final String name = "föö";
>>>>      System.out.println("Normalizer.isNormalized(name,
>>>> Normalizer.Form.NFC) = " + Normalizer.isNormalized(name,
>>>> Normalizer.Form.NFC)); // true
>>>>      System.out.println("Normalizer.isNormalized(name,
>>>> Normalizer.Form.NFD) = " + Normalizer.isNormalized(name,
>>>> Normalizer.Form.NFD)); // false
>>>>      root.addNode(name);
>>>>      session.save();
>>>>
>>>>      final Node node1 = root.getNode(name);
>>>>      System.out.println("node1 = " + node1);
>>>>      final Node node2 = root.getNode(Normalizer.normalize(name,
>>>> Normalizer.Form.NFC));
>>>>      System.out.println("node2 = " + node2);
>>>>      final Node node3 = root.getNode(Normalizer.normalize(name,
>>>> Normalizer.Form.NFD)); // fails
>>>>      System.out.println("node3 = " + node3);
>>>>
>>>> There's a good chance fetching node3 won't work. It might be dependent
>>>> on
>>>> the underlying os and database, but in the case of OSX and Derby, this
>>>> fails. It's not that surprising, really, given that
>>>> Normalizer.normalize(name,
>>>> Normalizer.Form.NFC).equals(Normalizer.normalize(name,
>>>> Normalizer.Form.NFD))
>>>> is NOT true.
>>>>
>>>> Now, taking into account the fact that all sorts of clients will use a
>>>> different Normalizing Form (Firefox seems to encode URL parameters with
>>>> NFD,
>>>> Safari with NFC; linux NFC, OSX finder seems to favor NFD), wouldn't it
>>>> be a
>>>> safe bet to normalize all input at repository level ? Or do you consider
>>>> this is something client applications should do ?
>>>>
>>>> ref: http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
>>>>
>>>> Thanks for any tip, pointer, idea, feedback or reaction !
>>>>
>>>> Cheers,
>>>>
>>>> -greg
>>>>
>>>>
>>>
>>>
>>>
>
>
>
gregoryjoseph

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink

On Nov 5, 2009, at 3:39 PM, Tobias Bocanegra wrote:

> 2009/11/5 Grégory Joseph <[hidden email]>:
>> Hi Toby,
>>
>> On Nov 5, 2009, at 12:26 AM, Tobias Bocanegra wrote:
>>
>>> hi,
>>> i don't think this should be the job of the repository to do
>>> normalization of the paths. likewise a good filesystem (a case
>>> sensitive one :-) does no normalization of it's paths neither.
>>
>> Since I wrote this yesterday in quite a rush, let me just stress  
>> the fact
>> that I'm only talking about unicode normalization forms; a  
>> filesystem won't
>> have to bother about that, since it doesn't have a whole slew of  
>> clients who
>> decide to use one form or the other for no apparent reason. For  
>> "fun", you
>> might want to see this:
>> http://www.mail-archive.com/bug-bash@.../msg05818.html
>>
>> I can see why one would want to make a differentiation between the  
>> 2 forms
>> in *values*; in item names, not so much.
> well, i see a repository somewhere in between filesystems and  
> databases.
>
> however, i think the path to an item needs to be solid - the search
> can still provide you with all stemming and normalization you need.

I can see why one wouldn't this as the default behaviour; is there any  
chance the current PathResolver implementation could become  
configurable or swappable?



>>
>>> 2009/11/4 Grégory Joseph <[hidden email]>:
>>>>
>>>> fwiw, the following solves the simple problem shown by my previous
>>>> example:
>>>>
>>>>   private Session wrap(final SessionImpl origSession) throws
>>>> RepositoryException {
>>>>       final WorkspaceImpl workspace = (WorkspaceImpl)
>>>> origSession.getWorkspace();
>>>>       final RepositoryImpl rep = (RepositoryImpl)
>>>> origSession.getRepository();
>>>>       return new SessionImpl(rep, origSession.getSubject(),
>>>> workspace.getConfig()) {
>>>>           public Path getQPath(String path) throws
>>>> MalformedPathException,
>>>> IllegalNameException, NamespaceException {
>>>>               // this is the only relevant part:
>>>>               return super.getQPath(Normalizer.normalize(path,
>>>> Normalizer.Form.NFC));
>>>>           }
>>>>       };
>>>>   }
>>>>
>>>> If there was a way to swap the session implementation or the
>>>> Name-and/or-PathResolver implementations that are used by  
>>>> default, I
>>>> might
>>>> give this a spin.
>>>>
>>>> Any opinions about the whole problem?
>>>>
>>>> Cheers,
>>>>
>>>> -g
>>>>
>>>> On Nov 4, 2009, at 6:11 PM, Grégory Joseph wrote:
>>>>
>>>>> Hi list,
>>>>>
>>>>> Given the following code,
>>>>> import java.text.Normalizer;
>>>>> ...
>>>>>
>>>>>      final Session session = ...
>>>>>
>>>>>      final Repository rep = session.getRepository();
>>>>>      System.out.println(rep.getDescriptor("jcr.repository.name")  
>>>>> + " " +
>>>>> rep.getDescriptor("jcr.repository.version"));
>>>>>
>>>>>      final Node root = session.getRootNode();
>>>>>      final String name = "föö";
>>>>>      System.out.println("Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFC) = " + Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFC)); // true
>>>>>      System.out.println("Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFD) = " + Normalizer.isNormalized(name,
>>>>> Normalizer.Form.NFD)); // false
>>>>>      root.addNode(name);
>>>>>      session.save();
>>>>>
>>>>>      final Node node1 = root.getNode(name);
>>>>>      System.out.println("node1 = " + node1);
>>>>>      final Node node2 = root.getNode(Normalizer.normalize(name,
>>>>> Normalizer.Form.NFC));
>>>>>      System.out.println("node2 = " + node2);
>>>>>      final Node node3 = root.getNode(Normalizer.normalize(name,
>>>>> Normalizer.Form.NFD)); // fails
>>>>>      System.out.println("node3 = " + node3);
>>>>>
>>>>> There's a good chance fetching node3 won't work. It might be  
>>>>> dependent
>>>>> on
>>>>> the underlying os and database, but in the case of OSX and  
>>>>> Derby, this
>>>>> fails. It's not that surprising, really, given that
>>>>> Normalizer.normalize(name,
>>>>> Normalizer.Form.NFC).equals(Normalizer.normalize(name,
>>>>> Normalizer.Form.NFD))
>>>>> is NOT true.
>>>>>
>>>>> Now, taking into account the fact that all sorts of clients will  
>>>>> use a
>>>>> different Normalizing Form (Firefox seems to encode URL  
>>>>> parameters with
>>>>> NFD,
>>>>> Safari with NFC; linux NFC, OSX finder seems to favor NFD),  
>>>>> wouldn't it
>>>>> be a
>>>>> safe bet to normalize all input at repository level ? Or do you  
>>>>> consider
>>>>> this is something client applications should do ?
>>>>>
>>>>> ref: http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
>>>>>
>>>>> Thanks for any tip, pointer, idea, feedback or reaction !
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -greg
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>


Alexander Klimetschek

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
2009/11/6 Grégory Joseph <[hidden email]>:
> I can see why one wouldn't this as the default behaviour; is there any
> chance the current PathResolver implementation could become configurable or
> swappable?

I think nobody sees a real issue with that (yet). Your original
example code that fails under certain combinations (OSX and Derby) is
not a good case, as it can be expected to fail that way, as the
original name "föö" provided is changed within the java application
itself. I expect that any string in a Java application follows the
same utf-8 encoding & normalization. If you find a combination (eg.
including a browser or other client, using webdav, etc.) where it
fails, this would be helpful.

Also note that most (all?) people use the URL space as node names, to
map it back and forth and unify the naming, just as in a plain unix
filesystem. This gives plain ASCII and leaves out any umlautes.

Regards,
Alex

--
Alexander Klimetschek
[hidden email]
Alexander Klimetschek

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
In reply to this post by gregoryjoseph
2009/11/6 Grégory Joseph <[hidden email]>:
> I can see why one wouldn't this as the default behaviour; is there any
> chance the current PathResolver implementation could become configurable or
> swappable?

Sorry forgot to answer your question: no, it's not easily swappable by
configuration.

Regards,
Alex

--
Alexander Klimetschek
[hidden email]
gregoryjoseph

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
In reply to this post by Alexander Klimetschek
Hi Alex,

On Nov 6, 2009, at 4:46 PM, Alexander Klimetschek wrote:

> 2009/11/6 Grégory Joseph <[hidden email]>:
>> I can see why one wouldn't this as the default behaviour; is there  
>> any
>> chance the current PathResolver implementation could become  
>> configurable or
>> swappable?
>
> I think nobody sees a real issue with that (yet). Your original
> example code that fails under certain combinations (OSX and Derby) is
> not a good case, as it can be expected to fail that way, as the
> original name "föö" provided is changed within the java application
> itself. I expect that any string in a Java application follows the
> same utf-8 encoding & normalization. If you find a combination (eg.
> including a browser or other client, using webdav, etc.) where it
> fails, this would be helpful.

Map a webdav folder to OSX's finder, create a node with umlauts, it  
will be created with the NFD form.
(java.text.Normalizer.isNormalized() to see that, or String.getBytes())

Map the same folder using Linux or Windows, I'm pretty sure the files  
will be created using the NFC form.
TBH, I still have to try that; I stumbled upon the issue earlier  
because of something rather silly: at some point, a path is passed to  
a servlet, and this passed was not encoded on the client side (i.e the  
html used to trigger this call was wrong); somehow, it seems Firefox  
respected the original form (NFD) while apparently Safari tempered  
with it and converted it to NFC first.

Granted, this isn't really convincing. Now that this piece is patched  
and the urls are encoded, clients seem to behave much better, in that  
they don't temper with the normal form anymore. Still, I have no  
control under what form a node is created. This could mean (to be  
verified) that in the case of a node type that does not allow same-
name siblings, one could actually create two nodes with an "apparent"  
same name.

> Also note that most (all?) people use the URL space as node names, to
> map it back and forth and unify the naming, just as in a plain unix
> filesystem. This gives plain ASCII and leaves out any umlautes.

Sure; same remark as above though, without enforcing the  
normalization, you could end up with what could appear as  
"duplicates" (even though they're really not)

> 2009/11/6 Grégory Joseph <[hidden email]>:
>> I can see why one wouldn't this as the default behaviour; is there  
>> any
>> chance the current PathResolver implementation could become  
>> configurable or
>> swappable?
>
> Sorry forgot to answer your question: no, it's not easily swappable by
> configuration.

Encoding URLs properly is probably going to solve most of my problems;  
I've been looking at patching this, but it would seem indeed pretty  
contrived and requiring quite some code on our side to just change the  
type of PathResolver to use, for instance (starting from  
org.apache.jackrabbit.core.jndi.RegistryHelper and all the way down to  
javax.jcr.Repository#login. Could this maybe be something that would  
its place in the WorkspaceConfig ?

Cheers,

-g


Alexander Klimetschek

Re: Unicode, NFC,NFD and node names

Reply Threaded More More options
Print post
Permalink
2009/11/6 Grégory Joseph <[hidden email]>:
> Map a webdav folder to OSX's finder, create a node with umlauts, it will be
> created with the NFD form.
> (java.text.Normalizer.isNormalized() to see that, or String.getBytes())
>
> Map the same folder using Linux or Windows, I'm pretty sure the files will
> be created using the NFC form.
> TBH, I still have to try that;

An explicit failure case would be good, as I think nobody has seen
this issue (yet) with Jackrabbit.

The only occurrence of this different normalization issue was with
certain filenames (containing "special" characters) in SVN that was
used both on Windows and Mac. But that was using the standard C-based
SVN client. I think with Java the UTF-8 support is better.

> Still, I have no control under what
> form a node is created. This could mean (to be verified) that in the case of
> a node type that does not allow same-name siblings, one could actually
> create two nodes with an "apparent" same name.

I think (feel free to correct me here) that under Java both strings
should be equal(), regardless of their normalization when serialized
and stored onto disk.

> Encoding URLs properly is probably going to solve most of my problems; I've
> been looking at patching this, but it would seem indeed pretty contrived and
> requiring quite some code on our side to just change the type of
> PathResolver to use, for instance (starting from
> org.apache.jackrabbit.core.jndi.RegistryHelper and all the way down to
> javax.jcr.Repository#login. Could this maybe be something that would its
> place in the WorkspaceConfig ?

I think would be an advanced setting, since the JCR compliance is
based on a PathResolver working according to the spec, and people
should not be easily allowed to "break" Jackrabbit this way.

Rather, if this is really an issue, it should simply be fixed in
Jackrabbit (PathResolver or where else the String might need to be
normalized).

Regards,
Alex

--
Alexander Klimetschek
[hidden email]