Apache Directory Project - LDAP protocol implementation and data containing accents

This is Interesting: Free IT Magazines  
Home > Archive > Apache Directory Project > August 2005 > LDAP protocol implementation and data containing accents





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author LDAP protocol implementation and data containing accents
J閞鬽e Baumgarten

2005-08-30, 5:45 pm

Hi,

I'm (still) working on my LDAP proxy implementation and I get into
troubles with data containing accents.

For example, accents seem to disappear from returned DN value.

Also, accents in a filter are incorrectly received (decoded ?) in
SearchHandler, for example the filter (sn=3D*=E9*) is retrieved as
(sn=3D*=C3(c)*).

A wild guess would be some kind of coding / decoding problem (charset
related ?). I believe that if I get it in my own SearchHandler
implementation, the one provided with ApacheDS should also get it.
That may be a problem for people dealing with data having accents
(like my firstname).

Are my assumptions correct ?

What would be the best way to get rid of that problem.

Regards,
J=E9r=F4me

Emmanuel Lecharny

2005-08-30, 5:45 pm

> Also, accents in a filter are incorrectly received (decoded ?) in
> SearchHandler, for example the filter (sn=*茅*) is retrieved as
> (sn=*脙(c)*).


Are you using UTF-8 to encode your string? Data are stored in UTF-8
format in Ldap.


>
> A wild guess would be some kind of coding / decoding problem (charset
> related ?). I believe that if I get it in my own SearchHandler
> implementation, the one provided with ApacheDS should also get it.
> That may be a problem for people dealing with data having accents
> (like my firstname).
>
> Are my assumptions correct ?


Hmmmm. Just try using new String("J茅r么me", "UTF-8");

Emmanuel Lécharny ;)


J閞鬽e Baumgarten

2005-08-31, 7:45 am

On 8/30/05, Emmanuel Lecharny <elecharny-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>=20
> Are you using UTF-8 to encode your string? Data are stored in UTF-8
> format in Ldap.


I did some other tests and I get the following (clients and server
running on a Windows box) :

* JXplorer (build JXv3.2b2 2005-08-18 13:46 EST) : filter is
incorrect w.r.t accents

* Softerra LDAP Browser 2.6 : filter is incorrect w.r.t accents

* JNDI test code : filter is incorrect w.r.t accents

* JLDAP test code : filter is incorrect w.r.t accents

* OpenLDAP ldapsearch (but running on a Linux box) : filter is
correct w.r.t accents

I can fix these problems if I do the following :

String filter =3D LdapProxyUtils.filterToString(request.getFilter());
try {
filter =3D new String(filter.getBytes(), "UTF-8");
} catch (UnsupportedEncodingException ueEx) {
throw new RuntimeException(ueEx);
}

But I don't really understand why I must do so since "RFC 2254 - The
String Representation of LDAP Search Filters" says that it is
represented as an UTF-8 string. Thus I would expect the filter value
to be correct, no matter the platform my LDAP proxy is running on.

Also, has anyone tested search on ApacheDS with filter containing
accents ? The problems I'm facing right now may also be present with
ApacheDS.

>=20
> Hmmmm. Just try using new String("J=E9r=F4me", "UTF-8");
>=20
> Emmanuel Lécharny ;)


J=E9r=F4me

Emmanuel Lecharny

2005-08-31, 7:45 am

On Wed, 2005-08-31 at 11:00 +0200, J茅r么me Baumgarten wrote:
> On 8/30/05, Emmanuel Lecharny <elecharny-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> I did some other tests and I get the following (clients and server
> running on a Windows box) :
>
> * JXplorer (build JXv3.2b2 2005-08-18 13:46 EST) : filter is
> incorrect w.r.t accents
>
> * Softerra LDAP Browser 2.6 : filter is incorrect w.r.t accents
>
> * JNDI test code : filter is incorrect w.r.t accents
>
> * JLDAP test code : filter is incorrect w.r.t accents
>
> * OpenLDAP ldapsearch (but running on a Linux box) : filter is
> correct w.r.t accents
>
> I can fix these problems if I do the following :
>
> String filter = LdapProxyUtils.filterToString(request.getFilter());
> try {
> filter = new String(filter.getBytes(), "UTF-8");
> } catch (UnsupportedEncodingException ueEx) {
> throw new RuntimeException(ueEx);
> }
>
> But I don't really understand why I must do so since "RFC 2254 - The
> String Representation of LDAP Search Filters" says that it is
> represented as an UTF-8 string. Thus I would expect the filter value
> to be correct, no matter the platform my LDAP proxy is running on.




It's not a question of tool or platform. Values are stored in UTF-8 in
LDAP if they are Strings (from RFC 2251) :

"
4.1.2. String Types
The LDAPString is a notational convenience to indicate that, although
strings of LDAPString type encode as OCTET STRING types, the ISO
10646 [13] character set (a superset of Unicode) is used, encoded
following the UTF-8 algorithm [14]. Note that in the UTF-8 algorithm
characters which are the same as ASCII (0x0000 through 0x007F) are
represented as that same ASCII character in a single byte. The other
byte values are used to form a variable-length encoding of an
arbitrary character."

So you must send String values encoded in UTF-8 when requesting a Ldap Server. If you use a tool,
there is good chance that a convversion is done from your locale to UTF-8 (ie ISO-8859-1 to UTF-8 in your case).

If you write a piece of code to send requests to LDAP, you *MUST* do this conversion yourself. Using
a simple new String("J茅rome") is not enough, as it will internally encode "J茅r么me" using UTF-16.

So you always should use a new String("J茅r么me", "UTF-8") before sending data to Ldap. It applies to search filters, too.


> Also, has anyone tested search on ApacheDS with filter containing
> accents ? The problems I'm facing right now may also be present with
> ApacheDS.


Sure we have problem with accents !!! Strings are created in ApacheDs
using new String(byte[] data) without using a UTF-8 encoding. So this is
a bug. It would be cool to add a JIRA issue with a simple test case.

However, we are actually tracking down a bug related to encoding and
binary values, it may fix your problem.

Emmanuel L茅charny


Niclas Hedhman

2005-08-31, 7:45 am

On Wednesday 31 August 2005 17:34, Emmanuel Lecharny wrote:
> So you always should use a new String("J=C3=A9r=C3=B4me", "UTF-8") before=

sending
> data to Ldap. It applies to search filters, too.


I think you are somewhat mistaken. The above constructor doesn't exist, nor=
=20
makes any sense.

The straight forward way is the InputStreamReader and OutputStreamWriter wi=
th=20
the encoding set to "UTF-8" wrapping the stream to the external system.


Cheers
Niclas

Enrique Rodriguez

2005-08-31, 7:45 am

Niclas Hedhman wrote:
> On Wednesday 31 August 2005 17:34, Emmanuel Lecharny wrote:
>
>
> I think you are somewhat mistaken. The above constructor doesn't exist, nor
> makes any sense.


Constructor doesn't exist, but how about :

try
{
String string = new String( "J茅r么me".getBytes( "UTF-8" ), "UTF-8" );
}
catch (UnsupportedEncodingException e)
{
}

Enrique

Niclas Hedhman

2005-08-31, 7:45 am

On Wednesday 31 August 2005 19:25, Enrique Rodriguez wrote:

Duh! I get the feeling you don't understand encodings.

Run;

try
{
String string1 =3D "J=C3=A9r=C3=B4me";
String string2 =3D new String( "J=C3=A9r=C3=B4me".getBytes( "UTF-8" ),=
"UTF-8" );
String string3 =3D new String( "J=C3=A9r=C3=B4me".getBytes( "ISO-8859-=
1" ),=20
"ISO-8859-1" );
System.out.println( string1.equals( string2 ) );
System.out.println( string2.equals( string3 ) );
System.out.println( string3.equals( string1 ) );
}
catch (UnsupportedEncodingException e)
{
}


If you haven't got it; String does not have encoding, it is Unicode (which =
has=20
very little to do with UTF-8/16 encoding). A stream of bytes which represen=
ts=20
unicode characters has an encoding, and only when you convert that stream o=
f=20
bytes to/from String object do you need to apply an encoding. Hence, if you=
=20
have a byte array and you want the String constructor to convert that to a=
=20
String, you need to tell it what encoding is used in the byte array.


Cheers
Niclas

Enrique Rodriguez

2005-08-31, 5:45 pm

Niclas Hedhman wrote:
> On Wednesday 31 August 2005 19:25, Enrique Rodriguez wrote:
>
> Duh! I get the feeling you don't understand encodings.


My bad. I put 2 seconds into writing something that worked in the IDE
and posted it to the list. Looking now that was stupid, pointlessly
converting to from a byte[].

OK, back to committing ...

Enrique

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com