|
Home > Archive > Apache Directory Project > August 2005 > LDAP protocol implementation and data containing accents
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
LDAP protocol implementation and data containing accents
|
|
| J閞鬽e Baumgarten 2005-08-30, 5:45 pm |
| Hi,
I'm (still) working on my LDAP proxy implementation and I get into
troubles with data containing accents.
For example, accents seem to disappear from returned DN value.
Also, accents in a filter are incorrectly received (decoded ?) in
SearchHandler, for example the filter (sn=3D*=E9*) is retrieved as
(sn=3D*=C3(c)*).
A wild guess would be some kind of coding / decoding problem (charset
related ?). I believe that if I get it in my own SearchHandler
implementation, the one provided with ApacheDS should also get it.
That may be a problem for people dealing with data having accents
(like my firstname).
Are my assumptions correct ?
What would be the best way to get rid of that problem.
Regards,
J=E9r=F4me
| |
| Emmanuel Lecharny 2005-08-30, 5:45 pm |
| > Also, accents in a filter are incorrectly received (decoded ?) in
> SearchHandler, for example the filter (sn=*茅*) is retrieved as
> (sn=*脙(c)*).
Are you using UTF-8 to encode your string? Data are stored in UTF-8
format in Ldap.
>
> A wild guess would be some kind of coding / decoding problem (charset
> related ?). I believe that if I get it in my own SearchHandler
> implementation, the one provided with ApacheDS should also get it.
> That may be a problem for people dealing with data having accents
> (like my firstname).
>
> Are my assumptions correct ?
Hmmmm. Just try using new String("J茅r么me", "UTF-8");
Emmanuel Lécharny ;)
| |
| J閞鬽e Baumgarten 2005-08-31, 7:45 am |
| On 8/30/05, Emmanuel Lecharny <elecharny-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>=20
> Are you using UTF-8 to encode your string? Data are stored in UTF-8
> format in Ldap.
I did some other tests and I get the following (clients and server
running on a Windows box) :
* JXplorer (build JXv3.2b2 2005-08-18 13:46 EST) : filter is
incorrect w.r.t accents
* Softerra LDAP Browser 2.6 : filter is incorrect w.r.t accents
* JNDI test code : filter is incorrect w.r.t accents
* JLDAP test code : filter is incorrect w.r.t accents
* OpenLDAP ldapsearch (but running on a Linux box) : filter is
correct w.r.t accents
I can fix these problems if I do the following :
String filter =3D LdapProxyUtils.filterToString(request.getFilter());
try {
filter =3D new String(filter.getBytes(), "UTF-8");
} catch (UnsupportedEncodingException ueEx) {
throw new RuntimeException(ueEx);
}
But I don't really understand why I must do so since "RFC 2254 - The
String Representation of LDAP Search Filters" says that it is
represented as an UTF-8 string. Thus I would expect the filter value
to be correct, no matter the platform my LDAP proxy is running on.
Also, has anyone tested search on ApacheDS with filter containing
accents ? The problems I'm facing right now may also be present with
ApacheDS.
>=20
> Hmmmm. Just try using new String("J=E9r=F4me", "UTF-8");
>=20
> Emmanuel Lécharny ;)
J=E9r=F4me
| |
| Emmanuel Lecharny 2005-08-31, 7:45 am |
| On Wed, 2005-08-31 at 11:00 +0200, J茅r么me Baumgarten wrote:
> On 8/30/05, Emmanuel Lecharny <elecharny-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> I did some other tests and I get the following (clients and server
> running on a Windows box) :
>
> * JXplorer (build JXv3.2b2 2005-08-18 13:46 EST) : filter is
> incorrect w.r.t accents
>
> * Softerra LDAP Browser 2.6 : filter is incorrect w.r.t accents
>
> * JNDI test code : filter is incorrect w.r.t accents
>
> * JLDAP test code : filter is incorrect w.r.t accents
>
> * OpenLDAP ldapsearch (but running on a Linux box) : filter is
> correct w.r.t accents
>
> I can fix these problems if I do the following :
>
> String filter = LdapProxyUtils.filterToString(request.getFilter());
> try {
> filter = new String(filter.getBytes(), "UTF-8");
> } catch (UnsupportedEncodingException ueEx) {
> throw new RuntimeException(ueEx);
> }
>
> But I don't really understand why I must do so since "RFC 2254 - The
> String Representation of LDAP Search Filters" says that it is
> represented as an UTF-8 string. Thus I would expect the filter value
> to be correct, no matter the platform my LDAP proxy is running on.
It's not a question of tool or platform. Values are stored in UTF-8 in
LDAP if they are Strings (from RFC 2251) :
"
4.1.2. String Types
The LDAPString is a notational convenience to indicate that, although
strings of LDAPString type encode as OCTET STRING types, the ISO
10646 [13] character set (a superset of Unicode) is used, encoded
following the UTF-8 algorithm [14]. Note that in the UTF-8 algorithm
characters which are the same as ASCII (0x0000 through 0x007F) are
represented as that same ASCII character in a single byte. The other
byte values are used to form a variable-length encoding of an
arbitrary character."
So you must send String values encoded in UTF-8 when requesting a Ldap Server. If you use a tool,
there is good chance that a convversion is done from your locale to UTF-8 (ie ISO-8859-1 to UTF-8 in your case).
If you write a piece of code to send requests to LDAP, you *MUST* do this conversion yourself. Using
a simple new String("J茅rome") is not enough, as it will internally encode "J茅r么me" using UTF-16.
So you always should use a new String("J茅r么me", "UTF-8") before sending data to Ldap. It applies to search filters, too.
> Also, has anyone tested search on ApacheDS with filter containing
> accents ? The problems I'm facing right now may also be present with
> ApacheDS.
Sure we have problem with accents !!! Strings are created in ApacheDs
using new String(byte[] data) without using a UTF-8 encoding. So this is
a bug. It would be cool to add a JIRA issue with a simple test case.
However, we are actually tracking down a bug related to encoding and
binary values, it may fix your problem.
Emmanuel L茅charny
| |
| Niclas Hedhman 2005-08-31, 7:45 am |
| On Wednesday 31 August 2005 17:34, Emmanuel Lecharny wrote:
> So you always should use a new String("J=C3=A9r=C3=B4me", "UTF-8") before=
sending
> data to Ldap. It applies to search filters, too.
I think you are somewhat mistaken. The above constructor doesn't exist, nor=
=20
makes any sense.
The straight forward way is the InputStreamReader and OutputStreamWriter wi=
th=20
the encoding set to "UTF-8" wrapping the stream to the external system.
Cheers
Niclas
| |
| Enrique Rodriguez 2005-08-31, 7:45 am |
| Niclas Hedhman wrote:
> On Wednesday 31 August 2005 17:34, Emmanuel Lecharny wrote:
>
>
> I think you are somewhat mistaken. The above constructor doesn't exist, nor
> makes any sense.
Constructor doesn't exist, but how about :
try
{
String string = new String( "J茅r么me".getBytes( "UTF-8" ), "UTF-8" );
}
catch (UnsupportedEncodingException e)
{
}
Enrique
| |
| Niclas Hedhman 2005-08-31, 7:45 am |
| On Wednesday 31 August 2005 19:25, Enrique Rodriguez wrote:
Duh! I get the feeling you don't understand encodings.
Run;
try
{
String string1 =3D "J=C3=A9r=C3=B4me";
String string2 =3D new String( "J=C3=A9r=C3=B4me".getBytes( "UTF-8" ),=
"UTF-8" );
String string3 =3D new String( "J=C3=A9r=C3=B4me".getBytes( "ISO-8859-=
1" ),=20
"ISO-8859-1" );
System.out.println( string1.equals( string2 ) );
System.out.println( string2.equals( string3 ) );
System.out.println( string3.equals( string1 ) );
}
catch (UnsupportedEncodingException e)
{
}
If you haven't got it; String does not have encoding, it is Unicode (which =
has=20
very little to do with UTF-8/16 encoding). A stream of bytes which represen=
ts=20
unicode characters has an encoding, and only when you convert that stream o=
f=20
bytes to/from String object do you need to apply an encoding. Hence, if you=
=20
have a byte array and you want the String constructor to convert that to a=
=20
String, you need to tell it what encoding is used in the byte array.
Cheers
Niclas
| |
| Enrique Rodriguez 2005-08-31, 5:45 pm |
| Niclas Hedhman wrote:
> On Wednesday 31 August 2005 19:25, Enrique Rodriguez wrote:
>
> Duh! I get the feeling you don't understand encodings.
My bad. I put 2 seconds into writing something that worked in the IDE
and posted it to the list. Looking now that was stupid, pointlessly
converting to from a byte[].
OK, back to committing ...
Enrique
|
|
|
|
|