IIS Index Server - Problem with UNICODE and Index Server

This is Interesting: Free IT Magazines  
Home > Archive > IIS Index Server > September 2004 > Problem with UNICODE and Index Server





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Problem with UNICODE and Index Server
Mathias Dahl

2004-08-27, 6:17 pm


I found a page a while ago that stated that MS Index Server
handles UNICODE. Now, I have a index server running and it
is indexing a directory with files where some of the files
are UTF-16 encoded text files containing UNICODE
characters. For example, I mixed using latin characters and
Cyrillic. The text files were created using notepad, and the
characters where paster from the Character Map tool.

Now, if I search for something in these documents using
latin characters I get hits, but when I use Cyrillic
characters I just get an error message saying something like
"All words were ignored".

I used the search tool that you can access from MMC and I
paste characters from the Character Map directly into the
search field.

I have also tested using a application we have (that uses
the index server COM objects) and that does not work either,
but to rule out UNICODE-bugs in out own application I though
using the included search tool should work.

What am I missing here? Is this something that is dependent
on the server "locale" or similar concepts?

Any hints regarding my problem and also UNICODE vs Index
Server in general are greatly appreciated. Searching the
internet did not give me very much information (or I used
the wrong search phrase...).

Thanks!

Mathias Dahl
Hilary Cotter

2004-08-28, 7:47 am

There are two factors here.

The first is how the docs are indexed. They will be indexed according the
language rules of the word breaker. The word breaker chosen will be the word
breaker for the language specified in your ms.locale metatag. If there is no
language tag, these words will be broken according to the neutral word
breaker.

There is minimal word breaker which is done while indexing, but there is
some. If you do not specify the correct unicode setting in your code page,
the indexer will treat a character as a letter, instead of rendering two
characters as a single unicode representation of a character.

The second factor is at query time the server will apply the word breaker
for the server's regional setting, which means if your server is configured
for us english it will expect ansi characters, and if you feed it cyrillic
characters it will think that you are sending it jibberish and raise the
ignored words error. So you have to tell your IS server what language you
are using by

1) objQuery.LocaleID using ixsso or CiLocale using IDQ
2) setting your session.CodePage to cyrillic or whatever code page you are
querying in.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Mathias Dahl" <brakjoller@hotmail.com> wrote in message
news:uu0uq3pqg.fsf@hotmail.com...
>
> I found a page a while ago that stated that MS Index Server
> handles UNICODE. Now, I have a index server running and it
> is indexing a directory with files where some of the files
> are UTF-16 encoded text files containing UNICODE
> characters. For example, I mixed using latin characters and
> Cyrillic. The text files were created using notepad, and the
> characters where paster from the Character Map tool.
>
> Now, if I search for something in these documents using
> latin characters I get hits, but when I use Cyrillic
> characters I just get an error message saying something like
> "All words were ignored".
>
> I used the search tool that you can access from MMC and I
> paste characters from the Character Map directly into the
> search field.
>
> I have also tested using a application we have (that uses
> the index server COM objects) and that does not work either,
> but to rule out UNICODE-bugs in out own application I though
> using the included search tool should work.
>
> What am I missing here? Is this something that is dependent
> on the server "locale" or similar concepts?
>
> Any hints regarding my problem and also UNICODE vs Index
> Server in general are greatly appreciated. Searching the
> internet did not give me very much information (or I used
> the wrong search phrase...).
>
> Thanks!
>
> Mathias Dahl



Mathias Dahl

2004-08-30, 2:50 am

"Hilary Cotter" <hilary.cotter@gmail.com> writes:

Hi Hilary, thanks for the reply:

> The first is how the docs are indexed. They will be indexed
> according the language rules of the word breaker. The word breaker
> chosen will be the word breaker for the language specified in your
> ms.locale metatag. If there is no language tag, these words will be
> broken according to the neutral word breaker.


I found this page on MSDN on how to set that ms.locale metatag:

http://msdn.microsoft.com/library/d...uwebqy_3sv9.asp

Now, most of the files we index are not HTML files, so setting that
meta tag is not possible. Right? If we cannot do this, will our intent
to index files written in different languages fail? If so, the next
part of your explanation is not important to me either, right?

> There is minimal word breaker which is done while indexing, but
> there is some. If you do not specify the correct unicode setting in
> your code page, the indexer will treat a character as a letter,
> instead of rendering two characters as a single unicode
> representation of a character.


How do I "specify the correct unicode setting in my code page"? I
though unicode was unicode and code pages just a small specific part
of characters for a ceratin language and region.

> The second factor is at query time the server will apply the word
> breaker for the server's regional setting, which means if your
> server is configured for us english it will expect ansi characters,
> and if you feed it cyrillic characters it will think that you are
> sending it jibberish and raise the ignored words error. So you have
> to tell your IS server what language you are using by
>
> 1) objQuery.LocaleID using ixsso or CiLocale using IDQ
> 2) setting your session.CodePage to cyrillic or whatever code page you are
> querying in.


Some follow-up questions to this:

1. So you say I can do this each time I am doing a search, right?

2. Also, what we need to do to enable searching for text in multiple
languages is to have the user (we have a search interface for the
users) specify language when they do the search (or fetch language
from their user properties, or whatever?

> breaker for the server's regional setting, which means if your
> server is configured for us english it will expect ansi characters,


Hmm, this seems a bit strange. Can't I tell the index server that
"here is a string of unicode characters for you to search with"?
Expecting ANSI chars just because the regional setting is English
seems strange.

(I understand (mostly) the language-part of your description though,
to be able to do correct parsing of the indexed text and the search
text it needs to know in what language I use.=

When you write this:

> The second factor is at query time the server will apply the word
> breaker for the server's regional setting, which means if your


Do you mean language or regional setting here? I may misinterpret what
you are trying to describe but do you maybe mix up the terms
"language", "locale" and "regonal setting" here? I know that they are
are connected but to understand this I guess I need to know exactly
what you mean.

At last, sorry if I seem a bit uninformed, maybe you can point me in
the right direction to look for information about this? I just find
pieces here and there...

I'm appreciating all help, thanks.

/Mathias
Hilary Cotter

2004-09-02, 6:44 pm

The ms.locale metatag only applies to correctly formatted html. Most office
documents also can embed language within them. By default they will have the
language type that matches your regional settings.

You can always change them. For instance in Word, click Format, point to
Style, click Modify (for your style), click format, and then language and
select the language for that style or selection.

If your document is not html, or an Office document, or your iFilter is not
language aware, the word breaker which corresponds to your regional settings
will be applied. So, if your server has the german regional settings
(Start, Settings, Control Panel, Regional Settings, and German (Germany)
shows up as your locale), and you are indexing/searching text files the
German word breaker will be applied.

It seems to me that you are having a problem with your query string being
interpreted as ASCII, instead of being rendered as unicode - this is why you
are getting the ignored words error when you are searching on cyrillic
characters.

In your asp page if you set <% Session.Codepage=1251 %> at the top, this
should correct this behavior.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Mathias Dahl" <brakjoller@hotmail.com> wrote in message
news:upt5914h4.fsf@hotmail.com...
> "Hilary Cotter" <hilary.cotter@gmail.com> writes:
>
> Hi Hilary, thanks for the reply:
>
>
> I found this page on MSDN on how to set that ms.locale metatag:
>
>

http://msdn.microsoft.com/library/d...uwebqy_3sv9.asp
>
> Now, most of the files we index are not HTML files, so setting that
> meta tag is not possible. Right? If we cannot do this, will our intent
> to index files written in different languages fail? If so, the next
> part of your explanation is not important to me either, right?
>
>
> How do I "specify the correct unicode setting in my code page"? I
> though unicode was unicode and code pages just a small specific part
> of characters for a ceratin language and region.
>
are[vbcol=seagreen]
>
> Some follow-up questions to this:
>
> 1. So you say I can do this each time I am doing a search, right?
>
> 2. Also, what we need to do to enable searching for text in multiple
> languages is to have the user (we have a search interface for the
> users) specify language when they do the search (or fetch language
> from their user properties, or whatever?
>
>
> Hmm, this seems a bit strange. Can't I tell the index server that
> "here is a string of unicode characters for you to search with"?
> Expecting ANSI chars just because the regional setting is English
> seems strange.
>
> (I understand (mostly) the language-part of your description though,
> to be able to do correct parsing of the indexed text and the search
> text it needs to know in what language I use.=
>
> When you write this:
>
>
> Do you mean language or regional setting here? I may misinterpret what
> you are trying to describe but do you maybe mix up the terms
> "language", "locale" and "regonal setting" here? I know that they are
> are connected but to understand this I guess I need to know exactly
> what you mean.
>
> At last, sorry if I seem a bit uninformed, maybe you can point me in
> the right direction to look for information about this? I just find
> pieces here and there...
>
> I'm appreciating all help, thanks.
>
> /Mathias



Mathias Dahl

2004-09-02, 6:44 pm

"Hilary Cotter" <hilary.cotter@gmail.com> writes:

> In your asp page if you set <% Session.Codepage=1251 %> at
> the top, this should correct this behavior.


I am not using ASP, I use the Query COM objects directly
from a dll we have created.

Thanks for the info though, each little info snippet I get
gets me closer to understand how this strange beast
works.

Anyone have any pointers to some really beafy documentation
where you can read up on the basics and how to use the index
server from your applications? Preferably related to
multiple languages/encodings.

/Mathias
Hilary Cotter

2004-09-02, 6:44 pm

use objQuery.CodePage.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Mathias Dahl" <brakjoller@hotmail.com> wrote in message
news:u1xhnxnf9.fsf@hotmail.com...
> "Hilary Cotter" <hilary.cotter@gmail.com> writes:
>
>
> I am not using ASP, I use the Query COM objects directly
> from a dll we have created.
>
> Thanks for the info though, each little info snippet I get
> gets me closer to understand how this strange beast
> works.
>
> Anyone have any pointers to some really beafy documentation
> where you can read up on the basics and how to use the index
> server from your applications? Preferably related to
> multiple languages/encodings.
>
> /Mathias



Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com