Web Server forum
Back To The Forum Home!Search!Private Messaging System

This is Interesting: Free IT Magazines Now Free shipping to California  
Web Server Talk Web Server Talk > Web Servers reviews > IIS server support > IIS Index Server > Unicode corruption in Characterization field for chinese character




  Last Thread   Next Thread Next
  Show Printable Version Email this Page Subscribe to this Thread      Post New Thread    Post A Reply      

    Unicode corruption in Characterization field for chinese character  
Dan Meineck


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
10-11-04 12:49 PM

Hi there, i wonder if anyone can help me. I am creating an index server base
d
search plug-in for a .NET site, using Cisso as my method of using index
server. It's all working nicely and i have now started looking at the server
indexing pages with unicode characters, specifically in this example, chines
e.

I am indexing flat HTML pages in a publish directory which have a meta
element of MS.LOCALE set to the locale of the correct language, in my case
'zh-CN'.

Setting the codepage of the cisso wrapper allows the foreign characters to
render correctly, and setting the localeid to chinese allows for chinese
characters to be acceptable as a search terms.

My problem is that when i have conducted a search and am getting the results
back, the value immediately retrived from the dataset from cisso in the
characterisation column, for the chinese content result, is corrupt:

"my keywords. 锘?html>. Latest News鏅寸鍦板尯鏀垮簻 鈥撴湇鍔″
ぇ浼? 鏅寸鍦板尯鏀垮簻
 鈥撴湇鍔″ぇ浼楁櫞绌哄湴鍖
斂搴滅幇鏈?5浣嶅鍛樸備
 _潵鑷拰_h〃鐫28涓夊尯浜
皯缇や紬锛屽苟鍦ㄤ换鏈
 殑鍥涘勾閲岋紝璐熻矗鏅寸
板尯鐨勫畯瑙傛斂绛_笌
勫垝锛屾彁渚涘叕
 辨湇鍔″拰鍐冲畾鍚_鏈嶅姟
鐨勬敹璐广?
 閫氳繃鏈綉绔欙紝鎮ㄥ彲_ラ
_笎娣卞叆浜嗚В鏅寸鏀垮
 鍚勯」鏂逛究甯傛皯鐨勬湇鍔
′互鍙婃斂搴滃姛鑳斤紝鍚
 勯儴闂ㄧ殑鑱旂郴鏂瑰紡鍜屾
搴滃勾搴︽姤鍛娿俉hat's N
ewTwo Column Lorem ipsum do
lor sit amet, consete"

- Notice the ?html along with the w missing of 'What's NewTwo Column' - i
will add the HTML source of the index page below to clarify:

<html><head><title>dan</title>
<meta name="MS.LOCALE" content="zh-cn">
<meta name="keywords" content="my keywords">
<meta name="comments" content="">
<meta name="author" content="Admin">
<meta name="accessrights" content=",1,2,">
<meta name="immediacyurl" content="http://localhost/immsample501">
<meta name="lastsavedtm" content="08/10/2004 10:31:53">
<meta name="categories" content=",">
<meta name="language" content="--">
</head><body>Latest News晴空地区政府 –服务大众

晴空地区政府
 –服务大众晴空地区政府现有5
5位委员。他们来自和代表
 28个选区人民群众,并在任期
四年里,负责晴空地区的
 观政_与规划,提供公共服
和决定各种服务的收费。

 通过本网站,您可以逐渐深入
解晴空政府各项方便市民
 服务以及政府功能,各部门
联系方式和政府年度报告
What''s NewTwo Column
Lorem ipsum dolor sit
amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor
invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum
dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing
elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et accusam et
justo duo dolores et ea rebum. Stet clita kasd gubergren.
dan dan dan dan dan dan my keywords my keywords my keywords my keywords my
keywords
</body>
</html>

- It looks as if the corruption comes straight out of index server - can
anyone shed any light on this? Also another problem found is if the title is
in chinese text, it gets ignored because for some reason the meta data below
is corrupted.

Thanks,

Dan





[ Post a follow-up to this message ]



    Re: Unicode corruption in Characterization field for chinese character  
Hilary Cotter


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
10-16-04 02:25 AM

your server locale has to be Chinese for the characterization to show up
correctly.

"Dan Meineck" <DanMeineck@discussions.microsoft.com> wrote in message
news:B415FA8C-A79F-4D5B-88A4-26D1B3A47418@microsoft.com...
> Hi there, i wonder if anyone can help me. I am creating an index server
> based
> search plug-in for a .NET site, using Cisso as my method of using index
> server. It's all working nicely and i have now started looking at the
> server
> indexing pages with unicode characters, specifically in this example,
> chinese.
>
> I am indexing flat HTML pages in a publish directory which have a meta
> element of MS.LOCALE set to the locale of the correct language, in my case
> 'zh-CN'.
>
> Setting the codepage of the cisso wrapper allows the foreign characters to
> render correctly, and setting the localeid to chinese allows for chinese
> characters to be acceptable as a search terms.
>
> My problem is that when i have conducted a search and am getting the
> results
> back, the value immediately retrived from the dataset from cisso in the
> characterisation column, for the chinese content result, is corrupt:
>
> "my keywords. 锘?html>. Latest News鏅寸鍦板尯鏀垮簻 鈥撴湇鍔
ぇ浼?
> 鏅寸鍦板尯鏀垮簻
>  鈥撴湇鍔″ぇ浼楁櫞绌哄湴鍖
斂搴滅幇鏈?5浣嶅鍛樸備
粬
> 潵鑷拰  h〃鐫28涓夊尯浜烘皯缇や紬
屽苟鍦ㄤ换鏈熺
 殑鍥涘勾閲岋紝璐熻矗鏅寸鍦
尯鐨勫畯瑙傛斂绛
>  笌瑙勫垝锛屾彁渚涘叕鍏辨湇
鍔″拰鍐冲畾鍚_鏈嶅姟
勬敹璐广?
> 閫氳繃鏈綉绔欙紝鎮ㄥ彲 ラ_笎娣卞叆浜嗚В鏅寸鏀垮
 簻鍚勯」鏂逛究甯傛皯鐨勬湇
′互鍙婃斂搴滃姛鑳斤紝
 勯儴闂ㄧ殑鑱旂郴鏂瑰紡鍜
斂搴滃勾搴︽姤鍛娿俉hat's
> NewTwo Column Lorem ipsum dolor sit amet, consete"
>
> - Notice the ?html along with the w missing of 'What's NewTwo Column' - i
> will add the HTML source of the index page below to clarify:
>
> <html><head><title>dan</title>
> <meta name="MS.LOCALE" content="zh-cn">
> <meta name="keywords" content="my keywords">
> <meta name="comments" content="">
> <meta name="author" content="Admin">
> <meta name="accessrights" content=",1,2,">
> <meta name="immediacyurl" content="http://localhost/immsample501">
> <meta name="lastsavedtm" content="08/10/2004 10:31:53">
> <meta name="categories" content=",">
> <meta name="language" content="--">
> </head><body>Latest News晴空地区政府 –服务大众
>
> 晴空地区政府
>  –服务大众晴空地区政府现有5
5位委员。他们来自和代表
 着28个选区人民群众,并在任
的四年里,负责晴空地区
 宏观政_与规划,提供公共服
务和决定各种服务的收费。
>
>  通过本网站,您可以逐渐深入
解晴空政府各项方便市
 的服务以及政府功能,各部门
的联系方式和政府年度报告
。What''s
> NewTwo Column
> Lorem ipsum dolor sit
> amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor
> invidunt ut labore et dolore magna aliquyam erat, sed diam
> voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
> Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum
> dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing
> elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore
> magna aliquyam erat, sed diam voluptua. At vero eos et accusam et
> justo duo dolores et ea rebum. Stet clita kasd gubergren.
> dan dan dan dan dan dan my keywords my keywords my keywords my keywords my
> keywords
> </body>
> </html>
>
> - It looks as if the corruption comes straight out of index server - can
> anyone shed any light on this? Also another problem found is if the title
> is
> in chinese text, it gets ignored because for some reason the meta data
> below
> is corrupted.
>
> Thanks,
>
> Dan







[ Post a follow-up to this message ]



    Sponsored Links  




 





   All times are GMT. The time now is 08:25 PM.      Post New Thread    Post A Reply      
  Last Thread   Next Thread Next


Most Popular forums 

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
 
Medical and Health forum | Computer Games Reviews | Graphics design forum

Back To The Top
Home | Usercp | Faq | Register