AOL Webserver - Working with Chinese characters in Tcl/AOLserver

This is Interesting: Free IT Magazines  
Home > Archive > AOL Webserver > September 2007 > Working with Chinese characters in Tcl/AOLserver





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Working with Chinese characters in Tcl/AOLserver
Janine Sisk

2007-09-05, 7:11 am

This may be more of a Tcl question than an AOLserver one, but I'm
guessing that people on this list are more likely to have run into
it. So here goes.

I'm working with strings encoded in big5 and gb2312 (traditional and
simplified Chinese, respectively). I'm exec'ing out to an Java
program that translates from one to the other. I have found that the
only way to get my data to the program intact, and get the response
back intact, is to store the data in intermediate files with the
fconfigure command to set the encoding. Anything else ends up
mangling the data. I can't, for example, grab the return value of
the command directly from the exec; if I do, it's mangled. I have
to have the Java program write it to a file, and then read it with
the encoding set, in order to get the data intact.

As you can imagine, having to write two files per page request isn't
exactly ideal, even with caching. So has anyone else done this and
found a way to do it?

thanks,

janine


Bas Scheffers

2007-09-05, 7:11 am

Instead of using exec, have you tried to open a pipe (open "|
javacmd") and use fconfigure on the I/O channel returned by this?

Cheers,
Bas.

On 5 Sep 2007, at 17:05, Janine Sisk wrote:

> This may be more of a Tcl question than an AOLserver one, but I'm
> guessing that people on this list are more likely to have run into
> it. So here goes.
>
> I'm working with strings encoded in big5 and gb2312 (traditional
> and simplified Chinese, respectively). I'm exec'ing out to an Java
> program that translates from one to the other. I have found that
> the only way to get my data to the program intact, and get the
> response back intact, is to store the data in intermediate files
> with the fconfigure command to set the encoding. Anything else
> ends up mangling the data. I can't, for example, grab the return
> value of the command directly from the exec; if I do, it's
> mangled. I have to have the Java program write it to a file, and
> then read it with the encoding set, in order to get the data intact.
>
> As you can imagine, having to write two files per page request
> isn't exactly ideal, even with caching. So has anyone else done
> this and found a way to do it?
>
> thanks,
>
> janine
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the
> body of "SIGNOFF AOLSERVER" in the email message. You can leave the
> Subject: field of your email blank.



Dossy Shiobara

2007-09-05, 1:11 pm

On 2007.09.05, Janine Sisk <janine@FURFLY.NET> wrote:
> I'm working with strings encoded in big5 and gb2312 (traditional and
> simplified Chinese, respectively). I'm exec'ing out to an Java
> program that translates from one to the other. [...]


Is that Java program doing anything else to the data? If you're just
using Java to transcode Tcl strings, you're really hurting yourself for
no reason:

set big5string [encoding convertto big5 $gb2312string]

set gb2312string [encoding convertto gb2312 $big5string]

Tcl's encoding support is probably one of its strenghts as a scripting
language.

> I can't, for example, grab the return value of the command directly
> from the exec; if I do, it's mangled.


I don't think you can tell [exec] what encoding the I/O will be.
Perhaps you could/should see if there's a TIP for [exec -encoding $name
$command] already ...

-- Dossy

--
Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)


Janine Sisk

2007-09-05, 7:11 pm

I did try opening a pipe at one point (I tried a lot of different
things last night) but I can't remember if that was before or after I
found fconfigure.

I did find that it does not seem to be possible to open a pipe for
both reading and writing. I tried it and was able to write data to
the pipe (simulating stdin) but my read from the same channel just
hung. Reading from the channel does work if it is opened only for
reading.

The combination of an open pipe with fconfigure would allow me to
send data to the command properly, but I'd still have to use a file
to get the output. Still 50% less files is a good thing!

However, Dossy's suggestion of using the encoding command is very
intriguing - I've looked at that page of Tcl commands hundreds of
times and never noticed it. If it does what I need, that will be a
much better solution.

Thanks to both of you!

janine

On Sep 5, 2007, at 12:59 AM, Bas Scheffers wrote:

> Instead of using exec, have you tried to open a pipe (open "|
> javacmd") and use fconfigure on the I/O channel returned by this?
>
> Cheers,
> Bas.
>
> On 5 Sep 2007, at 17:05, Janine Sisk wrote:
>
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the
> body of "SIGNOFF AOLSERVER" in the email message. You can leave the
> Subject: field of your email blank.
>



Jeff Rogers

2007-09-05, 7:11 pm

Janine Sisk wrote:

> I did find that it does not seem to be possible to open a pipe for both
> reading and writing. I tried it and was able to write data to the pipe
> (simulating stdin) but my read from the same channel just hung. Reading
> from the channel does work if it is opened only for reading.


Opening a pipe for both reading and writing does work, but buffering and
eof handling can trip you up. Buffering can be turned off with
fconfigure or you can explicitly fflush, but eof isn't so simple to deal
with - if your inferior process waits for EOF before writing anything
then it will get stuck because there is no way to half-close the pipe.
If the inferior process writes as soon as it has data available then it
will likely work ok. (Incidentally, I asked about this exact same
issue about a week ago and got no response; I think it got lost in all
the debate about trac).

-J


Jeff Rogers

2007-09-05, 7:11 pm

Dossy Shiobara wrote:

>
> I don't think you can tell [exec] what encoding the I/O will be.
> Perhaps you could/should see if there's a TIP for [exec -encoding $name
> $command] already ...


There is -
TIP #259: Making 'exec' optionally binary safe
http://www.tcl.tk/cgi-bin/tct/tip/259.html

Unfortunately it has been around in draft state for a year and a half
now with no apparent action and is targeted for tcl 8.6, so its a ways off.

-J


Janine Sisk

2007-09-05, 7:11 pm

This is still not working out very well... let me explain more about
what I'm doing, and maybe it will ring a bell for someone.

I'm working with a site that stores it's content in big5, and is run
through a conversion program to create a gb2312 version for those who
prefer the simplified characters. I know these are the charsets
being used; I've seen the config files for the converter.
Unfortunately the converter was written by a Chinese company with no
English info available, does not appear in Google, and is no longer
supported even by the original authors. So basically I have to write
my own program to do what it does, without any info on how it does
what it does.

I'm currently working with a snippet of text from the site, but the
eventual idea is to have the converter run under a separate web
server and have it grab the page from the big5 site, convert it, and
send it out to the browser. This is how the existing translator
works, as far as I can tell.

Regardless of whether I'm reading the snippet from a text file or
getting an entire page via ns_http; I have to set the encoding to
utf-8 in order to get the data properly. It does not display
properly if I call it big5. This is odd, but not terribly so; the
database and source AOLserver are both configured to use utf-8, so
this is at least consistent.

The only conversion that works with the Java program is to go utf-8
to utf-8s, which it calls simplified utf-8. Google tells me that
this is a bastardized format of sorts, proposed by Oracle and not
widely accepted. Unfortunately it is, so far, the only one that
works. Data comes in as utf-8, gets converted to utf-8s, and goes
out through AOLserver configured to use utf-8. All is well.

The problem is, Tcl doesn't support utf-8s, and as far as I can tell
there is no other format that will work. This will leave me stuck
with the Java program, and I have serious concerns about the
performance of any sort of exec, let alone one that involves writing
files.

Any suggestions?

thanks,

janine

On Sep 5, 2007, at 6:08 AM, Dossy Shiobara wrote:

> On 2007.09.05, Janine Sisk <janine@FURFLY.NET> wrote:
>
> Is that Java program doing anything else to the data? If you're just
> using Java to transcode Tcl strings, you're really hurting yourself
> for
> no reason:
>
> set big5string [encoding convertto big5 $gb2312string]
>
> set gb2312string [encoding convertto gb2312 $big5string]
>
> Tcl's encoding support is probably one of its strenghts as a scripting
> language.
>
>
> I don't think you can tell [exec] what encoding the I/O will be.
> Perhaps you could/should see if there's a TIP for [exec -encoding
> $name
> $command] already ...
>
> -- Dossy
>
> --
> Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
> Panoptic Computer Network | http://panoptic.com/
> "He realized the fastest way to change is to laugh at your own
> folly -- then you can let go and quickly move on." (p. 70)
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the
> body of "SIGNOFF AOLSERVER" in the email message. You can leave the
> Subject: field of your email blank.
>



Bas Scheffers

2007-09-05, 7:11 pm

On 6 Sep 2007, at 08:17, Janine Sisk wrote:
> The problem is, Tcl doesn't support utf-8s, and as far as I can
> tell there is no other format that will work. This will leave me
> stuck with the Java program, and I have serious concerns about the
> performance of any sort of exec, let alone one that involves
> writing files.

In that case, would it not make sense to just implement the
"simplifying proxy" in Java itself (i.e.: use Tomcat or Jetty as
server) and forget about AOLserver for that? Java's i18n is very good
and sounds like it might well be the best tool for the job...

Cheers,
Bas.


Dossy Shiobara

2007-09-05, 7:11 pm

On 2007.09.05, Jeff Rogers <dvrsn@DIPHI.COM> wrote:
> [...] if your inferior process waits for EOF before writing anything
> then it will get stuck because there is no way to half-close the pipe.
> If the inferior process writes as soon as it has data available then
> it will likely work ok. (Incidentally, I asked about this exact same
> issue about a week ago and got no response; I think it got lost in all
> the debate about trac).


Perhaps it's time to introduce ns_exec, that yields read and write
Tcl_Channel's using pipe()?

Good idea? Bad idea? Something you'd like to see in the AOLserver 4.5
tree?

-- Dossy

--
Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)


Dossy Shiobara

2007-09-05, 7:11 pm

On 2007.09.05, Janine Sisk <janine@FURFLY.NET> wrote:
> The only conversion that works with the Java program is to go utf-8
> to utf-8s, which it calls simplified utf-8. Google tells me that
> this is a bastardized format of sorts, proposed by Oracle and not
> widely accepted.


http://download.oracle.com/docs/cd/.../appunicode.htm

| Oracle's AL32UTF8 character set supports 1-byte, 2-byte, 3-byte,
| and 4-byte values. Oracle's UTF8 character set supports 1-byte,
| 2-byte, and 3-byte values, but not 4-byte values.

Are you using Oracle with NLS_LANG set to AL32UTF8, or just UTF8?

I spotted this OpenACS forum message, if you see what to check and/or
change:

http://openacs.org/forums/message-v...ssage_id=198856

> The problem is, Tcl doesn't support utf-8s, and as far as I can tell
> there is no other format that will work.


If you tell Oracle you want AL32UTF8, then you'll get UTF-8 as Tcl
expects (and can handle).

-- Dossy

--
Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)


Janine Sisk

2007-09-06, 1:11 am

I'm not using Oracle, but Postgres. The database is in UTF-8 format,
so that "should" be what I'm getting out of it. I'm not getting the
data directly from the database anyway, but via ns_http. I have no
idea how UTF-8S is getting in to the mix other than it's the only
format I can find to convert to that works.

Bas is probably right, I should just do this in pure java. My main
concern is that I'm doing something wrong that's making this harder
than it needs to be. Using the encoding command would be so simple...

I already saw that message, thanks, and have followed it as much as
possible.

janine

On Sep 5, 2007, at 4:17 PM, Dossy Shiobara wrote:

> On 2007.09.05, Janine Sisk <janine@FURFLY.NET> wrote:
>
> http://download.oracle.com/docs/cd/...ver.102/b14225/
> appunicode.htm
>
> | Oracle's AL32UTF8 character set supports 1-byte, 2-byte, 3-byte,
> | and 4-byte values. Oracle's UTF8 character set supports 1-byte,
> | 2-byte, and 3-byte values, but not 4-byte values.
>
> Are you using Oracle with NLS_LANG set to AL32UTF8, or just UTF8?
>
> I spotted this OpenACS forum message, if you see what to check and/or
> change:
>
> http://openacs.org/forums/message-v...ssage_id=198856
>
>
> If you tell Oracle you want AL32UTF8, then you'll get UTF-8 as Tcl
> expects (and can handle).
>
> -- Dossy
>
> --
> Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
> Panoptic Computer Network | http://panoptic.com/
> "He realized the fastest way to change is to laugh at your own
> folly -- then you can let go and quickly move on." (p. 70)
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the
> body of "SIGNOFF AOLSERVER" in the email message. You can leave the
> Subject: field of your email blank.
>



Jeff Rogers

2007-09-06, 1:11 am

Janine Sisk wrote:

> I'm working with a site that stores it's content in big5, and is run
> through a conversion program to create a gb2312 version for those who
> prefer the simplified characters. I know these are the charsets being
> used; I've seen the config files for the converter. Unfortunately the
> converter was written by a Chinese company with no English info
> available, does not appear in Google, and is no longer supported even by
> the original authors. So basically I have to write my own program to do
> what it does, without any info on how it does what it does.


I haven't dealt with chinese characters at all, but this sounds like
you're doing character set translations, not character encoding
conversions. tcl's 'encoding' command won't help you here - you'd need
a monster "string map" command to change all 6000? code points from one
into the other. To draw a much simplified analogy, this is like
translating cp1252 to iso8859-1 - you can't do it by simply changing the
encoding, you must translate the character set from one to the other by
mapping the characters that do not appear in the target character set
(in the case of cp1252->iso8859-1 you might map both the left and right
single quotes to an apostrophe)


> The only conversion that works with the Java program is to go utf-8 to
> utf-8s, which it calls simplified utf-8. Google tells me that this is a
> bastardized format of sorts, proposed by Oracle and not widely
> accepted. Unfortunately it is, so far, the only one that works. Data
> comes in as utf-8, gets converted to utf-8s, and goes out through
> AOLserver configured to use utf-8. All is well.


I think simplified utf-8 is the same as regular utf-8 for all code
points < U+10000 (i.e., a single ucs-16 character, which is java's
native format for it). So if your encodings are all beneath that you
can call it utf-8 without issue.

> The problem is, Tcl doesn't support utf-8s, and as far as I can tell
> there is no other format that will work. This will leave me stuck with
> the Java program, and I have serious concerns about the performance of
> any sort of exec, let alone one that involves writing files.


It sounds like the Java program is your best bet since it does the
translation already; do you have the source to the Java program? You
might be able to modify it to run better in a pipe, or by being a
persistent process so you avoid the fork/exec overhead on every run
(e.g., by running it inside tomcat as someone else suggested). If
you're really adventurous you could try getting it to run under tcljava
but I have no idea if that even works inside aolserver.

-J


Jeff Rogers

2007-09-06, 1:11 am

Dossy Shiobara wrote:
> On 2007.09.05, Jeff Rogers <dvrsn@DIPHI.COM> wrote:
>
> Perhaps it's time to introduce ns_exec, that yields read and write
> Tcl_Channel's using pipe()?
>
> Good idea? Bad idea? Something you'd like to see in the AOLserver 4.5
> tree?


+1 (Ok, we aren't apache)

I'd rather see something in the tcl core, maybe an implementation in
aolserver would work as a proving ground.

Are you thinking of
lassign [ns_exec "somecommand"] read_fd write_fd
or
set result [ns_exec -binary "somecommand" << $input]
? I could see either having advantages in different situations.

-J


Tom Jackson

2007-09-06, 1:11 am

On Wednesday 05 September 2007 16:55, Jeff Rogers wrote:
> I haven't dealt with chinese characters at all, but this sounds like
> you're doing character set translations, not character encoding
> conversions. tcl's 'encoding' command won't help you here - you'd need
> a monster "string map" command to change all 6000? code points from one
> into the other. To draw a much simplified analogy, this is like
> translating cp1252 to iso8859-1 - you can't do it by simply changing the
> encoding, you must translate the character set from one to the other by
> mapping the characters that do not appear in the target character set
> (in the case of cp1252->iso8859-1 you might map both the left and right
> single quotes to an apostrophe)


This is what I was thinking. Simplifying a character set isn't 'simple'. And
it would seem impossible to go from the simple character set to the complex
one. It isn't quite a translation, which would be impossible, but the map
will likely have one entry for every char in the larger set, whereas you can
use an algorithm to convert UTF-16 to UTF-8. The key is the map. If this was
built into Tcl, or you could put it into Tcl, you could dispense with ipc,
java and files.

tom jackson


Tom Jackson

2007-09-06, 1:11 am

On Wednesday 05 September 2007 16:11, Dossy Shiobara wrote:
> Perhaps it's time to introduce ns_exec, that yields read and write
> Tcl_Channel's using pipe()?
>
> Good idea? Bad idea? Something you'd like to see in the AOLserver 4.5
> tree?


Yes, pipe is missing, one of the few missing pieces, named pipes (FIFOs) would
be nice too. I guess ns_exec would imply a simple pipe where AOLserver
exec's some process, but it seems that the new process would need to know
about the pipe. There are also named pipes (FIFOs) and message queues which
are pretty portable. At the very least, you would need two pipes to have a
conversation.

tom jackson


Dossy Shiobara

2007-09-06, 1:11 am

On 2007.09.05, Janine Sisk <janine@FURFLY.NET> wrote:
> Bas is probably right, I should just do this in pure java. My main
> concern is that I'm doing something wrong that's making this harder
> than it needs to be. Using the encoding command would be so simple...
>
> I already saw that message, thanks, and have followed it as much as
> possible.


If you're getting the data via ns_http and the data is coming to you as
utf8 and you need to transcode to big5, you'd need to do something like:

# ns_http queue here ...
ns_http wait $id data
set output [encoding convertto big5 \
[encoding convertfrom utf-8 $data]]

The "convertfrom utf-8" might SEEM unnecessary, but I believe ns_http
doesn't do the necessary external-to-utf conversion with its data (BAD!
or, use a Tcl_NewByteArrayObj either) ...

-- Dossy

--
Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)


Dossy Shiobara

2007-09-06, 1:11 am

On 2007.09.05, Jeff Rogers <dvrsn@DIPHI.COM> wrote:
> Are you thinking of
> lassign [ns_exec "somecommand"] read_fd write_fd
> or
> set result [ns_exec -binary "somecommand" << $input]
> ? I could see either having advantages in different situations.


I'm thinking of the former.

-- Dossy

--
Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)


Dave Bauer

2007-09-06, 1:11 am

The key is the map. If this was
> built into Tcl, or you could put it into Tcl, you could dispense with ipc,
> Java and files.
>

Here is a crazy idea.
Imagine creating a text file with every character you have in the
input data. Or just take say, a good sample of all the input you'd
have, and get a list of unique characters.

Pass it to the Java program.

Use the output to create a map and convert to Tcl (or whatever)


Might work, might not. You could write a few tests by comparing output
from tjhe Java to your new maintainable map. This assumes there is a
one-to-one mapping. I have no idea, I am not familiar with Chinese.

Dave


Tom Jackson

2007-09-06, 1:11 am

On Wednesday 05 September 2007 17:53, Dave Bauer wrote:
> Might work, might not. You could write a few tests by comparing output
> from tjhe Java to your new maintainable map. This assumes there is a
> one-to-one mapping. I have no idea, I am not familiar with Chinese.


There are several steps, but processing is char-by-char. You have to know how
to read a char. With UTF-16, this is easy. UTF-8 is harder to read. Once you
have a char, you need a method to map it to the new character set. Handling
errors is another story.

Also, if you are reading a channel, you can fconfigure as binary. If you don't
do this, then Tcl will probably convert it to UTF-8. Use your char reader on
a binary channel unless Tcl can do the conversion all by itself.

Don't expect too much, I have yet to find a browser which reads UTF-8 100%
correct.

see: http://rmadilo.com/files/utf-8/UTF-8-test.txt

tom jackson


Janine Sisk

2007-09-06, 7:11 pm

I think I will just go with moving the thing to Tomcat; it looks
like that's going to be easier than rewriting the mapping process in
Tcl.

Thanks for the input, everyone!

janine

On Sep 5, 2007, at 6:42 PM, Tom Jackson wrote:

> On Wednesday 05 September 2007 17:53, Dave Bauer wrote:
>
> There are several steps, but processing is char-by-char. You have
> to know how
> to read a char. With UTF-16, this is easy. UTF-8 is harder to read.
> Once you
> have a char, you need a method to map it to the new character set.
> Handling
> errors is another story.
>
> Also, if you are reading a channel, you can fconfigure as binary.
> If you don't
> do this, then Tcl will probably convert it to UTF-8. Use your char
> reader on
> a binary channel unless Tcl can do the conversion all by itself.
>
> Don't expect too much, I have yet to find a browser which reads
> UTF-8 100%
> correct.
>
> see: http://rmadilo.com/files/utf-8/UTF-8-test.txt
>
> tom jackson
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the
> body of "SIGNOFF AOLSERVER" in the email message. You can leave the
> Subject: field of your email blank.
>



Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com