|
Home > Archive > Unix Programming > November 2005 > How to socket and utf-8?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
How to socket and utf-8?
|
|
| yarco.w@gmail.com 2005-11-18, 5:54 pm |
| When coding a server, we always use the following code:
while(1) {
tmp_sd = accept(sd, (struct sockaddr*)&tmp_sin, &len);
//1
len = recv(tmp_sd, buf, MAX, 0); //2
send(tmp_sd, buf, len, 0); //3
close(tmp_sd);
}
1. And when accept a client, i want to check the client's IP. How to do
it?
2. When using utf-8 for communication, should i translate it into ascii
for normal using?
Can i deal with it directory like ascii?
Thanks a lot.
| |
| Måns Rullgård 2005-11-18, 5:54 pm |
| yarco.w@gmail.com writes:
> When coding a server, we always use the following code:
>
> while(1) {
> tmp_sd = accept(sd, (struct sockaddr*)&tmp_sin, &len);
> //1
> len = recv(tmp_sd, buf, MAX, 0); //2
> send(tmp_sd, buf, len, 0); //3
> close(tmp_sd);
> }
>
> 1. And when accept a client, i want to check the client's IP. How to do
> it?
The address of the connecting client is stored in the second argument
to accept().
> 2. When using utf-8 for communication, should i translate it into ascii
> for normal using?
> Can i deal with it directory like ascii?
Nothing special needs to be done. As long as both ends expect the
same encoding, everything should work.
--
Måns Rullgård
mru@inprovide.com
| |
| Pascal Bourguignon 2005-11-18, 5:54 pm |
| yarco.w@gmail.com writes:
> 2. When using utf-8 for communication, should i translate it into ascii
> for normal using?
http://en.wikipedia.org/wiki/Utf-8
In C, you don't have a notion of character.
The type char is merely a small integer, perhaps signed perhaps
unsigned, and the type unsigned char is merely a small unsigned
integer, and the type signed char is merely a small signed integer.
ASCII is a encoding, which is a direct mapping between some
_characters_ and some _integers_. Since the integers of the ASCII
encoding are between 0 and 127, they're small enough to be held in C
variables of type unsigned char. It's just a coincidence.
If you wanted to use the UNICODE encoding, which (to a first
approximation) is a direct mapping between some more _characters_ and
some _integers_, but bigger integers up to 0x10fff, you'll need to use
unsigned long int C variables.
Now, both ASCII and UNICODE have a 1-1 mapping between a set of
characters and a set of integers.
But there are other encodings, such as UTF-8, or UTF-16, or
ISO-2022-JP, etc, that map a character to a sequence of numbers of
variable length. However, despite this variable length of characters
encoded in UTF-8, this encoding has some nice properties:
- a character X encoded in ASCII as the same code as the same
character X encoded in UTF-8.
- no multi-byte sequence of UTF-8 contain a byte equal to one of the
ASCII subset: all multi-byte sequences in UTF-8 use only numbers
between 160 and 255.
So when you use C variables of type unsigned char, you can handle
safely utf-8 byte sequences, while you're not interested in the actual
characters represented by the byte sequence, or as long as the only
characters in this utf-8 byte sequence are all ASCII characters.
By the way you cannot "translate utf-8 to ASCII", because most
characters encodable in utf-8 cannot be encoded in ASCII:
$ echo é|iconv -f utf-8 -t ascii
iconv: illegal input sequence at position 0
So you can easily process utf-8 data as a whole, without having to
translate it. What would be "normal use" for your strings?
> Can i deal with it directory like ascii?
Globally, yes.
If you want to process the characters, in general, no.
In some cases, yes.
For example: "C'est ça la vie" is encoded in UTF-8 as these bytes:
43 27 65 73 74 20 c3 a7 61 20 6c 61 20 76 69 65
If you want to split this string on spaces (bytes 20), you can do it
as if it was encoded in ASCII, because the space in UNICODE has the
same code as in ASCII, and because UTF-8 doesn't use this code for
anything else than a space. So you can get these four subsequences of
bytes:
43 27 65 73 74
c3 a7 61
6c 61
76 69 65
which, when decoded from UTF-8 give you back these four strings:
"C'est"
"ça"
"la"
"vie"
If you keep in mind that in C you are not processing characters, but
bytes, and if you keep in mind the properties of the UTF-8 encoding,
then you can do a great deal without having to decode UTF-8 bytes to
characters. You may want to use:
typedef unsigned char byte;
byte* bytes="Hello World";
instead of char and string...
--
__Pascal Bourguignon__ http://www.informatimago.com/
I need a new toy.
Tail of black dog keeps good time.
Pounce! Good dog! Good dog!
| |
| SM Ryan 2005-11-18, 5:54 pm |
| yarco.w@gmail.com wrote:
# 2. When using utf-8 for communication, should i translate it into ascii
# for normal using?
# Can i deal with it directory like ascii?
If you are using the 7-bit ASCII subset, UTF-8 and ASCII are identical.
Non-ASCII characters are encoded as one or more bytes in the range
0x80-0xFF. If you pass through signed character <0 or unsigned characters
>=128 unmolested, you will preserve the Unicode characters.
--
SM Ryan http://www.rawbw.com/~wyrmwif/
Raining down sulphur is like an endurance trial, man. Genocide is the
most exhausting activity one can engage in. Next to soccer.
| |
| Nils O. Selåsdal 2005-11-18, 5:54 pm |
| yarco.w@gmail.com wrote:
> When coding a server, we always use the following code:
>
> while(1) {
> tmp_sd = accept(sd, (struct sockaddr*)&tmp_sin, &len);
> //1
> len = recv(tmp_sd, buf, MAX, 0); //2
> send(tmp_sd, buf, len, 0); //3
> close(tmp_sd);
> }
>
> 1. And when accept a client, i want to check the client's IP. How to do
> it?
the getpeername function.
> 2. When using utf-8 for communication, should i translate it into ascii
> for normal using?
Depends on what you want to do with it. If you want to e.g. display
it on something that doesn't understand utf-8 you must do something.
utf-8 doesn't provide direct access to the individual characters if
you ever need to do that.
> Can i deal with it directory like ascii?
For most purposes, yes.
| |
| yarco.w@gmail.com 2005-11-19, 5:51 pm |
| Thanks for replay.
I am doing this, for example:
A client send a command which is encoded in utf-8 to the server.
Then the server parse the command, send the response also encoded in
utf-8 to the client...
I don't know whether i can treat it as normal string:
if i do:
char* msg = "GET apple";
send(sd, msg, strlen(msg), 0);
what's the difference between using ascii and utf-8 for transfering?
I get confused...
Why not:
utf8* msg = u"GET apple";
send(sd, (char*)msg, utflen(msg)*sizeof(utf8), 0);
mmm...
Would you mean i can only translate it into Unicode for checking
whether it is command?
| |
| Pascal Bourguignon 2005-11-19, 5:51 pm |
| yarco.w@gmail.com writes:
> Thanks for replay.
> I am doing this, for example:
> A client send a command which is encoded in utf-8 to the server.
> Then the server parse the command, send the response also encoded in
> utf-8 to the client...
> I don't know whether i can treat it as normal string:
> if i do:
> char* msg = "GET apple";
> send(sd, msg, strlen(msg), 0);
> what's the difference between using ascii and utf-8 for transfering?
> I get confused...
> Why not:
> utf8* msg = u"GET apple";
> send(sd, (char*)msg, utflen(msg)*sizeof(utf8), 0);
>
> mmm...
> Would you mean i can only translate it into Unicode for checking
> whether it is command?
[240]> (ext:convert-string-to-bytes "GET apple" charset:ascii)
#(71 69 84 32 97 112 112 108 101)
[241]> (ext:convert-string-to-bytes "GET apple" charset:utf-8)
#(71 69 84 32 97 112 112 108 101)
[242]> (equalp (ext:convert-string-to-bytes "GET apple" charset:ascii)
(ext:convert-string-to-bytes "GET apple" charset:utf-8))
T
So for this specific string, "GET apple" it doesn't make a difference
whether you encode it in ASCII or in UTF-8: you obtain the same byte
sequence.
Now, if your command was instead: "REÇOIT une pomme", it would matter:
[244]> (ext:convert-string-to-bytes "REÇOIT une pomme" charset:utf-8)
#(82 69 195 135 79 73 101 32 117 110 101 32 112 111 109 109 101)
[245]> (ext:convert-string-to-bytes "REÇOIT une pomme" charset:ASCII)
*** - Character #\u00C7 cannot be represented in the character set
CHARSET:ASCII
The following restarts are available:
ABORT :R1 ABORT
Break 1 [246]>
as you can see, there's no ASCII encoding for this string.
Ok, so you're using UTF-8, and you send this byte sequence:
#(82 69 195 135 79 73 101 32 117 110 101 32 112 111 109 109 101)
What happens when you decode it as ASCII?
(ext:convert-string-from-bytes
(ext:convert-string-to-bytes "REÇOIT une pomme" charset:utf-8)
charset:ascii)
*** - invalid byte #xC3 in CHARSET:ASCII conversion
The following restarts are available:
ABORT :R1 ABORT
Break 1 [250]>
Well, you've got a problem because ASCII bytes can only be between 0 and 127.
Let's try something else, let's try to decode it as an ISO-8859-1
(Latin-1) bytes:
[251]> (ext:convert-string-from-bytes
(ext:convert-string-to-bytes "REÇOIT une pomme" charset:utf-8)
charset:iso-8859-1)
"REÇOIT une pomme"
Well, the command is not REÇOIT any more, so I don't know how your
server will be able to understand the command...
(Note that in iso-8859-1 the code 0x87 encodes no graphical character,
but a control character "ESA"). http://en.wikipedia.org/wiki/Iso-8859-1
Now, you could define your protocol differently, and say that messages
are made of bytes, and that if the first four bytes are:
71 69 84 32
then it's a GET command and you will call: do_get(msg+4);
and let do_get do whatever it wants with the following bytes, which
can be specified to be UTF-p8 bytes if you need.
Similarly, you could define your protocol to say that if the message
starts with these bytes:
82 69 195 135 79 73 101 32
then it's a GET command too, and you will call do_get(msg+8);
If you defined your protocol this way, you could even do as in HTTP,
let the command specify the encoding used for the data, so you could receive
these commands:
71 69 84 47 65 83 67 73 73 32 102 105 108 101
G E T / A S C I I SP <some ASCII bytes>
71 69 84 47 75 79 73 56 45 82 32 198 193 202 204
G E T / K O I 8 - R SP <some KOI8-R bytes>
71 69 84 47 85 84 70 45 56 32 209 132 208 176 208 185 208 187
G E T / U T F - 8 SP <some UTF-8 bytes>
You could parse them as:
const byte get={71,69,84,0};
byte* slash=strchr(msg,47); /* add a test for NULL ! */
byte* space=strchr(slash,32); /* add a test for NULL ! */
slash[0]=0;
space[0]=0;
if(strcmp(msg,get)==0){
byte* encoding=slash+1;
byte* encoded_bytes=space+1;
do_get(encoding,encoded_bytes);
}
--
"Debugging? Klingons do not debug! Our software does not coddle the
weak."
| |
| yarco.w@gmail.com 2005-11-20, 5:51 pm |
| Thanks, Pascal Bourguignon.
I'm trying to create a dict server in RFC2229.
Any suggestion for socket programming?
For example, i don't know how to test whether a client is still alive??
Someone said use write()...does there exist a function
is_alive(sock_description) to test it?
Thank you very much.
| |
| Pascal Bourguignon 2005-11-20, 5:51 pm |
| yarco.w@gmail.com writes:
> Thanks, Pascal Bourguignon.
> I'm trying to create a dict server in RFC2229.
> Any suggestion for socket programming?
> For example, i don't know how to test whether a client is still alive??
When the client dies, the socket gets closed automatically.
So next time you try to read or write to it, you get a EBADF error.
> Someone said use write()...does there exist a function
> is_alive(sock_description) to test it?
No, you just use read or write. It would be useless to have a
is_alive, because the client could die between your call to is_alive
and to read or write!
> Thank you very much.
--
__Pascal Bourguignon__ http://www.informatimago.com/
Nobody can fix the economy. Nobody can be trusted with their finger
on the button. Nobody's perfect. VOTE FOR NOBODY.
| |
| Måns Rullgård 2005-11-20, 5:51 pm |
| Pascal Bourguignon <spam@mouse-potato.com> writes:
> yarco.w@gmail.com writes:
>
>
> When the client dies, the socket gets closed automatically.
> So next time you try to read or write to it, you get a EBADF error.
Are you sure about that? The system will detect that the other end
has vanished, that's for sure. However, the file descriptor will
remain open (otherwise it could be reused, causing all sorts of
trouble). Writing to a socket where the other end has closed should
give an EPIPE error, reading should just indicate end of file. It is
also possible to get an ECONNRESET error, depending on how the link
was broken.
--
Måns Rullgård
mru@inprovide.com
| |
| Nils O. Selåsdal 2005-11-20, 5:51 pm |
| Måns Rullgård wrote:
> Pascal Bourguignon <spam@mouse-potato.com> writes:
>
>
>
>
> Are you sure about that? The system will detect that the other end
> has vanished, that's for sure. However, the file descriptor will
> remain open (otherwise it could be reused, causing all sorts of
> trouble). Writing to a socket where the other end has closed should
> give an EPIPE error, reading should just indicate end of file. It is
Remeber to handle SIGPIPE if so.
| |
| Pascal Bourguignon 2005-11-20, 5:51 pm |
| Måns Rullgård <mru@inprovide.com> writes:
> Pascal Bourguignon <spam@mouse-potato.com> writes:
>
>
> Are you sure about that?
No.
> The system will detect that the other end
> has vanished, that's for sure. However, the file descriptor will
> remain open (otherwise it could be reused, causing all sorts of
> trouble). Writing to a socket where the other end has closed should
> give an EPIPE error, reading should just indicate end of file. It is
> also possible to get an ECONNRESET error, depending on how the link
> was broken.
Right, EPIPE.
--
"Remember, Information is not knowledge; Knowledge is not Wisdom;
Wisdom is not truth; Truth is not beauty; Beauty is not love;
Love is not music; Music is the best." -- Frank Zappa
| |
| joe@invalid.address 2005-11-20, 5:51 pm |
| Måns Rullgård <mru@inprovide.com> writes:
> Pascal Bourguignon <spam@mouse-potato.com> writes:
>
>
> Are you sure about that? The system will detect that the other end
> has vanished, that's for sure. However, the file descriptor will
> remain open (otherwise it could be reused, causing all sorts of
> trouble). Writing to a socket where the other end has closed should
> give an EPIPE error, reading should just indicate end of file. It
> is also possible to get an ECONNRESET error, depending on how the
> link was broken.
It's more than that. The system probably won't detect that the other
end has closed the connection on the next write. It's perfectly legal
for one side of the connection to close its write end while leaving
its read end open, and several protocols make use of this.
In this case, if the sending system detects the FIN from the receiving
system, it won't send a FIN in return until the application calls
close(). If the application doesn't do that, the sending system will
happily send data on that connection.
The receiving system, when it gets data for a connection that *it*
knows is closed, should send a reset segment. When the sending system
sees that, it will return an error on the next send the application
attempts. Between those two events, the application could try to send
more than once.
Joe
--
Gort, klatu barada nikto
| |
| Daniel C. Bastos 2005-11-21, 7:49 am |
| In article <m38xvix5r4.fsf@invalid.address>,
joe@invalid.address wrote:
> Måns Rullgård <mru@inprovide.com> writes:
>
>
> It's more than that. The system probably won't detect that the other
> end has closed the connection on the next write. It's perfectly legal
> for one side of the connection to close its write end while leaving
> its read end open, and several protocols make use of this.
>
> In this case, if the sending system detects the FIN from the receiving
> system, it won't send a FIN in return until the application calls
> close(). If the application doesn't do that, the sending system will
> happily send data on that connection.
>
> The receiving system, when it gets data for a connection that *it*
> knows is closed, should send a reset segment.
I don't understand you here. Above you say that one system closes its
write end and leaves its read end open --- eliciting a FIN. Call this
system A.
System B *will* receive FIN. Suppose system B keeps writing. System A
which sent FIN will not send RST. First because A is still reading,
second because only a close on that socket will generate RST. I don't
even see a SHUT_RDWR eliciting RST either --- which I expected so, but
doesn't seem to happen.
> When the sending system sees that, it will return an error on the
> next send the application attempts. Between those two events, the
> application could try to send more than once.
As someone pointed out, SIGPIPE could be delivered before the
application gets a chance to handle EPIPE from the write call.
| |
| joe@invalid.address 2005-11-21, 5:53 pm |
| "Daniel C. Bastos" <dbast0s@yahoo.com.br> writes:
> In article <m38xvix5r4.fsf@invalid.address>,
> joe@invalid.address wrote:
>
>
> I don't understand you here. Above you say that one system closes
> its write end and leaves its read end open --- eliciting a FIN. Call
> this system A.
>
> System B *will* receive FIN. Suppose system B keeps writing. System
> A which sent FIN will not send RST. First because A is still
> reading, second because only a close on that socket will generate
> RST. I don't even see a SHUT_RDWR eliciting RST either --- which I
> expected so, but doesn't seem to happen.
If system A closes only its write end of the connection, yes. I was
talking about the case where system A calls close(), not shutdown(). I
see that I didn't make that clear, sorry.
>
> As someone pointed out, SIGPIPE could be delivered before the
> application gets a chance to handle EPIPE from the write call.
Which is why he suggested catching or ignoring SIGPIPE.
Joe
--
Gort, klatu barada nikto
|
|
|
|
|