Unix Programming - Validating multibyte strings

This is Interesting: Free IT Magazines  
Home > Archive > Unix Programming > September 2005 > Validating multibyte strings





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Validating multibyte strings
Simon Morgan

2005-09-24, 7:48 am

Hi,

The following code is meant to validate a string of multibyte characters
by using mbcheck() to call mblen() on each character on the string passed
to it. The problem is that it isn't working how I expect. I've included in
the comments what I think mbcheck() should be returning for each string
given my understanding of how the multibyte system works.

#include <stdio.h>
#include <stdlib.h>

int mbcheck(const char *);

int main(void) {
char *a[] = {
"\x05\x87\x80\x36\xed\xaa", /* 0 */
"\x20\xe4\x50\x88\x3f", /* -1 */
"\xde\xad\xbe\xef", /* -1 */
"\x8a\x60\x92\x74\x41" /* 0 */
};
int i;

for (i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
printf("%d\n", mbcheck(a[i]));
puts("--");
}

return 0;
}

int mbcheck(const char *s) {
int n;

for (mblen(NULL, 0); ; s += n) {
printf("checking %#.8x\n", *s);
if ((n = mblen(s, MB_CUR_MAX)) <= 0)
return n;
printf("%d\n", n);
}
}

Does mblen() rely on a locale being set? Reading the man page it doesn't
look like it. This code is for an exercise in the book "C Programming: A
Modern Approach". The strings are supposedly Shift-JIS encoded kanji and I
have no idea which locale that relates to if there is one.

Also could somebody please explain to me what's with all the hexadecimal
f's in the output? As you've probably realised I'm still learning C but
seeing as s points to a char shouldn't printf() only be reading 1 byte and
padding the output with 0?

Many thanks.

--
"Being a social outcast helps you stay concentrated on the really important
things, like thinking and hacking." - Eric S. Raymond

Ulrich Eckhardt

2005-09-24, 7:48 am

Simon Morgan wrote:
> Does mblen() rely on a locale being set? Reading the man page it doesn't
> look like it.


You need to update your manpages, mine (current Debian) explicitly mentions
locales.

> The strings are supposedly Shift-JIS encoded kanji and I
> have no idea which locale that relates to if there is one.


Just for you info, but how is mblen() supposed to know this encoding if not
via the locale?

Uli

--
http://www.erlenstar.demon.co.uk/unix/
Simon Morgan

2005-09-24, 7:48 am

On Sat, 24 Sep 2005 13:51:01 +0200, Ulrich Eckhardt wrote:

> You need to update your manpages, mine (current Debian) explicitly
> mentions locales.


I just spotted it in the NOTES section, which I didn't read. Sorry.

> Just for you info, but how is mblen() supposed to know this encoding if
> not via the locale?


I thought that the same multibyte encoding rules might apply to all
locales, i.e. a function such as mblen won't need to know the locale to
validate a string but a function used for displaying it would. I'm still
learning C so please excuse my ignorance.

--
"Being a social outcast helps you stay concentrated on the really important
things, like thinking and hacking." - Eric S. Raymond

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com