08-22-04 11:08 PM
Michael B Allen wrote:
> On Fri, 20 Aug 2004 21:49:38 -0400, Mohun Biswas wrote:
>
>
> Well Unicode is a charset not an encoding and the encoding is what really
> matters. Also, I think this test will work with most 8 bit encoding
> such as Latin1 so it's not as bad as ASCII only. It would also work with
> UTF-8 I think.
>
> But ultimately you're right -- if the input is UCS-2 the above won't work.
>
>
> I think that is a much more intense kind of examination and there's
> no way it can be perfect. I'm just wondering if there was some kind of
> clever trick to detecting text vs. everything else.
>
How about if every byte is in the range 0x20 through 0x7f, then it's text.
Oh, yeah, and \n and \t. The presence of a \r, 0x0d, means it's MS-DOS
text.
I've had 'less' ask me if I still want to look at a file when it saw a
control char, if that's the kind of functionality you're looking for.
Then, of course, you've got bell, and vertical tab, and form feed, and
EOF - I've seen files where a literal 0x1a is an "end-of-file" marker
character. This may have been from the days of paper tape I/O. :-)
Good Luck!
Rich
[ Post a follow-up to this message ]
|