Unix Programming - Binary File Test

This is Interesting: Free IT Magazines  
Home > Archive > Unix Programming > August 2004 > Binary File Test





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Binary File Test
Michael B Allen

2004-08-22, 6:08 pm

What is a good but simple/fast test for a binary file? I was thinking
anything with a byte < 0x20 is definately binary.

Mike
Rich Teer

2004-08-22, 6:08 pm

On Fri, 20 Aug 2004, Michael B Allen wrote:

> What is a good but simple/fast test for a binary file? I was thinking
> anything with a byte < 0x20 is definately binary.


There is no such thing as a "binary" file. The distinction
between "text" and binary" files is a fabrication of MS-DOS
(or perhaps CP/M). A file is a file.

--
Rich Teer, SCNA, SCSA, author of "Solaris Systems Programming",
publishing in August 2004.

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich
Michael B Allen

2004-08-22, 6:08 pm

On Fri, 20 Aug 2004 20:51:05 -0400, Rich Teer wrote:

> On Fri, 20 Aug 2004, Michael B Allen wrote:
>
>
> There is no such thing as a "binary" file. The distinction between
> "text" and binary" files is a fabrication of MS-DOS (or perhaps CP/M). A
> file is a file.


Ok, well can someone recommend a good but simple/fast test for a
fabrication of MS-DOS "binary" file?

Right now I'm considering a file to be "binary" if it has bytes that
match the following expression:

((unsigned int)ch < 0x09 || ((unsigned int)ch < 0x20 && ch > 0x0d))

Mike
Mohun Biswas

2004-08-22, 6:08 pm

Michael B Allen wrote:
> Right now I'm considering a file to be "binary" if it has bytes that
> match the following expression:
>
> ((unsigned int)ch < 0x09 || ((unsigned int)ch < 0x20 && ch > 0x0d))


What if the file contains Unicode or some other non-ascii text? How do
you want to classify that? I think you can determine ascii vs non-ascii
like this but not more generally binary vs text.

An obvious thing to do is look at the source of the 'file' program. Or I
believe nowadays there's a "libmagic" which file uses.

--
Thanks,
M.Biswas
Nick Landsberg

2004-08-22, 6:08 pm

Michael B Allen wrote:

> On Fri, 20 Aug 2004 20:51:05 -0400, Rich Teer wrote:
>
>
>
>
> Ok, well can someone recommend a good but simple/fast test for a
> fabrication of MS-DOS "binary" file?
>
> Right now I'm considering a file to be "binary" if it has bytes that
> match the following expression:
>
> ((unsigned int)ch < 0x09 || ((unsigned int)ch < 0x20 && ch > 0x0d))
>
> Mike


Why not use !isspace(ch) to do the tests?

e.g.

if (!isprint(ch) && !isspace(ch) )
{
/* must be something else */
}

Take a look at ctype.h for a list of other macros
you can use.

NPL




--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
Michael B Allen

2004-08-22, 6:08 pm

On Fri, 20 Aug 2004 21:49:38 -0400, Mohun Biswas wrote:

> Michael B Allen wrote:
>
> What if the file contains Unicode or some other non-ascii text? How do
> you want to classify that? I think you can determine ascii vs non-ascii
> like this but not more generally binary vs text.


Well Unicode is a charset not an encoding and the encoding is what really
matters. Also, I think this test will work with most 8 bit encoding
such as Latin1 so it's not as bad as ASCII only. It would also work with
UTF-8 I think.

But ultimately you're right -- if the input is UCS-2 the above won't work.

> An obvious thing to do is look at the source of the 'file' program. Or I
> believe nowadays there's a "libmagic" which file uses.


I think that is a much more intense kind of examination and there's
no way it can be perfect. I'm just wondering if there was some kind of
clever trick to detecting text vs. everything else.

Mike
Rich Grise

2004-08-22, 6:08 pm

Michael B Allen wrote:

> On Fri, 20 Aug 2004 21:49:38 -0400, Mohun Biswas wrote:
>
>
> Well Unicode is a charset not an encoding and the encoding is what really
> matters. Also, I think this test will work with most 8 bit encoding
> such as Latin1 so it's not as bad as ASCII only. It would also work with
> UTF-8 I think.
>
> But ultimately you're right -- if the input is UCS-2 the above won't work.
>
>
> I think that is a much more intense kind of examination and there's
> no way it can be perfect. I'm just wondering if there was some kind of
> clever trick to detecting text vs. everything else.
>


How about if every byte is in the range 0x20 through 0x7f, then it's text.
Oh, yeah, and \n and \t. The presence of a \r, 0x0d, means it's MS-DOS
text.

I've had 'less' ask me if I still want to look at a file when it saw a
control char, if that's the kind of functionality you're looking for.

Then, of course, you've got bell, and vertical tab, and form feed, and
EOF - I've seen files where a literal 0x1a is an "end-of-file" marker
character. This may have been from the days of paper tape I/O. :-)

Good Luck!
Rich

Lev Walkin

2004-08-22, 6:08 pm

Rich Grise wrote:
>=20
>=20
> How about if every byte is in the range 0x20 through 0x7f, then it's te=

xt.
> Oh, yeah, and \n and \t. The presence of a \r, 0x0d, means it's MS-DOS
> text.



=D0=95=D1=81=D0=BB=D0=B8 =D0=B8=D1=81=D0=BF=D0=BE=D0=BB=D1=8C=D0=
B7=D0=BE=
=D0=B2=D0=B0=D1=82=D1=8C =D1=8D=D1=82=D0=BE=D1=82 =D0=B0=D0=BB=D0=B3=D0=BE=
=D1=80=D0=B8=D1=82=D0=BC, =D0=B4=D0=B0=D0=BD=D0=BD=D0=B0=D1=8F =D1=81=D1=82=
=D1=80=D0=BE=D0=BA=D0=B0 =D0=B1=D1=83=D0=B4=D0=B5=D1=82 =D1=80=D0=B0=D1=81=
=D0=BF=D0=BE=D0=B7=D0=BD=D0=B0=D0=BD=D0=
B0 =D0=BA=D0=B0=D0=BA =D0=B1=D0=B8=
=D0=BD=D0=B0=D1=80=D0=BD=D0=B0=D1=8F
=D0=BF=D0=BE=D1=81=D0=BB=D0=B5=D0=B4=D0=
BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=
=8C=D0=BD=D0=BE=D1=81=D1=82=D1=8C.


Isn't it a "text" string? But your algorithm won't recognize it.


--=20
Lev Walkin
vlm@lionet.info
David Schwartz

2004-08-22, 6:08 pm


"Michael B Allen" <mba2000@ioplex.com> wrote in message
news:pan.2004.08.20.20.38.13.162418.3418@ioplex.com...

> What is a good but simple/fast test for a binary file? I was thinking
> anything with a byte < 0x20 is definately binary.


Define "binary file". Then it should be obvious how to test for such a
thing.

DS


Mark Rafn

2004-08-23, 5:55 pm

Michael B Allen <mba2000@ioplex.com> wrote:
>What is a good but simple/fast test for a binary file? I was thinking
>anything with a byte < 0x20 is definately binary.


You're going to get a bunch of snippy responses saying that there is no
difference between text and binary. Don't ignore these just because they're
short and sometimes mean. They're basically right: pretend you have an
isbinary() call available. On Unix, it always returns true.

What they mean to say is "What will you do differently for a file based on
this classification"? That will determine what test to use, and whether a
test is needed at all. If you're worried about display, then isprint(ch) is a
handy macro to use. If you're worried about conversion of line endings in
transmission to another system, then you probably have to ask the user: since
text is a subset of binary, you can never be sure you're not incorrectly
handling a binary file which just happens to have all characters in the ascii
printing range.
--
Mark Rafn dagon@dagon.net <http://www.dagon.net/>
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com