Binary File Test
Web Server forum
Back To The Forum Home!Search!Private Messaging System

Web Server Talk Web Server Talk > Unix and Linux reviews > Free Unix support > Unix Programming > Binary File Test




  Last Thread   Next Thread Next
  Show Printable Version Email this Page Subscribe to this Thread      Post New Thread    Post A Reply      

    Binary File Test  
Michael B Allen


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

What is a good but simple/fast test for a binary file? I was thinking
anything with a byte < 0x20 is definately binary.

Mike





[ Post a follow-up to this message ]



    Re: Binary File Test  
Rich Teer


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

On Fri, 20 Aug 2004, Michael B Allen wrote:

> What is a good but simple/fast test for a binary file? I was thinking
> anything with a byte < 0x20 is definately binary.

There is no such thing as a "binary" file.  The distinction
between "text" and binary" files is a fabrication of MS-DOS
(or perhaps CP/M).  A file is a file.

--
Rich Teer, SCNA, SCSA, author of "Solaris Systems Programming",
publishing in August 2004.

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich





[ Post a follow-up to this message ]



    Re: Binary File Test  
Michael B Allen


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

On Fri, 20 Aug 2004 20:51:05 -0400, Rich Teer wrote:

> On Fri, 20 Aug 2004, Michael B Allen wrote:
> 
>
> There is no such thing as a "binary" file.  The distinction between
> "text" and binary" files is a fabrication of MS-DOS (or perhaps CP/M). A
> file is a file.

Ok, well can someone recommend a good but simple/fast test for a
fabrication of MS-DOS "binary" file?

Right now I'm considering a file to be "binary" if it has bytes that
match the following expression:

((unsigned int)ch < 0x09 || ((unsigned int)ch < 0x20 && ch > 0x0d))

Mike





[ Post a follow-up to this message ]



    Re: Binary File Test  
Mohun Biswas


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

Michael B Allen wrote:
> Right now I'm considering a file to be "binary" if it has bytes that
> match the following expression:
>
>   ((unsigned int)ch < 0x09 || ((unsigned int)ch < 0x20 && ch > 0x0d))

What if the file contains Unicode or some other non-ascii text? How do
you want to classify that? I think you can determine ascii vs non-ascii
like this but not more generally binary vs text.

An obvious thing to do is look at the source of the 'file' program. Or I
believe nowadays there's a "libmagic" which file uses.

--
Thanks,
M.Biswas





[ Post a follow-up to this message ]



    Re: Binary File Test  
Nick Landsberg


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

Michael B Allen wrote:

> On Fri, 20 Aug 2004 20:51:05 -0400, Rich Teer wrote:
>
> 
>
>
> Ok, well can someone recommend a good but simple/fast test for a
> fabrication of MS-DOS "binary" file?
>
> Right now I'm considering a file to be "binary" if it has bytes that
> match the following expression:
>
>   ((unsigned int)ch < 0x09 || ((unsigned int)ch < 0x20 && ch > 0x0d))
>
> Mike

Why not use !isspace(ch) to do the tests?

e.g.

if (!isprint(ch) && !isspace(ch) )
{
/* must be something else */
}

Take a look at ctype.h for a list of other macros
you can use.

NPL




--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch





[ Post a follow-up to this message ]



    Re: Binary File Test  
Michael B Allen


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

On Fri, 20 Aug 2004 21:49:38 -0400, Mohun Biswas wrote:

> Michael B Allen wrote: 
>
> What if the file contains Unicode or some other non-ascii text? How do
> you want to classify that? I think you can determine ascii vs non-ascii
> like this but not more generally binary vs text.

Well Unicode is a charset not an encoding and the encoding is what really
matters. Also, I think this test will work with most 8 bit encoding
such as Latin1 so it's not as bad as ASCII only. It would also work with
UTF-8 I think.

But ultimately you're right -- if the input is UCS-2 the above won't work.

> An obvious thing to do is look at the source of the 'file' program. Or I
> believe nowadays there's a "libmagic" which file uses.

I think that is a much more intense kind of examination and there's
no way it can be perfect. I'm just wondering if there was some kind of
clever trick to detecting text vs. everything else.

Mike





[ Post a follow-up to this message ]



    Re: Binary File Test  
Rich Grise


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

Michael B Allen wrote:

> On Fri, 20 Aug 2004 21:49:38 -0400, Mohun Biswas wrote:
> 
>
> Well Unicode is a charset not an encoding and the encoding is what really
> matters. Also, I think this test will work with most 8 bit encoding
> such as Latin1 so it's not as bad as ASCII only. It would also work with
> UTF-8 I think.
>
> But ultimately you're right -- if the input is UCS-2 the above won't work.
> 
>
> I think that is a much more intense kind of examination and there's
> no way it can be perfect. I'm just wondering if there was some kind of
> clever trick to detecting text vs. everything else.
>

How about if every byte is in the range 0x20 through 0x7f, then it's text.
Oh, yeah, and \n and \t. The presence of a \r, 0x0d, means it's MS-DOS
text.

I've had 'less' ask me if I still want to look at a file when it saw a
control char, if that's the kind of functionality you're looking for.

Then, of course, you've got bell, and vertical tab, and form feed, and
EOF - I've seen files where a literal 0x1a is an "end-of-file" marker
character. This may have been from the days of paper tape I/O. :-)

Good Luck!
Rich






[ Post a follow-up to this message ]



    Re: Binary File Test  
Lev Walkin


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM

Rich Grise wrote: 
>=20
>=20
> How about if every byte is in the range 0x20 through 0x7f, then it's te=
xt.
> Oh, yeah, and \n and \t. The presence of a \r, 0x0d, means it's MS-DOS
> text.


=D0=95=D1=81=D0=BB=D0=B8  =D0=B8=D1=81=D0=BF=D0=BE=D0=BB=D1=8C=D0=
B7=D0=BE=
=D0=B2=D0=B0=D1=82=D1=8C =D1=8D=D1=82=D0=BE=D1=82 =D0=B0=D0=BB=D0=B3=D0=BE=
=D1=80=D0=B8=D1=82=D0=BC, =D0=B4=D0=B0=D0=BD=D0=BD=D0=B0=D1=8F =D1=81=D1=82=
=D1=80=D0=BE=D0=BA=D0=B0 =D0=B1=D1=83=D0=B4=D0=B5=D1=82 =D1=80=D0=B0=D1=81=
 =D0=BF=D0=BE=D0=B7=D0=BD=D0=B0=D0=BD=D0=
B0 =D0=BA=D0=B0=D0=BA =D0=B1=D0=B8=
=D0=BD=D0=B0=D1=80=D0=BD=D0=B0=D1=8F
 =D0=BF=D0=BE=D1=81=D0=BB=D0=B5=D0=B4=D0=
BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=
=8C=D0=BD=D0=BE=D1=81=D1=82=D1=8C.


Isn't it a "text" string? But your algorithm won't recognize it.


--=20
Lev Walkin
vlm@lionet.info





[ Post a follow-up to this message ]



    Re: Binary File Test  
David Schwartz


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-22-04 11:08 PM


"Michael B Allen" <mba2000@ioplex.com> wrote in message
news:pan.2004.08.20.20.38.13.162418.3418@ioplex.com...

> What is a good but simple/fast test for a binary file? I was thinking
> anything with a byte < 0x20 is definately binary.

Define "binary file". Then it should be obvious how to test for such a
thing.

DS







[ Post a follow-up to this message ]



    Re: Binary File Test  
Mark Rafn


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
08-23-04 10:55 PM

Michael B Allen  <mba2000@ioplex.com> wrote:
>What is a good but simple/fast test for a binary file? I was thinking
>anything with a byte < 0x20 is definately binary.

You're going to get a bunch of snippy responses saying that there is no
difference between text and binary.  Don't ignore these just because they're
short and sometimes mean.  They're basically right: pretend you have an
isbinary() call available.  On Unix, it always returns true.

What they mean to say is "What will you do differently for a file based on
this classification"?  That will determine what test to use, and whether a
test is needed at all.  If you're worried about display, then isprint(ch) is
 a
handy macro to use.  If you're worried about conversion of line endings in
transmission to another system, then you probably have to ask the user: sinc
e
text is a subset of binary, you can never be sure you're not incorrectly
handling a binary file which just happens to have all characters in the asci
i
printing range.
--
Mark Rafn    dagon@dagon.net    <http://www.dagon.net/>





[ Post a follow-up to this message ]



    Sponsored Links  




 





   All times are GMT. The time now is 09:28 AM.      Post New Thread    Post A Reply      
  Last Thread   Next Thread Next


Most Popular forums 

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
 
Medical and Health forum | Computer Games Reviews | Graphics design forum

Back To The Top
Home | Usercp | Faq | Register