Web Servers on Unix and Linux - mod_proxy_html and nested <body> tags

This is Interesting: Free IT Magazines  
Home > Archive > Web Servers on Unix and Linux > March 2005 > mod_proxy_html and nested <body> tags





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author mod_proxy_html and nested <body> tags
Bill

2005-03-14, 5:54 pm

I have encountered a situation where some of the web pages my company
has contains nested <body> tags.

<BODY>
David Dorward

2005-03-14, 8:46 pm

Bill wrote:

> I have encountered a situation where some of the web pages my company
> has contains nested <body> tags.


> When I run it through mod_proxy_html, it the internal tags look like
> they are getting dropped. Has anyone run into a similar situation, and
> how did you resolve it?


I haven't run into a situation like that, but I'd resolve it by ensuring all
the pages being served were valid HTML in the first place.

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Nick Kew

2005-03-14, 8:46 pm

Bill wrote:
> I have encountered a situation where some of the web pages my company
> has contains nested <body> tags.
>
> <BODY>
> .
> .
> .
> <body>
> .
> .
> .
> </body>
> .
> .
> .
> </BODY>
>
> When I run it through mod_proxy_html, it the internal tags look like
> they are getting dropped. Has anyone run into a similar situation, and
> how did you resolve it?
>


Erm, and that's a problem exactly how?

mod_proxy_html inherits much of its parsing, including this, from
libxml2. If you run your pages through xmllint, you'll see the same
thing. If your markup happened to be well-formed XML, you could
suppress libxml2's html corrections by parsing as XML instead of HTML.
Or you could run mod_publisher to give you more control over parse
modes and handling of broken markup.

--
Nick Kew
Bill

2005-03-15, 5:53 pm

Here is a snippet from the original HTML file:


<BODY TEXT="000000" BGCOLOR="F8F0D9" BACKGROUND="">

<FORM><SCRIPT LANGUAGE="JavaScript"
SRC="../files/pp.js/$File/pp.js"></script>
<link rel="stylesheet" href="../files/my_style.css/$File/my_style.css"
type="text/css">
<link rel="stylesheet"
href="../files/my_style2.css/$File/my_style2.css" type="text/css">
<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR VALIGN=top><TD WIDTH="360"><div id="specsTitle">SMW</div><br><BODY
onLoad='javascript...


Please note the beginning of a nested <BODY> tag within the <TABLE>
tag. After being run through mod_proxy_html, this is what is being
served to the browser:


<body text="000000" bgcolor="F8F0D9" background="">
<form><script language="JavaScript"
src="../files/pp.js/$File/pp.js"></script>
<link rel="stylesheet" href="../files/my_style.css/$File/my_style.css"
type="text/css">
<link rel="stylesheet"
href="../files/my_style2.css/$File/my_style2.css" type="text/css">
<table border="0" cellspacing="0" cellpadding="0">
<tr valign="top"><td width="360"><div id="specsTitle">SMW</div><br>
onLoad='javascript...


Notice now that the <BODY> tag within the table has been removed. The
onLoad event is being treated as text and is being displayed on the
page as opposed to occurring when the page is loaded. What could cause
this internal <BODY> tag to get dropped?

Tim

2005-03-15, 8:49 pm

On 15 Mar 2005 08:01:40 -0800,
"Bill" <gardneriv@yahoo.com> posted:

> Notice now that the <BODY> tag within the table has been removed. The
> onLoad event is being treated as text and is being displayed on the
> page as opposed to occurring when the page is loaded. What could cause
> this internal <BODY> tag to get dropped?


I thought that'd already been explained. But hasn't it occured to you that
if the proxy has problems with MALFORMED HTML, then so will some browsers?
There's one solution, and one solution only: Fix up the broken HTML.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.
Nick Kew

2005-03-16, 5:56 pm

Tim wrote:

> I thought that'd already been explained. But hasn't it occured to you that
> if the proxy has problems with MALFORMED HTML, then so will some browsers?
> There's one solution, and one solution only: Fix up the broken HTML.
>


Entirely right of course, but not the whole story.

A browser expects to work for a single user on a workstation where it
is one of a very few active tasks, and can help itself to oodles of
CPU and memory. So it can put a lot of effort into error-correction.

The proxy doesn't have that luxury. It needs to be able to process
thousands of concurrent requests, and cares a lot more about efficiency
than a browser. So it's less forgiving than a typical browser.

mod_publisher offers more options, including expending more resources
on error correction.

--
Nick Kew
Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com