|
Home > Archive > Linux Debian support > April 2006 > Strange Network Problem
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Strange Network Problem
|
|
|
|
I'm tearing my hair out with this one. The gateway machine to the outside
world periodically becomes inaccessible to some hosts on the lan.
This is the setup.
LAN -> switch -> gateway/firewall -> DMZ -> ADSL modem -> Internet
What happens is that hosts on the LAN sometimes get no response from the
gateway. These hosts can be Windows or Linux boxes.
The gateway is a Debian Sarge box with two nics. Eth0 is on the DMZ side,
eth1 is on the LAN side. Shorewall is configured as a nat firewall.
Every once in a while, hosts on the lan no longer see the gateway. They
can't ping it, can't get access through it. At the same time, other hosts
seem to see it okay. The problem will fix itself after a short period. It
can go for hours without a problem, and then become very flaky.
If you do a disable followed by an enable on the ethernet adaptor from a
Windows box, the problem will fix itself. It seems that the gateway sees
the broadcast dhcp request and wakes up. (The dhcp and dns servers are on
another machine.)
I've changed almost everything that I could change. I've changed the nic
in the gateway, I've changed the gateway machine completely (with another
Debian box). I've replaced the switch. I've replaced the cable between
the switch and the gateway. I've replace Shorewall with a script file to
configure iptables. I've also disconnected some parts of the lan from the
switch. And I've followed all the cabling to make sure nothing is plugged
into the wrong spot.
I've had tcpdump or ethereal running on the gateway and other machines,
and there doesn't seem to be excessive traffic that could cause a problem,
or any other unexplained packets.
Everything seems to point to some configuration problem with the gateway.
But I've already tried replacing Shorewall with a script, so I don't think
it's there.
My interfaces file looks something like this.
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 192.168.2.2
netmask 255.255.255.248
network 192.168.2.0
broadcast 192.168.2.255
gateway 192.168.2.1
auto eth1
iface eth1 inet static
address 192.168.1.1
netmask 255.255.255.0
network 192.168.1.0
broadcast 192.168.1.255
Hope someone can give me some ideas where to look next.
Dan
| |
| Snowbat 2006-04-28, 1:13 am |
| On Fri, 28 Apr 2006 10:51:55 +0800, Dan N wrote:
> auto eth0
> iface eth0 inet static
> address 192.168.2.2
> netmask 255.255.255.248
> network 192.168.2.0
> broadcast 192.168.2.255
> gateway 192.168.2.1
>
> auto eth1
> iface eth1 inet static
> address 192.168.1.1
> netmask 255.255.255.0
> network 192.168.1.0
> broadcast 192.168.1.255
Not sure if this relates to your problem but for eth0, either the
netmask or the broadcast address is incorrect.
For netmask 255.255.255.248 the broadcast address should be 192.168.2.7
Broadcast address 192.168.2.255 would suit a netmask of 255.255.255.0
--
*** Posted via a free Usenet account from http://www.teranews.com ***
| |
| Bit Twister 2006-04-28, 1:13 am |
| On Fri, 28 Apr 2006 10:51:55 +0800, Dan N wrote:
>
> I'm tearing my hair out with this one. The gateway machine to the outside
> world periodically becomes inaccessible to some hosts on the lan.
>
> This is the setup.
>
> LAN -> switch -> gateway/firewall -> DMZ -> ADSL modem -> Internet
>
> What happens is that hosts on the LAN sometimes get no response from the
> gateway. These hosts can be Windows or Linux boxes.
>
> The gateway is a Debian Sarge box with two nics. Eth0 is on the DMZ side,
> eth1 is on the LAN side. Shorewall is configured as a nat firewall.
>
> Every once in a while, hosts on the lan no longer see the gateway. They
> can't ping it, can't get access through it. At the same time, other hosts
> seem to see it okay. The problem will fix itself after a short period. It
> can go for hours without a problem, and then become very flaky.
>
> If you do a disable followed by an enable on the ethernet adaptor from a
> Windows box, the problem will fix itself. It seems that the gateway sees
> the broadcast dhcp request and wakes up. (The dhcp and dns servers are on
> another machine.)
Sit down and draw a picture of your setup with all the players.
When a node goes out, and others work, what is unique to that node.
Example, you said some box fails, others work.
That indicates gateway is working. If gateway is failing, all nodes
would fail.
You mentioned release/renew on doze box seems to fix doze box.
My SWAG would say the dhcp server is at fault.
Your other comment about
"The problem will fix itself after a short period. It
can go for hours without a problem,"
also leans towards dhcp.
Your dhcp client will ask the dhcp server for lease renewal half way
through the lease time. If request fails, it keeps trying at each half way
point in the remaining lease time.
I have no idea what happens when the lease expires on a doze box.
I _think_ a linux box will continue to work until you bounce the network.
| |
|
| On Thu, 27 Apr 2006 23:35:05 -0500, Bit Twister wrote:
> Sit down and draw a picture of your setup with all the players. When a
> node goes out, and others work, what is unique to that node.
Seems to be any node, or many nodes. It can be hard to tell exactly how
many don't work, because it's likely to fix itself before they can all be
tested. It's not any particular location on the network, it seems to be
happening to every host.
> Example, you said some box fails, others work. That indicates gateway is
> working. If gateway is failing, all nodes would fail.
That makes sense, but failures are alway to the gateway, not to
other hosts.
> You mentioned release/renew on doze box seems to fix doze box. My SWAG
> would say the dhcp server is at fault.
I don't think it's dhcp because;
1. at least one of the linux boxes has a fixed ip but is displaying the
problem
2. an ipconfig on the windows box shows a valid ip address
3. its only the gateway that can't be pinged.
Thanks for your comments. Any light on the picture is helpful.
Dan
| |
| Michael Paoli 2006-04-28, 1:13 am |
| Dan N wrote:
> I'm tearing my hair out with this one. The gateway machine to the outside
> world periodically becomes inaccessible to some hosts on the lan.
> This is the setup.
> LAN -> switch -> gateway/firewall -> DMZ -> ADSL modem -> Internet
> What happens is that hosts on the LAN sometimes get no response from the
> gateway. These hosts can be Windows or Linux boxes.
> The gateway is a Debian Sarge box with two nics. Eth0 is on the DMZ side,
> eth1 is on the LAN side. Shorewall is configured as a nat firewall.
I'd suggest start with some basic "divide and conquer"
troubleshooting. Most notably, I might suggest, a few possible things
when the problem is active ... and then various other possible logical
trouble shooting and evidence gathering, e.g.:
o Can the host/device on the LAN talk to the switch (blinky lights on
switch may help provide useful clues)
o Can the gateway/firewall talk to the switch (blinky lights on switch
may help provide useful clues)
o Can the gateway/firewall talk through the switch to the host/device
on the LAN and/or vice versa
o Can traffic be (at least temporarily) simplified to aid in diagnosis
(e.g. reduce or extremely simplify traffic to/from host/device on
LAN and/or temporarily eliminate other LAN traffic to the switch
(e.g. unplug other switch connections for a bit))
o What about the blinky lights on the other various Ethernet
interfaces (e.g. host/device and gateway/firewall). Note that
the meaning and interpretation of the light indicators can vary
radically from one particular hardware device to another, so be sure
to interpret their intended meanings appropriately
o if hardware and/or software isn't 100% healthy, it may not tell the
truth 100% about its status/behavior (e.g. blinky lights can
sometimes lie). Corollary: diagnostic code is often the least
exercised in software/firmware/hardware, and can often be less than
100% truthful, though most commonly, what it claims to be the
problem, when there is a problem, is typically at least "in the
ballpark" of what/where the problem is ... but not necessarily
always so.
o Substitution? Can you swap components/connections, e.g. change
cables, flip hosts/devices among switch ports (note that it may
take the switch up to typically about 40 seconds, +- possibly a
fair bit, to (re)learn which Ethernet MAC address(es) are on what
ports, any autonegotiation(s) involved might also potentially add a
bit to the time required to (re)establish connectivity that should
be present). Have you tried a different switch? A different
Ethernet interface card? A different driver/module for the
Ethernet interface card (e.g. I've had switches fail on me -
intermittently at first, I've had Ethernet interface cards work for
months/years and then slowly get more and more flakey, and I've run
into driver/module issues for some Ethernet interface
cards/chipsets.)
o Did you jiggle it? :-) Do the failure(s) or lack thereof correlate
to some jiggling about and/or stressing cables/connections - most
notably/likely at interface connections (e.g. does pushing,
pulling, twisting, or pulling/pushing in any particular direction
where cable goes into an interface correlate to the problem being
or not being present?).
o Logs? Are any of the systems/devices giving you useful log
indications/messages/diagnostics that correlate to the problem(s)?
o Correlations? Do the problems correlate to any particular events,
e.g. time, temperature, furniture being moved about, particular
hosts/devices being turned on or used in certain ways, volumes or
types of traffic, etc.?
o Can you logically or physically drop device(s) onto any of the
relevant connections to watch relevant packet(s) on the wire(s), to
see where traffic goes missing or wrong?
o Have you checked ARP tables and MAC addresses and the like to be
sure nothing strange or incorrect is happening there?
Anyway, there's an answer to be found out there somewhere. Hopefully
that gives you at least some useful clue(s) of stuff to look for
and/or check that you may not have yet already fully covered and/or
considered. Who knows, ... might be as "simple" as a flakey switch or
some misconfigured device, but it will probably be necessary to do
some detective work and/or dig a fair bit deeper to find and isolate
the problem.
| |
| Davorin Vlahovic 2006-04-28, 1:12 pm |
| On 2006-04-28, Dan N <dan@localhost.localdomain> wrote:
> I'm tearing my hair out with this one. The gateway machine to the outside
> world periodically becomes inaccessible to some hosts on the lan.
>
> This is the setup.
>
> LAN -> switch -> gateway/firewall -> DMZ -> ADSL modem -> Internet
>
> What happens is that hosts on the LAN sometimes get no response from the
> gateway. These hosts can be Windows or Linux boxes.
It could be several things; faulty switch, faulty cable(s) or prehaps
you've got a rogue machine which gets set up with an IP that exists
elsewhere in the network.
You should see the arp cache on the g/f.
--
Uspjesne regije, tvrtke, muskarci i zene znaju da je uvijek bolje biti
prvorazredna verzija sebe nego drugorazredna verzija nekog drugog.
| |
| Bit Twister 2006-04-28, 1:12 pm |
| On Fri, 28 Apr 2006 13:40:47 +0800, Dan N wrote:
<snip>
> I don't think it's dhcp because;
> 1. at least one of the linux boxes has a fixed ip but is displaying the
> problem
> 2. an ipconfig on the windows box shows a valid ip address
But is it a valid default route during failure. 
> 3. its only the gateway that can't be pinged.
Since you are fixated on the gateway and have a static ip linux box,
run a test script to log faults to see if it is random or happens at particuar
hours or happens in clustered groups.
Do feel free to add nodes to the test script.
Do NOT add nodes past the router.
run
ifconfig eth0
on the linux box and verify errors/faults are normal.
Example: RX packets:64116 errors:0 dropped:0 overruns:0 frame:0
TX packets:52776 errors:2 dropped:0 overruns:0 carrier:4
collisions:0 txqueuelen:1000
If yours are high, you have hardware problems and you cannot get good
test results until you clear that up.
click up two terminals on the linux box.
If you see
all nodes fail, at same time, hardware problem,
gateway, dmz, router only, it's gateway's LAN side nic
dmz, router only, it's gateway dmz nic or dmz,
router only, it's dmz router nic or router,
In one terminal,
save script into ck_net
chmod +x ck_net
../ck_net
now in the other terminl
tail -f /tmp/ping.log
You can use Ctl c to abort ck_net and tail commands.
-------->8-------->8- cut below here------->8-------->8-------->8------------
#!/bin/bash
_log_fn=/tmp/ping.log
function ping_ck ()
{
_node= $1
ping -c 1 -w 2 $_node > /dev/null
if [ $? -ne 0 ] ; then
echo "$_node $(date) ping failed" >> $_log_fn
fi
}
while true; do
sleep 60
ping_ck gateway_ip
ping_ck dmz_ip
ping_ck router_ip # if possible
ping_ck doze_ip
done
#************** end script *************************
| |
|
| On Fri, 28 Apr 2006 07:57:17 -0500, Bit Twister wrote:
Thanks for all the suggestions. I'll have a closer look on Monday and let
you know how I get on.
Dan
> On Fri, 28 Apr 2006 13:40:47 +0800, Dan N wrote:
>
> <snip>
>
>
> But is it a valid default route during failure. 
>
>
> Since you are fixated on the gateway and have a static ip linux box, run a
> test script to log faults to see if it is random or happens at particuar
> hours or happens in clustered groups. Do feel free to add nodes to the
> test script. Do NOT add nodes past the router.
>
> run
> ifconfig eth0
> on the linux box and verify errors/faults are normal. Example: RX
> packets:64116 errors:0 dropped:0 overruns:0 frame:0
> TX packets:52776 errors:2 dropped:0 overruns:0 carrier:4
> collisions:0 txqueuelen:1000
> If yours are high, you have hardware problems and you cannot get good test
> results until you clear that up.
>
> click up two terminals on the linux box. If you see
> all nodes fail, at same time, hardware problem, gateway, dmz, router
> only, it's gateway's LAN side nic dmz, router only, it's gateway dmz nic
> or dmz, router only, it's dmz router nic or router,
>
> In one terminal,
> save script into ck_net
>
> chmod +x ck_net
> ./ck_net
>
> now in the other terminl
> tail -f /tmp/ping.log
>
> You can use Ctl c to abort ck_net and tail commands.
>
> -------->8-------->8- cut below
> here------->8-------->8-------->8------------ #!/bin/bash
>
> _log_fn=/tmp/ping.log
>
> function ping_ck ()
> {
> _node= $1
> ping -c 1 -w 2 $_node > /dev/null
> if [ $? -ne 0 ] ; then
> echo "$_node $(date) ping failed" >> $_log_fn
> fi
> }
> }
>
> while true; do
> sleep 60
> ping_ck gateway_ip
> ping_ck dmz_ip
> ping_ck router_ip # if possible
> ping_ck doze_ip
> done
>
> #************** end script *************************
|
|
|
|
|