|
Home > Archive > Data Storage > April 2007 > Troubleshooting SANs
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Troubleshooting SANs
|
|
| seanh012@gmail.com 2007-03-10, 1:12 pm |
| I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based.
Many times the problem the customer is talking about is some vague
intermittent slowness issue or something like that. In cases like
this, my troubleshooting goes something like this:
1. Check switch logs for marginal ports or other errors (usually
brocade 4/24s or similar)
2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
etc.
If the problem still exists, I'll call HP support, but more often than
not they can't really help from here. So the only approach that
yields results is to start unplugging stuff until I see the problem
disappear.
In one recent instance, I had a customer start shutting blades off
until he found that one of them had an HBA that was mysteriously
causing the intermittent slowness for the whole SAN. The HBA actually
seemed to work, and there were no errors in the Windows event logs, or
switch logs, sansurfer, or anything.
There has got to be a better way to find this kind of thing. On an IP
network, I would run Ethereal or some other packet analyzer to try and
see what is talking on the network when the problem manifests. But
I've never really found anything like that for a fibre channel SAN.
As I said, I'm pretty new to SAN, so any direction would be helpful.
Thanks,
Sean
| |
|
| Uzytkownik <seanh012@gmail.com> napisal w wiadomosci
news:1173550832.940149.73810@j27g2000cwj.googlegroups.com...
>I work for a consulting firm, and have begun to do troubleshooting on
> small SANs, mostly HP MSA1500cs based.
>
> Many times the problem the customer is talking about is some vague
> intermittent slowness issue or something like that. In cases like
> this, my troubleshooting goes something like this:
>
> 1. Check switch logs for marginal ports or other errors (usually
> brocade 4/24s or similar)
> 2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
> etc.
>
> If the problem still exists, I'll call HP support, but more often than
> not they can't really help from here. So the only approach that
> yields results is to start unplugging stuff until I see the problem
> disappear.
>
> In one recent instance, I had a customer start shutting blades off
> until he found that one of them had an HBA that was mysteriously
> causing the intermittent slowness for the whole SAN. The HBA actually
> seemed to work, and there were no errors in the Windows event logs, or
> switch logs, sansurfer, or anything.
>
> There has got to be a better way to find this kind of thing. On an IP
> network, I would run Ethereal or some other packet analyzer to try and
> see what is talking on the network when the problem manifests. But
> I've never really found anything like that for a fibre channel SAN.
>
> As I said, I'm pretty new to SAN, so any direction would be helpful.
>
> Thanks,
> Sean
>
Hi Sean,
check
[url]http://www.finisar.com/index.php?file=product&var=product&div_id=smenu3&level=B&sub_cat_id=3& dlink=SAN%20Monitoring%20and%20Analysis[
/url]
Good luck,
Piotr
| |
| Sean Howard 2007-03-13, 1:16 am |
| > Hi Sean,
>
> check
> [url]http://www.finisar.com/index.php?file=product&var=product&div_id=smenu3&level=B&sub_cat_id=3& dlink=SAN%20Monitoring%20and%20Analysis[
/url]
>
> Good luck,
> Piotr
Yeah I found some of that stuff. The problem with everything I've found is
that it requires Taps. I haven't found anything equivalent to a "mirroring
port" on a switch.
Does such a thing exist?
| |
|
|
Uzytkownik "Sean Howard" <seanh012@gmail.com> napisal w wiadomosci
news:vZ6dnQ0LoffOaWjYnZ2dnUVZ_oCmnZ2d@co
mcast.com...
>
> Yeah I found some of that stuff. The problem with everything I've found
> is that it requires Taps. I haven't found anything equivalent to a
> "mirroring port" on a switch.
>
> Does such a thing exist?
Yes it does, but not on every product. As far as I am aware you can find it
on Brocade 48000 directors and Brocade 5000 FC switches.
There is a good reason for using Taps in SAN monitoring and troubleshooting
(see below as found in a Finsar document covering this problem).
1. Multiple ports mirrored to one port causes buffer overflow and dropped
packets.
2. Packets go through a buffer and are retimed, making accurate time
sensitive measurements impossible, such as jitter, packet gap analysis, or
latency.
3. Most mirror ports filter anomalies, thus making troubleshooting
impossible.
4. Turning on port mirroring puts a load on the switch's CPU/transfer logic
thus impacting the switch's operational performance.
Piotr
| |
| mark.stubblefield@gmail.com 2007-03-16, 7:13 pm |
| Since when do 48k's (or *any* Brocade switch) support port mirroring?
One would think the Condor's could handle it, but I've never seen it
implemented in Brocade's product line. I'm not sure about Cisco.
To the OP, the only way I know of is tapping the fabric. There are FC
protocol analyzers, but they sit in band.
-Mark
On Mar 13, 7:50 am, "Piotr" <nos...@nospam.com> wrote:
> Uzytkownik "Sean Howard" <seanh...@gmail.com> napisal w wiadomoscinews:vZ6dnQ0LoffOaWjYnZ2dnUVZ_
oCmnZ2d@comcast.com...
>
>
>
>
>
>
> Yes it does, but not on every product. As far as I am aware you can find it
> on Brocade 48000 directors and Brocade 5000 FC switches.
>
> There is a good reason for using Taps in SAN monitoring and troubleshooting
> (see below as found in a Finsar document covering this problem).
> 1. Multiple ports mirrored to one port causes buffer overflow and dropped
> packets.
> 2. Packets go through a buffer and are retimed, making accurate time
> sensitive measurements impossible, such as jitter, packet gap analysis, or
> latency.
> 3. Most mirror ports filter anomalies, thus making troubleshooting
> impossible.
> 4. Turning on port mirroring puts a load on the switch's CPU/transfer logic
> thus impacting the switch's operational performance.
>
> Piotr
| |
| Moojit 2007-03-19, 1:13 am |
|
<seanh012@gmail.com> wrote in message
news:1173550832.940149.73810@j27g2000cwj.googlegroups.com...
>I work for a consulting firm, and have begun to do troubleshooting on
> small SANs, mostly HP MSA1500cs based.
>
> Many times the problem the customer is talking about is some vague
> intermittent slowness issue or something like that. In cases like
> this, my troubleshooting goes something like this:
>
> 1. Check switch logs for marginal ports or other errors (usually
> brocade 4/24s or similar)
> 2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
> etc.
>
> If the problem still exists, I'll call HP support, but more often than
> not they can't really help from here. So the only approach that
> yields results is to start unplugging stuff until I see the problem
> disappear.
>
> In one recent instance, I had a customer start shutting blades off
> until he found that one of them had an HBA that was mysteriously
> causing the intermittent slowness for the whole SAN. The HBA actually
> seemed to work, and there were no errors in the Windows event logs, or
> switch logs, sansurfer, or anything.
>
> There has got to be a better way to find this kind of thing. On an IP
> network, I would run Ethereal or some other packet analyzer to try and
> see what is talking on the network when the problem manifests. But
> I've never really found anything like that for a fibre channel SAN.
>
> As I said, I'm pretty new to SAN, so any direction would be helpful.
>
> Thanks,
> Sean
>
You're correct. There is no such thing as port mirroring or fibre channel
software analyzer such as Ethernet's Ethereal. Your best bet in this
scenario without using an inline fibre channel analyzer (Finisar is the
defacto standard) is to use an application such as SCSI Utility For Windows
to monitor the HBA port statistics to determine what errors man be
happening.
The Moojit
| |
|
| Sean,
I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically
I would guess that the blade you closed down was doing Target Resets.
If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the
outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end
result is all the initiators slow down but no errors are displayed. Zoning won't help.
You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common.
The HP box might provide some higher debug level that exposes target resets so you can track them down.
From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.)
I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers.
Follow http://support.microsoft.com/defaul...kb;EN-US;923830
MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show
up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for
/arbitrat/ as in arbitration or something like that.
There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had
Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems).
seanh012@gmail.com wrote:
> I work for a consulting firm, and have begun to do troubleshooting on
> small SANs, mostly HP MSA1500cs based.
>
> Many times the problem the customer is talking about is some vague
> intermittent slowness issue or something like that. In cases like
> this, my troubleshooting goes something like this:
>
> 1. Check switch logs for marginal ports or other errors (usually
> brocade 4/24s or similar)
> 2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
> etc.
>
> If the problem still exists, I'll call HP support, but more often than
> not they can't really help from here. So the only approach that
> yields results is to start unplugging stuff until I see the problem
> disappear.
>
> In one recent instance, I had a customer start shutting blades off
> until he found that one of them had an HBA that was mysteriously
> causing the intermittent slowness for the whole SAN. The HBA actually
> seemed to work, and there were no errors in the Windows event logs, or
> switch logs, sansurfer, or anything.
>
> There has got to be a better way to find this kind of thing. On an IP
> network, I would run Ethereal or some other packet analyzer to try and
> see what is talking on the network when the problem manifests. But
> I've never really found anything like that for a fibre channel SAN.
>
> As I said, I'm pretty new to SAN, so any direction would be helpful.
>
> Thanks,
> Sean
>
| |
| jason_cook@ntlworld.com 2007-04-02, 7:14 am |
| On 25 Mar, 00:50, Bob S <bsremovemetrac...@nycap.rr.com> wrote:
> Sean,
>
> I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically
> I would guess that the blade you closed down was doing Target Resets.
>
> If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the
> outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end
> result is all the initiators slow down but no errors are displayed. Zoning won't help.
>
> You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common.
> The HP box might provide some higher debug level that exposes target resets so you can track them down.
>
> From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.)
> I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers.
>
> Followhttp://support.microsoft.com/default.aspx?scid=kb;EN-US;923830
>
> MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show
> up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for
> /arbitrat/ as in arbitration or something like that.
>
> There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had
> Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems).
>
>
>
> seanh...@gmail.com wrote:
>
>
>
>
>
>
>
>
> - Show quoted text -
I work as a SAN consultant for HP and I agree that embedding taps into
environments is a very good idea. I have three finisar analysers and
one of the biggest problems is getting the change approved to add or
remove them, getting the customer to install taps removes this
obstacle. The cisco platform does have the SD port (mirror...)
functionality but you don't see the whole picture when using it. Last
time I was involved with an escalation on MDS then cisco themselves
asked for a finisar trace.
Kind Regards
Jason
|
|
|
|
|