atom feed4 messages in com.redhat.linux-clusterRe: [Linux-cluster] bonding
FromSent OnAttachments
Scott McClanahanApr 12, 2007 5:45 am 
rhu...@bidmc.harvard.eduApr 12, 2007 6:52 am 
Scott McClanahanApr 12, 2007 7:19 am 
Neil WatsonApr 12, 2007 7:38 am 
Subject:Re: [Linux-cluster] bonding
From:Scott McClanahan (scot@trnswrks.com)
Date:Apr 12, 2007 7:19:20 am
List:com.redhat.linux-cluster

I don't know that I'd need to increase max_bonds since I only have one bond on each node but I have considered resorting to the old MII or ETHTOOL ioctl method to determine link state. You are running a newer kernel and I haven't checked the changelog to see what differences might be pertinent but mainly you are using e1000 drivers compared to my e100 driver. I just can't seem to associate the link status failures with any other events on the box, it's really strange.

On Thu, 2007-04-12 at 09:52 -0400, rhu@bidmc.harvard.edu wrote:

I have the same hardware configuration for 11 nodes, but without any of the spurious failover events. The main thing different I had to do was to increase the bond device count to 2 (the driver defaults to only 1), as I have mine teamed between dual tg3/e1000 ports from the mobo and PCI card. bond0 is on a gigabit switch, while bond1 is on 100mb. In /etc/modprobe.conf:

alias bond0 bonding alias bond1 bonding options bonding max_bonds=2 mode=1 miimon=100 updelay=200 alias eth0 e1000 alias eth1 e1000 alias eth2 tg3 alias eth3 tg3

So eth0/eth2 are teamed, and eth1/eth3 are teamed. In dmesg:

e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex bonding: bond0: making interface eth0 the new active one 0 ms earlier. bonding: bond0: enslaving eth0 as an active interface with an up link. bonding: bond0: enslaving eth2 as a backup interface with a down link. tg3: eth2: Link is up at 1000 Mbps, full duplex. tg3: eth2: Flow control is on for TX and on for RX. bonding: bond0: link status up for interface eth2, enabling it in 200 ms. bonding: bond0: link status definitely up for interface eth2. e1000: eth1: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex bonding: bond1: making interface eth1 the new active one 0 ms earlier. bonding: bond1: enslaving eth1 as an active interface with an up link. bonding: bond1: enslaving eth3 as a backup interface with a down link. bond0: duplicate address detected! tg3: eth3: Link is up at 100 Mbps, full duplex. tg3: eth3: Flow control is off for TX and off for RX. bonding: bond1: link status up for interface eth3, enabling it in 200 ms. bonding: bond1: link status definitely up for interface eth3.

$ uname -srvmpio Linux 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 200 Down Delay (ms): 0

Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:11:0a:5f:1e:0a

Slave Interface: eth2 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:17:a4:a7:9a:54

$ cat /proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 200 Down Delay (ms): 0

Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:11:0a:5f:1e:0b

Slave Interface: eth3 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:17:a4:a7:9a:53

On Thu, 2007-04-12 at 08:45 -0400, Scott McClanahan wrote:

I have every node in my four node cluster setup to do active-backup bonding and the drivers loaded for the bonded network interfaces vary between tg3 and e100. All interfaces with the e100 driver loaded report errors much like what you see here:

bonding: bond0: link status definitely down for interface eth2, disabling it e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex bonding: bond0: link status definitely up for interface eth2.

This happens all day on every node. I have configured the bonding module to do MII link monitoring at a frequency of 100 milliseconds and it is using basic carrier link detection to test if the interface is alive or not. There was no custom building of any modules on these nodes and the o/s is CentOS 4.3.

Some more relevant information is below (this display is consistent across all nodes):

[smccl@tf35 ~]$uname -srvmpio Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686 i386 GNU/Linux

[smccl@tf35 ~]$head -5 /etc/modprobe.conf alias bond0 bonding options bonding miimon=100 mode=1 alias eth0 tg3 alias eth1 tg3 alias eth2 e100

[smccl@tf35 ~]$cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v2.6.1 (October 29, 2004)

Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0

Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:10:18:0c:86:a4

Slave Interface: eth2 MII Status: up Link Failure Count: 12 Permanent HW addr: 00:02:55:ac:a2:ea

Any idea why these e100 links report failures so often? They are directly plugged into a Cisco Catalyst 4506. Thanks.

Robert Hurst, Sr. Caché Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. plain text document attachment (ATT362682.txt), "ATT362682.txt"