|
||||||||
|
|
#1
|
|
Hi,
I have the following setup, multiple HP BL30p blade servers running Red Hat ES3 - kernel 2.4.21-32.0.1.ELsmp. Alle servers in the chassi share two internal switches, each switch has 24 ports, where 16 ports are 'down link' ports to the servers (presented as eth0 - Switch A and eth1 - Switch B), 2 ports for interconnectivity between the switches (disabled) and the remaining 4 ports are physical ports. Each switch has one physical port configured as trunk and is connected to a upstream Cisco 6513 chassi. All of the blade servers are in the same VLAN. _________________________________ | HP BL30p servers | | 1 2 3 4 5 6 7 8 | <<HP Switch A>>>-<<<HP Switch B>> | | | | <Cisco Switch A>-<Cisco Switch B> | | | | <<<<<<Router/Default gateway>>>>> To get high-availability on the network connection, bonding in active/backup mode is being used. As the 16 down link ports are internal ports and will never have a link failure (except if the whole switch suffers from hw-error) arp monitoring is used to monitor the default gateway of the servers. >From /etc/modules.conf alias bond0 bonding options bond0 mode=1 arp_interval=1000 arp_ip_target=Router_IP Without simulating any link failures everything works fine, e.g both eth0 (HP Switch A) and eth1 (HP Switch B) can function as primary interface without any problems. Tcpdumping on the bond0 interface shows a lot of arp-whowas requests for the Router_IP. To simulate a failure the link between Cisco Switch A and HP Switch A is removed, this is NOT detected by the bonding module and leaves the servers with eth0 active unreachable. But if I for example lowers the arp_interval to something like 60 for one of the servers, this server will notice the above link failure. Fine let's lower the arp_interval to 60ms for all of the servers, then we are back with the same problem, the bonding module does no detect the failure of reaching Router_IP. Looking at /usr/src/linux-2.4/Documentation/networking/bonding.txt 1. Driver support The ARP monitor relies on the network device driver to maintain two statistics: the last receive time (dev->last_rx), and the last transmit time (dev->trans_start). If the network device driver does not update one or both of these, then the typical result will be that, upon startup, all links in the bond will immediately be declared down, and remain that way. A network monitoring tool (tcpdump, e.g.) will show ARP requests and replies being sent and received on the bonding device. And at /usr/src/linux-2.4/drivers/net/bonding/bond_main.c /* * When using arp monitoring in active-backup mode, this function is * called to determine if any backup slaves have went down or a new * current slave needs to be found. * The backup slaves never generate traffic, they are considered up by merely * receiving traffic. If the current slave goes down, each backup slave will * be given the opportunity to tx/rx an arp before being taken down - this * prevents all slaves from being taken down due to the current slave not * sending any traffic for the backups to receive. The arps are not necessarily * necessary, any tx and rx traffic will keep the current slave up. While any * rx traffic will keep the backup slaves up, the current slave is responsible * for generating traffic to keep them up regardless of any other traffic they * may have received. * see loadbalance_arp_monitor for arp monitoring in load balancing mode */ static void bond_activebackup_arp_mon(struct net_device *bond_dev) { .. .. .. .. .. if (slave) { /* if we have sent traffic in the past 2*arp_intervals but * haven't xmit and rx traffic in that time interval, select * a different slave. slave->jiffies is only updated when * a slave first becomes the curr_active_slave - not necessarily * after every arp; this ensures the slave has a full 2*delta * before being taken out. if a primary is being used, check * if it is up and needs to take over as the curr_active_slave */ The question is, does the other blade servers arp queries (arp monitoring the Router_IP) affect the last_rx and trans_start counters on the other servers in the same vlan/chassi? Does the arp_monitoring function really monitors a host or does it simply rely on counters of the interface? And are those counters affected by arp queries from nearby servers? If so would that mean it is impossible to use arp monitoring if other internal traffic/broadcasting is done in the same vlan? Any help is appreciate, I am sure that this scenario must be possible to implement, or? Best regards Mathias Kanstrup mathias.kanstrup@ongame.com |
![]() |
| Tags |
| arp, bonding, monitoring |
| Thread Tools | |
| Display Modes | |
|
|