multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

Discussion in 'Linux Networking' started by Rahul, Sep 5, 2008.

  1. Rahul

    Rahul Guest

    I'm still a bit uncertain about the way that I've set up my Linux box and
    the switches correctly in my quest for Ethernet channel bonding.

    Goal: to bond eth0 and eth1 on each blade and thus attain close to 2Gbps
    transmit and receive. i.e. desire Load balancing / bandwidth
    aggregation. Do *not* care at all about fault-tolerance.

    Equipment: 1 server (3 eth ports), 23 blades, 2 Dell 6248 switches. Each
    switch has 48 Gbit ports.

    I setup bond0 on each blade. Used mode=6. Adaptive load balancing. From
    what I read this seems most suitable (correct me if I am wrong please!)
    since supports both transmit and receive side balancing.

    Now comes the confusing parts:

    1. Do I need Ling Aggregation Groups (LAGs) on the switch or not for my
    switch-to-port connections? I receive multiple conflicting views on this
    online. http://www.linuxfoundation.org/en/Net:Bonding says "does not
    require any special switch support...does not require any special switch
    support..." So do many other tutorials that do not mention anything about
    any switch side configs being required at all!

    Others say it still needs LAGs. My "common-sense" says I ought to tell
    the switch that two of my ports are going to the same blade somehow.

    2. Will the switch see two MAC ids or just a single one for the bond0
    device if I examined its address tables? For alb it ought to be both
    right? But I tried examining the ARP tables on the server and there for
    each blade IP only the bond0 MAC is listed. Is that a sign something is
    wrong or just my misinformed-ignorant paranoia! I read the specs on all
    the 6 bonding algorithms (some load balancing and others for fault
    tolerance) and see that some seem to transmit both MAC ids and other just
    a single one? True?

    3. How about the switch-to-switch connections? If I want to connect 8 eth
    cables switch-to-switch (to aggregate bandwidth again) do I need a LAG
    here or not? (8 is the magic number because thats the max number of ports
    my switch will allow me to aggregate).

    4. Each LAG group has a LACP option. Enable or not? The core Linux specs.
    seem to have no mention of LACP; only company specific info seems to
    exist! (Cisco, Dell etc.) I guess LACP is related to IEEE 802.3ad? Is
    that only a workaround to prevent having to manually aggregate ports into
    LAG groups? Or does it have an advantage as a load-balancing protocol to
    my chose "Adaptive load balancing"

    I guess it boils down to two questions: (1) To LAG or not-to-LAG (same
    for LACP) (2) Is my mode=6 (Adaptive load balancing) the appropriate
    mode?
     
    Rahul, Sep 5, 2008
    #1
    1. Advertisements

  2. Rahul

    Rick Jones Guest

    Do you expect that 2Gbps over a _single_ connection/flow?
    You mentioned blades - I cannot recall from earlier which blades these
    were, but are they connecting to the outside world through a _switch_
    module in the blade chassis or a pass-through module?
    Cannot really help much there.
    IIRC they are one and the same :)

    rick jones
     
    Rick Jones, Sep 6, 2008
    #2
    1. Advertisements

  3. Rahul

    Rahul Guest

    Thanks again for your comments Rick. Yes. Am I wrong in expecting that?
    Dell Power Edge 1435 ("nodes"). They have twin eth ports each. They connect
    to a switch. The switch connects to a server. Server to world. I am *not*
    interested in node-to-world performance. Mostly node-to-node and node-to-
    server.
    Could very well be! But then why does Dell have a seperate toggle for LACP.
    Implies I can have LAG but not LACP. Maybe its just a Dell error. Could
    swith users from other vendors comment on their configs? So that we can see
    if this is a Dell specific quirk?
     
    Rahul, Sep 9, 2008
    #3
  4. Rahul

    Rick Jones Guest

    I think so but my point of view may not be shared by others. IIRC the
    only mode that will spread the _outbound_ traffic of a single
    connection/flow across multiple links in the bond/trunk/aggregate is
    mode-rr aka round-robin.

    I've never been terribly fond of that mode because it leads to
    out-of-order TCP segments and a resulting increase in ACKs and
    depending on the number of links in the bond/trunk/aggregate spurrious
    TCP retransmissions.

    I am not familiar with any switch with a similar round-robin mode for
    the inbound traffic. Doesn't mean they don't exist mind you...

    Those adaptive modes which are doing clever things with MAC addresses
    are (probably) doing them for different destinations (IP addresses).
    It would be necessary to _constantly_ be sending ARP refreshes (as in
    an ARP frame for virtually every frame carrying a TCP segment) to get
    traffic between a single pair of IPs to spread across different MAC
    addresses.

    IMO the best-if-not-only way to get > 1Gbit/s for a single TCP
    connection is to use a 10G link.
    The "nodes" connect directly to an external switch and not some switch
    internal to the blade chassis? I'm not familiar with Dell blades, but
    for HP C-Class blades, there are I/O modules which plug into the back
    of the blade chassis to connect the eth ports on the blades themselves
    with the outside world. Those can either be pass-through modules or
    they can be actual switches. That is why I was asking about what was
    in the blade chassis along with the blades themselves. If you have
    switch modules you would need to bond/trunk/aggregate to _that_ switch
    module, and then have another bond/trunk/aggregate between the
    "chassis switch" and the external switch to which the server is
    connected.

    rick jones
    --
    The computing industry isn't as much a game of "Follow The Leader" as
    it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
    - Rick Jones
    these opinions are mine, all mine; HP might not want them anyway... :)
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
     
    Rick Jones, Sep 9, 2008
    #4
  5. Rahul

    Rahul Guest

    Rick, my bad. Maybe I confused you with my misleading usage of the term
    "blades"? These are Dell Power Edge 1435 Rack Mount servers.
    http://www.dell.com/content/products/productdetails.aspx/pedge_sc1435?
    c=us&cs=555&l=en&s=biz

    The backplane has twin eth ports. We connected these using ordinary CAT5e
    cables to ports on a Dell switch. Switch is also a Dell Power Connect 6248
    with 48 Gbit ports.

    Does that clarify the situation better?
     
    Rahul, Sep 9, 2008
    #5
  6. Rahul

    Rahul Guest

    Interesting. Any downsides to mode-rr? Is it transmit-side load balancing
    only? Also, why do you think that some of the other "smarter" modes (alb
    / 802.3ab) do not achieve a bandwidth multiplier, can I ask? Just a
    personal preference or anything fundamentally iffy about those modes?

    I thought a LAG was the same idea. If a switch cannot distinguish between
    two similar links and clubs them together doesn't that achive the same
    effect? Maybe I am wrong.
    Right. Which is why mode=6 (alb) will only (IMO) give a bandwidth
    multiplier when speaking to *at least* two different peers. When talking
    to a single peer (single IP) no advantage.
    Too expensive for a university-research cluster! :)
     
    Rahul, Sep 9, 2008
    #6
  7. Rahul

    Rick Jones Guest

    Yes. Standalone systems. Understood. My end conclusion about
    single-stream, aggregatation and 10Gig still stands though :)

    rick jones
     
    Rick Jones, Sep 9, 2008
    #7
  8. Rahul

    Rick Jones Guest

    It leads to out-of-order TCP segments, which leads to an increase in
    the number of ACKs, which will increase CPU utilization per KB
    transferred (service demand in netperf-speak) and on the larger link
    counts in a single aggregate, spurrious TCP retransmissions which will
    waste bandwidth and suppress the congestion window.
    Unless I've really misunderstood what is going on, the modes playing
    tricks with ARP cannot on first principles affect a single flow. They
    get traffic to flow over different links by handing-out different MAC
    addreses to queries for their one local IP. Even if we assume that
    every segment sent on a TCP connection does an ARP cache lookup, the
    only way it could get a new MAC address each time would be if there
    was an ARP update between every TCP segment. I cannot imagine any of
    the modes in linux bonding doing something sooo terribly inefficient.
    It would make mode-rr look positively pristine in comparison.

    The point of link aggregation was to increase aggregate throughput and
    provide a modicum of HA. Increasing the speed of a single flow was
    not part of the design center.
    All depends on what the switch does. My experience with other
    switches (non-Dell) has been that when presented with an aggregate the
    switch will hash on some addressing in the frame to pick the link on
    which it will place the frame. Soemtimes this is simply the MAC,
    sometimes it may include the IP. I've heard unconfirmed rumours that
    some switches may even go so far as to look at TCP/UDP port numbers.
    However, none of that would result in traffic for a single flow
    flowing over multiple links in parallel.
    Right, and you said you needed an increase for comms to a single peer
    right?
    How did the line go in "The Right Stuff?" "No bucks, no Buck Rogers."
    :)

    rick jones
     
    Rick Jones, Sep 9, 2008
    #8
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.