Sunday, December 2, 2012

Virtual Networks : The way ahead - 2

In this post, I will start focusing on Layer2 Multipath technologies those became prominent with new changes in Data Center enviroment with the advent of virtualization. I will focus on MLAG technology in this post.

L2 Multi-Path : MLAG
Probably, everyone who puts their hands/brains around a switch/router knows about LAG(Link Aggregation) which was proposed as an IEEE standard  IEEE 802.3ad.  Before we go into what is M-LAG and how it is different from LAG and what are the things M-LAG borrows from LAG, I would like to first remember what LAG means and how it works.

What is LAG and how it works?

Actually, Link aggregation is pretty old technology that allows you to bond multiple parallel links into a single virtual link (from the STP perspective). With parallel links being replaced by a single link, STP detects no loops and all the physical links can be fully utilized.

Link Aggregation Control Protocol LACP (IEEE 802.3ad) detects multiple links available between two devices and configures them to use as an aggregate bandwidth. The two sides detect the availability of the other side by sending LACP PDUs. One end is an Actor, while the other end is the Partner. LACP PDUs are sent at a regular instance to multicast MAC address 01:80:C2:00:00:02. During LACP negotiation, the triplet {Admin Key, System ID, System Priority} identifies the LAG instance. So, for a LAG, all participating ports on that device must have the same triplet value.

LACP has two modes- Active and Passive. In Active mode, the ports send out LACP PDUs to seek Partners after the physical link comes UP. In Passive mode, the ports send out LACP PDUs only in response to reception of LACP PDUs from remote side. When LAG is manually configured, it is the responsibility of the operator to ensure that the configuration is same on both endpoints. The capabilities of the ports within a LAG must be consistent i.e speed/duplex must match on all ports, auto-negotiation must be disabled when LACP is used.

There are 2 reasons to implement LAG- a) to improve link reliability i.e. if one of the links in the LAG goes down, the LAG is still operationally UP. b) to expand the bandwidth i.e. the available bandwidth in LAG is the summation of the bandwidth of all LAG member links.

To keep traffic flow in sequence, traffic is distributed over the links in the LAG using a hashing algorithm called per-flow based hashing algorithm. Hashing is an operation of transforming an input into a fixed value or key. In Ethernet LAG, the hash input can be either source/destination MAC addresses, or source/destination IP addresses, or both. Even Layer 4 header can be added to the hashing algorithm criteria. This results in the id of the egress port to which the flow is sent.
So, this is how a LAG works…

What is M-LAG and how it works?

Multi-Chassis LAG is an emerging technology that is mainly meant to solve problems raised because of inefficiencies in Spanning-Tree protocol (STP) in data center environments. Normally, In Link Aggregation topologies, two devices involved are directly connected. Imagine you could pretend two physical boxes use a single control plane and coordinated switching fabrics.. then the links terminated on two physical boxes actually terminate within the same control plane and you could aggregate them. Welcome to the wonderful world of Multi-Chassis Link Aggregation (MLAG).
MLAG nicely solves the STP problem: no bandwidth is wasted and close-to-full redundancy is retained.
MLAG is the simplest L2 multipathing strategy that vendors offer now a days. MLAG. MLAG allows multiple physical switches to appear to other devices on a network as a single switch, although each switch is still managed independently. This allows you to multihome a physical host to each of the switches in the MLAG group while actively forwarding on all links, instead of having some links be active, and some wasted while they lie dormant in a standby state. LACP (802.3ad) is commonly used to arbitrate these links.

Let me go in detail about this with a picture:
Device 1 treats the two links as  regular Link Aggregation (LAG). Devices 2 and 3 participate in the MLAG to create the perception of a LAG. In effect, MLAG adds multi-path capability to traditional LAG, albeit where the number of paths is generally limited to 2. With MLAG, both links that are dual homed from Device 1 can be actively forwarding traffic. If one device in the MLAG fails, for example, if Device 3 fails, traffic is redistributed back to Device 2, thus allowing for both device and link level redundancy while utilizing both active links. MLAG can be used in conjunction with LAG and other existing technologies. The limitation of two paths for an MLAG isn’t really such a big limitation today, because many DC networks today are designed using dual uplinks, i.e.,  in  a large cross section of current deployments, you don’t have more than two uplinks to multi-path over anyway.

“proprietary” implementations of MLAG

MLAG implementations are mostly proprietary. The “proprietariness” of MLAG is confined to the two switches in the tier that is offering the MLAG, i.e., Device 2 and Device 3 in the picture above need to be from the same vendor. Device 1, on the other hand, simply treats both the ports as a regular LAG and as such could come from another vendor. So for example, MLAG can be used in conjunction with NIC teaming where Device 1 could be a server  which can be dual homed to two switches operating as an MLAG. MLAG can also be used in conjunction with upcoming standards-based technologies such as VEPA to switch VMs directly in the network over active-active paths from the server. For knowing what is VEPA technology, you can always look into my previous post.

Normally How do Device 2 and Device 3 communicate so that they are connected to a single partner and it is MLAG?

So, the  Million dollar question – How do these device 2 and device 3 in the above example come to know that they are connected to an MLAG? These two devices have to advertise the same LACP triplet  {Admin Key, System ID, System Priority}  to the partner device 1 so that the connection stays intact.  Device 2 and Device 3 normally follow a protocol which is implementation specific/vendor specific. However, IEEE has a standard for this feature defined in the standard IEEE 802.1AX. This comes as a revision to Link Aggregation. However, the communication mechanism between devices is vendor specific and is not quoted in IEEE standard specified above.For example, In an industry implementation followed by Alcatel-Lucent in such topology, MC-LAG control protocol information is exchanged between device 2 and device 3. This exchange results in active/standby selection, and ensures only one of the two device's(device 2/device3) ports are active and carrying traffic. MC-LAG control protocol runs only between MC-LAG peers. The protocol uses UDP packets (destination port 1025) and can use MD5 for authentication. It is used as a keep-alive to ensure peer device is active. It is also used to synchronize LAG parameters. MC-LAG peers are not required to be directly connected to each other. Also, if MC-LAG peer is not found, both devices (device 2 and device3) become active. Thus, the device1 brings up all links for the LAG.

  Why M-LAG is needed in Data Center Networks? Why normal LAG will not help?

Why M-LAG is needed in DC networks? What is first the need for these kind of multipath configurations? That points me to where I started my blog. Impact of server virtualization is one of the prime reasons for the situation. IT administrators are looking to pack several virtual machines(VMs) on a physical server in order to reduce cost and power consumption.  As more VMs are packed on a single server, the bandwidth demands from the server edge, all the way to the core of the network, are growing at a rapid pace. Additionally with more virtual machines on a single server, the redundancy and resiliency requirements from the server edge to the core of the network are increasing.
Traditionally, the approach to increasing bandwidth from the server to the network edge has been to add more Net­work Interface Cards (NICs) and use Link Aggregation (LAG) or “NIC teaming” as it is commonly called to bond links to achieve higher bandwidth. Something as shown in the following figure can be visualized for this scenario:


If any of the links in the group of aggregated links fails, the traffic load is redistributed among the remaining links. Link aggregation provides a simpler and easier way to both increase bandwidth and add resiliency. Link aggregation is also commonly used between two switches to increase bandwidth and resiliency. How­ever, in both cases, link aggregation works only between two individual devices, for example switch to switch, or server to switch. If any one of the devices on either end of the link aggregated group (or trunk as it is also called) fails, then there is complete loss of connectivity. So, we need device level redundancy along with link level redundancy. As link level redundancy can be achieved with LAG, let us explore some options to have device level redundancy.

Layer3 routing protocols –  for device level redundancy

Various router redundancy protocols such as VRRP, in conjunction with interior gateway protocols such as OSPF, provide adequate resiliency, failover and redundancy in the net­work. These kind of mechanisms are used for device level redundancy in the network. Where Layer 3 routing and segmentation is deployed in the network. However, as you can see from my previous post, virtualization technologies are driving current Layer 2 topologies to go “flatter” and “faster”. As virtual machine movement today is typically restricted to within a subnet boundary, device level redundancy through Layer3 protocols may not be a good option.

How current STP may not be very useful here???

In Layer 2 topologies, protocols such as the spanning tree protocol have typically provided redundancy around both link and device failures.
Spanning tree protocol works by blocking ports on redundant paths so that all nodes in the network are reachable through a single path. If a device or a link failure occurs, based on the spanning tree algorithm, a selective redundant path or paths are opened up to allow traffic to flow, while still reducing the topology to a tree structure which prevents loops.

STP with Link Aggregation(LAG):

Spanning tree protocol can be used in combination with link aggregation where links between two nodes – such as switch to switch connections – can be aggregated using link aggregation to increase bandwidth and resiliency between nodes or devices. Spanning tree would typically treat the aggregated link as a single logical port in its calcu­lations to come up with a loop free topology. See such normal STP+LAG combination topology:


So, how MLAG helps here??

If one read above blog content carefully, it all boils down to a point – we need a provision that gives device level redundancy along with link level redundancy. We reached this situation because spanning tree protocol does not provide this and this is a shortcoming of STP. . But highly virtualized data centers require high performance as well as resiliency as mentioned earlier in this post. One way to solve such requirements is to extend the link-level redundancy capabilities of link aggregation and add support for device-level redundan­cy. This can be accomplished by allowing one end of the link aggregated port group to be dual-homed into two different devices to provide device-level redundancy. The other end of the group is still single homed into a single device. Let us examine this topology through a figure:

  • Device 1 => No change in LAG behavior. LAG hashing distributes traffic as before.
  •   Device 2 & Device 3 => Communicate to each other through an ISL. ISL link can also be a LAG interface. These two devices communicate each other through a proprietary protocol so that they create a perception that together they form a normal Link aggregation group towards Device 1. Device 2 and Device 3 communicate through ISL link so that learning, forwarding, bridging happens without any loops.
  •   Communication protocol between Device 2 and Device 3 => proprietary
  •   Device 2 and Device 3 should belong to the same vendor
  •   Device 1 can be a switch or server and need not be from the same vendor as that of Device 2 and Device 3. Device 1 does not participate in any proprietary protocol.
  •   If link from device 1 goes down, an alternative path is chosen normally as if one of the links in a normal LAG goes down.
  •  If any links on Device 2 or Device 3 go down, proprietary communication mechanism between Device 2 and Device 3 decide upon providing alternate connectivity as if there is a single Aggregation group among them as a normal LAG.

Vendor Offerings

This MLAG service is offered by prominent data center vendors among which most famous are:

CISCO’s Virtual Port Channel – CISCO provides MLAG feature by a name called Virtual Port Channel. This feature is available in Nexus 7000 and Nexus 5000 switches. Cisco supports configuring two switches into a Virtual Port Channel(vPC) domain.
Arista’s Multi-Chassis Link Aggregation – This feature is implemented in Arista’s EOS product and is present across Arista’s product lines. Here two switches can participate in an MLAG.
Avaya’s Split Multi Link Trunking(SMLT) – Avaya supports Split Multi-Link Trunking feature for the Ethernet Routing Switch 8600, 8300, 5x00, and 1600 series. Switches are deployed as SMLT pairs in a cluster. 
Exterme Networks Multi System LAG – Extreme Networks supports Multi System LAG feature in order to join two switches to form an MLAG pair.

Advantages of MLAG 

  •  Can be built on existing LAG
  •  Simple Software upgrade on the existing infrastructure can bring MLAG feature

Disadvantages of MLAG

  • one of the member links, meaning that adding  physical link to an MLAG bundle doesn’t always result in a commensurate bandwidth boost.
  • you can’t link switches from two vendors to form an MLAG group. For example, you could uplink a Cisco switch into an Arista MLAG pair, but you won’t be able to have an Arista switch and a Cisco switch form the MLAG pair.
  • It’s also important to understand that an MLAG pair is still two physical switches with minds of their own. Therefore, complex communication must be maintained between the pair at all times to ensure a stable, loop-free topology. Understanding how an MLAG pair behaves when communication is lost between the two members is a key design element one will need to review with vendor from which the member is bought.
  [In my next post, I will take up other L2 Multi-Path technologies]

Saturday, December 1, 2012

Virtual Networks : The Way Ahead - 1


Virtualization changed server market dramatically. Dramatic enough to raise a new market force called VmWare in server market.  Apart from changing marketing dynamics, virtualization started changing the school of thought about information transfer into which networking market got admission lately.

 In the past, network designers built fat-tree topologies in which traffic traveled in a north-south orientation up and down the tree. That’s an adequate design for client-facing traffic and workloads that don’t move. A smart designer could put systems that need to talk to one another nearby and reduce the amount of traffic flowing up and down the tree.
Networks were always determined by the Spanning Tree Protocol that forced a tree like structure from core to edge. Today, we refer to this as North/South Alignment because traffic flows were predominantly Server to LAN Core to WAN Core to WAN Edge to Client.
Virtualization breaks this paradigm. Virtual machines are talking to other VMs in other racks and rows in an east-west fashion. And VMs can move to unpredictable data center locations. A designer can’t know where a workload is at any given time, because it’s no longer physically constrained. In that world, the fat tree fails at scale.  Also, Typical Spanning-Tree topologies would fail as well.  Alternatively L2 Multi-Path (L2MP) technologies are replacing Spanning-Tree. 


Today’s network architects and engineers have a multitude of options to meet demands raised because of virtualization. I would like to categorize at significant data center network technologies in three major categories:

(i) Layer 2 multi-path
(ii)Layer 2 extension 
(iii)software-defined networking.

I will try to take a stab at these technologies once.  I will try to go in deep about these in my next-posts.


L2 Multi-Path
Layer 2 multi-path tackles the built-in limitations of Spanning Tree Protocol by enabling all links to forward traffic while ensuring redundancy and eliminating loops that could take down a network. While some of these L2 Multipath technologies are standards/work group based, come are proprietary. IETF has a workgroup which introduced TRILL(Trasparent Interconnection of Lots of Links) whereas IEEE has a standard 802.1aq known as SPB(Shortest Path Bridging). Emerging protocols such as TRILL and SPB let designers create meshes or fabrics that enable traffic to take the shortest path between switches.
Proprietary Options include MLAG and virtual chassis, which allow multiple switches to act like a single device.

L2 Extension
One of the reasons for Server Virtualization becoming prominent was it makes the server movement a cake walk. Virtual Machines can be moved across servers without any physical movement. VM movement has some problems to be solved in which case L2 Extension technologies are discovered. Layer 2 extension allows physically separate data centers to be linked into a Layer 2 domain across Layer 3 boundaries. Originally aimed at carrier networks(think VPLS and Q-in-Q, among others), some Layer 2 extension protocols are appearing the data center because they support the ability to move VMs from one data center to another, an ideal capability for load sharing, business continuity and disaster recovery. We look at Cisco’s Overlay Transport Virtualization, the Virtual Extensible Local Area Network(VXLAN), Network Virtualization using Generic Routing Encapsulation(NVGRE) and Stateless Transport Tunneling(STT).

Software Defined Networking(SDN)
Software-defined networking is emerging as an alternative to the traditional switch model in which the control plane resides within each switch. While SDN and OpenFlow are not synonymous, OpenFlow demonstrates SDN’s promise: take the decision-making away from the switches and routers, and move it into a centralized controller that will tell the network as a whole how to forward traffic, allowing for more flexible networks that can respond in near real time to changing conditions. It also doesn’t hurt that Open-Flow and SDN have the potential to make networking gear less expensive.This can make the network more flexible and better able to respond to changing demands. In addition to SDN,in my next posts, I will try to dig into OpenFlow, a new protocol for communicating between switches and a controller. In next posts, I will try to explain the potential implicationsof SDN and OpenFlow and evaluate its impact on data center networks.



[In my next post, I will take a deep dive on L2 Multipath technolgies]