In this post, I will start focusing on Layer2 Multipath technologies those became prominent with new changes in Data Center enviroment with the advent of virtualization. I will focus on MLAG technology in this post.
L2 Multi-Path : MLAG
Probably,
everyone who puts their hands/brains around a switch/router knows about LAG(Link
Aggregation) which was proposed as an IEEE standard IEEE 802.3ad.
Before we go into what is M-LAG and how
it is different from LAG and what are the things M-LAG borrows from LAG, I
would like to first remember what LAG means and how it works.
What is LAG and how it works?
Actually,
Link aggregation is pretty old technology that allows you to bond multiple
parallel links into a single virtual link (from the STP perspective). With
parallel links being replaced by a single link, STP detects no loops and all
the physical links can be fully utilized.
Link Aggregation Control Protocol LACP (IEEE 802.3ad) detects
multiple links available between two devices and configures them to use as an
aggregate bandwidth. The two sides detect the availability of the other side by
sending LACP PDUs. One end is an Actor, while the other end
is the Partner. LACP PDUs are sent at a regular instance to
multicast MAC address 01:80:C2:00:00:02. During LACP negotiation, the
triplet {Admin Key, System ID, System Priority} identifies the
LAG instance. So, for a LAG, all participating ports on that device must have
the same triplet value.
LACP has two modes- Active and Passive. In Active mode, the
ports send out LACP PDUs to seek Partners after the physical link comes UP. In
Passive mode, the ports send out LACP PDUs only in response to reception of
LACP PDUs from remote side. When LAG is manually configured, it is the
responsibility of the operator to ensure that the configuration is same on both
endpoints. The capabilities of the ports within a LAG must be consistent i.e
speed/duplex must match on all ports, auto-negotiation must be disabled when
LACP is used.
There are 2 reasons to implement LAG- a) to improve link
reliability i.e. if one of the links in the LAG goes down, the LAG is still
operationally UP. b) to expand the bandwidth i.e. the available bandwidth in
LAG is the summation of the bandwidth of all LAG member links.
To keep traffic flow in sequence, traffic is distributed over
the links in the LAG using a hashing algorithm called per-flow based
hashing algorithm. Hashing is an operation of transforming an input
into a fixed value or key. In Ethernet LAG, the hash input can be either
source/destination MAC addresses, or source/destination IP addresses, or both. Even Layer 4 header can be added to the hashing algorithm criteria. This results in the id of the egress port to which the flow is sent.
So, this is how a LAG works…
What is M-LAG and how it works?
Multi-Chassis
LAG is an emerging technology that is mainly meant to solve problems raised
because of inefficiencies in Spanning-Tree protocol (STP) in data center
environments. Normally, In Link Aggregation topologies, two devices involved
are directly connected. Imagine you could pretend two physical boxes use a
single control plane and coordinated switching fabrics.. then the links
terminated on two physical boxes actually terminate within the same control
plane and you could aggregate them. Welcome to the wonderful world of
Multi-Chassis Link Aggregation (MLAG).
MLAG nicely solves the STP problem: no bandwidth is wasted
and close-to-full redundancy is retained.
MLAG is the
simplest L2 multipathing strategy that vendors offer now a days. MLAG. MLAG
allows multiple physical switches to appear to other devices on a network as a
single switch, although each switch is still managed independently. This allows
you to multihome a physical host to each of the switches in the MLAG group while
actively forwarding on all links, instead of having some links be active, and
some wasted while they lie dormant in a standby state. LACP (802.3ad) is
commonly used to arbitrate these links.
Let me go
in detail about this with a picture:
Device 1 treats the two links as regular Link
Aggregation (LAG). Devices 2 and 3 participate in the MLAG to create the
perception of a LAG. In effect, MLAG adds multi-path capability to traditional
LAG, albeit where the number of paths is generally limited to 2. With MLAG,
both links that are dual homed from Device 1 can be actively forwarding traffic.
If one device in the MLAG fails, for example, if Device 3 fails, traffic is
redistributed back to Device 2, thus allowing for both device and link level
redundancy while utilizing both active links. MLAG can be used in conjunction
with LAG and other existing technologies. The limitation of two paths for an MLAG isn’t really such a big
limitation today, because many DC networks today are designed using dual
uplinks, i.e., in a large cross section of current
deployments, you don’t have more than two uplinks to multi-path over anyway.
“proprietary” implementations of MLAG
MLAG implementations are mostly
proprietary. The “proprietariness” of MLAG is confined to the two switches in
the tier that is offering the MLAG, i.e., Device 2 and Device 3 in the picture
above need to be from the same vendor. Device 1, on the other hand, simply
treats both the ports as a regular LAG and as such could come from another
vendor. So for example, MLAG can be used in conjunction with NIC teaming where
Device 1 could be a server which can be dual homed to two switches
operating as an MLAG. MLAG can also be used in conjunction with upcoming
standards-based technologies such as VEPA to switch VMs directly in the network
over active-active paths from the server. For knowing what is VEPA technology,
you can always look into my previous post.
Normally How do Device 2 and Device 3 communicate so that they are
connected to a single partner and it is MLAG?
So, the Million
dollar question – How do these device 2 and device 3 in the above example come
to know that they are connected to an MLAG? These two devices have to advertise
the same LACP triplet {Admin Key, System ID, System Priority} to
the partner device 1 so that the connection stays intact. Device 2 and Device 3 normally follow a
protocol which is implementation specific/vendor specific. However, IEEE has a
standard for this feature defined in the standard IEEE 802.1AX.
This comes as a revision to Link Aggregation. However, the communication mechanism between devices is vendor specific and is not quoted in IEEE standard specified above.For example, In an industry
implementation followed by Alcatel-Lucent in such topology, MC-LAG control protocol information is
exchanged between device 2 and device 3. This exchange results in
active/standby selection, and ensures only one of the two device's(device
2/device3) ports are active and carrying traffic. MC-LAG control protocol runs
only between MC-LAG peers. The protocol uses UDP packets (destination port
1025) and can use MD5 for authentication. It is used as a keep-alive to ensure
peer device is active. It is also used to synchronize LAG parameters. MC-LAG
peers are not required to be directly connected to each other. Also, if MC-LAG
peer is not found, both devices (device 2 and device3) become active.
Thus, the device1 brings up all links for the LAG.
Why M-LAG is needed in Data Center Networks? Why normal LAG will not help?
Why
M-LAG is needed in DC networks? What is first the need for these kind of
multipath configurations? That points me to where I started my blog. Impact of
server virtualization is one of the prime reasons for the situation. IT
administrators are looking to pack several virtual machines(VMs) on a physical
server in order to reduce cost and power consumption. As more VMs are packed on a single server, the
bandwidth demands from the server edge, all the way to the core of the network,
are growing at a rapid pace. Additionally with more virtual machines on a
single server, the redundancy and resiliency requirements from the server edge
to the core of the network are increasing.
Traditionally, the approach to
increasing bandwidth from the server to the network edge has been to add more
Network Interface Cards (NICs) and use Link Aggregation (LAG) or “NIC teaming”
as it is commonly called to bond links to achieve higher bandwidth. Something as shown in the
following figure can be visualized for this scenario:
If any of the links in the group
of aggregated links fails, the traffic load is redistributed among the
remaining links. Link aggregation provides a simpler and easier way to both
increase bandwidth and add resiliency. Link aggregation is also commonly used
between two switches to increase bandwidth and resiliency. However, in both
cases, link aggregation works only between two individual devices, for example
switch to switch, or server to switch. If any one of the devices on either end
of the link aggregated group (or trunk as it is also called) fails, then there
is complete loss of connectivity. So, we need device level redundancy along
with link level redundancy. As link level redundancy can be achieved with LAG,
let us explore some options to have device level redundancy.
Layer3
routing protocols – for device level
redundancy
Various
router redundancy protocols such as VRRP, in conjunction with interior gateway
protocols such as OSPF, provide adequate resiliency, failover and redundancy in
the network. These kind of mechanisms are used for device level redundancy in the network. Where
Layer 3 routing and segmentation is deployed in the network. However, as you can see from my
previous post, virtualization technologies are driving current Layer 2
topologies to go “flatter” and “faster”. As virtual machine movement today is typically restricted to within a subnet
boundary, device level redundancy through Layer3 protocols may not be a good option.
How
current STP may not be very useful here???
In
Layer 2 topologies, protocols such as the spanning tree protocol
have typically provided redundancy around both link and device failures.
Spanning tree protocol works by blocking ports on redundant paths
so that all nodes in the network are reachable through a single path. If a
device or a link failure occurs, based on the spanning tree algorithm, a
selective redundant path or paths are opened up to allow traffic to flow, while
still reducing the topology to a tree structure which prevents loops.
STP
with Link Aggregation(LAG):
Spanning tree protocol can be used in combination with link
aggregation where links between two nodes – such as switch to switch
connections – can be aggregated using link aggregation to increase bandwidth
and resiliency between nodes or devices. Spanning tree would typically treat
the aggregated link as a single logical port in its calculations to come up
with a loop free topology. See such normal STP+LAG combination topology:
So,
how MLAG helps here??
If
one read above blog content carefully, it all boils down to a point – we need a
provision that gives device level redundancy along with link level redundancy.
We reached this situation because spanning tree protocol does not provide this
and this is a shortcoming of STP. . But highly virtualized data
centers require high performance as well as resiliency as mentioned earlier in
this post. One way to solve such requirements is to extend the link-level
redundancy capabilities of link aggregation and add support for device-level
redundancy. This can be accomplished by allowing one end of the link aggregated
port group to be dual-homed into two different devices to provide device-level
redundancy. The other end of the group is still single homed into a single
device. Let us examine this topology through a figure:
- Device 1 => No change in LAG
behavior. LAG hashing distributes traffic as before.
- Device 2 & Device 3 =>
Communicate to each other through an ISL. ISL link can also be a LAG interface.
These two devices communicate each other through a proprietary protocol so that
they create a perception that together they form a normal Link aggregation
group towards Device 1. Device 2 and Device 3 communicate through ISL link so
that learning, forwarding, bridging happens without any loops.
- Communication protocol between
Device 2 and Device 3 => proprietary
- Device 2 and Device 3 should
belong to the same vendor
- Device 1 can be a switch or
server and need not be from the same vendor as that of Device 2 and Device 3.
Device 1 does not participate in any proprietary protocol.
- If link from device 1 goes down,
an alternative path is chosen normally as if one of the links in a normal LAG
goes down.
- If any links on Device 2 or
Device 3 go down, proprietary communication mechanism between Device 2 and
Device 3 decide upon providing alternate connectivity as if there is a single
Aggregation group among them as a normal LAG.
Vendor
Offerings
This MLAG service is offered by prominent data center
vendors among which most famous are:
CISCO’s Virtual Port Channel – CISCO provides MLAG feature
by a name called Virtual Port Channel. This feature is
available in Nexus 7000 and Nexus 5000 switches. Cisco supports configuring two
switches into a Virtual Port Channel(vPC) domain.
Avaya’s Split Multi Link
Trunking(SMLT) – Avaya supports Split Multi-Link Trunking feature for the Ethernet
Routing Switch 8600, 8300, 5x00, and 1600 series. Switches
are deployed as SMLT pairs in a
cluster.
Exterme Networks
Multi System LAG – Extreme Networks supports Multi System LAG feature in order
to join two switches to form an MLAG pair.
Advantages of MLAG
- Can be built on existing LAG
- Simple Software upgrade on the existing
infrastructure can bring MLAG feature
Disadvantages of MLAG
- one of the member links, meaning that adding physical link to an MLAG bundle doesn’t
always result in a commensurate bandwidth boost.
- you
can’t link switches from two vendors to form an MLAG group. For example, you could
uplink a Cisco switch into an Arista MLAG pair, but you won’t be able to have
an Arista switch and a Cisco switch form the MLAG pair.
- It’s
also important to understand that an MLAG pair is still two physical switches
with minds of their own. Therefore, complex communication must be maintained
between the pair at all times to ensure a stable, loop-free topology. Understanding
how an MLAG pair behaves when communication is lost between the two members is
a key design element one will need to review with vendor from which the member
is bought.
[In my next post, I will take up other L2 Multi-Path technologies]