inevitable: March 2009

Monday, March 23, 2009

CCIE vs. CCDE

Many people have asked my opinion about CCIE vs. CCDE from time to time. Which one is better? Should I take CCIE, or CCDE with some mid-level professional certification such as CCNP/CCIP? Which certification will give me more chance to get a job? And so on.

I don’t have CCDE yet, so my answer below can be considered partial. You know, I’m a kind of guy who thinks I should have done or completed something first before I can give my full review about it. Been there, done that, then write a review. But that’s just me.

This afternoon I passed the CCDE written exam. I didn’t spend much time to prepare. I was busy writing a low level design, network migration plan and pre-sales document for customers in three different countries. And I currently have personal issue that makes me really run out of time, so studying CCDE was my lowest priority.

No time to study, no practice test, no reading books at all.

What I did: looked at the CCDE written blueprint and realized I have done and implemented most of the technologies listed either in real world networks or during my previous CCIE labs. So I just spent a couple of hours during the weekend to read Networkers presentation to refresh my memory for some specific technology, then walked to the testing center. Networkers slides are priceless, since they explain in detail about the technology with implementation case study and best practice. Now the material is available for public called Cisco Networkers Virtual with only €200 for annual subscription.

I don’t know if it’s possible to do the same for CCDE lab later on. Ah, it is not a real lab actually, but computer based exam instead. I guess if in CCDE written the questions are just about the implication of the protocols or design as separate different questions that are not related one to another, CCDE lab should provide more like scenario-based questions where there are series of questions that we have to answer to build or improve the design. So knowing the implication of running one protocol is not enough, we must know the implication in specific topology and scenario, where there are different requirements and other protocols running too. And the scenario in CCDE lab will start from gathering information about the requirements, before we can decide which design can fit to answer the requirements.

There is a practical exam demo in Cisco Learning Network, as well as some sample questions in Networkers presentation. I encourage you to check the slides and do the demo.

So here is the fact: CCIE was made with main focus on building a complex network with practical implementation in the lab, and troubleshoot the issues during the process. We can only troubleshoot something if we know how it works in normal operation. So by studying CCIE can make us understand how the protocol works in detail, know the limitation and implication of running multiple protocols at the same time, give us hands-on experience to implement them, and able to decide what is the best way to enable the protocols or features to answer the requirements.
But it is not a design exam.

The requirement stated in CCIE questions is not to test the design skill, but to ensure the candidates have deep understanding of the topic. What kind of design we can do with less than 10 routers anyway? We will surely utilize all the lab routers to enable the required protocols and features, without having to know the best practice design and the implications of running those chosen protocols in real world scenarios.
That is where the CCDE comes into play.

Most of the folks who built the CCDE program have CCIE and started their career as TAC engineers, who know how exactly the protocol works and how to implement it, as well as have seen how the network fails, before they all moved to become design engineers. They got design knowledge by implementing and troubleshooting the real world networks, with different topology and requirements. And personally, I believe this is the best way to become a real design expert.

So what is my answer to the original question? Which one is better, CCIE or CCDE? Well, it depends, as always :). To get a job in current situation you may want to get CCIE or multiple mid certifications such as CCNP/CCIP as well as other vendor certifications, depending on the requirement listed in the job description. Especially since CCDE lab will still need time before it’s available world wide. But if you really want to become a network designer with a solid knowledge and experience, why don’t you follow the path that was taken by most network designers, including those who made the CCDE program: take the CCIE first, get implementation and design experience from the real world, and then take CCDE in the end to backfill the missing knowledge, or to certify our skills as network designer, or just for fun.

I suggest you to read the Networkers slides about CCDE or check out the Cisco Learning Network to get more information.

If you had asked me a different question: which one is better, CCIE or IE certification from other vendors such as the JxxIE or HxIE? The answer would be easy and straight forward: CCIE! Why, you may ask? Well, because those vendors even when they are hiring most of the time they write: CCIE preferable.

With that, I rest my case.

Saturday, March 07, 2009

Deep Diving Router Architecture, Part III

In the previous two parts we have discussed a lot about the hardware architecture. So where do we go from here? Let’s now discuss the features and the applications running on top of the hardware architecture that we have been discussing so far. I’m running out of the pictures that are available and can be found in google to explain this topic. And obviously I can’t use the picture from my company’s internal document. So let this part be the picture-less discussion.

The following are few sample features and applications that are required from a modern and next generation router:

High Availability (HA) and Fast Convergence
Router fails eventually. The failure may happen on the route processor module, the power supply, the switch fabric, the line card, or the whole chassis somehow. The key point here is not on how to avoid the failure, but how to manage during the failure to minimize time required to switch the traffic to redundant path or module.

For most of us who like to see a network as a collection of nodes connected to each other, the failure might be only in either link or node failure. For these two cases, router vendors have been introducing Fast Convergence (FC) features in the product such as IGP FC and MPLS TE Fast Re-Route (FRR) to reduce the network convergence time to minimal. And the key point for this type of failure is to detect the failure as soon as possible. If the nodes are connected with direct link, the Loss of Signal (LoS) may be used to inform the failure to the upper layer protocol such as IGP. If it is not direct link, we may use a feature called Bidirectional Forwarding Detection (BFD) which basically sends hello packet from one end to the other.

When the hardware fails, we expect to see packet loss for fragment of time. In most cases this is inevitable and the only thing we can do is to minimize the packet loss or to reduce the convergence time. For a router with redundant route processor, let’s say the primary route processor fails and it has to switch over to the secondary route processor, it can use a feature called Non-Stop Forwarding (NSF) during the switch over time until the secondary route processor is ready to completely take over, to avoid any packet loss. NSF offers some degree of transparency, since the failure node can inform its neighbors that it’s going to down :) but make promises it will go back online again so please all neighbors don’t flush the routes from the routing table for certain period of time, and please keep forwarding the traffic to the failure node.

The failure node itself must use modular concept as explained in previous discussion. So the forwarding plane should be done in other location but the route processor, for example in the line cards. Before the failure, the router must run the Stateful Switchover (SSO) feature to ensure the redundant route processor is synchronized with the primary route processor, fabric and the line card. During the switch over, while waiting for initialization process of the secondary route processor to take over completely, forwarding packet is still done in the line card by using the last state of local forwarding table before the failure. So if the failure node can still forwarding the packet to the neighbors, even it uses the last forwarding table state before failure, and the neighbors are willing to continue forwarding the packet to the failure node because they have been informed it will go back online again soon, then we should not have any packet loss at all. Later the SSO/NSF feature should be able to return the forwarding table to the recent state once the secondary route processor has taken over completely.

The new HA feature has been pushed recently is the Non Stop Router (NSR). NSR is expected to offer full transparency to the neighbors. For NSF during the failure the IGP relationship is tear down, even the neighbors will continue using the routes from the failure node during the agreed period of time. With NSR, the IGP relationship should remain up during the switch over.

If we go back to the hardware design and architecture, we can see now the first requirement is to have the secondary route processor to be synchronized always with the other route processor, fabric and the line card. If this is not possible to achieve then we should see packet loss during the switchover. Obviously we all understand that if the failure is in the line card or fabric, while there is traffic passing through it, we should expect to see packet loss regardless of any HA features we enabled. And for modular switch fabric architecture, we should have several different modules for fabric and the failure of one module should not affect the total capacity of forwarding packets in the whole switch fabric.

Quality of Services
Quality of Services (QoS) feature in order to differentiated treatment to the packet is a must have requirement especially during network congestion. Where exactly the congestion may occur?

If we use the carrier class router architecture in Part II, we can see that the congestion may happen on the following:
- Egress queue, a queue in egress line card before physical interface: while waiting for the packet to be transmitted to the physical media
- Fabric queue, a queue to receive packet from switch fabric in egress line card: since it has to normalized the packet received from fabric if the packet must be converted to fixed-size cell, for example. Or because the egress queue is congested so this queue is becoming congested too
- Ingress queue, a queue before sending packet to switch fabric in ingress line card: as consequences of the congestion in fabric queue or in the fabric, this queue can be congested as well

Congestion may happen in the switch fabric itself. But normally carrier-class router has a huge capacity in forwarding inside the switch fabric to accommodate fully loaded chassis with all line cards. Unless if the switch fabric is modular and there is failure in some of the fabric modules that will reduce the capacity.

So the key here is we should be able to differentiate services in many points inside the router. For example, if the egress physical ports are congested, we should be able to ensure the high priority packet in egress queue will be transmitted first. Same case with the fabric queue. And even inside the fabric we should be able to prioritize some packet in case the fabric queue or the fabric itself is congested. And when there is congestion in egress queue, it should inform the fabric queue, that will inform the ingress queue to slow down sending the packet to the fabric. This mechanism is known as back pressure, and the communication from fabric queue to ingress queue normally is through the bypass link, and not through the fabric since for this intelligent fabric described in Part II it has only one way direction from ingress to egress, not the other way around. And slowing down the packet sent to the fabric actually means the ingress packet engine should start dropping low priority packets, so it can send lower rate of traffic to the ingress queue.

It is clear now where we can deploy QoS tools in different points inside the router. Policing, for example, should be done in ingress packet engine. Egress queue can use shaping or queuing mechanism and congestion avoidance tools. Fabric queue may need only to be able to inform the ingress queue in case there is congestion.

Btw, the QoS marking that is used inside the router is normally derived from the marking set to the packet such as CoS, DSCP or EXP. When the packet travels within the router, the external marking is used to create internal marking that will be used in forwarding path until the packet goes out from the router. It should be the task of ingress packet engine to do the conversion.

One other important point from QoS feature is the support of the recent hierarchical QoS model. In normal network, packet that comes to the router has only one tag or identification to distinguish the priority of the packet of one given source or flow. In MPLS network, the tag will be EXP bit. In normal IP network, the identification can be CoS or DSCP. And they are all associated to only one type of source or flow so there is only one QoS action need to be done to it. But how if there are multiple tags, and it is required to provide different QoS tools to different tag? Let’s say in Carrier Ethernet environment the packet that reaches the router comes with two 802.1q tags, the S-tag to identify the provider’s aggregation point for example, and the C-tag to identify different customer VLANs (this is known as Q-in-Q). We may want to do QoS action to the packet as a unit, it means we just need to apply the QoS to the S-tag, but we also want to apply QoS based on different C-tag. This means the router must support hierarchical QoS model where the main QoS class will impact the whole packet, while the child classes can be specific based on customer tag.

Multicast
In a network of multiple nodes, multicast traffic means a single packet coming from one source get replicated to multiple nodes depending on the request to join the multicast group. Now it’s our time to look in more detail and ask question: who is doing the replication inside the router?

Multicast packet can be distinguished easily from the destination multicast group address. Inside the router the replication can be in ingress line card, called ingress replication, or in egress line card, called egress replication. Using multicast control protocol such as PIM, the ingress line card should be able to know the destination line cards for any multicast group address. Let’s say we have two ports in the ingress line card, and multicast packet (S,G) is received in one port. From the lookup the ingress packet engine or network processor find out that the other port in the same line card is interested to the multicast group as well as some other line cards. Ingress line card may do ingress replication, to replicate the packet into multiple and send it to the other port in the same line card as well as to the other line cards.

Now, if we always do ingress replication there is a huge drawback in term of performance. Let say the rate of multicast packet received by ingress line card is X Gbps. And there are 10 egress ports, in different line card, that are interested to the multicast group. If ingress replication is being done, then the ingress card must multiply the packet into 10, meaning the total number of rate is 10X Gbps now, and this is the rate that is sent from the ingress line card to the switch fabric. In this scenario it’s better to use egress replication since the ingress line card just needs to send a single packet to each egress line card that is interested. And if there are multiple ports on the egress card that are interested to the same multicast group, the replication of the packets can be done by the egress line card in order to send the same packet to all those ports. This egress replication can avoid unnecessary huge number of traffic inside the ingress queue and the fabric in case of the ingress replication had been used.

In carrier-class router, the switch fabric is more intelligent it can do replication of multicast packet inside the fabric. So again, the ingress line card just need to send a single packet to the fabric, then based on interested egress line cards the fabric will replicate this packet and send it to those egress cards, then the egress line card can do another replication in case there is more than one port that is interested with the multicast group.

Performance and Scalability
Once you have reached this point, I guess now you have started asking questions in your head for any features or protocols: is it done in hardware or software? Is it done by central CPU or distributed in the line card? Is it done in ingress line card or egress? If yes, then good, finally we are making progress here.

Before I continue I would like to mention one critical component in the hardware for forwarding plane which is Ternary Content Addressable Memory (TCAM). In simple words, TCAM is a high speed memory that is used to store the entry of forwarding table or other feature such as access control list, in order to do high performance hardware switching. Remember the concept of pushing the forwarding table to the line card processor, then from the line card processor to the hardware? TCAM is used to stored the information. So now you know, we should ensure there is enough space there to keep the information, or in other words the TCAM is one limit point in forwarding path. If the route processor push more forwarding entries that the TCAM can handle, we may end up with inconsistent forwarding table between route processor and line card. This means, even the route processor knows what to do with the packet, but the hardware may not have the entry and will just drop it.

Looking at the modular architecture of next generation router, it is clear for us that in order to achieve non-blocking or line rate packet switching performance we should ensure that every components in the forwarding path should support the line rate performance. It means if we want to forward X Gbps traffic without any congestion, then the components from ingress processor and queue in ingress line card, the capacity of the fabric, the fabric queue, egress processor and egress queue in egress line card should be able to process X Gbps or even more. So if you want to know where the bottleneck inside the router, check the processing capacity of each component. If you know the capacity from the ingress line card to the fabric is only X Gbps, but you put more ports in ingress line card with total capacity more than X, it means you are doing over subscription. And by knowing the congested point you can figure out which QoS tools to be applied and where exactly you need to apply it. In this sample, using egress QoS won’t help as it is not the congestion point, since the congestion is in the queue to the fabric.

Now, why bother to keep increasing the route processor performance then, if we know the actual performance is in the forwarding plane that is done in the line cards? Well, because we still need the route processor to do the control plane function. You need a good CPU in order to process big number of IGP or BGP control packets. You still need a big memory to store the routes received from the neighbor before it can be pushed down to the hardware. You also need a good capacity for storage to keep the router software image as well as any system logging and crash dump information.

NGN Multi-Service Features and Application
It is common for an next generation network to carry multiple different services. The common applications other than multicast for IPTV, are MPLS L3VPN for business customer, Internet, L2VPN point to point and multipoint with VPLS and so on. The complexity comes when we have to combined and run the features at the same time.

For example, when we have MPLS-based network, the label imposition for the next hop is done in ingress line card. But how if we run another features such as one type of L2VPN that can be software based or performed in route processor? We may need to do the label imposition in egress line card because of this reason.

And how about if we have to do multiple lookup? For example, if we have to remove two MPLS tags on the last label switch router in case of Penultimate Hop Popping (PHP) is not being used in MPLS L3VPN network. First of all we need to do lookup to know what we need to do with the first or the topmost MPLS tag. Most probably we want to keep the top most to get the EXP bit for QoS. Then we have to do another lookup to see the VPN label on the second tag to associate it with the VRF. Last, after all the MPLS labels have been stripped off, we still need to do another lookup in IP forwarding table to know to which egress interface we should send the packet. Doing several lookups in the same location such as ingress may introduce us with the concept or recirculation, where the packet is looped inside the ingress line card. So after the first lookup the packet is not sent to the fabric but it will get the layer 2 information re-written with the destination of ingress line card itself, and the packet will be sent to the first hardware that processes incoming packet. So it looks like it’s just the next packet need to be processed by the line card.

Multicast VPN can give us a different challenge. But just to summarize, by knowing how the protocol and feature works, and the component inside the router that does specific task related to the feature, we can foreseen if any issues may occur during the implementation of the design. And we may be able to find the work around to overcome the issues.

Frankly speaking, I really can’t go to more detail discussion, for various reasons. First, it’s already 4 am in the morning now. I have been awake for almost 48 hours to write this Deep Diving trilogy and do some other things at the same time, so I’ve got to sleep. Have I mentioned how grateful I am for them who invented Red Bull? But for now, even the strongest energy drink won’t make me last forever.

Second, although I want to write more in this subject but I may not be able to do so. It’s really difficult to discuss in more detail but still able to avoid using or discussing some confidential information from my company. O well, let’s see how it goes. I may have a fresh idea after getting a proper sleep.

Good night.
End of the trilogy.

Friday, March 06, 2009

Deep Diving Router Architecture, Part II

So in the first part I have explained the basic of internal packet switching process inside a router. Normally we look at a router just as a node with multiple interfaces, and our focus is on how the router communicate to the others to build the routing table. Once the table has been built we can assume the packet is going in from one interface and going out from another interface depending on the destination. In case of multicast packet, one packet is going in from one interface and going out from several other interfaces depending on the join request to the multicast group. And even there are features such as filter and Quality of Services, if we see a router as a simple node with ingress interface and egress interface, we normally think that the features are applied either in ingress or egress direction respective to the router, and they just work like magic.

Now you can see that there are several other tasks that as important as building the routing table. The first is to build forwarding table based on the routing table. The forwarding table contains the next hop information and next hop interface for each destination, just like the routing table, with addition of layer 2 information of the next hop. The packet must be sent out with the new layer 2 header so it is important to re-write this information to the packet. The next task is the lookup process to match the destination with the entry in forwarding table. Packet must be stored somewhere while waiting for the lookup process to be completed. Then the packet must be moved to different location (or in old router the actual packet can be still in the same physical memory location, but it has different pointers to distinguish the state before and after the lookup). The last but not least is to apply the features or policy to the packet inside the router. It’s really crucial to understand what the above tasks are, as well as where exactly they are done.

First, let’s all understand a concept to separate the router into two planes, control and forwarding or data. Actually there is the third one called management plane which is used to connect, to interact and to manage the router itself, but let’s just focus on the first two. Control plane is where all communication between routers using routing protocol happen, in order to build the routing table and forwarding table to be used to switch the packet from ingress interface to egress interface. The process to switch the packet between interfaces in the same router is part of data or forwarding plane.

Let’s see a brief architecture of one next generation and carrier-class router in below picture.

The architecture uses modular concept where most of important tasks are performed in different locations by different components. This is very contrast with the simple architecture in Part I where there is only a single main board, a central route processor and memory, and PCI bus communication to move the packet from the network card to the processor and back to the network card. Route processor is still the main brain of the system. But the function of switching packet including the lookup can be done by different hardware altogether. The network card or line card may have its own processor to do the lookup and dedicated hardware to do the actual packet switching. And to connect different line cards we use a module called switch fabric, known as the backplane of a router. Modular approach is chosen to address the challenges of scalability and to avoid all-in-one approach where a module can become a single point of failure of the whole system.

So the central route processor can be considered as one line card now and it is still required to do the function of control plane, which is running the routing protocol with another routers to build the routing and forwarding table that can be pushed to the network processor in the line card. Once the line card has this information, it will be able to do the lookup and layer 2 information re-write to the packet. To increase the performance during the switching, or applying some features such as packet filter, we can have a dedicated hardware that is programmed to do specific instruction only, called Application Specific Integrated Circuit (ASIC).

The picture below can describe how the forwarding information is built by the central route processor and it can then be pushed to the network processor in the line card.

The route processor uses routing protocol such as ISIS, OSPF and BGP to build Routing Information Base (RIB) database known as routing table. In next generation networks, it is common to use not the IP protocol as the information to switch the packet, but instead by using the MPLS label information. So the MPLS label for specific route or destination IP prefix is communicated and agreed among the routers using different label distribution protocols: LDP, RSVP or even with BGP. Obviously the label distribution protocols depend on the underlying routing protocol for the routers to communicate to each other. And the routing table is used along with the label database to build Label Forwarding Information Base (LFIB). If forwarding information base is derived from the routing table and it contains the next hop IP destination information with the next hop interface and layer 2 information to be re-written in the packet, the LFIB contains the next hop IP destination information with the MPLS label need to be popped or pushed to the packet before it can be sent out the egress interface.

Both forwarding table and label forwarding table can be pushed to the network processor in the line card using the Inter Process Communication (IPC) interface. If all incoming packets must be processed by the network processor, then we just distribute the processing challenge from a central processor to distributed model. Moving a bit far, the network processor can build specific instruction to define what action need to be done to the packet that comes to the line card, and push this information to the hardware that is built to run specific instruction such as ASIC. And ASIC nowadays can process packet not only in layer 2 but in layer 3 and layer 4 as well to deploy feature such as packet filter and so on. And the function to process packet in layer 2 and layer 3 and 4 can be split in two different ASICs for performance purpose.

Up to the point where the forwarding information is pushed to the line card and to the specific hardware, is part of control plane. The actual switching packet by the hardware or ASIC from one line card to the others, is part of the forwarding plane.

The carrier class router from Cisco extends the modular concept to even further more by introducing the concept of Modular Services Card (MSC). So line card is separated into two components: physical part and the intelligent part. The physical (known as PLIM – Physical Layer Integrated Module) is dealing with all layer 1 in TCP/IP stack, including to provide the physical port where we can plug the cables. And the MSC is the one that does the upper layer processing once the PLIM has constructed the bits or digital signal from the network media into a single TCP/IP packet. The purpose is obviously to address the scalability issue. I mean, the physical part can be replaced or upgraded but the MSC can remain the same. Or if one day we want to upgrade the MSC capacity we can do so without removing the physical cabling on the ports.

Let see a bit closer of how the packet gets processed inside the line card by using the MSC architecture as above. This is a very famous discussion known as Life of a Packet.

From PLIM the packet is sent to the MSC through the midplane (you can also consider this process happens in a single line card without separation of PIM, midplane and MSC). Then the packet is processed by the Ingress Packet Engine, that has all the information and instruction received from the line card processor to decide what to do to all incoming packets. Once it has been decided to send the packet to other line card or to the route processor (for some cases where the packet is destined to the router IP address itself or for control packet to manage the router) then the packet need to be sent to the backplane or switch fabric, with additional internal header to ensure only the destination line card will receive it. In some architecture the packet travels in the fabric must be standardized or normalized to use a fixed size or length. The reason is because it will be easier and faster for the hardware to process the packets with the same size. In some architecture the packet is converted to different format (such as fixed-size cells with new header) when it travel across the backplane. So there should be a buffer or place to put the packets into the queue before it can be transmitted into the backplane. The backplane itself is another module or line cards designed specifically to connect all other line cards. We will discuss the backplane or switch fabric later.

From the backplane the packet is transmitted back to the destination line card and obviously it requires another buffer or queues to convert the packet back to its original format or length. Then there is another process in egress packet engine in case there are some features need to be applied. For specific cases, the MPLS label imposition to push the label can happen here. In most cases, the layer 2 re-write or MPLS label imposition can be done in ingress engine, so the egress doesn’t need to do any lookup or further processing other than applying additional features in egress direction. And for a carrier-class router the egress engine can be the same packet engine as the ingress or it can be different hardware to ensure the performance. Before the packet is sent outside through the physical interface, there should be another queue to place the packets to wait for its turn before it can be processed and moved to the network media.

When you look at the physical layout of the card as in the picture below, it is really easy to understand each component. And you can see there are several different chips to do different tasks. The forwarding path from ingress physical port to the switch fabric and go back to the egress physical port can be seen clearly.

In some other architecture, the hardware that processes the incoming packet cannot do lookup so it has to consult the route processor. But it can send the packet to the destination line card directly over the backplane. So when the ingress line card receive the packet, it may put it in buffer or queue while waiting for the route processor to do the lookup and give instruction where to send. This mean the ingress line card doesn’t have to send the whole packet to the route processor, instead it can make a copy of only the layer 3 header and send this information to the route processor. Once the ingress line card knows the destination line card, it can send the packet out to egress line card directly.

When we discuss the switch fabric or backplane, the very basic and mid-range router may still use bus architecture just as shown in the picture below. Even if the linecard has its own processor to do the lookup, but with bus backplane the packet sent from ingress line card will be received by all the other line cards.

Bus uses similar concept as Ethernet, when the ingress line card puts the packet into bus and all other line cards can receive it, and only the switching engine or destination egress line card will take the packet to process it further. You can see directly that the bottleneck of this system is the capacity of the backplane.

The better backplane architecture is using the crossbar below.

With crossbar, each ingress line card can send the packet to any other line card at any given time. But since the egress line card can only receive from one ingress line card at any point in time, there should be a controller or scheduler to ensure there is only one ingress line card to connect to the egress line card. The controller can be integrated as part of the switch fabric or it can be separated as another external module to offer scalability and redundancy.

There is a router architecture that still has both crossbar fabric and bus. Bus is still required perhaps for backward compatibility. For instance the old line cards may not have the new fabric connection to the backplane so it has to use bus. The newer line card that has already had fabric connection still need to have connection to the bus if it needs to send packet to the bus-only line card. And in some cases, bus is still used by line card to send the packet to the central route processor.

The latest switch fabric technology is very intelligent as it can do lookup and packet replication within the fabric, and provide full line rate connectivity to the egress. For example if each line card is connected to fabric with X Gigabit per second connection, then at any given point in time as long as the number of packets send to egress line card is still less or equal to X Gigabits per second the traffic would be able to flow without any congestion even if the packets come from multiple ingress line cards. And in carrier-class router normally the capacity to receive packets from the fabric is double or even more than the capacity to send to the fabric. It means, if each line card can send X Gbps to the backplane, so each line card can receive 2 – 2.5X Gbps from the backplane to accommodate multiple ingress line cards sending the packets to the same egress line cards at the same time.

In this type of fabric, there can be a bypass link between ingress line card and egress line card. But this bypass link should not be used to forward the actual packet. Usually the link is used by the egress line cards to inform the ingress line card if there is congestion so the ingress line card can slowing down the rate of the packets sent to the switch fabric.

There are other things to discuss when the packet is in the fabric. As I mentioned before the packet itself can be standardized into a fixed-size packet (by fragmenting the packet if it’s larger than the threshold and adding pad if the packet is smaller than the threshold). By converting the packet into an internal format such as fixed-size cells with internal header the processing inside the switch fabric can be faster. In carrier-class router there are different stages of the fabric so even inside the fabric lookup process need to be done to ensure the packet is sent to the right egress line card only. Obviously this is why there is new internal header since the lookup process in the fabric may not be the same with the lookup in ingress line card processor which is based on IP or MPLS label forwarding table. If the fabric doesn’t do any lookup, so it is up to the ingress line card to put the internal header which identify the destination egress line card. By adding the internal header to the packet, in case of crossbar fabric the controller can determine which egress line card this ingress line card should be connected, to ensure the packet can reach the destination line card. And this internal header can be considered as additional overhead to the packet inside the fabric.

Up to this point, do you still think the knowledge of internal packet switching is not important? Well, my friend, it seems like you really want to push your luck. So please continue reading to the next part where I will try to explain the implication of hardware architecture to the features and applications running on top of it.

End of part two.

Thursday, March 05, 2009

Deep Diving Router Architecture, Part I

When I was young (how old do you think I am now?) I used to look at a router as a “black box” or just a node. I mean, I was not interested to go to the internal packet switching process inside the router itself and I was focusing more on the protocols and features that are run between nodes. Well, actually interest is not the best word to describe it. If you don’t work for a company who makes the routers, do you think you can get more detail information about what is really going on inside the box? Now I’m still young (I guess) but at least I have the chance to look deep dive down to the architecture level of a router hardware.

And actually it’s not always required to have such knowledge anyway in our daily job. Most of the network engineers, even the CCIEs, may just need to assume that the router is a box with multiple interfaces, and its function is to forward the packet to the next hop based on the routing table built from dynamic or static routing protocol. Then we put more focus on the communication between routers to build that routing table, instead of the packet switching process from one interface to the other inside a router. In OSPF, the LSA packets, database and SPF calculation discussion can be very complex and give us lots of headache, especially if we have to do redistribution with another IGP protocol or BGP and so on. So once we can see the routes in the routing table, and there is no other treatment such as filter or policy, normally we would happily assume that the packet will be processed and forwarded to the next hop. Then we can focus more on the other features or applications that run on top of the routing, which will probably give us another different kind of headaches.

So for most of us mere mortals, it may be enough to say that the packet switching within a router means switching the packet from ingress (input) interface to egress (output) interface. In CCIE we do need to dig a bit inside, for example when we have to determine the sequence of features implementation in the router. Does NAT come first or Access Control List? How about policy based routing that override the routing table? And so on. But we never really bother to look at which the internal part of a router who does this or that. Later I can explain why most of us don’t bother, other than due to lack of resources available to learn it.

Why is it important to understand the internal packet switching?
For me personally, is to understand the limitation of protocols or features implementation due to the hardware. And this is important for any design engineer. I mean, we can build a network design to specify number and type of hardware for core routers, aggregation, access etc. Then we recommend the protocols and features to be enabled, and come up with a nice and complete configuration to be pasted to the box. In reality, there is standard for a protocol but every vendor may implements it differently, depending on their interpretation of the standard or perhaps because they invent their own approach in following the standard. And for some features, or the way the protocols are implemented, depend on the hardware architecture. We may end up into situation where the new network has been up and running and only after sometime we start noticing a performance or scalability issue due to the limitation of the hardware inside the routers, when we really have heavy traffic in the network or when we want to expand the design.

A very simplified process of packet switching can be shown in the above picture. The packet travels on the wire with Layer 3 and Layer 2 header information as per TCP/IP protocol stack. The interface processor in a router is capable to pick it up, inspect and strip the layer 2 header and send it to the route processor for further process. While waiting for the route processor doing a layer 3 lookup in the routing table (and forwarding table) to check what should it do to the packet, the packet itself must be stored in a queue or buffer. Once the next hop is determined, the route processor knows to which interface it should send the packet. Then the packet can be moved to an output queue to wait before it can be transmitted back to the wire, get re-written with the new layer 2 header containing the information of the next hop, then the packet can leave out the box. The input and output queue can be virtual, so it can refer to the same physical memory and the packet never moves anywhere. But it makes it possible to apply different treatment when the packet is considered in input queue (before the lookup) and when it is already in the output, where the lookup has been done and the destination interface for the packet has been determined.

So the keywords are: Layer 3 and Layer 2 header, input queue, routing table and forwarding table, lookup, move packet between different location or queues, output queue, layer 2 re-write.

Let’s see it once again in more detail. Here is the snapshot from Vijay Bollapragrada’s Inside Cisco IOS Architecture book, for a very basic switching process called process switching.

Once the interface processor receives the packet from the network media on input or ingress interface, it has to store it in the buffer or memory (1) and at the same time it has to interrupt the processor (2) to inform there is a packet need to be processed. The book focus on software architecture, so it explains how the processor then invokes a process (3), which is called ip_input in Cisco, to start doing the lookup in the routing and forwarding table. This lookup results on which output or egress interface the router need to send out the packet, along with layer 2 information need to be written to the packet before it can be sent out (4). Processor then will do the layer 2 rewrite (5) and move the packet to be processed by egress interface processor (6), then off the packet goes back to the network media. Step 7 is just to inform the main processor that the packet has been sent out, so the memory can be freed and the packet counter on the interface can be increased.

I have to admit that I won’t be able to explain as good as how Vijay (and the other guys) does, so I suggest to read the book for those who are still curious. But my point here is just to emphasize that there are different tasks need to be done other than the lookup, such as moving the packet from ingress to egress interface, and re-writing the new layer 2 information to the packet, which will become important for later discussion.

Again, why we need to worry about internal process of packet switching? Hang on there. I know we usually put more focus on the interaction between routers with routing protocol, to ensure each router can build the routing table successfully. Once we have the table, the Layer 3 lookup process itself now can be done very fast. For each incoming packet we need to compare the destination against the database containing the list of all destinations with the associated egress interface. It can be done quickly, especially since a vendor like Cisco has invented a mechanism so the comparison doesn’t need to be done by going through the entry in the list one by one. Instead, Cisco Express Forwarding (CEF) builds a new mtrie data structure from the routing table, as shown in the next picture. Once the entry has been found, it can give a pointer to the adjacency table which contains the layer 2 information of the next hop.

Enough with the lookup process and how the router can determine to which interface it should send the packet. There is a book written dedicatedly to explain CEF in more detail. And I want to focus on the hardware architecture instead of software or algorithm of the lookup, so I suggest you to read this Cisco Express Forwarding book as well as Vijay’s book.

Now, let’s talk about moving the packet from ingress interface to egress interface. As discussed previously, the packet can be stored in a central memory while waiting for the lookup process. So the ingress interface processor must store the packet there, and the egress interface process can copy the packet (with new Layer 2 information) from the same central location. As you can see, with this idea, the bottleneck is in the central memory performance and obviously the memory must be able to serve multiple requests from different interface processors at the same time.

To improve the memory performance, one may want to use local memory on each interface. So the packet is stored in local memory of ingress interface, then it can be copied to the shared central memory over bus communication, and the local memory of egress interface can get the packet from there. You may start asking, why the ingress interface memory doesn’t send the packet directly to the egress interface memory? Hold your horse for a while. It is possible but it requires some sort of intelligence on the ingress interface processor to define to which egress interface memory it should send the packet. In other word, the ingress interface components may need to do the lookup. I will talk more about this in the next part.

When you open the chasing of an old mid-range router, you may see something similar with below picture. The main board is the base component to connect all other components. There is a central route processor, central memory, the interface network cards, PCI bus to communicate the network cards to the route processor, and other components such as flash where we can store the software image, boot ROM to run the firmware required for booting process before we can load the router software image, and so on.

Back to our keywords quickly: Layer 3 and Layer 2 header are inside the packet. Input queue or buffer can be in ingress network card local memory or in central memory. Routing table and forwarding table are build by route processor using protocol to communicate to other routers. Layer 3 lookup (along with the layer 2 information of the next hop) is done by route processor, by using algorithm to compare the destination against the routing table and forwarding table. Move packet between different location or queues, meaning the packet from ingress network cards local memory must be copied to the central memory using PCI or bus communication, then the egress network cards local memory can get it from there. Output queue is the egress network cards local memory or central memory. Layer 2 re-write to put the layer 2 information to the packet must be done by route processor before the packet can be sent out the router. All the features such as filter or NAT are done by the route processor. Applying the feature on ingress interface or egress interface can just simply be a function to apply the feature on the state of the packet before or after the lookup has been done.

Looking at the picture above, does it remind you of something? Yes, it looks the same as the components of normal PC main board! This is a reason why some talented people can build their own router software, upload it to normal PC, put multiple network cards, and claim they can compete or even beat a router built in dedicated hardware by router vendor.

My take on this: it depends. If you want to compare the free router on normal PC to some old mid-range router, this might be true. Because all the tasks inside the router are done in central processor and memory, so what it takes is to build a good software to do lookup and packet switching, with optimization to ensure it can utilize the resource in proper or better way.

But how about the latest features in next generation network? Do you think some people will build it for free? The features in a router are getting more complicated it needs decision from the team on how to implement it even there is a standard already defined. And in second part I will explain what a vendor has gone far to develop a modern or next generation router. Because obviously the challenge is not on how to switch the packet between ingress interface to egress interface, but how to do so as fast as possible. And it has to be done consistently for different type of packets, for different size of packets, in massive amount to accommodate the demand of huge bandwidth nowadays. Then later on we will start facing more challenges on how to deploy some features that should be done in the hardware, for example to apply different treatment of packets based on priority on egress network card to ensure high priority packets can be transmitted first back to the network media or the wire. Or re-writing the layer 2 information to the packet should be done in the hardware too to ensure maximum performance.

If you have read this far, and you think all the information above is more than enough to help you in your daily job, and you think it’s more important to go back to all the headaches caused by the communication between routers, or protocols and features that need to be run in multiple routers, then you are completely welcomed to still see a router as a black box or a node with multiple interfaces where the packet is going in and out. And there is really no harm if you want to skip the next part and make decision not to bother at all with the internal packet switching process inside a router.

End of part one.