For some time I’ve had some issues on my network. They don’t happen very often so they’ve been hard to track down. I know for sure that one of the issues I have is that one of my Linksys SGE2024 refuses to hold on to its configuration if it loses power. This is made more difficult by the fact that the power supply in the unit has a very low capacitance. As a result of that it will appear to lose its configuration randomly. Take a look at my last post What is a UPS, Really? for the details on that.
However, I also had these weird “storms” where I’d start losing packets and despite my best efforts to diagnose it, I wouldn’t find a problem and suddenly it would be fixed. Honestly there’s nothing that’s more frustrating than a problem that fixes itself before you can put your finger on it — and then comes back later when you’re not ready.
This problem and the lack of diagnostic capabilities on the Linksys SGE2024 switches lead me to replace the switches with some new HP V1910 switches. However, immediately during the deployment of them I started seeing similar — but not identical issues. While not perfect the enhanced diagnostics of the V1910 switches allowed me to sort out that there was a problem with the connection between my virtualization server and the switch.
The virtualization server is a Dell R710 which has four 1 GB Ethernet ports on it. Since I expected that overtime I’d want to get more than 1GB of traffic to the virtual servers, I decided to team the Ethernet ports together to aggregate the total capacity of three of the four ports. The fourth port I’d used for the virtualization host. Wat is called Teaming on the windows side is known on the switch side as Link Aggregation. Generally speaking this aggregation is done through the Link Aggregation Control Protocol (LACP) — although it doesn’t have to be done that way. It defines how the communication should happen to form up an aggregated channel.
When I looked at the LACP status on the switch it told me that the ports weren’t properly bonding together. In more research I found that the infrastructure guy that I had helping me — and a friend of mine — had used network bridging to connect the NICs together and had not in fact actually used network teaming. The teaming had to be setup with a special utility that Dell didn’t see fit to include in the latest driver package. When my friend didn’t see it he tried to do a workaround.
The problem was that each of those adapters was operating independently. The Windows was doing a software bridge of them. This means that every multicast frame that one of the adapters would get would be copied to the other two adapters. This is what a bridge does; multicast (public) traffic is sent everywhere.
So imagine a scenario where each of these three adapters is in turn bound to a virtual switch inside the server. Well the virtual switch itself will send out multicast traffic on every port. Then imagine that the physical switche which the physical network adapters are connected to is also replicating the packets. What you end up with is the same packet getting sent over and over again. I have no idea whether it was three times per iteration or six times per iteration or how many it was but I do know that as soon as I plugged in the second adapter into the physical switch I got a lot of traffic very quickly.
So the mystery was somewhat solved. The packet loss was because I was accidentally flooding the switches. I had created a wiring loop — even if part of that loop was completely virtual. Fix the teaming and the problem went away. The network started working just fine.
So why did the network ever work? Well, eventually the switches were doing flood protection and shutting down some of the ports — but this took 15 minutes or more in some cases. So about the time I’d really get into looking at the problem the switch would have solved the problem — again back to diagnostics, the relatively poor diagnostics made it impossible for me to figure out what was going on with the Linksys — that’s really disappointing for a product that’s supposed to be marketed to people that can make mistakes.