Networking issues are very complex and critical for any IT infrastructure. We fix one and ten new pops up from no where. As it is very critical foundation for any enterprise/telco network on which entire business is built and run, any sluggishness/break down/breach in it are the nightmare issues for network engineers to solve. Troubleshooting is like maze running but its an interesting art if played by rules of Dr. Network troubleshooter. Lets’ meet Dr. Network troubleshooter then.
Very commonly said in health care world that catching a disease at early stage prevents break up of the system at long run. Same sentence stands correct in networking as well. Generally issues are observed at various stages of deployment(Day 1) or after deployment (Day 2) at the field, Starting from pilot testing, extensive feature/integration/inter-opt testing at a staging period or even after going live. The impact of the same issue is different at different stages. However, each stage gives hints of the causes.
Issues found at initial period of lab testing (Pre production) tent to be misconfiguration of the feature Whereas design anomalies cause issues in later stages.
Same way, Issues which are reported or seen at live network field, commonly be software failures (i.e. memory leaks) or Hardware failures (i.e. FAN rpm fault), but not necessarily misconfiguration. Sometime there are “timing” issues too, momentary packets drops etc. Since, network is real time reactive system, timing issues are hard to reproduce or do not occur frequently or it reacted based on some changes i.e. traffic pattern/topology/protocol timer failures/high temperature due to cooling failure etc. Although their impact is temporary but it can be sounding one.
So, it is always a good idea to see from which stage the symptoms are being reported and decide next course of action.
Now, While it is very common in most issues that symptoms reported with very superficial level information i.e. Huge packet drops observed/Link flapped multiple times etc. Troubleshooting of such issues become challenging journey.
On such issues, it is always a good idea to break the entire infrastructure design into multiple parts based on Products/Feature involved. For example, in Openstack infrastructure, we can divide it into like, neutron, openvswitch, dpdk, NICs, RHEL (OS), opendaylight controller, Director etc. Sometime, it is just one product but multiple feature are colliding with each other i.e. BONDING with DPDK Ports. BONDING involves protocols like LACP or different balancing algorithms.
Remember, Networking is all about visualization. Drawing the end to end picture, packet path, critical junctures etc. certainly makes tracking journey easy.
First aid in emergency:
Issues, reported from the live/production network are always an Emergency/Critical case. First priority while working on those cases are, restore back the normalcy. This requires isolation of the trouble making part and recover back the system with redundancy/work around. For example, traffic disruption due to some event happenings outside of the infra like sudden burst of the broadcast traffic or It could be a case of DOS attack as well. In such circumstances, a external policy/ACL may require further tightening. So, it can’t sneak in easily and disturb the setup till it gets stabilize.
Second case could be, one forwarding path broken. While traffic can be rerouted to next alternate available, if required new should be setup (PORTs) or QoS imposed on critical traffic in order to protect critical traffic.
While we work on recovery, one concerning thing from production setup is, limited freedom of diagnosis. So, Best way to reach the root cause is, simulate similar setup in testing lab, even though we can’t match the scale but small simulator creating some conditions.
Once, symptoms are reported and in order to identify which products/features/daemons are causing the issue, next step is to focus on
- Relevant Commands, running config
- Logs (if DEBUG available)
- Events (TRAPS, UPGRADES, UPDATES, MIGRATION)
- Traffic Pattern (Burst, BUUM)
- SOS report
These are very important indicators to narrow down the problem further. However, to analyse these indicators require significant technical expertise and skills of the area close to the issue. From here on, journey moves from strategic track to technical track.
One more thing here is to note, FALSE ALARM. While we debug the issue it is quite possible that a junction may lead us to wrong path. However, to identify the right path, it takes years of handling various issues and come only by experience.
Troubleshooting is a stage by stage game and should be played like super mario. I hope Dr. network troubleshooter’s rules will help you in your journey.