
Today, many networking operating systems have very static logging system and cant be change dynamically. Look at the below examples from Cisco NX-OS and Juniper JUNOS.
Below snippet is from Cisco NX-OS.
Table 1. System Message Severity Levels
Level Description
0 emergency System unusable
1 alert Immediate action needed
2 critical Critical condition
3 error Error condition
4 warning Warning condition
5 notification Normal but significant condition
6 informational Informational message only
7 debugging Appears during debugging only
Below snippet is from Juniper JUNOS.
Table 2: System Log Message Severity Levels Value
Severity Level Description
N/A none Disables logging of the associated facility to a destination.
0 emergency System panic or other condition that causes the router to stop functioning.
1 alert Conditions that require immediate correction, such as a corrupted system database.
2 critical Critical conditions, such as hard errors.
3 error Error conditions that generally have less serious consequences than errors at the emergency, alert, and critical levels.
4 warning Conditions that warrant monitoring.
5 notice Conditions that are not errors but might warrant special handling.
6 info Events or non-error conditions of interest.
7 any Includes all severity levels.
Computer networking is very live and reactive system, so when any event occurs in the networking system (i.e switch), it just logs the messages of related components (i.e debugger) in the log file (or syslog server) as per severity defined level (INFO, WARNING, ERR in most case). This severity level (oostly kept low level to avoid overwhelming log messages) may not be enough or helpful when post event (i.e protocol timeout) troubleshooting is performed. Also, being in live production network, we cant have much experiments to narrow down the problem.
Why does not each logger have an option to switch to a higher level of severity logging when a specified even occurred?
For example, in the event of LACP (link aggregation control protocol) negotiation failure, the component who declares final verdict like “LACP has failed” should increase its severity level, so, in next re-try attempt we will have more debugs message to figure out what was going on and possible reasons for failure. Every networking protocol has retry timer/hello/keepalive etc, so at the 2nd attempt logging level goes high (raise severity level from INFO to DEBUG), this may help us to collect more logs, at 3rd attempt it can stop again or continue whichever suits to the implementation. Once, the protocol gets stabilized or stop re-tries, debugger component should revert to original severity level. Enabling this option shall be handled by a knob provided to user/admin if he/she wishes to enable it else default behavior remains the same as today.
Look at below log message taken from pre-production machines for ovs-vswitchd when lacp negotiation failed. These are not at all useful when we have to troubleshoot post-event. Admin does not have any more data here on why members are disabled.
2023-07-21T12:17:57.729Z|522426|bond|INFO|member dpdk1: disabled
2023-07-21T12:17:57.729Z|522427|bond|INFO|bond dpdk_bond0: all members disabled
Many vendor switches/routers have many such protocols to deal with like lldp, vrrp, bfd etc. so, proposed solution really helps in resolving issue faster.
This idea enables admin/vendor engineers with more data to fast troubleshoot the problem.
This will really help in live networks where sometime some event happens only during certain time or just happened that time and then never re-appeared, such logs collected (syslog server or log files) during this time may help us to potential fix un-reproducible issues. Dynamic logging level of the system when any network event occurs, help in troubleshooting in protocol state machine failures.
What do you think, Network engineers! Ask you vendor and raise RFE. 🙂