Cause of high CPU load on Juniper peering router's routing engine
Recently the routing engine CPU utilization on two of our Juniper peering routers increased from ~10-20% average load to 80+%. I'm trying to figure out what's causing this (and how to get this high load back down).
Some info on the routers: both run the same JunOS version, both are connected to the same two peering IXP LANs and have a large number (several hundreds) of (almost identical) IPv4 and IPv6 sessions. Both routers have a connection to a different IP transit provider and are connected in the same way to the rest of our network. The routing engines' CPU load isn't flatline on 80+%, there are drops back to normal levels for minutes to hours, but these drops are not that often.
Things I've checked:
- no configuration changes have been made at the moment the increase started
- there's no increase in non-unicast traffic directed at the control plane
- there's no (substantial) change in the amount of traffic being forwarded (though even a increase shouldn't matter)
show system processes summaryindicates the
rpdprocess is causing the high CPU load
- there are no rapidly flapping BGP peers causing a large amount of BGP changes
One possible explanation I can come up with is a peer (or more than one) on one of the IXP's both routers are connected to sending a large number of BGP updates. Currently I only have statistics on the number of BGP messages for my transit sessions (showing no abnormal activity) and with several hundreds of BGP sessions on the peering LANs it's not that easy to spot the problematic session(s) if I should create graphs for all sessions.
My questions are:
- are there any other things I should check to find the cause of this increase in CPU load on the routing engines?
- how can I easily find out which sessions are causing these problems (if my assumption is right)? Enabling BGP traceoptions generates huge amounts of data, but I'm not sure if it gives me any real insights.
There might be some helpful information for you at the Juniper Knowledge Center.
If RPD is consuming high CPU, then perform the following checks and verify the following parameters:
Check the interfaces: Check if any interfaces are flapping on the router. This can be verified by looking at the output of the show log messages and show interfaces ge-x/y/z extensive commands. Troubleshoot why they are flapping; if possible you can consider enabling the hold-time for link up and link down.
Check if there are syslog error messages related to interfaces or any FPC/PIC, by looking at the output of show log messages.
Check the routes: Verify the total number of routes that are learned by the router by looking at the output of show route summary. Check if it has reached the maximum limit.
Check the RPD tasks: Identify what is keeping the process busy. This can be checked by first enabling set task accounting on. Important: This itself might increase the load on CPU and its utilization; so do not forget to turn it off when you are done with the required output collection. Then run show task accounting and look for the thread with the high CPU time:
[email protected]> show task accounting Task Started User Time System Time Longest Run Scheduler 146051 1.085 0.090 0.000 Memory 1 0.000 0 0.000 <omit> BGP.184.108.40.206+179 268 13.975 0.087 0.328 BGP.0.0.0.0+179 18375163 1w5d 23:16:57.823 48:52.877 0.142 BGP RT Background 134 8.826 0.023 0.099
Find out why a thread, which is related to a particular prefix or a protocol, is taking high CPU.
You can also verify if routes are oscillating (or route churns) by looking at the output of the shell command:
Check RPD Memory. Some times High memory utilization might indirectly lead to high CPU.
RPD is bit annoying blackbox. On top of great suggestions rtsockmon -t and show task account, I'd also like to add 'show krt queue' as potentially useful tool.
show krt queue will show you any route updates going form the control to the data plane. You should see nothing queued for most of the time. When a flap happens this can stay queued for quite some time
A couple of those points were too obvious to mention (flapping interfaces, errors in logs), but the rtsockmon and task accounting suggestions were insightful. It looks like a lot of CPU cycles are used for SNMP, so next up is figuring out which boxes and tools are polling these routers.
Sorry if they were too obvious, I've come from a support background where getting a user to check if its plugged in was a hassle!