A company using multi-vendor routing platforms
(Cisco and Juniper) has a HQ and multiple spoke sites connected by an MPLS
provider. Each remote site has a GRE tunnel with the Headquarter (HQ) and runs
BGP over it.
After attending security training, your Security Team raised concerns about ICMP-based attacks and decided to block ICMP messages on all physical interfaces connected to outside networks, on all border routers, in all sites.
Some time later, all the BGP sessions between Cisco and Juniper devices started flapping up/down, impactiving the connectivity between HQ and Juniper-based sites, while the BGP sessions between HQ (Cisco-based) to other Cisco-based sites were ok.
As most of you spotted already, dropping all ICMP messages affects Path MTU Discovery (PMTUD) which in turn impacts end to end connectivity (in this case, BGP session)… but why is there a difference between Cisco and Juniper ? We will see that, but before let’s do some simple math operations:
After attending security training, your Security Team raised concerns about ICMP-based attacks and decided to block ICMP messages on all physical interfaces connected to outside networks, on all border routers, in all sites.
Some time later, all the BGP sessions between Cisco and Juniper devices started flapping up/down, impactiving the connectivity between HQ and Juniper-based sites, while the BGP sessions between HQ (Cisco-based) to other Cisco-based sites were ok.
As most of you spotted already, dropping all ICMP messages affects Path MTU Discovery (PMTUD) which in turn impacts end to end connectivity (in this case, BGP session)… but why is there a difference between Cisco and Juniper ? We will see that, but before let’s do some simple math operations:
- by
default, Ethernet MTU is 1500 bytes (full Ethernet is
1518 = 1514 Ethernet II header + 4 bytes checksum)
- by
default, GRE tunnel MTU is 1476 = 1500 – (20 bytes IP
header + 4 bytes GRE header)
- MPLS
adds a 4-byte overhead for each label – by default, if MPLS MTU
is not configured, this will be 1492 bytes (accounting
for 2 labels)
- by
default, the TCP MSS (Maximum Segment Size) is
automatically calculated by substracting 40 (20-bytes IP
header + 20-bytes TCP header) from the MTU of the outgoing interface:
- TCP
MSS is the maximum size of the TCP payload
- TCP
MSS is negociated (the lower should be chosen) between source and
destination during the TCP 3-way handshake, in the SYN & SYN/ACK
packets
- for
example: MSS for a TCP outgoing an Ethernet interface would be 1500 – 40
= 1460 bytes
- another
example: MSS for a TCP outgoing a GRE tunnel interface would be 1476 – 40
= 1436 bytes
Now, for the BGP sessions, the math is like this:
- the
maximum BGP UPDATE message would have a size of 1436 bytes =
this is the TCP MSS for a BGP over GRE tunnel (see above)
- when
such a packet reaches the PE, its size would be 1500 bytes = 1436 (BGP
payload) + 20 (TCP header) + 20 (IP
header) + 4 (GRE header) + 20 (outer IP
header)
- the
quiz does not make any reference to the MTU size inside the MPLS cloud as
there is no MTU configuration on the MPLS links – this is done on purpose
to create the quiz => a
packet of 1500 bytes is too large for the MPLS links (because
PE needs to add 2 labels = another 8 bytes)
- as
a result, the PE will need to perform fragmentation of
the BGP UPDATE message
Path
MTU Discovery (PMTUD)
For completeness of this article, in short, Path
MTU Discovery consists of:
- source
host sets DF-bit in the IP header to indicate that packet must not be
fragmented in transit
- intermediate
routers (PE in our quiz) will drop these large packets: because they
exceed the MTU of outgoing interface and because they are not allowed to
fragment them due to DF-bit setting
- intermediate
routers will send an ICMP “Fragmentation Needed and DF set” back to source
host (CE router for BGP session, in our quiz)
- very
important: the ICMP “Fragmentation Needed” messages contains
also the recommended MTU value
Cisco
vs. Juniper
The difference between the BGP sessions
established between Cisco-only sites (that were not impacted) and Cisco-Juniper
ones (sites impacted) lies in the DF-bit setting !
- By
default, Cisco does not set DF-bit for GRE
tunnels => this means that a BGP UPDATE of 1500-bytes would be
fragmented by the PE before sending them over the 1492-bytes MPLS links.
- Junipers,
on the other hand, by default set the DF-bit for GRE
tunnels => so a 1500-bytes BGP UPDATE with DF-bit set would
not fit the 1492-bytes MPLS links. The PEs would drop them and
send back to CEs an ICMP “Fragmentation Needed” indicating the MTU of the
outgoing link (see above screenshot: 1492).
This is visible on both Cisco PE and Juniper CE:
- debugging ICMP on PE:
R5#
*Mar 1 00:22:32.851: ICMP: dst
(192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2
*Mar 1 00:22:33.747: ICMP: dst
(192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2
R5#
*Mar 1 00:22:40.291: ICMP: dst (192.168.255.1)
frag. needed and DF set unreachable sent to 192.168.255.2
*Mar 1 00:22:41.699: ICMP: dst
(192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2
- firewall logs on Juniper CE with the drops:
root@Router-1> show firewall
Filter: DENY_ICMP-ge-0/0/0.0-i
Counters:
Name
Bytes
Packets
deny-icmp-ge-0/0/0.0-i
22400 400
root@Router-1> show firewall log
Log :
Time
Filter Action
Interface
Protocol Src
Addr
Dest Addr
23:14:23 DENY_ICMP-ge-0/0/0.0-i D
ge-0/0/0.0
ICMP
192.168.2.1
192.168.255.2
23:14:07 DENY_ICMP-ge-0/0/0.0-i D
ge-0/0/0.0
ICMP
192.168.2.1
192.168.255.2
23:13:59 DENY_ICMP-ge-0/0/0.0-i D
ge-0/0/0.0
ICMP
192.168.2.1
192.168.255.2
...
root@Router-1> show firewall log detail
Time of Log: 2014-01-24 23:14:23 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action:
discard, Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189,
Source address: 192.168.2.1, Destination address: 192.168.255.2
ICMP type: 3, ICMP code: 4
Time of Log: 2014-01-24 23:14:07 UTC, Filter:
DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189,
Source address: 192.168.2.1, Destination address: 192.168.255.2
ICMP type: 3, ICMP code: 4
If you are curious how the BGP session behaves on each end, here it is:
Cisco CE in HQ
|
Juniper CE in remote
site
|
The BGP session
gets established but it does not learn any route. Notice:
- the 0 counter on the PfxRcd -the Up/Down timer never gets more that “1:29? = 90 sec (the BGP default holdtime) |
The BGP session
gets established and prefixes are learned over it. Notice the Flaps counter
is non-zero
|
CE-HQ#
%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up
%BGP-3-NOTIFICATION: sent to neighbor
192.168.12.2 4/0 (hold time expired) 0 bytes
%BGP-5-NBR_RESET: Neighbor 192.168.12.2 reset
(BGP Notification sent)
%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down BGP
Notification sent
CE-HQ#
*Jan 24 22:42:18.519: %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up
CE-HQ#sh ip bgp summary
...
Neighbor
V AS MsgRcvd
MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
192.168.12.2
4
65200
2 8
1935 0 0 00:01:27
0
192.168.13.2
4
65300 11
15 1935 0 0
00:04:35 848
CE-HQ#
%BGP-3-NOTIFICATION: sent to neighbor
192.168.12.2 4/0 (hold time expired) 0 bytes
|
root@Router-1> show bgp summary
Groups: 1 Peers: 1 Down peers: 0
Table
Tot Paths Act Paths Suppressed History Damp
State Pending
inet.0
1934
1934
0 0
0 0
Peer
AS InPkt
OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received...
192.168.12.1
65000
9
16 0
10 1:27 1934/1934/1934/0
root@Router-1> show firewall
Filter: DENY_ICMP-ge-0/0/0.0-i
Counters:
Name
Bytes
Packets
deny-icmp-ge-0/0/0.0-i
5152 92
root@Router-1> show firewall log detail
Time of Log: 2014-01-24 22:46:38 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard,
Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189,
Source address: 192.168.2.1, Destination
address: 192.168.255.2
ICMP type: 3, ICMP code: 4
|
Solutions
1. Set the higher MTU inside MPLS
As mentioned above, the MPLS MTU was not set to
take into account the labels, for the sake of this quiz. Considering the
default MTU of physical interface of 1500, the MPLS MTU would be 1492 (for
2 labels). This value is easily seen in the ICMP “Fragmentation Needed”
messages, as shown above.
Usually MPLS providers do provide an MTU of 1500 bytes to their customers. To do this we need to increase the MPLS MTU to at least 1508 – usually you set the MPLS MTU to 1516 (to accomodate 4 labels), but for this quiz we use only 2 MPLS labels:
Usually MPLS providers do provide an MTU of 1500 bytes to their customers. To do this we need to increase the MPLS MTU to at least 1508 – usually you set the MPLS MTU to 1516 (to accomodate 4 labels), but for this quiz we use only 2 MPLS labels:
PE-2#sh mpls interfaces
Interface
IP
Tunnel Operational
FastEthernet0/1
Yes (ldp) No Yes
PE-2#
PE-2#conf t
Enter configuration commands, one per line.
End with CNTL/Z.
PE-2#(config)#int fa0/1
PE-2#(config-if)#mpls mtu
1508
PE-2#
*Mar 1 00:47:45.651: %SYS-5-CONFIG_I:
Configured from console by console
PE-2#sh mpls int detail
Interface FastEthernet0/1:
IP
labeling enabled (ldp):
Interface config
LSP
Tunnel labeling not enabled
BGP
tagging not enabled
Tagging
operational
Fast
Switching Vectors:
IP to MPLS Fast Switching Vector
MPLS Turbo Vector
MTU = 1508
PE-2#
This is, by far, the best solution, because it avoids fragmentation!!
2. Allow ICMP “Fragmentation Needed” into the ACL
(on Juniper side)
Another
solutions to this problem is to modify the access-list / Juniper filter to
permit the ICMP messages type 4 (destination unreachable) – code 3
(fragmentation needed) that are used to achieve the PMTUD (Path MTU Discovery).
In general, it’s a good practice to allow the ICMP “Fragmentation needed” messages into access-lists, whenever ICMP protocol is filtered.
For this quiz, these ICMP messages needs to be allowed by the firewall filter only on the Juniper devices (because it’s the Juniper that sets DF-bit in the GRE packets).
In general, it’s a good practice to allow the ICMP “Fragmentation needed” messages into access-lists, whenever ICMP protocol is filtered.
For this quiz, these ICMP messages needs to be allowed by the firewall filter only on the Juniper devices (because it’s the Juniper that sets DF-bit in the GRE packets).
root@Router-1> show configuration firewall
filter DENY_ICMP
interface-specific;
term ALLOW_PMTUD {
from {
protocol icmp;
icmp-type unreachable;
icmp-code fragmentation-needed;
}
then {
count
allow-pmtud;
log;
accept;
}
}
term DENY_ICMP {
from {
protocol icmp;
}
then {
count
deny-icmp;
log;
discard;
}
}
term ALLOW_ALL {
then accept;
}
After commiting this change, the BGP between Cisco-HQ and Juniper-sites became stable. The firewall counters and logs show that ICMP “Fragmentation Needed” messages are allowed on Juniper:
root@Router-1> show firewall
Filter: __default_bpdu_filter__
Filter: DENY_ICMP-ge-0/0/0.0-i
Counters:
Name
Bytes
Packets
allow-pmtud-ge-0/0/0.0-i
224
4
deny-icmp-ge-0/0/0.0-i
168
2
root@Router-1> show firewall log detail
Time of Log: 2014-01-22 22:55:30 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: accept, Name of interface: ge-0/0/0.0
Name of protocol: ICMP, Packet Length: 54189,
Source address: 192.168.2.1, Destination address: 192.168.255.2
ICMP type: 3, ICMP code: 4
3. Apply the “allow-fragmentation” on the Tunnel
Interface (on Juniper)
By
default, GRE packets will be dropped if they exceed the MTU of the outgoing
physical interface. Instead of dropping them, you can tell the Juniper router
to split them into more IP fragments – this is achieved with command “allow-fragmentation”
under the gr- (tunnel) interface:
root@Router-1> show configuration interfaces
gr-0/0/0
unit 0 {
tunnel {
source
192.168.255.2;
destination 192.168.255.1;
allow-fragmentation;
}
family inet {
address
192.168.12.2/30;
}
}
Since you allow fragmentation of the GRE packets, then it will not set the DF-bit anymore. This is the reason why I consider this solution to be more of a “workaround” since in fact you don’t solve the problem: large BGP Updates messages are still sent and they get fragmented on MPLS PE routers.
A real solution would be to avoid fragmentation !
4. Implement MSS Clamping
Another good solution to avoid fragmentation is
to use the “MSS Clamping”. This feature will modify (usually decrease) the MSS
value in the SYN and SYN/ACK packets to the configured value. As shown above
the MSS value for the BGP sessions that run over GRE tunnels is 1436 (= 1476
(GRE MTU) – 40 (IP+TCP headers)).
On Cisco devices, this is implemented at the global level with “ip tcp mss” or at the interface level with “ip tcp adjust-mss“:
On Cisco devices, this is implemented at the global level with “ip tcp mss” or at the interface level with “ip tcp adjust-mss“:
CE-HQ(config)#ip tcp mss
1400
CE-HQ(config)#end
CE-HQ#
CE-HQ#clear ip bgp 192.168.12.2
CE-HQ#
%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down
User reset
%BGP_SESSION-5-ADJCHANGE: neighbor
192.168.12.2 IPv4 Unicast topology base removed from session User reset
%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up
CE-HQ#sh ip bgp s
CE-HQ#sh ip bgp summary
...
Neighbor
V AS MsgRcvd
MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
192.168.12.2
4 65200
13 12
2835 0 0 00:00:19 999
192.168.13.2
4 65300
83 88
2835 0 0
01:10:53 848
CE-HQ#
CE-HQ#sh ip bgp nei 192.168.12.2 | i max
Number of NLRIs in the update sent: max
1010, min 0
minRTT: 48 ms, maxRTT: 484 ms, ACK hold: 200 ms
Datagrams (max data
segment is 1400 bytes):
CE-HQ#
This is a screenshot of the TCP 3-way handshake for the BGP between HQ and remote-site:
5. Additional tests run on Juniper
On Juniper, I tried several other options that,
theoretically, represent solution to this quiz:
1.
use “no-gre-path-mtu-discovery”
to disable PMTUD for GRE. This can be applied either on the GRE interface or
under “system internet-options”
For unknown reasons (I suspect due to virtual hardware that I used for testing) this solution did not work for me.
For unknown reasons (I suspect due to virtual hardware that I used for testing) this solution did not work for me.
2.
use “no-path-mtu-discovery”
to disable PMTUD for all outgoing TCP connections.
This can also be applied either on the GRE interface or under “system internet-options”.
Although this may look as a solution at the first sight, it’s not working because it disables PMTUD on the TCP (BGP sessions, in our case) which represents the inner header, not for the outer IP header.
This can also be applied either on the GRE interface or under “system internet-options”.
Although this may look as a solution at the first sight, it’s not working because it disables PMTUD on the TCP (BGP sessions, in our case) which represents the inner header, not for the outer IP header.
Last but not least, let me mention here that, with current IOS version, BGP performs PMTUD by default:
CE-HQ#sh ip bgp nei 192.168.12.2 | i path-mtu
Transport(tcp) path-mtu-discovery is
enabled
CE-HQ#
- Source : http://www.costiser.ro
No comments:
Post a Comment