February 2014

Access-lists vs Prefix-lists

The main purpose of this post is to show how prefix lists work and how to decipher them vs regular access lists. Access-lists do a great job on Cisco devices, not just for security but all kinds of route filtering, QoS and so on.

A prefix list is a bit different form an access-list, and it’s important to know the differences and when to use either.

I’ve created the following simple topology to illustrate what I’m going to be doing. There are 2 routers, both running BGP. Router1 will have numerous loopbacks with IP addresses that will be advertised into the BGP process. On router2 I’ll use various access-lists and prefix-lists to see what kind of results I get. Remember though that prefix-lists can be used with other routing protocols and not just BGP.

This is the topology (Click for the full view):

Prefix lists 150x150 Access lists vs Prefix lists

This is the config on each:

R1#sh run | begin bgp
router bgp 100
 no synchronization
 bgp log-neighbor-changes
 network 1.1.1.1 mask 255.255.255.255
 neighbor 10.1.1.10 remote-as 200
 no auto-summary

R2#sh run | begin bgp
router bgp 200
 no synchronization
 bgp log-neighbor-changes
 neighbor 10.1.1.9 remote-as 100
 no auto-summary

I’ll put the following subnets on R1 and advertise them in BGP:

192.168.1.1/24
192.168.2.1/24
192.168.3.1/25
192.168.3.129/25
192.168.4.1/25
192.168.4.129/26
192.168.4.193/26

#R1
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
!
interface Loopback1
 ip address 192.168.1.1 255.255.255.0
!
interface Loopback2
 ip address 192.168.2.1 255.255.255.0
!
interface Loopback3
 ip address 192.168.3.1 255.255.255.128
!
interface Loopback4
 ip address 192.168.3.129 255.255.255.128
!
interface Loopback5
 ip address 192.168.4.1 255.255.255.128
!         
interface Loopback7
 ip address 192.168.4.129 255.255.255.192
!         
interface Loopback8
 ip address 192.168.4.193 255.255.255.192

This is R1′s BGP config now:

R1#sh run | begin bgp
router bgp 100
 no synchronization
 bgp log-neighbor-changes
 network 1.1.1.1 mask 255.255.255.255
 network 192.168.1.0
 network 192.168.2.0
 network 192.168.3.0 mask 255.255.255.128
 network 192.168.3.128 mask 255.255.255.128
 network 192.168.4.0 mask 255.255.255.128
 network 192.168.4.128 mask 255.255.255.192
 network 192.168.4.192 mask 255.255.255.192
 neighbor 10.1.1.10 remote-as 200
 no auto-summary

On Router2, we can see the routes advertised:

R2#sh ip bgp 
BGP table version is 10, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.1/32       10.1.1.9                 0             0 100 i
*> 192.168.1.0      10.1.1.9                 0             0 100 i
*> 192.168.2.0      10.1.1.9                 0             0 100 i
*> 192.168.3.0/25   10.1.1.9                 0             0 100 i
*> 192.168.3.128/25 10.1.1.9                 0             0 100 i
*> 192.168.4.0/25   10.1.1.9                 0             0 100 i
*> 192.168.4.128/26 10.1.1.9                 0             0 100 i
*> 192.168.4.192/26 10.1.1.9                 0             0 100 i

Let’s say I want to filter out the network 192.168.4.0/25. If I use an access-list I need to do it as follows. Create the access list:

R2#conf t
R2(config)#access-list 5 deny   192.168.4.0 0.0.0.127
R2(config)#access-list 5 permit any

Add a rule to the BGP config:

R2#sh run | begin bgp
router bgp 200
 no synchronization
 bgp log-neighbor-changes
 neighbor 10.1.1.9 remote-as 100
 neighbor 10.1.1.9 distribute-list 5 in
 no auto-summary

You can see that the 192.168.4.0/25 route has now been filtered out:

R2#sh ip bgp
   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.1/32       10.1.1.9                 0             0 100 i
*> 192.168.1.0      10.1.1.9                 0             0 100 i
*> 192.168.2.0      10.1.1.9                 0             0 100 i
*> 192.168.3.0/25   10.1.1.9                 0             0 100 i
*> 192.168.3.128/25 10.1.1.9                 0             0 100 i
*> 192.168.4.128/26 10.1.1.9                 0             0 100 i
*> 192.168.4.192/26 10.1.1.9                 0             0 100 i

Let’s say I wanted to filter out the 192.168.4.x/26′s as well. In order to do so I’d have to add another line for each network in my access-list. With a prefix-list it’s much easier to do this. Let’s remove the access-list and start again. NB: Prefix-lists, like access-lists, have a implicit DENY at the end. In an ACL you’ll place a permit any at the end. The prefix-list version of this is to permit 0.0.0.0/0 le 32
First I’ll create the prefix-list:

R2(config)#ip prefix-list exclude_4 seq 5 deny 192.168.4.0/24 ge 25 le 26
R2(config)#ip prefix-list exclude_4 seq 10 permit 0.0.0.0/0 le 32

Now I’ll apply it to the BGP process:

router bgp 200
 no synchronization
 bgp log-neighbor-changes
 neighbor 10.1.1.9 remote-as 100
 neighbor 10.1.1.9 prefix-list exclude_4 in
 no auto-summary

When checking the BGP table I see the following:

R2#sh ip bgp
   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.1/32       10.1.1.9                 0             0 100 i
*> 192.168.1.0      10.1.1.9                 0             0 100 i
*> 192.168.2.0      10.1.1.9                 0             0 100 i
*> 192.168.3.0/25   10.1.1.9                 0             0 100 i
*> 192.168.3.128/25 10.1.1.9                 0             0 100 i

You can see that all the 192.168.4.1/25 and /26s are gone thanks to the prefix-list.

The basics of the prefix list is as follows. If I write

ip prefix-list exclude_4 seq 5 deny 192.168.4.0/24 ge 25 le 26

The /24 tells the IOS to match only the first 24 bits. i.e. 192.168.4 – I then tell the IOS to match only those prefixes that have a subnet mask of /25 or /26. i.e. If I had another network advertised which was 192.168.4.200/27 it would NOT match as even though the 192.168.4 part matches, it has a subnet mask of /27

Let’s say I wanted to now match 192.168.x.x/25 but I wanted to leave the /26′s in place. This would be easy with a prefix list as follows:

R2(config)#ip prefix-list exclude_4 seq 5 deny 192.168.3.0/16 ge 25 le 25 
R2(config)#ip prefix-list exclude_4 seq 10 permit 0.0.0.0/0 le 32

I’ve told the IOS to only match on the first 16 bits, i.e. 192.168 – I then told IOS to only match those prefixes that have a subnet mask of /25. If I apply this to my BGP process I can see that it works as expected:

R2#sh ip bgp          
   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.1/32       10.1.1.9                 0             0 100 i
*> 192.168.1.0      10.1.1.9                 0             0 100 i
*> 192.168.2.0      10.1.1.9                 0             0 100 i
*> 192.168.4.128/26 10.1.1.9                 0             0 100 i
*> 192.168.4.192/26 10.1.1.9                 0             0 100 i

Only the 3 /25′s have disappeared, everything else is still there.

You can also do all of this with extended access-lists, but it’s so much more work, why make life difficult? Once you understand the context of prefix-lists it becomes very easy

#Source: http://mellowd.co.uk/ccie/?p=447

Cisco XRv has been released

Many of you have probably heard of Cisco VIRL or what is now called Cisco Modeling
Labs (CML). CML is due for release later this year and is supposed to include
support for IOS, IOS XE, NXOS and IOS XR.

Last year Cisco released the Cloud Services Router (CSR1000v) which is a virtual
router running IOS XE.

Now Cisco has released the XRv which is a virtual router running IOS XR. This is
great news for anyone that wants to learn IOS XR or to test changes and to try
different concepts on IOS XR.

The VM can be run on ESXi or KVM/QEMU which gives flexibility. The installation
guide is located here.

Every VM needs 3GB of RAM to start but when it’s running it seems to only use 1GB
and has a very low CPU usage.

At this moment the download is restricted but Cisco is working on moving the restriction.
Expect this to be solved this coming week. XRv will be available in three packages.

There is restriction on 2 Mbit BW which should not be an issue for labbing purposes.

The initial release has some nasty bugs. So make sure to not use more than one CPU.

To be able to run the image the server must meet the following requirements.

The complete list of release notes are here.

After you have deployed the VM, which can be done with OVF template or by creating
a VM and using VMDK, don’t forget to create a serial interface and tie this to the
network or you will not be able to see any output from the VM. This is described
in the release notes.

Cisco also provides a free workbook with some basic concepts for IOS XR which can be
found here.

I would also recommend this also free IOS XR workbook by Jeffrey Fry. It’s a great
contribution to the community and it can be found here.

Have fun learning IOS XR!

Source : http://lostintransit.se/

BGP Best Path Selection Algorithm

BGP is the protocol used to announce prefixes throughout the internet. It’s a very robust protocol, and very useful to carry lot of prefixes, such as the Internet prefixes or internal client prefixes of an ISP.

When a prefix is received in BGP, the path passes through two steps before being chosen as candidate to populate the RIB.

The first step consists on checking if the path is valid. If it is, the prefix will get into the BGP table, and later the second step of selection will start.

In order to pass this first check, the path must meet the following requirements:

The prefix must not been marked as “not-synchronized”
There must be a route in the RIB to reach the next-hop
For prefixes learned through eBGP sessions, the local ASN must not be in the AS_PATH of the prefix

In the second step, the best path to reach the prefix is selected. If there is only one path, no comparison needed. If there are many paths to reach the prefix, there is a special algorithm that BGP uses to select the best path, and this is what I want to talk about.

This algorithm dictates the following:

Prefer the path with the highest WEIGHT
Prefer the path with the highest LOCAL PREFERENCE
Prefer the path that was locally originated via a network o redistribute command over aggregate-address command
Prefer the path with the lowest AS_PATH
Prefer the path with the lowest ORIGIN type
Prefer the path with the lowest MULTI-EXIT DISCRIMINATOR (MED)
Prefer eBGP over iBGP
Prefer the path with the lowest IGP metric to the BGP next-hop
When both path are external, prefer the one that was received first
Prefer the route that comes from the BGP router with the lowest router ID
If the originator or router ID is the same for multiple paths, prefer the path with the minimum cluster list length
Prefer the path that comes from the lowest neighbor address

As you can see, the selection process is quite long, although in most cases the selection doesn’t go further than point 8

MPLS : The Core: Intermediate System - Intermediate System -- Part...

MPLS : The Core: Intermediate System - Intermediate System -- Part...: ISIS was originally designed for Open System Interconnect (OSI) protocol Suite. The Connectionless Network Service (CLNS) is been used &amp...

ISIS Fast Convergence

IS-IS, fast convergence is achieved through the following mechanisms:

(1) Incremental SPF (ISPF)

(2) Partial route calculation (RPC)

(3) LSP fast flooding

(4) Intelligent timer (including the SPF intelligent timer and intelligent timer for generating LSPs)

In addition, fast fault detection can be achieved through BFD for IS-IS, and optimizing the IS-IS network can achieve fast convergence.

Configurations of IS-IS Fast Convergence:

isis 1

is-level level-2

cost-style wide

timer lsp-generation 1 50 50 level-2 //Set the interval for generating the same LSP fragment.

flash-flood 15 level-2 //Configure LSP fast flooding.

bfd all-interfaces enable //Enable BFD for IS-IS to configure BFD for all IS-IS interfaces.

network-entity 22.0000.0000.0001.00

timer spf 1 50 50 //Set the interval for SPF calculation.

traffic-eng level-2 //Configure IS-IS extension to enable TE on the entire network.

interface GigabitEthernet1/0/0

undo shutdown

ip address 22.1.3.1 255.255.255.0

isis enable 1

isis bfd enable //Enable BFD for IS-IS for a single interface.

interface GigabitEthernet1/0/1

undo shutdown

ip address 22.1.2.1 255.255.255.0

isis circuit-type p2p //Change the type of a broadcast interface to P2P to avoid DIS election and thus speed up route convergence.

isis enable 1

isis bfd enable

Notes: The ISPF algorithm and RPC algorithm are the default algorithms of IS-IS and are enabled by default. Therefore, they do not need to be configured.

The timer lsp-generation command is used to set the interval for generating the same LSP fragment.

The parameter init-interval specifies the initial interval for generating the same LSP fragment. The parameter incr-interval specifies the incremental interval for generating the same LSP fragment. One incr-interval is added each time the network topology changes.

When SPF calculation is performed for the first time, the interval is init-interval. Each time the routes change, the interval is added by incr-interval until the interval reaches max-interval. After the interval reaches max-interval for three times, the interval is reduced to init-interval.

In IS-IS, a device recalculates the shortest path when the LSDB changes. Frequent route calculations consume a lot of resources and thus degrade the system performance. Delaying SPF calculation can improve efficiency in route calculation and reduce the consumption of system resources. If the delay in route calculation is too long, route convergence becomes slow.

To speed up route convergence without affecting the efficiency of routers, you can use an intelligent timer in SPF calculation. This timer automatically adjusts the interval according to the frequency of changes in the LSDB.

Parameters of IS-IS Fast Convergence:

How Could MTU affect BGP Sessions

A company using multi-vendor routing platforms (Cisco and Juniper) has a HQ and multiple spoke sites connected by an MPLS provider. Each remote site has a GRE tunnel with the Headquarter (HQ) and runs BGP over it.

After attending security training, your Security Team raised concerns about ICMP-based attacks and decided to block ICMP messages on all physical interfaces connected to outside networks, on all border routers, in all sites.
Some time later, all the BGP sessions between Cisco and Juniper devices started flapping up/down, impactiving the connectivity between HQ and Juniper-based sites, while the BGP sessions between HQ (Cisco-based) to other Cisco-based sites were ok.

As most of you spotted already, dropping all ICMP messages affects Path MTU Discovery (PMTUD) which in turn impacts end to end connectivity (in this case, BGP session)… but why is there a difference between Cisco and Juniper ? We will see that, but before let’s do some simple math operations:

by default, Ethernet MTU is 1500 bytes (full Ethernet is 1518 = 1514 Ethernet II header + 4 bytes checksum)
by default, GRE tunnel MTU is 1476 = 1500 – (20 bytes IP header + 4 bytes GRE header)
MPLS adds a 4-byte overhead for each label – by default, if MPLS MTU is not configured, this will be 1492 bytes (accounting for 2 labels)
by default, the TCP MSS (Maximum Segment Size) is automatically calculated by substracting 40 (20-bytes IP header + 20-bytes TCP header) from the MTU of the outgoing interface:

TCP MSS is the maximum size of the TCP payload
TCP MSS is negociated (the lower should be chosen) between source and destination during the TCP 3-way handshake, in the SYN & SYN/ACK packets
for example: MSS for a TCP outgoing an Ethernet interface would be 1500 – 40 = 1460 bytes
another example: MSS for a TCP outgoing a GRE tunnel interface would be 1476 – 40 = 1436 bytes

Note the entire frame size of 1522 = 1508 (packet with 2 MPLS labels) + 14 (Ethernet II header)

Now, for the BGP sessions, the math is like this:

the maximum BGP UPDATE message would have a size of 1436 bytes = this is the TCP MSS for a BGP over GRE tunnel (see above)
when such a packet reaches the PE, its size would be 1500 bytes = 1436 (BGP payload) + 20 (TCP header) + 20 (IP header) + 4 (GRE header) + 20 (outer IP header)
the quiz does not make any reference to the MTU size inside the MPLS cloud as there is no MTU configuration on the MPLS links – this is done on purpose to create the quiz => a packet of 1500 bytes is too large for the MPLS links (because PE needs to add 2 labels = another 8 bytes)
as a result, the PE will need to perform fragmentation of the BGP UPDATE message

Path MTU Discovery (PMTUD)

For completeness of this article, in short, Path MTU Discovery consists of:

source host sets DF-bit in the IP header to indicate that packet must not be fragmented in transit
intermediate routers (PE in our quiz) will drop these large packets: because they exceed the MTU of outgoing interface and because they are not allowed to fragment them due to DF-bit setting
intermediate routers will send an ICMP “Fragmentation Needed and DF set” back to source host (CE router for BGP session, in our quiz)
very important: the ICMP “Fragmentation Needed” messages contains also the recommended MTU value

Cisco vs. Juniper

The difference between the BGP sessions established between Cisco-only sites (that were not impacted) and Cisco-Juniper ones (sites impacted) lies in the DF-bit setting !

By default, Cisco does not set DF-bit for GRE tunnels => this means that a BGP UPDATE of 1500-bytes would be fragmented by the PE before sending them over the 1492-bytes MPLS links.
Junipers, on the other hand, by default set the DF-bit for GRE tunnels => so a 1500-bytes BGP UPDATE with DF-bit set would not fit the 1492-bytes MPLS links. The PEs would drop them and send back to CEs an ICMP “Fragmentation Needed” indicating the MTU of the outgoing link (see above screenshot: 1492).

This is visible on both Cisco PE and Juniper CE:
- debugging ICMP on PE:

R5#

*Mar 1 00:22:32.851: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2

*Mar 1 00:22:33.747: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2

R5#

*Mar 1 00:22:40.291: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2

*Mar 1 00:22:41.699: ICMP: dst (192.168.255.1) frag. needed and DF set unreachable sent to 192.168.255.2

- firewall logs on Juniper CE with the drops:

root@Router-1> show firewall

Filter: DENY_ICMP-ge-0/0/0.0-i

Counters:

Name Bytes Packets

deny-icmp-ge-0/0/0.0-i 22400 400

root@Router-1> show firewall log

Log :

Time Filter Action Interface Protocol Src Addr Dest Addr

23:14:23 DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0 ICMP 192.168.2.1 192.168.255.2

23:14:07 DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0 ICMP 192.168.2.1 192.168.255.2

23:13:59 DENY_ICMP-ge-0/0/0.0-i D ge-0/0/0.0 ICMP 192.168.2.1 192.168.255.2

...

root@Router-1> show firewall log detail

Time of Log: 2014-01-24 23:14:23 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, Name of interface: ge-0/0/0.0

Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2

ICMP type: 3, ICMP code: 4

Time of Log: 2014-01-24 23:14:07 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, Name of interface: ge-0/0/0.0

Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2

ICMP type: 3, ICMP code: 4

If you are curious how the BGP session behaves on each end, here it is:

Cisco CE in HQ	Juniper CE in remote site
The BGP session gets established but it does not learn any route. Notice: - the 0 counter on the PfxRcd -the Up/Down timer never gets more that “1:29? = 90 sec (the BGP default holdtime)	The BGP session gets established and prefixes are learned over it. Notice the Flaps counter is non-zero
CE-HQ# %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up %BGP-3-NOTIFICATION: sent to neighbor 192.168.12.2 4/0 (hold time expired) 0 bytes %BGP-5-NBR_RESET: Neighbor 192.168.12.2 reset (BGP Notification sent) %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down BGP Notification sent CE-HQ# Jan 24 22:42:18.519: %BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up* CE-HQ#sh ip bgp summary ... Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 192.168.12.2 4 65200 2 8 1935 0 0 00:01:27 0 192.168.13.2 4 65300 11 15 1935 0 0 00:04:35 848 CE-HQ# %BGP-3-NOTIFICATION: sent to neighbor 192.168.12.2 4/0 (hold time expired) 0 bytes	root@Router-1> show bgp summary Groups: 1 Peers: 1 Down peers: 0 Table Tot Paths Act Paths Suppressed History Damp State Pending inet.0 1934 1934 0 0 0 0 Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State\|#Active/Received... 192.168.12.1 65000 9 16 0 10 1:27 1934/1934/1934/0 root@Router-1> show firewall Filter: DENY_ICMP-ge-0/0/0.0-i Counters: Name Bytes Packets deny-icmp-ge-0/0/0.0-i 5152 92 root@Router-1> show firewall log detail Time of Log: 2014-01-24 22:46:38 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: discard, Name of interface: ge-0/0/0.0 Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2 ICMP type: 3, ICMP code: 4

Solutions

1. Set the higher MTU inside MPLS

As mentioned above, the MPLS MTU was not set to take into account the labels, for the sake of this quiz. Considering the default MTU of physical interface of 1500, the MPLS MTU would be 1492 (for 2 labels). This value is easily seen in the ICMP “Fragmentation Needed” messages, as shown above.

Usually MPLS providers do provide an MTU of 1500 bytes to their customers. To do this we need to increase the MPLS MTU to at least 1508 – usually you set the MPLS MTU to 1516 (to accomodate 4 labels), but for this quiz we use only 2 MPLS labels:

PE-2#sh mpls interfaces

Interface IP Tunnel Operational

FastEthernet0/1 Yes (ldp) No Yes

PE-2#

PE-2#conf t

Enter configuration commands, one per line. End with CNTL/Z.

PE-2#(config)#int fa0/1

PE-2#(config-if)#mpls mtu 1508

PE-2#

*Mar 1 00:47:45.651: %SYS-5-CONFIG_I: Configured from console by console

PE-2#sh mpls int detail

Interface FastEthernet0/1:

IP labeling enabled (ldp):

Interface config

LSP Tunnel labeling not enabled

BGP tagging not enabled

Tagging operational

Fast Switching Vectors:

IP to MPLS Fast Switching Vector

MPLS Turbo Vector

MTU = 1508

PE-2#

This is, by far, the best solution, because it avoids fragmentation!!

2. Allow ICMP “Fragmentation Needed” into the ACL (on Juniper side)

Another solutions to this problem is to modify the access-list / Juniper filter to permit the ICMP messages type 4 (destination unreachable) – code 3 (fragmentation needed) that are used to achieve the PMTUD (Path MTU Discovery).
In general, it’s a good practice to allow the ICMP “Fragmentation needed” messages into access-lists, whenever ICMP protocol is filtered.
For this quiz, these ICMP messages needs to be allowed by the firewall filter only on the Juniper devices (because it’s the Juniper that sets DF-bit in the GRE packets).

root@Router-1> show configuration firewall filter DENY_ICMP

interface-specific;

term ALLOW_PMTUD {

from {

protocol icmp;

icmp-type unreachable;

icmp-code fragmentation-needed;

}

then {

count allow-pmtud;

log;

accept;

}

term DENY_ICMP {

from {

protocol icmp;

}

then {

count deny-icmp;

log;

discard;

}

term ALLOW_ALL {

then accept;

}

After commiting this change, the BGP between Cisco-HQ and Juniper-sites became stable. The firewall counters and logs show that ICMP “Fragmentation Needed” messages are allowed on Juniper:

root@Router-1> show firewall

Filter: __default_bpdu_filter__

Filter: DENY_ICMP-ge-0/0/0.0-i

Counters:

Name Bytes Packets

allow-pmtud-ge-0/0/0.0-i 224 4

deny-icmp-ge-0/0/0.0-i 168 2

root@Router-1> show firewall log detail

Time of Log: 2014-01-22 22:55:30 UTC, Filter: DENY_ICMP-ge-0/0/0.0-i, Filter action: accept, Name of interface: ge-0/0/0.0

Name of protocol: ICMP, Packet Length: 54189, Source address: 192.168.2.1, Destination address: 192.168.255.2

ICMP type: 3, ICMP code: 4

3. Apply the “allow-fragmentation” on the Tunnel Interface (on Juniper)

By default, GRE packets will be dropped if they exceed the MTU of the outgoing physical interface. Instead of dropping them, you can tell the Juniper router to split them into more IP fragments – this is achieved with command “allow-fragmentation” under the gr- (tunnel) interface:

root@Router-1> show configuration interfaces gr-0/0/0

unit 0 {

tunnel {

source 192.168.255.2;

destination 192.168.255.1;

allow-fragmentation;

}

family inet {

address 192.168.12.2/30;

}

Since you allow fragmentation of the GRE packets, then it will not set the DF-bit anymore. This is the reason why I consider this solution to be more of a “workaround” since in fact you don’t solve the problem: large BGP Updates messages are still sent and they get fragmented on MPLS PE routers.
A real solution would be to avoid fragmentation !

4. Implement MSS Clamping

Another good solution to avoid fragmentation is to use the “MSS Clamping”. This feature will modify (usually decrease) the MSS value in the SYN and SYN/ACK packets to the configured value. As shown above the MSS value for the BGP sessions that run over GRE tunnels is 1436 (= 1476 (GRE MTU) – 40 (IP+TCP headers)).
On Cisco devices, this is implemented at the global level with “ip tcp mss” or at the interface level with “ip tcp adjust-mss“:

CE-HQ(config)#ip tcp mss 1400

CE-HQ(config)#end

CE-HQ#

CE-HQ#clear ip bgp 192.168.12.2

CE-HQ#

%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Down User reset

%BGP_SESSION-5-ADJCHANGE: neighbor 192.168.12.2 IPv4 Unicast topology base removed from session User reset

%BGP-5-ADJCHANGE: neighbor 192.168.12.2 Up

CE-HQ#sh ip bgp s

CE-HQ#sh ip bgp summary

...

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd

192.168.12.2 4 65200 13 12 2835 0 0 00:00:19 999

192.168.13.2 4 65300 83 88 2835 0 0 01:10:53 848

CE-HQ#

CE-HQ#sh ip bgp nei 192.168.12.2 | i max

Number of NLRIs in the update sent: max 1010, min 0

minRTT: 48 ms, maxRTT: 484 ms, ACK hold: 200 ms

Datagrams (max data segment is 1400 bytes):

CE-HQ#

This is a screenshot of the TCP 3-way handshake for the BGP between HQ and remote-site:

5. Additional tests run on Juniper

On Juniper, I tried several other options that, theoretically, represent solution to this quiz:

1. use “no-gre-path-mtu-discovery” to disable PMTUD for GRE. This can be applied either on the GRE interface or under “system internet-options”
For unknown reasons (I suspect due to virtual hardware that I used for testing) this solution did not work for me.

2. use “no-path-mtu-discovery” to disable PMTUD for all outgoing TCP connections.
This can also be applied either on the GRE interface or under “system internet-options”.
Although this may look as a solution at the first sight, it’s not working because it disables PMTUD on the TCP (BGP sessions, in our case) which represents the inner header, not for the outer IP header.

Last but not least, let me mention here that, with current IOS version, BGP performs PMTUD by default:

CE-HQ#sh ip bgp nei 192.168.12.2 | i path-mtu

Transport(tcp) path-mtu-discovery is enabled

CE-HQ#

- Source : http://www.costiser.ro

Pages

Access-lists vs Prefix-lists

Cisco XRv has been released

BGP Best Path Selection Algorithm

MPLS : The Core: Intermediate System - Intermediate System -- Part...

ISIS Fast Convergence

How Could MTU affect BGP Sessions