Network Troubleshooting and MTU

These days I do minimal networking, but my college diploma is in an IP Engineering discipline, and prior to that, I had the opportunity to complete the Cisco Networking Academy curriculum in high school. All of that to say that while I’ve not pursued that field, I have a decent foundation and understanding on how those things operate. At work, I have my hand in firewalls, switches, DNS, DHCP, routing, subnetting and other wonderful things, but it’s not super advanced work.

For about 8 months now, we’ve had a dedicated link setup between our main office and a remote site. This will replace an internet-based VPN tunnel that currently carries traffic. However, it’s still not in use because of problems. More specifically, whenever I switch traffic to route over the dedicated link, service becomes extremely problematic. Ping to any system works, but protocols like RDP, SMB and replication (AD & Exchange) work intermittantly at best (RDP), or not at all (SMB, replication). I could RDP to ServerA, but not to ServerB, although they’re both VMs hosted on the same Hyper-V server at the remote site. And while RDP worked to the DC at the remote site and I could ping all the other DCs, repadmin /replsummary showed that replication was no longer working.

Here’s a simple topology diagram in ASCII art:

Head Office core switch - Head Office firewall - VPN Tunnel over internet - SiteA firewall - SiteA core switch
vs.
Head Office core switch - dedicated L2 link - SiteA core switch

I knew this was going to be a pain to troubleshoot, especially since I couldn’t take the whole DR site offline whenever I wanted during the day to conduct tests. I quintuple checked the configuration and didn’t find any errors. I compared it line-by-line to another site we have this same setup (dedicated link between core switches) and everything was configured correctly.

I finally decided to bite the bullet and spent a couple of late nights trying to gather information so that I could quantify the precise problem. Because the physical layer was out of my control, the data link layer seemed fine (also out of my control…mostly) and ping worked (layer 3), I was pretty confident the issue originated at the Network layer, or higher.

I started by looking at the packet sizes. I knew that ping was a tiny packet, while things like replication and SMB would probably exceed the MTU. I ran some tests with the ping -f -l to determine the maximum packet size that could successfully transit the link.

No.     Time           Source                Destination           Protocol Length ICMP Length Info
      1 0.000000       10.1.0.51             192.168.4.3           ICMP     1507   1465        Echo (ping) request  id=0x006b, seq=37568/49298, ttl=128 (reply in 2)
      2 0.005320       192.168.4.3           10.1.0.51             ICMP     1507   1465        Echo (ping) reply    id=0x006b, seq=37568/49298, ttl=126 (request in 1)
      3 1.005759       10.1.0.51             192.168.4.3           ICMP     1507   1465        Echo (ping) request  id=0x006b, seq=37569/49554, ttl=128 (reply in 4)
      4 1.011053       192.168.4.3           10.1.0.51             ICMP     1507   1465        Echo (ping) reply    id=0x006b, seq=37569/49554, ttl=126 (request in 3)
      5 2.011301       10.1.0.51             192.168.4.3           ICMP     1507   1465        Echo (ping) request  id=0x006b, seq=37570/49810, ttl=128 (reply in 6)
      6 2.016662       192.168.4.3           10.1.0.51             ICMP     1507   1465        Echo (ping) reply    id=0x006b, seq=37570/49810, ttl=126 (request in 5)
      7 3.017966       10.1.0.51             192.168.4.3           ICMP     1507   1465        Echo (ping) request  id=0x006b, seq=37571/50066, ttl=128 (reply in 8)
      8 3.023353       192.168.4.3           10.1.0.51             ICMP     1507   1465        Echo (ping) reply    id=0x006b, seq=37571/50066, ttl=126 (request in 7)
      9 16.070931      10.1.0.51             192.168.4.3           ICMP     1510   1468        Echo (ping) request  id=0x006b, seq=37576/51346, ttl=128 (reply in 10)
     10 16.076166      192.168.4.3           10.1.0.51             ICMP     1510   1468        Echo (ping) reply    id=0x006b, seq=37576/51346, ttl=126 (request in 9)
     11 17.075444      10.1.0.51             192.168.4.3           ICMP     1510   1468        Echo (ping) request  id=0x006b, seq=37577/51602, ttl=128 (reply in 12)
     12 17.080656      192.168.4.3           10.1.0.51             ICMP     1510   1468        Echo (ping) reply    id=0x006b, seq=37577/51602, ttl=126 (request in 11)
     13 18.082161      10.1.0.51             192.168.4.3           ICMP     1510   1468        Echo (ping) request  id=0x006b, seq=37578/51858, ttl=128 (reply in 14)
     14 18.087498      192.168.4.3           10.1.0.51             ICMP     1510   1468        Echo (ping) reply    id=0x006b, seq=37578/51858, ttl=126 (request in 13)
     15 19.087735      10.1.0.51             192.168.4.3           ICMP     1510   1468        Echo (ping) request  id=0x006b, seq=37579/52114, ttl=128 (reply in 16)
     16 19.093044      192.168.4.3           10.1.0.51             ICMP     1510   1468        Echo (ping) reply    id=0x006b, seq=37579/52114, ttl=126 (request in 15)
     17 26.901887      10.1.0.51             192.168.4.3           ICMP     1511   1469        Echo (ping) request  id=0x006b, seq=37580/52370, ttl=128 (no response found!)
     18 31.753228      10.1.0.51             192.168.4.3           ICMP     1511   1469        Echo (ping) request  id=0x006b, seq=37581/52626, ttl=128 (no response found!)
     19 36.753276      10.1.0.51             192.168.4.3           ICMP     1511   1469        Echo (ping) request  id=0x006b, seq=37582/52882, ttl=128 (no response found!)
     20 41.753263      10.1.0.51             192.168.4.3           ICMP     1511   1469        Echo (ping) request  id=0x006b, seq=37583/53138, ttl=128 (no response found!)

I found that where the size was <= 1468, the ping succeeded. At 1469-1472 bytes, there was no response. And at sizes >= 1473, ping helpfully told me that the Packet needs to be fragmented but DF set.

Okay! We’re making progress! …Maybe. Still working on the MTU theory, I took a look at all the network devices that these packets would transit:

ServerA
Hyper-V host server
Remote site core switch … L2 Dedicated Link beyond my control…
Head Office core switch

Both switches were set with an MTU of 1500 bytes, which is pretty standard. The Hyper-V host server used the Windows default of 1500, as did ServerA. So it looked like MTUs haven’t been altered in a meaningful way.

I ran a series of packet captures to see if anything looked unusual. Nothing jumped out at me, but I did notice that whenever the packet size exceeded 1500-ish bytes, there were lots of retranmissions and/or errors.

Wait! How is it that a server with an MTU of 1500 is sending packets that are 2600 bytes?

Wireshark uses winpcap (or libpcap) to grab the data before it’s handed to the NIC:

Many OS and NIC drivers support TCP Segmentation Offload / Large Segment Offload / Generic Segment Offload which offloads the task of breaking up TCP data into MSS-appropriate pieces. This is handled by the NIC, and saves resource overhead, improving performance. However… with offloading enabled, this task is now completed after Wireshark has grabbed the data, so it’s not seen when capturing from the OS. Using a SPAN port (port mirroring) or TAP would not have this limitation.

Back to my regularly scheduled commentary…

Running with the MTU theory, I decided to set the OS MTU = 1496 bytes. That’s 1468, which is the largest setting that worked with the Don’tFragment flag set, plus the 28 byte IP header. Let’s double check what the MTU is set at on my Windows 2019 DC at SiteA: netsh int ipv4 show sub (the full command is netsh interface ipv4 show subinterfaces, but I generally use the shorthand).

C:\Users>netsh int ipv4 show sub

   MTU  MediaSenseState   Bytes In  Bytes Out  Interface
------  ---------------  ---------  ---------  -------------
4294967295                1          0      52364  Loopback Pseudo-Interface 1
  1500                1  550194231  647147510  Ethernet 2

Let’s update this: netsh in ipv4 set sub "Ethernet 2" mtu=1496 store=persistent (netsh interface ipv4 set subinterface “Ethernet 2” mtu=1496 store=persistent)

C:\Users>netsh int ipv4 show sub

   MTU  MediaSenseState   Bytes In  Bytes Out  Interface
------  ---------------  ---------  ---------  -------------
4294967295                1          0      52364  Loopback Pseudo-Interface 1
  1496                1  55025789  647165843  Ethernet 2

On Windows 2019, this took effect immediately. If it doesn’t, a reboot should ensure it applies. AD replication - which had been failing - picked right back up as soon as I entered this command.

Now, I’ve proved that it’s an MTU issue, so I’ve turned it over to our L2 direct link provider. Interestingly enough, the 4 bytes (1500-1496=4) is the exact same size that a QinQ implementation adds. The packet arrives as an 802.1Q frame (tagged), and QinQ will add its own Tag & EType field in the frame. This resulting QinQ frame is often referred to as ‘double tagged’.

You can read more about QinQ in the Cisco article Inter-Switch Link and IEEE 802.1Q Frame Format