Myself and a colleague have wasted a lot of time trying to track an issue over the last few days, it seemed to be that a simple TCP stream could have corrupted data in it.
Basically, we would do a test of a web page served from a device (the FireBrick as it happens, with test s/w on it), and the page would be all nulls and spaces, or should be, but random corruption would appear.
To be 100% frank we have not tracked it down yet, but we initially assumed it had to be an issue in the new FireBrick code. We are just launching the FB2900, so any issue in testing is somewhat serious.
Now, if you know anything about networking, TCP as a stream will only work if you get proper packets at both the TCP level and lower levels like Ethernet, with good checksums/CRCs. Whilst Ethernet is generated on each link. The TCP checksum is end to end (no NAT involved here).
So, you have a visible, albeit slightly intermittent problem - basically view a specific web page from a specific device and see gibberish in the page. You naturally assume it has to be within the TCP stack / processing of each end as that is where data could be corrupted without breaking the TCP checksum logic. There are two ends, one is my nice shiny new iMac Pro and the other is a FireBrick somewhere on the end of a DSL line. It has to be one of those two ends with the issue, and well, obviously not the iMac, so look at the newest code, and latest hardware, at the FireBrick end. Anywhere in between would cause a TCP checksum error and so we would not see in the final TCP stream.
A lot of time wasted, and a clue pops up - cannot make it happen from the iMac (not the Pro) at the office. Hmmm. Sadly it is a tad intermittent, so not conclusive. I could not make it happen locally either, but assumed the slower uplink and more buffering may be a factor.
Then it gets really really weird, and I can imagine some network engineers would pull their hair out at this point. On my home iMac Pro I did lots of the same test. I tested on wired Ethernet (running only at 1Gb/s), and on WiFi. Every time on wired it got corruption. Every time on WiFi it did not. A dozen tests. Pretty conclusive. WTF?
Seriously how can that be - clearly the iMac Pro is not randomly corrupting TCP streams, as I would notice in web pages, images, etc. Even, as it seems, levels like 1 in 1000 packets, you'd notice. So why only the stream from the FireBrick. It cannot just be an iMac bug...
So I used wireshark to pcap on the iMac Pro, and yes, it saw the corrupted content. Odd. Then it occurred to me - tell wireshark to check checksums. Bingo, the TCP checksum is bad. Yet the iMac Pro has accepted the packet and fed the corrupted data in to the TCP stream!
This changes the game massively. It means this is a raw data corruption somewhere on the route from the FireBrick to me. It could be ANYTHING now, as not in the TCP realm as the TCP checksum is wrong. We were (probably) looking in totally the wrong place for this issue. Thanks Apple!
It seems that the iMac Pro wired Ethernet is not checking TCP checksums. The WiFi is fine. Likely cause is that they use h/w TCP checksum logic on the MAC (not Mac, I mean Media Access Control, the Ethernet hardware) and there is either a h/w or s/w bug meaning the checksum check is not being checked! It is an easy bug to exist and one you will not notice generally as almost all packets have no TCP level corruption!
Reported to Apple but also said that I am not spending ages doing logs and crap for them unless I am paid my hourly rate.
When debugging you really do not expect two unrelated bugs to be in play at once. This is hard work.
Sadly, at this point, we confirmed the corruption is definitely upstream of my Mac (dumps on router in-between), but the bug has vanished for now - Heisenbugs we call them. If/when we manage to reproduce it, we can then trace at each step along to way to find if an issue in our network, BT, the modem, switches, and so on, or, as is seemingly increasingly unlikely, the FireBrick at the far end. As this happened on old FB2700 and new FB2900 on this DSL line, but not anywhere else, the chance of this being in the FireBrick is getting vanishingly small.
Chasing bugs can be hard!
P.S. looks like it was BT supplied VDSL modem all along. Wow.
"When debugging you really do not expect two unrelated bugs to be in play at once."
ReplyDeleteHow long have you been in this game? :)
All my life, and yes, you expect the unexpected. I know.
DeleteIME, there's never just two unrelated bugs in play at once. There's N unrelated bugs in play, where N > 0.
ReplyDeleteI remember seeing those kind of characters before on the FireBrick — it was the IP address which showed after "Access now" on the login page.
ReplyDeleteThis might be entirely unrelated and utterly unhelpful, but, just in case it helps...
If by "those characters" you mean the inverse question mark on a black diamond, those are U+FFFE the "replacement character" and are what Unicode does when handling nonsense input, such as bytes labelled UTF-8 that aren't valid UTF-8, or only one UTF-16 surrogate occurs without its partner in a UTF-16 stream.
DeleteAh, yes, I recall that - fixed ages ago Neil, and yes, they are, as tialaramex says, the sort of thing you see for gibberish in UTF-8. Very different bug.
DeleteFair enough! Thanks for the explanation.
DeleteHad a similar (but probably different) problem with an SFP module that didn't like long runs of zero bytes, so it caused data dependent errors!
ReplyDeletesolve with sudo sysctl -w net.inet.tcp.tso=0
ReplyDeletesee page 2 on https://discussions.apple.com/thread/8253042?start=15&tstart=0
yours apple imac user
Indeed, I don't use Macs but as soon as I read this, I thought of how you sometimes have to disable TCP offloading when doing virtualisation on Linux.
Delete