The snag is that they keep falling off the internet! A power cycle fixes, but it is very frustrating.
I have found the solution though, and I think it points a finger at the cause.
And it is all down to DHCP. Yep, not DNS this time. Not IPv6 even. DHCP!
So what's the problem?
First off, what's the kit?
- FireBrick doing DHCP and Internet gateway
- Aruba APs
- Apple HomePods
The failure did not seem to be all the time, but could be. Sandra has almost given up using them as they never work. But it seems it can usually renew its DHCP without problems, but sometimes it gets stuck. The logs on the FireBrick showed we kept sending a DHCP "Offer" to the HomePo, but it keeps asking.
I added lots of debug, and confirmed that the request being sent, the DHCP "Discover", does not request a broadcast reply, which is fine, so we send the reply to the MAC of the HomePod and its "new" IP address. This is normal.
On a whim, I decided to try fudging the code to treat the discovery as if it has asked for a broadcast reply. This then meant a Discover, Offer, Request, and Ack - but the HomePod did not see the Ack and so kept asking. I then forced the broadcast on the Ack as well, and bingo, it worked. So the issue is the broadcast used for Offer and Ack.
This is a massive clue.
So more investigating.
The RFC says the broadcast request is in the left most bit of a 16 bit flag field.
PLEASE DO NOT DO SPECIFICATIONS LIKE THIS!
I fully understand that bits in a byte may be sent "on the wire" low or high bit first, or high to low bit first. I fully understand that bytes in a word may be ordered big endian or little endian. The above diagram is for a 16 bit "network byte order" value (i.e. big endian).
They number the bits from 0 to 15. Actually they number the gaps between the bits 0 to 15.
In my view there is only one way you should number bits - by their binary power of two value. I would always write that in the way we write numbers, most significant first, so would write that as bits 15 to 0, and it is bit 15 that is the B flag. I don't mind if showing as bits 15 to 8, and 7 to 0 (big endian) or even as 7 to 0, 15 to 8 (little endian), but number each bit by its power of two value, please!
Some people number as order on the wire, starting from 1. So 1 to 8 may be 0 to 7 or 7 to 0, who knows! Please do not do that. But at least if numbering bits 1 to 8, you have some clue that something is wrong.
So, to be quite frank, I actually do not know if this is bit 0 or 15 in a network byte order (big endian) 2 byte (16 bit) value. We assumed it is bit 15, i.e. bit 7 in the first byte. But seriously, from bits numbered 0 to 15 and a reference to "left most bit" I don't actually know for sure. I started to doubt we had read the RFC correctly!
Thankfully empirical testing shows the flags as 0x8000 from other devices, so either it is bit 7 of first byte, or other devices have the same fun reading the RFC.
So who is at fault here?
Well, my son has the same FireBrick and the same HomePods, but different APs. That all works. That is another clue.
My Aruba APs are set up to inject data in the DHCP, which is good. I get details of the AP and SSID, and can even tell the FireBrick to allocate based on SSID even if different SSIDs on the same physical network. All good.
It may be that it is stripping the broadcast bit, bit that does not explain why it works after a power cycle. Interestingly the working DHCP renewals did not have the injected AP details, it seems. This points further to the AP being "special"
My son does have different network switches as well, so it is just remotely possible that it is a switch level issue, but that seems unlikely - the DHCP discovers are from the right MAC so all switch learning should be fine.
P.S. Yes, I had changed the filtering to disabled already.
The work around...
FireBricks now have an option to force broadcast reply. And it works. Alpha out soon.
Only last week we had a Netgear PoE switch somehow blocking DHCP traffic and preventing a laptop from obtaining an IP address. APs were Unifi. Router was Virgin Media's freebie. Only affected one laptop. All other devices in the premises were fine.
ReplyDeleteUnifi APs are notorious for having DHCP issues, you'll find numerous posts on their support forums and I myself have and do suffer with it on random occasions.
DeleteIt's been a while since I fired up Wireshark as normally it's less painful to just powercycle the AP but from memory the discovery and offer were fine and possibly the request but it was the ack that didn't happen. It only ever seems to happen on one AP and only on the 5ghz band which makes little sense from an interference point of view.
I wonder if this is a wider issue with the 802.11 standards
Interesting.
DeleteWe have about 200 Unifi APs over perhaps 30 sites spread around the south-east of England. A mix of commercial and residential sites. First ones installed in 2013 and still in place. Have only ever experienced this one DHCP problem.
I’m not saying that you are wrong, nor that I am right.
We have seen some weirdness when using Unifi APs with Nest CCTV cameras though.
I guess that everyone's mileage will vary and no doubt it depends on your accompanying hardware but there's more than 700 threads featuring the words DHCP issue in the Unifi forums and the last two v4 firmware releases both have the following bug fix in their release notes:
Delete"Fix intermittent broadcast and multicast packet drop on gen2 APs, introduced in 4.3.24. This impacted users with non-UniFi DHCP servers which use broadcast for DHCP, along with IoT devices that rely on multicast for discovery"