Ubiquiti have been very helpful trying to get to the cause of a long standing issue impacting a small number of people, but including myself. It is a very frustrating issue which has led me to consider scrapping using the Unifi APs on more than one occasion, but I do like the Unifi kit and I would like to get this actually resolved and continue selling it.
What do we think we know?
- This only seems to impact Apple - it is seen on iPhones mostly - not android.
- This only seems to impact Unifi APs - not seen using other APs yet.
- This almost always seems to be FireBrick as gateway router (at least one case of not FireBrick)
- This is a rare situation, with many people using hundreds of Unifi APs with no problem. Similarly lots of people using Apple with no problem. Similarly lots of people using FireBricks with no problem.
- It seems sticky - when a set up has the issue, it stays. When a set up does not have the issue, that stays OK. It is also very intermittent and can seem to take days to be sure if fixed or not.
- This seems to be only where IPv6 is on the network, which is one reason most people don't see it, and may also be a reason why cases where an IPv6 friendly router sold by an IPv6 friendly ISP is the most common case we have seen (i.e. why FireBricks in almost all cases).
As I say, Ubquiti have been very helpful - they sent us two switches, and edge router and a security gateway. I was only expecting a switch from what was said, so thank you. It has allowed more testing. We sent an FB2700, which has also allowed more testing. The results are interesting, to say the least.
- Brandon has advised that using FB2700 they see the problem right away. This is good, we have created a set up with the problem. He confirms that using other gateways he does no see it. So something about the network when using a FireBrick seems to be able to trigger this somehow. Oddly he has also seen up to 60 seconds "delay getting an IP" which is not one we have seen. The problem we have seen is permanent - you lose all IPv4 and IPv6 on a roam (intermittently) and do not get an IP even after 60 seconds, all you see is the 169.254 address for when you don't get a reply. I assume that is not what Brandon was seeing, but actually a "delay", which is rather odd. If it is, then that explains the phantom delay and means he has exactly reproduced the problem.
- Here, we tried moving all APs on to a unifi switch connected to our main LAN (and using FB6000 as gateway). It did not help. That eliminates the switches I have which could have been messing with multicast or something.
- So I set up a separate subnet for the APs, connected to a Unifi switch, and that then connected via their EdgeRouter. Sadly I needed help setting up IPv6, but got there, in spite of some of my typos. It seemed to fix things - great.
- So I changed to using an FB2700 on the same separate subnet and same Unifi switch, just swapping one box, and again it is working. I have made the set up as close to the main LAN as I can, same VLANs etc, and the APs are the same config exactly - not changed.
This means the separate subnet appears to be the fix rather than change of router.
It also means a really simple set up of FB2700, switch, and three APs here just worked, but Brandon, with presumably a similarly simple set up, immediately failed. That would be nice to try and compare.
The roaming also seems to happen, apparently as expected, with no interaction with the gateway. No DHCP or anything, just switches over from one AP to another. So it is hard to see how any gateway can be the cause of the problem.
At this point I am wondering if somehow it is a specific configuration of a network that breaks it - I hesitate to suggest the actual IPs in use somehow. I also wonder if it is something else on the LAN causing this - but that does not fit with Brandon's comments.
Unfortunately we have reached an impasse with Ubquiti - they have been very helpful up until now, and thanks for that. But even though this only happens with their APs, and only happens with Apple products, they have now concluded it must be FireBrick and "So at this point I don't think it's fair for you to ask us to help you
resolve this. In doing so your are asking us to help your company make a
competing product, for free." and now "So I'm out. Refuse to interact under such disrespectful terms."
We'll continue to look for the issue. I suspect, when we find it, it will not be something where any finger of blame can be pointed at a single bit of kit. But nice to know the spirit of co-operation is alive and well, up to a point. Thanks for your help so far.
FYI, I don't care that Ubuiti have a "competing product". As an ISP we work with competition all of the time for the greater good. I'd be happy to continue to work together to get to the bottom of this anyway - all of our customers would benefit from that. I will, of course, share our findings, even if we find a bug in something FireBrick is doing.
P.S. My next avenue of investigation is differences in configuration, no matter how small, to try and see if we can find a network set-up difference. It is very likely that a typical (mostly default) FireBrick network will have some notable differences to a typical (mostly default) non FireBrick set-up...
P.P.S. You gotta love it - Brandon has complained to FireBrick about one of their employees (me) swearing at him. This is from the country that actually believes in free speech.
FYI, I don't care that Ubuiti have a "competing product". As an ISP we work with competition all of the time for the greater good. I'd be happy to continue to work together to get to the bottom of this anyway - all of our customers would benefit from that. I will, of course, share our findings, even if we find a bug in something FireBrick is doing.
P.S. My next avenue of investigation is differences in configuration, no matter how small, to try and see if we can find a network set-up difference. It is very likely that a typical (mostly default) FireBrick network will have some notable differences to a typical (mostly default) non FireBrick set-up...
P.P.S. You gotta love it - Brandon has complained to FireBrick about one of their employees (me) swearing at him. This is from the country that actually believes in free speech.
From years of working in tech support, I recognise Ubiquiti's current position. It's a very common, but rather foolish one - "We don't know what the problem is, so therefore it's the other guy's fault."
ReplyDeleteIf you draw up a 2x2 matrix of the two attitudes ("We will investigate" and "Other guy's fault") and the two possible explanations ("Really our fault" and "Really their fault") then the company which takes Ubiquiti's position loses every time.
It's a mistake which has been made countless times before.
The language of competition is interesting in this context.
ReplyDeleteThe FireBrick may be a competitor to one part of Ubiquiti's stack — its range of routers — but, at the same time, FireBrick users may well buy other Ubiquiti products, such as their WAPs, and hope for seamless interoperability. The more routers which work well with their WAPs, the better — but the more competition they have for their routers.
I might have used "Thanks for the opportunity of looking at this. It appears that this is something more to do with the FireBrick than our WAPs and, while we'd love this to be fixed, I'm afraid that we can't justify the cost of the engineering/support time spent on this, rather than on other issues, for a prolonged investigation." :)
To be fair they currently don't have the engineering resources to deal with all the internal work they need to do. I know that's not an excuse but given the firmware/controller is changing (currently) twice a month on average I can understand it.
DeleteSomeone (can't remember who) was recently moaning about finding a router which "worked" with "AAISPs IPv6 setup". I wasn't aware there was an issue but given you seem to have pretty much nailed it down to FB/IPv6 I wonder whether you'd considered doing a quick test setup for yourself using a dynamic/sticky /56 PD setup (like the mainstream ISPs)?
I can't see why you using static PI space would matter but its maybe another thing to try?
Did they not send you a USG as well?
Yes, but a USG will be no easier or more helpful than an Edge Switch. I am not on DSL here.
DeleteTrue enough.
DeleteYou can still set yourself up a test /56 DHCPv6 setup with SLAAC on the LAN though?
I only ask as that's the default Sky setup which works with their router or a USG. I assume you're on a /48 PI block?
Shouldn't make a blind bit of difference I know but its worth a try as it'll take bugger all time to do.
https://community.ubnt.com/t5/EdgeMAX/ipv6-setup-for-Sky-Fibre-UK/m-p/1696161/highlight/true#M131098
Second post down is the config I used on a USG for Sky. Seemed to work fine over a month or so at daughters place but they kept putting stuff on top of the vents so I took it back :)
Do Apple have anything to say about this other than "bugger off & stop bothering us with trivia?" ;)
Hi Brandon - thanks for your work on this so far. As another customer of both A&A and Ubiquiti (but *not* Firebrick), I've been seeing a very similar issue: my iPhone periodically loses association with my Unifi AP. When I have a backup AP nearby (old Netgear I haven't got round to unplugging), it falls back to that instead; when I tried eliminating that from the picture, I just started getting 169.254 IPs instead. Manually disconnecting and reconnecting seems to get a working IP address again though.
DeleteI haven't tried opening a support case about it myself yet, or tcpdump, but so far it looks as if something is eating the DHCP traffic in some circumstances.
Would I be right in thinking RevK/FB uses a relatively short DHCP lease time? I saw the problem more often with a one hour lease, less with 24 hours - I've shortened mine to 20 minutes now to see if it recurs soon.
I take it you don't have any known "Unifi sometimes eats DHCP packets if ..." issues at present?
Seems to me they have the same attitude to making it work as to making it secure: https://www.theregister.co.uk/2017/03/16/ubiquiti_networking_php_hole/
ReplyDeleteYou've far more patience with this than me - I'd have flashed OpenWrt months ago.
Its not the same kit and was actually patched a month or so before El Reg wrote the article. That was the PTP infrastructure stuff - which wasn't vulnerable with a default config IIRC - not the WAP/switching stuff.
DeleteThe Unifi stuff is where all the dev time is (currently) going now and it shows. Previous comments about get a fucking move on with IPv6-PD and DCHP options in the GUI still stand but things (currently) progress apace.
OpenWRT isn't quite the same matey ;)
I didn't say it was the same kit, I said it was the same attitude. Using PHP 2.x at all in 2017 is so bone-headed I can't really trust anything else from the same company.
DeleteNor did I claim OpenWRT is the same, but for a home network I'd suggest it's a good enough solution when your users prefer their 3/4G connection to your WiFi.
So its currently working with a Firebrick yes?
ReplyDeleteIf so then what happens if you restart the Unifi Controller? Does it still work?
The fact it "worked" when you changed subnet makes me wonder if its something to do with reprovisioning.
You have a Unifi switch in there so changes on subnets should have resulted in reprovisioning Unifi WAPs/Switch via the controller.
Probably dumb and you've done it....
To be honest, my best guess at present is some aspect of the way the network is set up - no idea what. I suspect that FireBricks have a default setting in some aspect of the way they do networking which happens to be the trigger. It could be something daft but I am trying to find it.
DeleteI once had an odd case where the FB wouldn't hand out DHCP leases to a device because it thought the device already had one.. and instead of replying with the same/new address just went silent.
ReplyDeleteThat was a config quirk though - I'd just setup a reservation on a device and it had cached the dynamically assigned address previously.. fixed by clearing the cache when I make such changes.
I wonder if a similar oddity is happening with the apple devices.
Indeed, common factors are ONLY fails with Apple, ONLY fails with Unifi APs, and almost only fails with FireBrick as there has been a report of this with another gateway. Indeed, that leaves Apple and Unifi as only common factor. However, it is clear that FireBricks manage to trigger the problem more easily, by the look of it, which gives me something I can use as a basis to try and find the actual cause.
DeleteBrandon,
DeleteYour current position seems to be that the issue manifests itself only when three separate factors are all in place:
1) Unifi AP
2) Apple client
3) Firebrick router
Therefore you conclude that the fault must lie with the FB router. You try to back this up by pointing out loads of combinations with a different no. 3. The trouble is that the same argument can be applied to changing item 1 and item 2, and therefore it is flawed reasoning.
You need to realise that RevK is potentially a very useful resource for you. He is of way above average diagnostic ability, and has the true geek's approach to problems like this - just keep drilling away until you find the explanation, regardless of any attempt to direct blame.
By working with him, you gain either way - any true explanation of what is causing the problem will be an asset for both your organisation and his. A properly diagnosed problem is much better for everyone - if a fix is needed to anyone's kit it can be done, and if none is needed then one has a clear diagnosis of the problem for future reference.
Yes, we all get snappy at times, but that too can be an asset.
For what it's worth I experience the same problem, albeit intermittently, only affecting Apple (iOS) devices on Ubiquiti kit with a Microtik RB2011UaS router.
DeleteI also know someone else with a Viprinet router who has experienced similar issues.
It is *definitely* not only affecting Firebrick.
Bramdon, just a Nickles worth of free advice. I'd suggest completely ignoring the cursing thing when you deal with any future RevKs....
ReplyDeleteThere are good and bad points in how you've both reacted. What is clear from both of you is that you are both passionate creators, and that your respective devices are both your babies. As such you're both going to be defensive about it and you're both going to lose perspectve.
His cursing seems to be in response to what he seemed to see as your evasiveness. On one level he doesn't want to believe it's the firebrick any more than you want to believe it's your device.
Yes, I appreciate that the problem doesn't manifest when other devices are connected to yours... but remember RevK can say the exactly the same in return. Your refusal to even consider there might be some quirk in your device that only manifests under these specific circumstances isn't really that different from his cursing.
Hopefully you'll both figure it out between you, but I'd suggest avoiding finger pointing until the problem is solved, only then can either of you know whether the "cause" is Apple, you, or the firebrick, or perhaps even none of the above or all of the above.
Being passionate does not excuse the unprofessional behaviour or the ungratefulness after being given not-so-cheap equipment to test with and generally being a pain in the ass to someone who has no obligation to help considering it is as he says a competing product, especially as it was on his own time.
DeleteI'm not the same Anonymous but IME USA support is a lot more sensitive to language than they used to be. For example I can't think of any UK network provider (NB not mainstream ISP) who'd even blink if I said "FFS get a clue!".
DeleteThe same thing is happening in Oz - BOFHs I know swear like troopers seem to become "special snowflakes" at work now. They never used to be :) Probably no bad thing really.
Anyway hopefully they both sort it out - its not in Ubiquiti's interest to fall out with one of the more influential UK ISPs, even if its just in terms of reduced sales through that ISP.
I don't think he's ungrateful for the help. Quite the opposite. I appreciate you see his behaviour as unprofessional, but stonewalling can equally be seen as unprofessional... calling someone a pain in the ass ain't professional either.
Delete@revk how about putting up a public pcap of a session where roaming breaks (with WPA decryption if you could?). Do it in RF monitoring mode and induce a fault and let us all have a gander.
ReplyDeleteI'm not convinced by the argument that "other routers don't have this problem", mainly because I've seen similar roaming issues with Apple devices other non-ubiquity APs (aruba+infoblox I think, they're not under my control). If I had to guess I'd say it'll be down to either some kind of race condition or two equally valid interpretations of some spec point that makes Apple choke. What I would love to see someone narrow it down to a specific service: DHCP, RADIUS, 801.11k, etc.
I don't know how to get a pcap on the RF side, that would be good. I am still testing trying to get from a working case to a non working case one step at a time, and taking days for each test. When we are back on non working, I'll do a pcap from gateway.
DeleteYes, Apple is clearly a common factor here.
We tried with static config on Ipv4 on iPhone and still broke, so not DHCP.
We tried with no IPv6 on LAN, and that seemed to fix, assuming I waited long enough with that test.
For the pcap, can you not run Wireshark or tcpdump on a laptop connected via WiFi to an AP, then roam to that AP with your phone? I'm sure that's how I did it some years ago.
DeleteThe Unifi AP's are interesting, and recently purchased one myself based on your blog. I only have one AP (and a FreeBSD gateway) so can't do any testing myself.
Just thinking out aloud here, but as you say it's only Apple devices, could it be related to DHCPv6? Apple devices will do DHCPv6 or SLAAC, but Android will only do SLAAC. I don't think it's related, but it is a difference that could explain why only Apple is affected.
Would running tcpdump on the AP itself do the trick?
DeleteShould work fine, dump the output into wireshark & away you go.
DeleteIt's in the link Brandon gave before :
https://help.ubnt.com/hc/en-us/articles/221029967-UniFi-Debugging-Intermittent-Connectivity-Issues-on-your-UAP
Also the DHCPv6 vs SLAAC difference Chris mentioned sounds promising.
DeleteAs said before Sky allocate a /56 PD via DHCPv6 to the CPE then just SLAAC on the LAN as per page 7 of https://indico.uknof.org.uk/event/35/material/slides/1?contribId=5
Any possibility the FB is doing something unexpected regarding a choice between DHCPv6 and SLAAC on Apple devices?
I hope you get this sorted, not entirely sure Brandon understands he's dealing with a fairly influential UK ISP ;)
My wifi capture knowledge is a bit rusty.
DeleteMacbook Pro's under OS X support monitor mode just fine with wireshark/tcpdump
Any non-bastardized chipset under Linux should work. Wifi Pineapples are great for this.
Could do capture on the access point itself depending if it has tcpdump or not
Would WPA2 security interfere with the capture of Ethernet frames? Probably.
OK so some guesses here... Firstly, if, like every internet connection in the UK and pretty much every country we have sold FireBricks, the connection is PPPoE, then the answers are pretty simple.
ReplyDeleteAAISP do native IPv6 on the internet connections over PPP which is presented either directly from FTTP NTE, or via a DSL modem as PPPoE.
The PPPoE interface will do IPV6CP and once that completes it will do a DHCPv6 client request asking for PD which is then assigned to the other interfaces. This is default, so odd asking how to turn it on. The PD can be constrained by setting the pd-interface to just be the interface(s) on which you want PD. If IPV6CP is rejected then no IPv6 is used on that interface. It is possible to set log-debug to track the PPP negotiations to confirm IPV6CP is working.
If using an Ethernet subnet/link as WAN, the FireBrick does not currently do an DHCPv6 client - it only does router solicitation / router announcements, so does not do PD. This is because we have not seen anyone that has asked for it as it is not how any internet connections are done anywhere we have sold FireBricks. We can add that is needed, but may take a little while. It would be interesting if that is the case in the US. Even in China we see PPPoE as the norm.
As for shaping, it depends what you want. A simple shaping of all traffic to 10Mb/s each way means making a shaper with tx and rx set to 10M (i.e. 10000000) and a name like "WAN". This can then be used to shape the traffic. In the case of an ethernet WAN interface, set graph="WAN" in the interface definition. In the case of PPPoE you would need to set graph="WAN" on the PPPoE definition. This also makes a graph showing levels of usage. In PPPoE it will also show loss and latency based on LCP echoes. On an ethernet interface a ping="..." can be set to add loss an latency to the same graph based on ping responses.
There are options for much more fine tuned shaping using the firewalling rules.
Not sure what port reflection is. The firewalling rules allow any sessions to be matched and changes to source/target IP and port. You can map between IPv4 and IPv6 as well. If you want some sort of incoming port mapping to an internal IP on NAT you may want to make a rule-set with source interface of the WAN (e.g. PPPoE or the WAN interface name depending on what is being used), a target interface of "self", assuming for a moment that NAT is being used and the external address of the FireBrick is the only one you have to port map. You can set the no-match-action to continue to other rule sets, and create specific rules matching a target port, and setting a new target-ip and target-port to map to devices on the LAN. You may want to set these using specific protocol, e.g. 6 for TCP.
As for VoIP, the basic premise is that VoIP and NAT do not mix. However, the FireBrick does standard NAT at the IP/port level and a good SIP gateway can recognise that and work with it - we have VoIP servers that can. The UDP timeouts are set as per RFC recommendations to long enough to allow such a VoIP gateway to manage keep-alives. However, we know some gateways and phones do not work well with NAT. However, the FireBrick itself can work as a VoIP client and server and can be set to work as a full PABX or just simply using a back to back config, thus bypassing all NAT - talking private IPs to devices on the LAN and via its public IP to a server via the WAN. This is in the VoIP config. This allows mix of IPv4 and IPv6 operations as well for VoIP.
If the WAN works in some other way (and we have people using L2TP) that will need some different advice.
Some posts may take a while to get approved as I have other work to do today as well!
ReplyDeleteBrandon, I really hope that you and RevK manage to work together on this and get it fixed.
ReplyDeleteWe love both of your products, but this current issue is becoming increasingly embarrassing for us.
Although it is "his" router, a reasonable number of companies (including us) use them in our network. Our core routers are Ubnt if you're interested as RevK's ones with enough umph to do full BGP are out of our price range and yours do a cracking job.
Please don't take it to heart when RevK gets upset. Firebrick is his baby and he puts his life and sole in this and AAISP. I am sure that he, like all of us, is just extremely frustrated that none of us can work out what the root cause of this is.
@jbsolutios - I'm interested in real world experience of the EdgeMax (Std or Pro) with full BGP table (multiple providers), as looking to replace some old Cisco kit, but the FB 6302 is unfortunately over budget - if you wouldn't mind contacting me off here (a040417 at ramsay dot im) to answer some questions of how it performs under load would be really useful. Thanks
DeleteAAISP Customer here with UAP and UAP-Pro. Also support two other sites with UAP-Pros. However, no Firebrick router (yet).
ReplyDeleteI am glad you think you have found it but I am (again) a little confused. I just looked through email and could not find anything from you guys about this "DHCP throttling" of which you speak. Perhaps you can elaborate. What DHCP throttling thing are you talking about?
ReplyDeleteAnyway, as I have said before, we tested this with static config on the iPhone, DHCP not in use, and it still failed.
I'll go back to a set-up that does not work shortly, but now I am testing different ways to try and break the set-up here at the moment using a FireBrick. It in interesting that you immediately had problems with an FB2700 and right now I can't make it break with one! When I have exhausted that I'll put the APs back on my main LAN and confirm still broken. Then I can re-do the various tests from before, including a case where the iPhone is set up statically and not using DHCP at all.
Eliminating DHCP as the cause was one of the very first things we did, so it has been some time since I did those tests. If DHCP was the cause we'd see this when only using one AP and not roaming, which we don't.
Also, what puzzles me, is that roaming should not cause any more DHCP traffic, surely? I am also pretty sure than when the roam has failed I have tried telling the iPhone to renew its DHCP, and it has failed, but turning WiFi off and on always works.
Roaming should cause a dhcp request because you may have roamed to something with the same SSID but a different subnet. This is particularly likely with larger public networks like eduroam etc.
ReplyDeleteI know from experience that Android doesn't do this (we had two eduroam networks in close proximity such that devices would frequently roam one to the other and then break), but it seems iOS and most other OSs do as these all worked fine (as they'd get a NAK from the other network and redo the discover cycle etc).
I believe (and as I say I am not the WiFi expert) that "proper" roaming does not, but simply changing to another AP on same SSID should sensibly do so.
ReplyDelete[University, eduroam, many APs]
ReplyDeleteWe have had problems with Apple devices roaming in the past. I forget the exact details now, but IIRC Apple devices did some sort of weird ARP ping to the gateway to try and detect whether they were on the same network after a roam or wake-from-sleep, rather than a full DHCP DORA.
But it's been a year or two and I can't remember seeing this for a while. Might be worked around or fixed on the Apple side or the Cisco WLCs, but may be something to check. Without any sort of wireless fast reauth (which I think I've only seen Windows do correctly, sadly) you'll generally see a full reauth and DHCP on roam.
Might also be different with PSK. We've only got WPA2-Enterprise with RADIUS, so the whole associate/auth part is different/slower anyway.
I'd probably go for getting a packet capture from the AP to see what the client is doing, and compare Apple with something else.
Wi-fi Assist is turned off on the iPhone(s), right? (Also have you tried tested with an iPod Touch or an iPad that doesn't have cellular at all?)
ReplyDeleteYes, but tried both with and without that. We don't have iPads without mobile.
Delete@Brandon - further to what you said, don't iPhones also behave differently depending on whether they're plugged in or not? (e.g. I once tested one on battery and when charging - in the first case it stopped responding to pings after 30 seconds, until you sent a wake on LAN command).
ReplyDeleteAs Adrian's usual testing ground appears to be "in the bath", I really do hope his phone isn't plugged in at the time!
ReplyDeleteBe careful dealing with Brandon, he is unpredictable
ReplyDeleteI found that turning off IPv6 fixed everything and had a month of no problems. Upgrading the UniFi APs to 3.7.55.6308 has brought the problem back even when using only IPv4.
ReplyDeleteTearing my hair out! Are we any closer to a fix?
Did you ever get a resolution on this? I have seemingly got the same kind of problem with Mikrotik router / Unifi APs, when roaming only and only with IPv6 enabled. I haven't done any additional or very in-depth / targeted testing as of yet and given it's 3 years later it may be completely different, but seems very similar.
ReplyDeleteI did by changing to Aruba, but someone did suggest there was an imp thing that fixed it.
Delete