The cobbler's children have the worst shoes!
It is true in any industry, and to some small extent our office is no different. We actually have loads of connectivity including half a dozen ADSL lines, FTTC, some fibres, and even a link to a virgin cable connection.
Sadly, yesterday, the main fibre link we use for everyday connections broke. A fibre is normally very reliable and nice low latency making it ideal for VoIP. Obviously our whole office runs on VoIP. This should be no problem, as we have the FTTC set up to back up all of the connectivity quickly and easily. Indeed, we have a schedule of testing the backup every month.
But this is where the cobbler's shoes come in! When it comes to designing our network for
customers we have loads of redundancy, services like Office::1 with multiple lines and fallback, multiple lines in to BT and TT and transit and peering, and so on and so on. But when it comes out our own office, things are not quite as tidy as they should be... We have been messing about testing VDSL and ADSL on new routers and TR069 and config, and so on, and have been using some of the lines, including an FTTC line, for such things. This meant that the scheduled backup checks had been stopped while we did that testing.
Unfortunately this meant we did not have the clean config all set up to just switch to backup - it was still set up with various test routing and stuff for the router tests. Also, to be frank, I had forgotten how exactly we had the back up configured. It ended up taking me a good half hour to get the core connectivity for VoIP and other systems sorted, which is plain stupid and I was kicking myself for forgetting the way we had it planned originally.
Simply accessing the Internet was not the big issue, of course, but making things like our VoIP work, and our fixed IP addresses which are used for lots of things internally, meant that it was a lot of hassle losing the main fibre link like this.
Of course, several people said "Should have got Office::1". And indeed, if we had, it would have "just worked". I am quite pleased that this is the reaction from people - it should be when your Internet breaks because you did not have the backup set up properly :-)
But then it gets interesting, and this is always a sobering exercise as it really helps everyone in the office appreciate the plight our customers go through when there are faults and delays.
The FTTC line, which the availability checker confidently lists as up to 80Mb/s, was synced at 3Mb/s. Yes, I mean THREE Mb/s. It is not entirely clear why BT had done this, but it was the result of the line being set on a silly low speed profile - possibly DLM reacting to all of the testing we had done. The testing did not rely on speed at all, and so we kind of had not noticed this until we tried using it. Sadly, even though this is a simple setting in the cabinet DSLAM, BT want an "SFI engineer" to fix it, which is mad. We booked one, somewhat under duress, for this morning. Of course when the time came the engineer did not arrive, as they had cancelled the appointment, after the
point of no return (for no apparent reason). We finally got them to agree to sort the profile without an engineer "within 2 hours", but sadly, what that seems to mean is that after around 1:55 they finally passed the issue to "Openreach" who will now sit on it for two days, we expect.
No problem - we have a spare ADSL line not being used for the ORG testing, that'll do 20Mb/s, which is more than enough. Except, when we tried it, it turns out to be lossy, even after cranking up to interleaved 9dB margin, and managing 13Mb/s. The 13Mb/s would have been fine, but the packet loss and dropping sessions was a pain.
Then we found more fun - it looks a lot like these ZyXEL routers have the same bug as the BT ones where a PPP restart causes VPNs to stop working until you unplug and replug the Ethernet lead. That would not be so bad if not for the fact that the ADSL kept dropping and reconnecting. So I spent half the morning walking to the router and replugging the Ethernet.
But hang on - why were we having issues today? This happened late yesterday afternoon and a fibre is on a nice SLA (something like 7 hours). We reported the fault around 6pm so why were engineers not on the case over night?
Well, the answer, it seems, is that BT parked the fault over night because of
confusion over contact details. They wanted a contact for each end of the fibre, and we gave them that, but it is the same contact for both ends. This makes sense - even though the contact is notionally "on site", one end is a data centre which means calling us so we can call the data centre to let someone in and escort them. So that end being a staff mobile was fine. The issue was the same contact for the office end, which is where the member of staff was located (or near by) so also made sense. They did not tell us there was an issue, but just waiting until we chased up at 7am. Having finally restarted the fault they did fix it just after lunch. Seems it was an issue with a bend (broken glass) in a fibre patch link at the exchange!
So now we have the fibre back, yay! but a silly slow FTTC, and an iffy ADSL. It does rather shows the importance of backup lines used just for backup and kept on-line all of the time so their condition can be constantly monitored - very much the way we sell the Office::1 service.
I'd obviously like to apologise to people trying to call us today - calls were mostly working, but
occasionally had some issues.
As ever, lessons to learn.