Basically, using BGP, one can move all of the traffic around and do so whilst the routers still know where to send the packets. We moved all of the traffic to one link at a time of day that was under 1Gb/s of traffic total. And when done, we moved traffic back. This did mean some staff up at 4am, including myself. Actually the whole process meant two staff in the data centre, me on a computer here, other staff working from home, tech staff at the office in case there were problems (we opened the phone lines early), staff at Talk Talk, and data centre staff, all involved to co-ordinate one fibre move.
Now we have done the first fibre, we can do the second any time of day, so that is planned for tomorrow during the day. It should be just as seamless.
But why do this upgrade now? Well, I am actually cross we did not do it a month ago. We forecast the need for more capacity and we started talking to Talk Talk several months ago, but these things seem to always talk longer than we expected. To be clear, all of the kit was in place as Talk Talk have 10G ports already. They upgraded their kit in the data centre some time ago. We upgraded our kit to allow 10G ports end of last year. The only technical step is config and change of optics in the switches which, as we see today, can be done over a couple of hours, and with a few days notice and planning. But all of the paperwork involved in ordering this, at a cost that is more than my first house, is very time consuming. Well done to my team for sorting it though.
Anyway, back to why I am cross. I am cross because our service was not as good as we expect. We don't make contractual promises as so much is out of our hands, but we aim high none the less. It is interesting to see what that actually means. I have picked a line that is on that backhaul. The congestion we saw was happening only on some of the lines, not all. But I can show you what the congestion means in practice.
This is a graph for one of the impacted lines, with our loss and latency monitoring every second of every day recording the packet loss and round trip latency for the line.
I'll explain the key here - the green dots are usage and not relevant here, what we are looking at is the blue/green at the bottom. Normally it is low, and is in fact 7ms round trip latency minimum, average, and maximum. But you can see some green bits, and a slight hump in the blue even. The green is peak, so typical one test in a hundred. The blue is the average, so needs several to be high for that to increase. So this shows that some times there are peaks of 20ms or 30m and even 50m just after 8pm even though the average is still down at 10ms or below.
Now, I know some people that would kill for a line that good and that clean from their ISP, but really, it is not the A&A way. We expect better!
This is what it should look like, and is in fact another line at the same premises on the same backhaul. As I said, only some lines affected. This is what it will look like now we have upgraded.
This is 7ms latency! In fact, to be technically correct, minimum 7.1ms, average 7.4ms, maximum 7.8ms. That is what an A&A line should look like - though the base latency will depend on line type and interleaving and so on, we are not adding any extra by way of congestion in our network.
And that is what we aim for - not being the bottleneck. That is why we have upgraded some backhaul links to 10Gb/s.
P.S. As someone will ask - the usage dots are different but similar. This is because it is bonded lines but the actual throughput depends on the line speed, and the lines are slightly different sync speeds meaning that the throughput of each will be biased slightly to match.
P.P.S This is the same line (as the first one above) last night after the upgrade, as predicted 7ms all the way down...
Out of interest Rev, I would be curious to hear a bit more from you about peering and transit as a function of congestion as I assume with the good backhaul you provide, they're our main enemies :).
ReplyDeleteThese are actually the next targets for 10G. We have multiple 1G links to transit and peering, but yes, they are the next enemy. The bulk traffic tends to be streaming and so we see a lot via peering. Making these 10G will help a lot. Obviously in the middle it is relatively easy to throw more and more 2Gb/s LNSs in to the mix.
DeleteI don't see latencies that low on my slow BTW lines, presumably because of interleaving?
ReplyDelete