Obviously all of the code, and scripts and web applications and so on have been updated over the years in many ways. If nothing else the advances in security best practice alone are reasons to update systems.
One aspect to the way some of the code has been written is "defensive coding". This is basically assuming the worst can happen (either by accident or malice by an attacker) and trying to ensure the outcome of such will always be "safe" in some way.
The way we (A&A) do Direct Debits is one example of this. At every stage in a very complicated set of processes and scripts we try to assume the worse and take the safest action on failure. And one area of this is the notices sent to customers. There are a load of rules we should follow, and (seemingly unlike so many other companies) we try very hard to follow the rules very very carefully.
For example, if we have notified a customer that we will "collect £20 on, or immediately after, the 1st of each month" we record that fact and try to ensure that DDs only go in if that is the case. The rules allow up to 3 working days later, but the second we are not collecting £20 exactly, or it is not exactly on the 1st or within 3 working days, we cannot rely on that notice and so the DD is not allowed. Indeed, if we notify the customer anything else such as cancelling a collection, we discard the record of that notice so a new notice has to be issued.
Another aspect is checking it all worked, this is a key step for defensive coding - scripts and systems to check things that should work have worked. Such a system flashed lights, dinged my phone, and even set off the very loud teletype in my hall in an attempt to make quite sure I knew things were not quite right and that some intervention may be needed. That said, the comments on Mastodon and call from staff were also a clue something had happened!
So what did happen yesterday?
With all of this in mind, shit happens, and did yesterday, and a load of A&A customers were told their 1st June DD collection was cancelled. So what happened?
The root cause was an update to one of our billing systems SQL servers. We do updates to all sorts of servers all the time, and it seems the timing of this was unfortunate. Many updates are to ensure all security patches are applied, and so done as soon as possible.
Once again, defensive code - the systems generating billing records, e.g. for calls, queue them up until the SQL server is back. And the billing system using these databases to make bills will re-run a few hours later if there is any issue accessing the data so just delaying bills slightly.
The problem is that there was, apparently, a small window each month where there is a process of "working out what people will be billed in two working days". This is a test/dummy billing run. This is done to confirm the regular direct debits are as expected and ensure they are submitted to BACS two days in advance, so two working days before the 1st of the month, i.e. yesterday. That was the system that failed to access the database, and unlike normally billing runs, which can run a few hours later, it left a load of customers set up to not be expecting a bill on the 1st.
This is a small window, and as a result it meant that a load of customers were not expected to have a bill on the 1st, even though they will. If the bill on the 1st had an issue it would have run a few hours later and still billed. But this was a test/dummy bill run in advance and run only once a day.
It is not just the 1st, we have billing on 4 week cycles, and every full moon, and all of these work in the same way.
What happened then is the direct debits scheduled for the 1st were cancelled because the test/dummy billing run did not say a bill was expected. The system sending the cancel emails knows to also forget the previous notice of collection. This means that not only were "cancel" emails sent, but once the bills do happen on the 1st they will need a new Direct Debit collection notice email for 5 working days to be sent - so following the DD rules.
It means people will have a DD collections on the 8th, not the 1st, and then from the following month should be back to normal.
Whilst it is a mistake, the system, being defensively coded, has followed the DD rules for notices to the letter, delaying collections, and sending new notices.
What have we done?
We have changed the logic slightly, which means this should not happen in quite the same way, but we have also made sure the ops team know not to do an update in the middle of this test billing run in future.
Sorry for any inconvenience caused by an extra week's credit / time to pay on your bills. As always we try to learn from our mistakes. And well done to my staff handling all the calls and emails today on this.