2024-06-29

TOTSCO correlationID

RESOLVED! See below!

My latest concern is understanding TOTSCO specification. This may be that I have mis-read or not read enough. I am fully prepared to accept I have this wrong. It came up because the buddy CP and myself read it differently.

Messages each way have a source and destination correlationID. This is necessary to allow a response to be correlated with a request. An initial request does not need a destination correlationID (indeed, should not have one), but needs a source, and the reply needs a destination correlationID matching that source (and arguably maybe not a source of its own, expect it is mandatory §2.1.5, except it is not §2.1.8).

My initial interpretation was that each message type that was a Request would have a response that is a Confirmation or a Failure. And that the Request/Confirmation or Request/Failure would need matching correlationID so the response could be matched to the request, but that was it.

Indeed, all of the messages and responses that progress a switching order also contain a switchOrderReference, so no actual need for correlationID at all anyway in those.

My code would send a Request and wait for a response, using the correlationID to match the response. This is synchronous in the customer order process where the SLA for a match request is 60 seconds. We make the customer wait for the response up to 61 seconds.

But then I saw the published TOTSCO test cases, and they all had a destination correlationID for the ongoing messages, the residentialSwitchOrderRequest, for example.

This only made sense if the whole sequence, such as the following, were all a single message flow with a consistent set of correlationIDs each way for the whole sequence.

  • residentialSwitchMatchRequest
  • residentialSwitchMatchConfirmation
  • residentialSwitchOrderRequest
  • residentialSwitchOrderConfirmation
  • residentialSwitchOrderUpdateRequest
  • residentialSwitchOrderUpdateConfirmation
  • residentialSwitchOrderTriggerRequest
  • residentialSwitchOrderTriggerConfirmation

If that is the case I have to hold correlationIDs much longer, and associate with ongoing switch orders. I spent many hours re-working the system to do just that. This had issues with the possibility of delayed/repeated messages, which can happen. A reply may be to an earlier message with the same correlationID. I'd far prefer the previous interpretation where each Request has a new and unique correlationID which has to be quoted in the single corresponding response (Confirmation or Failure). It would be simpler and easier. But the test case examples make it clear that this is not the case, which is messy and a lot more work.

I have now asked TOTSCO to clarify. I have not had a reply yet.

So, even though I did all the extra work, I am happy if they come back and say it is for each message pair distinctly. But they must update the specifications and examples and test case to make that clear, as it is a lot more work to track these over a complete (multiple days, weeks) switch order process than over a simple message pair.

For now my code does both - it tracks and uses consistent correlationID for the whole sequence of messages, but accepts new correlationIDs for each part of messages if that is what we get.

Update: "The specification does not call for either option to be a requirement, but our expectation and the behavior [sic] we have seen so far in testing is that the second option is being applied by users. There is nothing to stop a CP from wanting to use the same correlation ID throughout a whole switch journey, but the important thing is that they cannot expect their counterpart CP to follow the same behavior [sic]."

This is typically not helpful. If even one CP can expect / require the destination correlationID for a residentialSwitchOrderRequest to be their source correlation ID from previous residentialSwitchMatchConfirmation then that means all CPs will have to track correlationIDs through the sequence else they will not work with that CP. If a CP cannot expert / require that, then no CPs need to do that. The spec needs to say one way of the other. Saying "The specification does not call for either option to be a requirement" is a useless response!

Update: Finally a straight answer - I wasted a day making my code work the same as the test cases, FFS.

"We would like to inform you that, according to the specification, a switch order request is not seen as a response to a match confirmation. Additionally, the TOTSCo hub does not require users to include a destination correlation ID in any request message."

2024-06-28

Will TOTSCO be ready?

The One Touch Switching should be live 12th September. Will the "industry" be ready?

I am not sure.

We are on the pre-production platform now, doing integration testing. There are 47 CPs on the system, including us. And yes, please, any other CPs on there try sending us match requests. And if you need more testing try https://notsco.co.uk/

So I tried sending a match request to each.

The responses were interesting. A lot did respond, which is good, but what is fun is the range of different errors. This is a reflection of how badly the specification has been written. All should have failed to find any service for the name at the address. But the actual error codes and error texts varied a lot. If the specification was good, the response would have been consistent. It is not. Fun!

Quite a few did not respond, fair enough, they may only have their pre-production on line for testing.

Some failed with delivery timeouts, and one with an invalid API Key!

I really am not sure this will all be working. I mean, I think we are 100% ready according to my reading of the spec, and if I have the spec wrong, I am 100% confident I can address that within minutes. But I am not sure of others.

My biggest mistake today was finding apache had a weird 5 second delay. Seems I am not alone if you google that, and a simple fix for it (Content-Length). The CP we are working with may have the same issue, but I am not sure they have the means to debug at the right level to see and resolve it. I'm glad we fixed this, and embarrassed it was wrong.

What is fun is today TOTSCO also failed to meet their own SLA on response times to messages. No reply on that yet.

But all of this is "nuts and bolts" of messaging, and nothing close to the high level issues I fully expect to stem from the whole system. CP to CP messages going wrong has a whole new level of possible issues, and I am not sure we are close to tackling those.

Wow, and one replied after 4 minutes, and replied twice!!! (the SLA is 60 seconds). Their reply had incorrect auditData, and incorrect content in the payload!

2024-06-26

TOTSCO, gets worse

Seriously, this is bad.

TOTSCO have specifications for the whole process, but they are made of cheese. They don't even specify such fundamental things like the basic data types for something like an RCPID (Retail Communications Provider ID). I have argued with them, as one spec does say it is "4 alpha characters, not starting A", but they dismiss this as not actually the spec of an RCPID, and seem to have no issue with not having a specification?!?!

To be clear, I would expect it to be something like: "An RCPID is assigned by TOTSCO, and is 4 alpha characters not starting "A", or the 6 character string "TOTSCO". In JSON it is a string type value. By convention it started with an "R", but this is not a requirement and should not be assumed.", and I would even like them to reserve "TEST" as a special RCPID. I'll help them write a spec if they ask!

A clear specification to which all CPs can refer is essential. Heck, we are used to this with RFCs. The RFC is the reference and who had got it right or wrong is by reference to the RFC. 

But what is worse is the whole testing and integration process!

There seem to be these steps:-

  • A really simple messaging test (their simulator). It is flawed, but checks basic OAUTH2 at least.
  • A CP to CP integration test using their pre-production platform. <-- WE ARE HERE NOW!
  • Then live!

At no point is anything tested to the specification!!!!

I am not sure there is even a process for reporting and resolving a CP not following what little specification there is!

This is a serious problem, and as a simple example, we are currently going through the integration testing process with a buddy CP that has already done it. I won't name them, it is not their fault.

The first test

The first test was actually pretty good in many ways - they misread the details I provided and sent a residentialMatchRequest with an invalid account number, and we replied with an error, saying it was an invalid account number format. Yay, a good test.

So, I take that as a huge success of a test.

But no...

Their request was wrong in other ways, and now I see it, I have updated my system. They sent an envelope destination correlationID on an initial message which is not according to the specification. We mistakenly used that in our error reply. Oddly TOTSCO sent us a messageDeliveryFailure even though the other CP got our message, and we then barfed at the correlationID on that, because it was not one we issued!

So why TOTSCO sent the messageDeliveryFailure is unclear. But the other CP got it wrong in the original message anyway. What is worse is at least one messageDeliveryFailure is incorrect as well, according to the specification as it had no source correlationID, which is mandatory.

So at this point, we had a few checks missing, but the other CP had their message slightly wrong. They are the ones that have passed integration testing and are sending a wrong message to us. They fixed it and tried again, but TOTSCO then failed to deliver the message to us, which looks like another TOTSCO error.

Naturally my NOTSCO system picks up this stuff now.

Working with them

To be clear, we are working with the other CP here, we want to make it work.

Update: They had not gone through integration testing, which suggests they have been waiting at least a month for someone to buddy with, which suggests yet another serious problem in the process!

So, the score so far...

  • Other CP, 1 error (minor), fixed.
  • Us, 1 error (not handling their error well), fixed.
  • TOTSCO 2 errors, still awaiting a reply.

Update: Not a peep from TOTSCO all day so far, formal tickets raised.

Update: After raising tickets, I have some replies. They claim we did not respond within 2s, but my logs show no request, so some packet dumping next.

Update: One reply is interesting - their invalid message is apparently correct as two parts of the specification contradict each other.

Update: They said we did not respond in the 2s SLA, but when I asked for the SLA it states 3s (after up to 1s connection time), so no idea where the 2s came from.

Update: and wow...

Don't trust apache!

This may be of use to other CPs here. The SLAs are tight, they want a response (at http level) within 3 seconds.

It is run as an apache CGI function executable. It responds to stdout with Status, Content-Type, and content (JSON), and exits. That should be it. Simples!

My code was responding quickly, indeed, usually well under 100ms.  This was as measured in the code, and measured from an external connection (NOTSCO).

However, TOTSCO were still struggling and saying we were timing out. Very odd indeed, so I did packet dumps to prove them wrong.

To my shock, the packet dump showed a 5 second delay in the middle of the TCP.

After some experimentation, noting TOTSCO send Connection: keep-alive, I eventually found that if I sent a Content-Length, then the JSON, apache no longer fucked about, and responded instantly.

I can only assume some persistent connection thing - which is not usually very good with CGIs like this. But even so, having closed stdout and exited, I expected apache not to wait.

So, heads up, that 3 second timeout SLA can catch you out!

2024-06-12

NOTSCO (Not TOTSCO) One Touch Switching test platform (now launched)

I posted about how inept TOTSCO seem to be, and the call today with them was no improvement.

It seems they have test stages...

  • A "simulator" to prove basic connectivity, well, sort of. See blog!
  • Pre production (i.e. live with another CP, but not testing against the specification in any way).
  • They may have a wider general pre-production stage as well.

They seem to be missing the obvious, a proper simulator platform that can simulate communications with another CP using TOTSCO, both ways. This has the aspect that the testing is against the spec, not against other CPs and their interpretation of the spec. It would be something to use whilst developing OTS for an ISP, and before going on to preproduction testing.

Missing link

So how do we address this missing link, a platform to test TOTSCO as if talking to another CP, but without actually doing so. Testing against the specification?

Well, we, like other CPs, I am sure, made some simple test systems before going to TOTSCO. But external testing is invaluable. Even if the external systems have it wrong in terms of following the spec (as long as they will fix it), they won't have the same errors as you have. The best external test would be TOTSCO, making a proper CP to CP simulator system.

But it does not exist - so, as you might expect, if you know me, I have made it, for free.

  • No need to book test slot, just sign up and use for as little or as long as you need.
  • Configure the responses you want to a match request.
  • Send a match request with various options.
  • Send and receive the various messages for a switching order.
  • Send deliberately wrong messages to test your error checking.
  • Test as you go, an ideal way to test your code as you develop it.
  • Logging and reporting messages each way, in detail, with errors and warnings detected, quoting the specification and section that applies for anything it finds wrong.

From a privacy perspective, I am not expecting personal data to be stored, but we are deleting all test at the end of each day anyway. I did wonder about a report download option maybe.

Now launched

It is now launched at https://notsco.co.uk It took me a few days to create all this, about the same as it took TOTSCO to actually reply when we asked to go on pre-production testing (and they still have not actually set that up). Thank you all for your patience.

Discussions, bugs, feature requests - on GitHub please.

I have told TOTSCO about it as well...

2024-06-10

Working with TOTSCO

This is hopefully going to help other small ISPs that will have the same challenges.

As I explained in my previous post, we have to work with TOTSCO to set up One Touch Switching. Well, we are doing that now that TOTSCO actually exists. The new deadline is September, but we want to ensure we are working well before that.

Specifications

The specifications are not too bad. They have a few inconsistencies, which I have fed back to them. But I was able to code the system reasonably quickly. I created my own test system to act like TOTSCO so I could test my code with messages in and out in advance.

The underlying system is, as I say, just a messaging process between telcos. It can use OAUTH2, which is simple, and involves JSON messages each way, which is also simple. I use C and a load of long standing in-house JSON libraries, but for most people they would use some other platform with standard JSON libraries I am sure. It should be pretty simple. Obviously the hard part is integrating which whatever back end systems and processes the ISP uses, oh, and checking data for clean address data for matching services including UPRNs.

Simulator

TOTSCO have a simulator, which is good. It will allow testing against them. It has been two weeks since I finished coding it all, and only just on the simulator, but it is a mess, so far.

  • The token issuing URL had an invalid certificate (wildcard, but one level too high). I ignored that to get further testing.
  • The directory URL did not work (404). This provides (or should provide) the list of ISPs, basically.
  • The messaging URL simply said "Error connecting to the back end".

Well, that is not a good start, but chasing up, after several days they finally want me to check I am using the correct URLs. Good thing to check, but I was, as per the spec.

  • They fixed the token certificate, good, but the reply did not say they fixed it. The new cert now uses a different CA that libcurl does not know, or some such, which is fun. But at least is valid.
  • They told me to use the directory path but on the token issuing host, which makes no sense. Re-reading the documentation it certainly implies the directory URL is an "API" and so you would expect to use the API host. So that is weird. But it still did not work (404 Not Found). I eventually found it works if I add the optional parameter &identity=all. Well, it is meant to be optional, and is a GET form style argument, so how it was giving 404 is beyond me. Interestingly, with that, it works on token host and API host, so even weirder.
  • They told me to use a path for the messaging that starts /testharness/ which is not as per the specification (which states /letterbox/). So basically the simulator does not follow the specification! Using testharness gets further but a different error this time.
  • Oh, and the directory I get has RCPIDs (Retail Communications Provider IDs) which don't meet the specs, so, of course, my code barfs trying to put them in the database which was set for 4 characters, as per the specification. So again, the simulator does not meet the specification.

Some progress

Well, surprisingly, we have a quick response now.

  • They say that the duff RCPIDs are dummy entries. OK, but surely they should at least have correct syntax, as otherwise it is sensible for my end to reject them.
  • They just say testharness should work, but I have to use specific RCPIDs for testing, good (would be nice if that was documented, maybe I missed something). But they really need to fix it to actually follow the spec and use letterbox.
  • I got as far as testing a match request and them trying to send a reply. They get an OAUTH2 Bearer token, and then try and post a message, but the message they post does not use the same bearer token I issued to to them, so is rejected.
  • I can see what they tried to post and it does not have the right source and target RCPIDs or correlationIDs, so again I would reject them if they actually authenticated.
  • Oddly, after more tests, they are using the right bearer now, but still wrong IDs
The irony here is that part of my coding was to make a simulator for my own testing before going to TOTSCO, and so far my simulator is way better than theirs!

Next steps

I have come to the conclusion that the simulator is actually useless. It does not simulate either the TOTSCO messaging platform (as it does not actually use the right URLs, or provide a sensible directory, or actually do OAUTH2) nor actual end to end messaging (as it does not do source/target RCPID or correlationID correctly).

What really puzzles me is that we know we are not the first to do this, and we know some of the big telcos have done this. So how have other ISPs not ripped TOTSCO to pieces over this stupidity already?

Follow up call

We have had a call. They explain that the simulator is totally dumb, it cannot be told to initiate any messages, and all it does it send one of two fixed replies to a match request (depending on the RCPID to which it is sent). It is meant to test connectivity.

But they want to do more than just two match requests and replies, they want us to send the order, update, tigger, and cancel requests.

This makes no sense, as the match requests test connectivity both ways already. And, of course, my system will not do that as it has not received a valid switch order confirmation reply. The fixed text they send is not valid as wrong RCPID and correlationID, so we don't accept it and don't store the switch order reference. And as such it does not see a switch order we can place or update or trigger or cancel.

I could fake such messages, but that is not testing my system.

They say that if I email explaining this, they will move to pre-production platform. The is the same as live, but with other CPs.

What they seem to lack is any sort of useful simulator that handles messages both ways as if to another CP. This would seen a sensible step before going to pre production testing.

Pre-production testing

We have moved on. Yay!

But the simulator test is meant to test connectivity, and seriously, does no more than that.

So you would hope and expect it simulates the real system.

But no!

  • The pre-production system has a stupidly big Bearer token, which breaks SQL tinytext. The simulator was way smaller, so not representative of the live system.
  • The pre-production system can't talk to us, not sending an Authorisation header, WTF!?

I can confirm we have pre-production testing, and now we have to work with a buddy CP to test. They spent a week not finding one? So we suggested someone, and we are now ready to send and receive messages to complete the integration testing.

This whole process would be literally weeks quicker if they had something like my NOTSCO system.

More challenges

TOTSCO seem to see no issue with the fact they have not defined key data types, such as an RCPID. Well, they do, in one document, but they refuse to follow that spec and insist they have not specified. How they can even start without specifying key data types is beyond me.

2024-06-07

One Touch Switching

OFCOM have come up with a few things that are perhaps a tad questionable in terms of their benefit or practical application (in my personal opinion, of course). Sanity checking CLIs is one which created back scatter and broke useful services, but putting that aside, the latest is "One Touch Switching".

So what is it?

The concept seems relatively simple - a residential/consumer with a fixed location broadband (i.e. internet access) or telephone ("Number Based Interpersonal Communications Service") should be able to easily switch to a new provider. They should be able to do it as "one touch", i.e. their one order with the new provider.

Does this make sense?

Well, maybe. From a consumer point of view, for many people, the fact that moving from one "Openreach back end" broadband provider to another is different to moving from one technology to another, and may be confusing. Fair enough.

It is different for a reason - if you have a broadband service provided over Openreach based copper (or worse, aluminium) wires, you can change provider by the new provider working with Openreach to change what is attached to those wires and the ISP to which it is routed, and pretty much seamlessly move from one ISP to another. Of course ISPs vary, some don't even have IPv6, and some use CGNAT, and some filter or log stuff, so not really "switching", but OK.

But if it is a different technology, e.g. moving from VDSL on wires, to some radio (WiFi) service, or Starlink, or Virgin, or mobile, or, well, anything that is a different technology, the process is different.

But it is not complicated! It is order new service and cease old service. If you have any sense you arrange an overlap to ensure new service works well for you before old service stops, as it its not the same "wires". (and no, you cannot easily arrange that overlap now!)

OFCOM could have mandated that "ceasing" a service has to be simple and easy. That would have made any change of technology simple. They chose a different path.

How does it work?

Well, that's another problem, as OFCOM said to "industry, you have to do this", and expected something magically to happen. It did not, and has been delayed. Eventually some new company called TOTSCO has been created that is co-ordinating it.

This new system is simply a way for one telco to talk to another, with some quick, well defined (ish), messages to handle the process. Spoiler, it is JSON!

Basically the new provider ("gaining" provider) messages the old provider ("losing" provider) to match a customer and address, and if that all works they can start, and then later finish, a "switch". Old provider is expected to email customer with any early termination charges and stuff, good.

What it does not do?

It does not actually change the switching, migrating, or porting systems in place now. It simply adds a new layer.

If the process involves some migration or porting that happens the same as if ever did. If it does not, e.g. changing broadband from Virgin to Starlink, all its does is coordinate the cease of the old service when the new one starts.

More work for customer!

Our broadband provide and migrate order forms are complex enough, we have to know exact address and what service we can offer, and if migrating from another Openreach service. But now we have an extra layer on top to match the service from the old provider. It saves the customer ceasing the old service if it is a change of technology, but if a migrate then it makes no difference, just adds more that we have to ask and more that can go wrong.

But for some people it may help, especially if ceasing an old service would be hard work. Some ISPs seem to make it hard work. So some good, maybe.

It seems to also stop most "anti-slamming" measures - not allowing losing ISP to cancel a migration now!

The old systems still needed!

However, the new system is only fixed location internet access or telephony, and only consumers. Anything else still has to work as before, business services, and services that are not fixed location. And even for the cases the new system applies, the old systems to migrate and port are still needed to make it happen.

Some hope?

Maybe, just maybe, number porting, which seems to involve a lot of manual work now, could be improved using some new messaging system used for One Touch Switching. If so, that will be good.

The issue here is many VoIP services are not "fixed location", so outside the scheme. We have had lots of issues with people porting numbers to us where the "address" did not match, when in fact the losing providers idea of "address" is years old before it was moved to VoIP. The new system simply does not apply to non "fixed location" services, so that will be no help at all. A system like mobile ports, using a "PAC", may be way better, and not location dependent.

For us, porting a telephone service, from a fixed location, it may help, as it may confirm address match and confirm losing access provider, so ensure porting (which still has to use the same old system) may be more reliable. We hope so.

What's in a surname?

I mentioned a lack of any means to avoid "slamming", forced change of ISP/telco. This could be someone hijacking customers, or some end user being malicious and migrating someone's service for fun our malice or fraud.

The one thing the new system expects is a match of surname. They have a cryptic requirement to remove accents, but that is messy, depending on language and alphabet, simply "removing" an accent is far from "equivalent" to non accented. But we have done that in a crude way. But we do have to match surname.

So we have allowed customers to set the surname on their broadband services. This is not for VoIP as our VoIP is not fixed location, so will never match for One Touch Switching anyway, and needs old school porting out.

What I have now put on the web site re slamming is:-

For a long time we have operated an anti-slamming option where you tell us in advance that you do not wish your broadband to be migrated to a new provider. You could then change that at any time.

However, the new One Touch Switching system works differently. We will no longer be able to reject switching. However, to start switching the new provider needs an address and surname to match. They can start a switch process in BT without, but this is less likely as the normal process for consumers, and probably most businesses, will be One Touch Switching.

Because the surname has to match, we now allow you to edit the contact name on each line you have with us. Your name is what you want it to be, so picking any name for any circumstance is your right, and we have to respect that and allow you to change your name under GDPR, even if only on that very specific part of our system - the contact name for a broadband service.

If you change your surname, even if it is to PSJKHGJGEXC, then that is your choice. And any One Touch Switching match request would fail unless using the surname PSJKHGJGEXC.

Obviously this is meant to be for your surname not really as a pseudo password, but, well, it is up to you.

FB9000

I know techies follow this, so I thought it was worth posting and explaining... The FB9000 is the latest FireBrick. It is the "ISP...