RevK®'s ramblings: 💩

2014-06-13

💩

For those that cannot read it, the subject of this post is the Pile of Poo character (U+1F4A9). [Actually, blogger has messed it up as my iPhone won't show it even].

I have managed to get the pile of poo to correctly display on my iPhone as an incoming SMS text (i.e. using normal GSM SMS not iMessage or some such).

This is actually quite a milestone. There are various gateways to send texts but they all seem to have limitations or ways in which they translate to/from the GSM SMS protocol.

We can usually manage to handle multi-part (i.e. very long) texts, just about. Most of the time we can even handle something called a User Data Header (UDH) which is extra binary data sent with the message.

Getting UDH right is actually crucial for iMessage registrations to work at all. Otherwise you iPhone would not believe you had the number you have (it sends a text and expects a response that has a UDH).

Getting those key things to work is hard enough, but character set coding is a nightmare. This is because texts can be sent in one of three character sets.

GSM 7 bit character set. This has 128 characters, which include the normal letters (A-Z,a-z) numbers, punctuation, and a load of accented characters as well as upper case Greek. A text can have 160 characters using this coding. There are then extra characters using ESC (escape) as a prefix to get things like a Euro symbol (using two characters). Even just getting the @ character to work can be a challenge as it is coded on character 00 and not its usual place which breaks some things.
USC 8 bit characters - the first 256 unicode characters. You can have 140 of these in a text.
USC 16 bit characters - the first 65536 unicode characters. You can have 70 of these in a text.

Each of these can be used for any part of a multipart text, but the whole of each individual text is in one character coding system. The use of UDH makes for less space, and multipart texts use an extra UDH as well. So it is not simple.

The big issue is most text gateways are ASCII or some such, and do not map to/from these character sets. Even when XML is used that handles UTF-8, teh systems rarely give enough attention to detail to translate characters correctly. We have taken the view that the only right way to do things is to use UTF-8 coding for our interfaces with customers for texts and for us to do the translations right! For this reason we have been nagging the mobile operator, and they have finally come through for us.

The good news today is that the low level raw interface has been opened up allowing texts to and from our voice SIMs to use any of these character coding and UDH.

But even with all of that, the Pile of poo is extra special. It is 1F4A9 which is too big even for UCS16 coding. The trick is to use UTF-16 to use two of the UCS16 codes (total 32 bits) to code it. To my utter surprise this actually works and iPhones handle it!

We are gradually integrating various aspects of our new texting system now. The clean interface to and from our mobile SIMs is a really good start. If we can get other mobiles and even land line numbers all integrated more seamlessly, that will be even better.

9 comments:

KeijiFriday, 13 June 2014 at 22:35:00 BST
Are those surrogate characters in the title, because I see two ORCs (object replacement characters, but I love the unintentional acronym!) instead of one?

On a side note, if 09F9 was an illegal number, is 1F4A9 a (mildly) profane number now?
ReplyDelete
Replies
UnknownSaturday, 14 June 2014 at 09:08:00 BST
Yes, the post title seems to be two surrogate characters (which are invalid characters in UTF-8, the page encoding).

If I put a pile of poo into this comment and hit 'Preview' and then 'Edit' then Blogger gives me two surrogate characters so I suspect it might simply be broken, but let me try just posting without editing again.. 💩
ReplyDelete
Replies
KeijiSaturday, 14 June 2014 at 14:41:00 BST
Upon downloading the page with wget and then looking at a hexdump, it's actually serving up & # 55357; & # 56489; without the spaces, so there's no browser funkiness going on here.

You would think if Blogger's going to go to the effort of replacing Unicode characters with escapes, it would be smart enough to recognise surrogates too!
ReplyDelete
Replies
batfastadSunday, 15 June 2014 at 08:49:00 BST
Aha! Well this post spectacularly killed my RSS reader. ttrss failed to insert the record into MySQL so that gives me something to look into!
ReplyDelete
Replies
Owen ShepherdSunday, 15 June 2014 at 11:08:00 BST
It arrived in my feed reader correctly encoded...

I suspect what has happened is somehow you've gotten two surrogates UTF-8 encoded in your post. Somewhere along Blogger's E-Mail chain, and somewhere along the route to my feed reader, some software has converted these to UTF-16 using a non-validating parser. At this point, the surrogates have correctly gotten shoved together in UTF-16. When they came back out, well, they came back out as valid UTF-8.

This could quite easily happen if there was, say, some Python or Java in between the two.
ReplyDelete
Replies

Add comment

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

RevK^®'s ramblings

2014-06-13

💩

9 comments:

More on e-paper

Rules

Rules

Report Abuse