Wednesday 26 August 2020

I don't understand why it doesn't work on your computer

A few years ago I was responsible for the technical production at an exhibition venue. One of the artists wanted to present a work that consisted of a digital animation that was overlaid with subtitles in a number of different languages. The order and appearance of these languages changed every few seconds and included Hindi, Russian, Japanese and Arabic. She had encoded these texts in a plain-text .srt file and failed to include so much as a charset pointer to indicate how the computer should interpret the information. As should be obvious to anybody with basic computing knowledge, this gave rise to a number of problems while installing the work, with many characters not displaying properly or being otherwise scrambled. My suggestion that perhaps the subtitles should be hard-coded into the video file was met with some derision for the amount of 'difficulty' involved. A few days later the subtitles still weren't rendering correctly and with a somewhat desperate voice she exclaimed 'I just don't understand why it doesn't work on your computers'.  The only reply I could think of was that I couldn't understand why it did work on her computer. In the end the 'solution' we implemented was buying a used computer that had the same specifications of hers and even then we ran into some small issues.

This problem has popped into my mind on several occasions since then. Because of the large number of possible issues and causes I still don't know how to readily solve these problems, but I thought it would be worthwhile to at least point out some of the difficulties involved.

This text will get technical rather quickly and because I'm writing a text on a digital platform about the difficulties of writing text on a digital platform there are also some meta-issues you hopefully won't notice.

The first thing I would want to address is that the default character set a .srt file uses is Windows-1252, which is a barely extended ASCII set used in Windows 3.1. While .srt files can handle UTF-8 and UTF-16, both being theoretically able to handle multiple different scripts and reading directions, you still have to ensure that this is specified in some kind of charset detection in the .srt. While most devices will look for these detection pointers and interpret them correctly, there are also some devices that are able to read the .srt file, but not able to implement the specified charset. The metadata about the specifics of the language, reading direction and so on, is wholly dependent on the information present in these character sets, so there is no reason to assume that a file such as the one she made would work on any unknown device.

In simpler terms, the problem is that text that is displayed on a computer isn't encoded as such on a computer. Everything in a computer are strings of 1's and 0's that are then interpreted through some logical operations and appropriately converted to other strings of 1's and 0's which transform into pixels on screen that humans can read. For example, the binary code for the letter 'a' in UTF-8 is 00110110 00110001, while the same letter 'a' in ASCII is 01000001 and 01100001 in Windows-1252. So you have to remember that a computer doesn't see 'a', it can only see 00110110 00110001 or 01000001 and so on. Furthermore, each glyph is a discrete set of 1's and 0's and a computer can only interpret these strings in one direction. So while 00110110 00110001 might be an 'a' in UTF-8, in ASCII it would actually display '61'.
At the same time, assuming the computer 'reads' the binary codes 'left-to-right', then in order to present a right-to-left script, it would have to process the binary in the same order as all other text, but then present the outcomes in the reverse order. While this doesn't necessarily pose any problems to a computer that simply follows logical instructions, it will very quickly confuse the humans programming those instructions. Indeed we see that even high profile companies like Coca-Cola and Google get this wrong all the time and obviously that is not due to a lack of means. Even InDesign, the industry standard for setting type, didn't include official support for such features until the introduction of version CS6 in 2013.

Please consider the following example for some further demonstration of the problems at hand:

Wood / / خشب

I hope that it shows up correctly, although if it doesn't then it serves as an example of why all of this can be quite painful.
This text, the word 'wood' rendered in English, Japanese and Arabic, corresponds to the following Unicode according to a Unicode code converter. I added the colours for easier comparison:

U+2018 U+0057 U+006F U+006F U+0064 U+0020 U+002F U+0020 U+6728 U+0020 U+002F U+0020 U+062E U+0634 U+0628 U+2019

Note that in the Arabic, the leftmost character, U+062E, or the letter khāʾ, is actually the rightmost glyph in the word خشب. So while the order of {U+062E, U+0634, U+0628} is the way Unicode, and your computer, interprets it, in order to be legible to a human reader it would have to be presented on screen in the reverse order as {U+0628, U+0634, U+062E}. Arabic characters also change in appearance depending on their proximity to other characters, so an additional level of complexity (and thus potential for error) is introduced. For example, these: خ ش ب are the glyphs of خشب when they are taken in isolation.
In an attempt to mitigate some of these problems Unicode has a number of characters that can be used to indicate reading direction and they have updated these characters throughout the decades. For example, U+202A and U+202B, or left-to-right and right-to-left embedding, could be used, as well as the older U+200E and U+200F, which is a more general right-to-left mark. U+2066 and U+2067, or left-to-right and right-to-left isolate, were introduced in Unicode 6.3.0 in 2013 and Unicode now recommends their use for texts that have multiple changes of reading direction. There also exists an 'Arabic letter mark', U+061C, which is generally advised for use in the case of Arabic languages, but naturally doesn't work well for other right-to-left scripts like Hebrew. All of these characters also rely on specific pop (or end) tags. As these kind of technical characters are rendered invisibly on most text editors,  there is no practical way to check if they are actually present in the file you prepared. Even the Unicode code converter I used already doesn't recognise these directional characters and simply ignores their instructions.

So these are but a few of the issues that can arise when trying to incorporate different scripts and reading directions into a single file and I haven't yet gone into the problems of syntax errors, font compatibility, nor the fact that even if a device is interpreting the correct character set, it's often very difficult to ascertain that it is actually using a compatible, up-to-date version of that character set.

Computers are extremely limited in what they can do. Some things they can do extremely well, like making the quick calculations required to navigate a rocket into space, but other things they can only do through extremely cumbersome approximations of something that a human child can handle without any problems, like distinguishing the appropriate occasions to use katakana as opposed to hiragana.