Sunday 30 November 2014

Broken Speech

Yes, another Broken Sword post. In previous posts I wrote about my initial work to add support for the mac version of the Broken Sword game in ScummVM and some more work I did to fix graphical glitches with this version. At that point the game was working fine for me. But soon, we got a report on the forum that the speech was not working. Obviously it was working for me, I would have noticed if it wasn't. So what the heck?!

Bug reports are good. They stop me getting bored. And they show that other users have the mac version of Broken Sword and benefit from my work, which is also gratifying. So let's look at that issue. And to do so, first let's rewind to my first post. I wrote that I assumed the files with the same name (including the extension) as the files of the Windows version where in the same format, and in particular used the same endianness. And the files with a different extension were big endian in the mac version and little endian in the Windows version. That proved mostly correct (a few resources had been left as little endian data in files otherwise converted to big endian).

Except that wasn't correct. So why did it work? Because I had been lucky. When I initially worked on supporting that game it looked like my guess was correct, and I therefore made quick progress as I was not distracted by some strange behaviour. But...

Let's start the story from the beginning:

In the Windows version the speech is stored in a file named speech.clu. There are actually two such files, one on each CD, and they store the speech as 16 bits compressed mono wave data. And as you can expect the data is little endian.

In the Mac version, the files have the same name (speech.clu). So in my initial implementation I assumed the speech data was little endian in the mac version as well. And it worked... with the version I have (the French version). Obviously it didn't work with the version of the user reporting the bug (the English version) since the user reported hearing static noise instead of speech.

The two files (Windows and Mac versions, both English) have the same size:


But opening them in an hex editor shows differences:

Do the differences remind you of something?
If not go back and read again the first two posts in this Broken Sword mac support series.

Before looking at the differences, I will give a short explanation of the speech file format.
The speech files are a collection of sound resources. Each resource contains the wave data for spoken sentence and is organised as follow:
4 bytes: 'data' (i.e. hexadecimal values 64 61 74 61)
4 bytes: number of samples
n bytes: wave data (16 bits mono)

So quite simple, but maybe no as simple as you might think. If you are thinking that n is the number of samples multiplied by 2 (since each sample takes 2 bytes) you are wrong. Because the data is compressed. This is not really important for now so I will keep the description of the compression for later.

What is important here is that we can see that the first 4 bytes after 'data' are identical (in the image above hexadecimal values 8E E6 01 00 - which, since we know it is little endian, means 0x0001E68E = 124558 samples). But the values that follow are obviously a series of 2 bytes values for which the bytes have been swapped.

At this points, here is a small reminder in case you are not following me and did not go back to my previous posts: big endian and little endian are conventions used to interpret the bytes making up a data word (more at http://en.wikipedia.org/wiki/Endianness).  For example 42 in hexa is 2A, or when using two bytes 002A. When using the big endian convention, this would be stored as 00 2A. But when using the little endian convention this would be stored as 2A 00.

So this should be obvious to you now that both the Windows and the Mac version store the number of samples of the sound resource as little endian values, but in the mac version the data samples themselves are stored using big endian convention. Why mix endianness in the same file? Why do this for the mac English version but not the mac French version? Don't ask me, I have no idea.

Since some mac version use little endian and others use big endian, we need to know which one it is. Does it depend on the language, i.e. all French versions use little endian and all English versions use big endian? Maybe. But I don't trust statistics on a set of two samples. And what of the German versions?

Therefore we decided to use a heuristic to find out if the mac version the player has uses big endian data or little endian data. The heuristic works by computing the average difference between two consecutive samples (using absolute values). Using the assumption that a sound wave is smoother than picking values at random, the lower average difference is considered to be the correct endianness.

If we take the 13 samples from the example image above, assuming big endian for the mac version gives us the following curve:


The average difference value from one sample to the next is 425.17.

If we assume little endian data the curve is:

And the average difference value is 9344.75.

So in this case the heuristic tells us the data is big endian, which it is. Of course in the actual source code we use more than 13 samples to get a statistically valid heuristic value.
The original patch that adds the heuristic code can be found in the patch tracker: https://sourceforge.net/p/scummvm/patches/956/

But the story does not ends here. A new bug report very similar to the original one (i.e. speech sounds like static noise) was reported a few months ago. It was quite obvious that the heuristic did not work for that user and the wrong endianness was used. Why is that? It turns out there were several issues with the original heuristic code.

And that is where explaining how the speech data compression works will help to understand what was wrong. The data for one sound resource is broken in blocks, each one starting with a number of samples followed by the sample values. When you have consecutive samples with the same value it uses a negative size and the value is stored only once.
So for example the following sequence:

    0 0 0 0 0 1 2 3 4 5

would be stored as (where the brackets indicate the blocks):
    [
-5 0] [5 1 2 3 4 5]
All those numbers (number of samples and sample value) are stored on 2 bytes.

Here is the original source code (if you don't see the source code visit the blog as it may not be visible in RSS feeds).

I will not show the uncompressSpeech() code (yet). The only thing you need to know is that it uses the value of _bigEndianSpeech as either big endian or little endian data. So what the code above does is get the data assuming little endian data and then compute the heuristic value for the samples it gets and for the same samples to which a byte swap is applied, which should be the value we would get assuming big endian data. Right?

Wrong! This heuristic forgets something: the data is compressed, and when uncompressing it always assume little endian when reading the number of samples for each block. But if the data are big endian this number would be different and the blocks would have different sizes, and because the block boundaries would be wrong it would cause number of samples to be interpreted as sound samples and some sound samples to be interpreted as number of samples.

For example let's look at the resource of 10 sample with the compressed data -5 0 5 1 2 3 4 5.
Assuming the data is stored in big endian, in hexadecimal values with two bytes per value this give us: 80 05 00 00 00 05 00 01 00 02 00 03 00 04 00 05
If we read this with the heuristic above, because the number of sample is always read assuming little endian data we get 80 05 = 1408 samples instead of -5 for the first block. So the code will get the following samples: 0 5 1 2 3 4 5 followed by 4 garbage values read beyond the end of the resource instead of getting 0 0 0 0 0 1 2 3 4 5. So if the data are stored using big endian, the heuristic values gets biased. Almost always it still gets a lower score as reading it as little endian though.

The solution here is to call uncompressSpeech() twice, once assuming little endian and once assuming big endian.

But there is still an issue with this heuristic.
When reading with the wrong endianness, since we may read the wrong length, it may for example be a big negative number. Because we are using a relatively small finite number of samples, statistically we could end up with a small heuristic value because it has a lot of consecutive samples with the same value. For this reason I made an additional change to skip consecutive samples with the same value when computing the average difference.
After that commit the value for the heuristic with the wrong endianness is consistently about 21000, i.e. 1/3rd of 16 bits integer range (65 535 / 3 = 21845). As noted by wjp:  the average absolute difference between two random numbers drawn independently from a uniform distribution between 0 and N is indeed N/3. So this is quite reassuring.

So everything was correct after this change? No, that would be too easy. We want complex puzzles spanning several rooms, not some kind of hidden objects game. And the user reporting the bugs confirmed the bug was still present after that change. So let's look at a different room, or rather a different function.

Here is the code from uncompressSpeech():
Note something relevant? No? Look closer. You see it now? Yes, this function always give us the sound data in little endian format, whatever the endianness in which it is stored, and more importantly whatever the endianness of the computer on which the code is run.

And if you look at the heuristic code above, it assumes it gets data in the native endianness (i.e. the endianness of the computer on which the code is run). So when running on a computer using big endian convention the heuristic was wrong. Let's look again at our previous example and how it was interpret on a big endian computer:
   Read with the correct endianess: 0 1 2 3 4 5 (duplicate values have been removed)
   Was interpreted as: 0 256 512 768 1024 1280

   Read with the incorrect endianess: 0 1280 256 512 768 1024 1280
   Was interpreted as 0 5 1 2 3 4

So in the heuristic code, on a big endian computer we need to swap the bytes of the two set of data to get the correct value. This brings us to the final code, in which the heuristic computation was also moved to a separate function to avoid code duplication (since it is done twice):

The user reporting the bug confirmed he was using ScummVM on a big endian computer (a G4 mac) and that the speech was correct after that final change.

No comments:

Post a Comment