Technical Details of the SSE2 on Transmeta Crusoe Issue.

Warning

This page contains technical details and assumes serious understanding of Intel CPUs (modes, opcodes, instructions etc), Windows and Microsoft compiler internals. Therefore here I am not goint into much details explaining things about Intel CPUs, Windows the way Microsoft runtime library works. However I do provide references to the relevant materials wherever possible. All of the references I have used are gathered together in this section.

MMX, SSE and SSE2

The first SIMD (Single Instruction, Multiple Data) instruction set from Intel was MMX (introduced with later editions of Pentium I). The opcodes for the MMX command set are escaped with the opcode prefix in order to ditinguish them from the opcodes for the usual CPU instructions. Then with Pentium III the MMX was updated and extended with bigger registers and new instructions. This new command set was called SSE. It was then further upgraded to SSE2 and now even SSE3. Each new SSE extension have added new instructions and the problem of the distinguishing opcodes wass solved by requiring new opcode prefixes for each new extended instruction set. More about prefixes and various opcodes could be found in [1] or [2].

To provide big enough code space for additional opcodes, most of the SSE2 commands are prefixed either with  66h  or  F2h  or  F3h  byte prefixes. Because the instructions were extensions to originally existing MMX, all MMX instruction opcodes are prefixed by one of those prefixes to become corresponding SSE2 instruction. For example:

The following two byte  0Fh 6Fh  opcode represents MMX  MOVQ  instruction. The two corresponding SSE2 instructions are  MOVDQA  and  MOVDQU  and have opcodes that differ from MMX MOVQ  instruction only in prefix:
            MOVDQA - 66h 0Fh 6Fh
            MOVDQU - F3h 0Fh 6Fh
        

Transmeta Crusoe CPU specifics

Transmeta Crusoe CPU implements MMX instructions set only and it does not recognise extended SSE and SSE2 instructions set. However, it is how it does not recognise SSE/SSE2 what matters here and is causing all the problems. When Transmeta Crusoe CPU encounter instruction with opcode prefixed by  66h  or  F2h  or  F3h  byte prefix it simply ignores the prefix and treat the rest of the opcode as the instruction that is going to be executed. The reason for that behaviour is that those prefixes are valid and are used with non-MMX and non-SSE commands as well (although with the different meaning). In fact this behaviour is also attributed to older Pentium IIs and Cyrix CPUs as well (I have seen information about that on the web).

Externally, the resulting behaviour is that Transmeta Crusoe CPU implicitly executes some of the SSE2 instructions (subset of the SSE2 instruction set - all SSE2 instructions that have corresponding MMX instructions). The Transmeta Crusoe CPU simply treats those SSE2 instructions as corresponding MMX instructions. However for those SSE2 instructions that do not have corresponding MMX instructions (and hence cannot be recognised as valid instructions) the Crusoe CPU raises an "Illegal Operand" exception. This is essentially where the part of the problem lies.

The SSE2 instruction that causes the problem is SSE2 instruction  PSHUFD  and by removing  66h  prefix Crusoe CPU in reality sees it as  PSHUFW  instruction. Only in this case the  PSHUFW  is really an SSE instruction and there is no corresponding MMX instruction.

Code that causes the problem

I will take a closer look as to how the problem arose. I would not refer to the Adobe Photoshop and related products although this is where I found the problem initially (more specifically in free Adobe DNG Converter).

When I was investigating the problem I had the application that was SSE2 optimised. However I was told that there was a possibility to turn off SSE/SSE2 optimisation via application settings. This nevertheless didn't work on Transmeta Crusoe CPU. After investigating this issue involving low-level debugging (thanks to Microsoft for the excellent WinDbg), I have found out that some of the SSE2 code is getting executed even before the application entry point which could be main(), DllMain(), WinMain() and others depending on the type of your executable (further on I will call it Main() for simplicity sake).

After some digging on the net I have found the article [4] that explains some intrinsics of Microsoft C/C++ compilers, runtime library and runtime initialisation. As it appears, any C++ code that is used for static initialisation (like constructors for the static or global objects of the class) is called before the Main() function. This is done by some clever tricks in Microsoft runtime library and is extensively described in the above article. This was the cause of the problem in above mentioned application: although SSE2 optimisations could be turned off but there still were some SSE2 code that was called before the main program can read its settings and set itself up appropriately.

In a cases like this one have to be really careful especially when SSE/SSE2 optimisations can be turned on and off. This is mainly because some intitialisation code will still be called prior to Main() and application has to account for this code being SSE2 optimised or not.

Note: I am not entirely sure whether the problem in Adobe DNG Converter amd others is caused by Adobe's code or the code from some of the libraries they use (may be even Microsoft C++ runtime library). But the failing code in Adobe DNG Converter initialises static memory areas with specific patterns (like 0808080808h). This however may perfectly well come from any other library DNG Converter is using and was linked to like for example math library (for quick sin() and cos() implementations for instance). I also hope that all this information will be of some help to Adobe developers (if they need it of course).

Implementation details

The fix is implemented as alternative handler for the "Illegal Opcode" CPU exception. The handler checks the offending instruction for being SSE2 PSHUFD and emulates it. It then returns control to the main application. If offending instruction is different then control is passed to the system exception handler (previous handler of the "Illegal Opcode" CPU exception). The base for this driver was taken from [3]. More technical details about CPU traps and exception handling can be found in [1].

References

[1] Intel's Pentium 4 Development Manuals - these cover every aspect of IA-32 architecture in great details
[2] The sandpile.org website - provides invaluable references to IA-32 architecture in a short form
[3] The Code Project's article about interrupt handling and implementing interrupt hooks
[4] The Code Guru's article about internals of Microsoft runtime library


Hosted by www.Geocities.ws

1