To provide big enough code space for additional opcodes, most of the SSE2 commands are prefixed either with 66h
or F2h
or F3h
byte prefixes. Because the instructions were extensions to originally existing MMX, all MMX instruction opcodes are prefixed by one of those prefixes to become corresponding SSE2 instruction. For example:
The following two byte0Fh 6Fh
opcode represents MMXMOVQ
instruction. The two corresponding SSE2 instructions areMOVDQA
andMOVDQU
and have opcodes that differ from MMXMOVQ
instruction only in prefix:MOVDQA - 66h 0Fh 6Fh MOVDQU - F3h 0Fh 6Fh
66h
or F2h
or F3h
byte prefix it simply ignores the prefix and treat the rest of the opcode as the instruction that is going to be executed. The reason for that behaviour is that those prefixes are valid and are used with non-MMX and non-SSE commands as well (although with the different meaning). In fact this behaviour is also attributed to older Pentium IIs and Cyrix CPUs as well (I have seen information about that on the web).
Externally, the resulting behaviour is that Transmeta Crusoe CPU implicitly executes some of the SSE2 instructions (subset of the SSE2 instruction set - all SSE2 instructions that have corresponding MMX instructions). The Transmeta Crusoe CPU simply treats those SSE2 instructions as corresponding MMX instructions. However for those SSE2 instructions that do not have corresponding MMX instructions (and hence cannot be recognised as valid instructions) the Crusoe CPU raises an "Illegal Operand" exception. This is essentially where the part of the problem lies.
The SSE2 instruction that causes the problem is SSE2 instruction PSHUFD
and by removing 66h
prefix Crusoe CPU in reality sees it as PSHUFW
instruction. Only in this case the PSHUFW
is really an SSE instruction and there is no corresponding MMX instruction.
When I was investigating the problem I had the application that was SSE2 optimised. However I was told that there was a possibility to turn off SSE/SSE2 optimisation via application settings. This nevertheless didn't work on Transmeta Crusoe CPU. After investigating this issue involving low-level debugging (thanks to Microsoft for the excellent WinDbg), I have found out that some of the SSE2 code is getting executed even before the application entry point which could be main(), DllMain(), WinMain() and others depending on the type of your executable (further on I will call it Main()
for simplicity sake).
After some digging on the net I have found the article [4] that explains some intrinsics of Microsoft C/C++ compilers, runtime library and runtime initialisation. As it appears, any C++ code that is used for static initialisation (like constructors for the static or global objects of the class) is called before the Main()
function. This is done by some clever tricks in Microsoft runtime library and is extensively described in the above article. This was the cause of the problem in above mentioned application: although SSE2 optimisations could be turned off but there still were some SSE2 code that was called before the main program can read its settings and set itself up appropriately.
In a cases like this one have to be really careful especially when SSE/SSE2 optimisations can be turned on and off. This is mainly because some intitialisation code will still be called prior to Main()
and application has to account for this code being SSE2 optimised or not.
Note: I am not entirely sure whether the problem in Adobe DNG Converter amd others is caused by Adobe's code or the code from some of the libraries they use (may be even Microsoft C++ runtime library). But the failing code in Adobe DNG Converter initialises static memory areas with specific patterns (like 0808080808h). This however may perfectly well come from any other library DNG Converter is using and was linked to like for example math library (for quick sin() and cos() implementations for instance). I also hope that all this information will be of some help to Adobe developers (if they need it of course).
[1] | Intel's Pentium 4 Development Manuals - these cover every aspect of IA-32 architecture in great details |
[2] | The sandpile.org website - provides invaluable references to IA-32 architecture in a short form |
[3] | The Code Project's article about interrupt handling and implementing interrupt hooks |
[4] | The Code Guru's article about internals of Microsoft runtime library |