Comments on: A ping from threaders' prison
Just sending out a ping that I am here... but just that...
I'm being held captive in threaders' prison.
You may know what that means. If you don't, here's an example:
Earlier this week, quite by chance during the coding of a port handler, I noticed the single simple line of C code that pushes a value on the stack:
generated this machine code:
004057B5 8B 55 FC mov edx,dword ptr [ebp-4]
004057B8 A1 C4 24 46 00 mov eax,[__tls_index (004624c4)]
004057BD 64 8B 0D 2C 00 00 00 mov ecx,dword ptr fs:[2Ch]
004057C4 8B 04 81 mov eax,dword ptr [ecx+eax*4]
004057C7 8B 0D C4 24 46 00 mov ecx,dword ptr [__tls_index (004624c4)]
004057CD 64 8B 35 2C 00 00 00 mov esi,dword ptr fs:[2Ch]
004057D4 8B 0C 8E mov ecx,dword ptr [esi+ecx*4]
004057D7 8B 89 34 00 00 00 mov ecx,dword ptr [ecx+34h]
004057DD 83 C1 01 add ecx,1
004057E0 8B 35 C4 24 46 00 mov esi,dword ptr [__tls_index (004624c4)]
004057E6 64 8B 3D 2C 00 00 00 mov edi,dword ptr fs:[2Ch]
004057ED 8B 34 B7 mov esi,dword ptr [edi+esi*4]
004057F0 89 8E 34 00 00 00 mov dword ptr [esi+34h],ecx
004057F6 8B 0D C4 24 46 00 mov ecx,dword ptr [__tls_index (004624c4)]
004057FC 64 8B 35 2C 00 00 00 mov esi,dword ptr fs:[2Ch]
00405803 8B 0C 8E mov ecx,dword ptr [esi+ecx*4]
00405806 8B 89 34 00 00 00 mov ecx,dword ptr [ecx+34h]
0040580C C1 E1 04 shl ecx,4
0040580F 8B 80 30 00 00 00 mov eax,dword ptr [eax+30h]
00405815 03 C1 add eax,ecx
00405817 8B 0A mov ecx,dword ptr [edx]
00405819 89 08 mov dword ptr [eax],ecx
0040581B 8B 4A 04 mov ecx,dword ptr [edx+4]
0040581E 89 48 04 mov dword ptr [eax+4],ecx
00405821 8B 4A 08 mov ecx,dword ptr [edx+8]
00405824 89 48 08 mov dword ptr [eax+8],ecx
00405827 8B 52 0C mov edx,dword ptr [edx+0Ch]
0040582A 89 50 0C mov dword ptr [eax+0Ch],edx
Even though this is non-optimized, in a perfect world on a prefect CPU, that should be about 4 or 5 instructions.
It sure got me rethinking the usage of TLS variables, at least on x86 Win32 implementations. I decided not to be held captive by the compiler to any degree (on any OS model) and recode large parts of the VM and natives to avoid TLS references (caching them SP relative instead).
I really didn't think I'd need to be doing this in the year 2007.
A human-based global flow analysis!? Makes me homesick for the old A5 CPU register, you know what I mean? Or a CPU with a thread base register, or I'd even take a thread-local remap on a VM base page for TLS globals. Or just maybe... cool stuff like that happens when -O2 is enabled? Please say "yes".)
Carl, I don't know how DS_PUSH is implemented nor the compiler you use, but I expect something like this:
static void ds_push( int d )
if( data_stack_ptr >= DS_LEN )
data_stack[data_stack_ptr] = d;
Have you tried a different compiler? There are tremendous differences how code is generated. The Intel compilers are very good for the x86 architecture (of course).
The Digitalmars C compiler has an option to log runtime execution information by a profiler. And it has a good overview what can be done to help the compiler: http://www.digitalmars.com/ctg/ctgOptimizer.html
An other approach could be to take a look at: http://en.wikipedia.org/wiki/High_Level_Assembly
My experience is, that looking at the generated ASM code for the top-most used functions and than re-writing them in a more C-assemblish-style helps the compiler to generate better code.
And if you start measuring cache-line-misses etc. it's getting even harder to optimize.
Still in 2007, your brain is much better at this as any compiler I know... and I bet it will stay this for a very, very long time.
Lets assume something like the following:
# define __thread __declspec(thread)
__thread int* ds_base;
__thread int ds_top;
#define DS_PUSH(x) ds_base[++ds_top] = x
Then, compiling a simple DS_PUSH(10) with gcc -O2  results in:
movl %gs:0, %eax
movl ds_top@NTPOFF(%eax), %edx
movl %edx, ds_top@NTPOFF(%eax)
movl ds_base@NTPOFF(%eax), %eax
movl $10, (%eax,%edx,4)
Compiling with gcc without -O2 generates 7 instructions.
Using cl /O2  results in:
mov ecx, DWORD PTR fs:__tls_array
mov eax, DWORD PTR __tls_index
mov eax, DWORD PTR [ecx+eax*4]
add DWORD PTR _ds_top[eax], 1
mov ecx, DWORD PTR _ds_top[eax]
mov edx, DWORD PTR _ds_base[eax]
mov DWORD PTR [edx+ecx*4], 10
Compiling with cl without /O2 generates 17 instructions. So obviously /O2 makes things a lot better, here.
 gcc (GCC) 3.3.5 (Debian 1:3.3.5-13) on linux/x86
 Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80x86 on win32/x86
Thanks for the comments. Yes... we'll be sure to use a variety of compilers for the release stages, and do some tests to pick the best results.|
PS: As an OS and language person, I just like complaining about compilers, etc.|
Above link abuse
|Louis Vuitton bags outle|
Visit buyIf want to know where you want to buy a href="http://www.louisvuittonbagoutletsale.com>Louis Vuitton bags outlet sale, you can use online resources Designer Louis Vuitton bags outlet sale visit descriptions of the Louis Vuitton bags and the big guy for the different costs and other accessories. You can find the online destination and Overstock Handbagcrew reduced price handbags and designer handbags. Check out other great creators of these pages that have the same quality and you will be able to see, how much to save - prices really have to pay a department store shopping at this site is much more normal. Have fun while you shop Louis Vuitton bags outlet sale!|
Post a Comment:
You can post a comment here. Keep it on-topic.