What the GCC?

Happy New Year! With a new year comes new revelations and wow do I have a doozy – ARM GCC is just completely broken. I'm not saying it doesn't produce working code but there are cases where it produces obscenely poor performing code.

So the following is a pedagogical example of a direct-threaded interpreter.

#define NEXT goto **ip++

int main(void) {
  static void  *prog[] = {&&next1,&&next2,&&next1,&&next3,&&next1,&&next4,&&next1,&&next5,&&next1,&&loop};
  void **ip=prog;
  int    count = 100000000;
  NEXT;
 next1: NEXT;
 next2: NEXT;
 next3: NEXT;
 next4: NEXT;
 next5: NEXT;
 loop:
  if (count>0) {
    count--;
    ip=prog;
    NEXT;
  }
  exit(0);
}

When compiled with -Os, this produces abysmal code. What you're seeing here is the code for each next# statement which is MOSTLY the NEXT statement. This is loading the address to jump to into register R2, branching to .L2 which then copies R2 into the program counter PC. Okay.

.L2
        mov     pc, r2    @ indirect register jump
.Lnext1:
        ldr     r2, [r3], #4
        b       .L2
.Lnext2:
        ldr     r2, [r3], #4
        b       .L2
        ...

Well, if we compile it with -O3 which should produce faster code and shouldn't produce smaller code, we instead see this:

.Lnext1:
        ldr     r2, [r3], #4
        mov     pc, r2    @ indirect register jump
.Lnext2:
        ldr     r2, [r3], #4
        mov     pc, r2    @ indirect register jump
        ...

See that? We lost a completely useless branch and the code is smaller. But we're not done. While there is NO compiler option that will improve this further, this is what we SHOULD see with either -Os or -O3:

.Lnext1:
        ldr     pc, [r3], #4   @ indirect register jump
.Lnext2:
        ldr     pc, [r3], #4   @ indirect register jump
        ...

That's one instruction per NEXT. Ouch. Not only is this now one instruction instead of three, not only would this save unnecessary abuse on the branch predictor, but we also have avoided using a temporarty register: R2. GCC likes to make code-stubs in -Os even when said stubs are only one instruction and this is very bad, but the other problem is that GCC also doesn't understand that on ARM, the program counter is a general purpose register, and with few exceptions can be treated like any other and that means using LDR directly into PC.

Of course the fix here would be to define NEXT with assembly. But this isn't portable anymore and guess what – it doesn't even work because GCC doesn't see this as a "jump" and will optimize out all the computed goto labels. To date, I've not found a reasonable workaround, portable or not.

#define NEXT asm __volatile("ldr\tpc, [%0], #4" : "+r"(ip))

It's with some bit of irony though, that as much as GCC sometimes likes to pointlessly create these little vestigial code stubs, there's also times when it will create a lot of EXCESS code, even with -Os. Here's PJIT's cache routine.

uint32_t* cache_find_entry(uint32_t m68k_addr) {
    uint32_t tag = ((m68k_addr >> 1) >> (BLOCKLEN + INDEXLEN));
    uint32_t idx = ((m68k_addr >> 1) >> BLOCKLEN) & ((1 << INDEXLEN) - 1);
    uint32_t off = ((m68k_addr >> 1) & ((1 << BLOCKLEN) - 1));

    if ((*cache_tags)[idx] != tag) { // MISS!
        __cache_clear(&(*cache_data)[idx][0], &(*cache_data)[idx][(1 << BLOCKLEN) - 2]);
        (*cache_tags)[idx] = tag;
    }
    return &(*cache_data)[idx][off];
}

Simple enough, right? Well, GCC in it's infinte wisdom basically rolls that return statement into either code path and assembles them both. That is, the end of the function looks like this:

        ...
        add     ip, ip, r6
        strh    r5, [ip, #-255] @ movhi
        add     r0, r3, r1, lsl #7
        lsl     r0, r0, #2
        add     r0, r0, #-1627389952
        add     r0, r0, #14680064
        pop     {r4, r5, r6, r7, pc}
.L12:
        add     r0, r3, r1, lsl #7
        lsl     r0, r0, #2
        add     r0, r0, #-1627389952
        add     r0, r0, #14680064
        bx      lr

Like seriously? This should be four instructions shorter as there is NO performance gain from poping five registers versus popping four then performing BX LR.

        ...
        add     ip, ip, r6
        strh    r5, [ip, #-255] @ movhi
        pop     {r4, r5, r6, r7}
.L12:
        add     r0, r3, r1, lsl #7
        lsl     r0, r0, #2
        add     r0, r0, #-1627389952
        add     r0, r0, #14680064
        bx      lr

My best guess is that the optimizer that folds up the stack frame occurs before the subexpression eliminator, so because the end is POP in one and BX LR in another, it cannot merge these into common code blocks. I've found no way of restructuring code to avoid this.

This isn't a unique special case, these problems are endemic to the ARM GCC compiler.

On a happier note, if you're compiling in -Os and want a big code-shrink, add the option -ftree-vectorize. This will use the NEON registers for short memory copies and drastically reduces code size – for example, the compiler can use ldrd and strd to load/store 64-bit chunks. Honestly, if you specify VFP as a compiler option and -Os, it should be enabled by default.

Too bad -Os doesn't ACTUALLY produce the smallest code.

Written on January 9, 2023