mystrlen vs android-bionic’s strlen on ARM CPU

After my last post I got a request from @trufae asking for a benchmark against Android’s bionic library so there I go. I begun compiling the available bionic’s strlen version and running the same test as in my previous post.

The first thing I noticed reading the bionic’s strlen version is the they use a better way to detect the null character, which reduced by 1 instruction my straight forward byte-to-byte search which, of course, would lead to a performance improvement. They also unrolled the loop a little bit more up to 4 ldr instructions.

# This is inline C assembly
sub     %[t], %[v], %[mask], lsr #7
and     %[t], %[t], %[mask]
bics    %[t], %[t], %[v]

versus

andnes	v4, v3, v1, LSR #8
andnes	v4, v3, v1, LSR #16
andnes	v4, v3, v1, LSR #24

From now on I will only show fitted version of the lines to reduce noise due to scheduling and other OS related delays.

As can be seen the bionic’s version outperforms mystrlen much more than mystrlen outperforms uclibc version. To test how much each bionic improvement affects performance I first imported to mystrlen the 3-instruction version of null character finding, which didn’t lead to noticeable performance improvement.

So the key must be in the unrolling. I unrolled mystrlen but, instead of using 4 ldr instruction, I used the block copy to reduce the number of instructions. The results are shown in the next figure:

Surprisingly, mystrlen outperform bionic’s strlen! Here are the numbers for the y = mx + c plots:

  • uClibc: m = 0.014, c = -14.7
  • bionic: m = 0.012, c = -12.4
  • mystrlen: m = 0.010, c = -11.5

Now mystrlen is around 14% faster than bionic’s version and 29% faster than uClibc’s version.

Why? This is a little bit weird so I had a look at the binary produced by gcc for the bionic’s version and I found that the code wasn’t even close to the original inline assembly, gcc had just un-unrolled the code. Even using better optimization flags the generated code wasn’t better:

    86e4: ldr     r6, [r3], #4
    86e8: sub     r4, r4, r3
    86ec: sub     r5, r6, r2, lsr #7
    86f0: and     r5, r5, r2
    86f4: bics    r5, r5, r6
    86f8: ldreq   r6, [r3], #4
    86fc: beq     86ec <bionicstrlen+0x94>

Conclusion
With the new null checking algorithm and unrolling 4 loops mystrlen, is 29% faster than the uClibc version, which now makes a real difference. The processor where I have been conducting the tests is a arm926ejs and doesn’t even have the PLD instruction (preload data) which, I bet, would improve the performance of the algorithm.

Here is the latest version of mystrlen:

.globl mystrlen
mystrlen:
	stmfd	sp!, {v1, v2, v3, v4, v5, v6, v7, lr}
 
	mov	v7, a1
	ldr	v6,  =0x80808080
 
	##
	## un-aligned address till we get aligned
	##
1:	tst	v7, #3
	beq	0f
	ldrb	v1, [v7], #1
	tst	v1, #0xFF
	beq	4f
	bal	1b
 
 
	## un-rolling strings
	## as few instructions in the loop
	## as possible
	## - Check whether any position equals 0
0:	ldmfd	v7!, {v1, v2, v3, v4}
	sub	v5, v1, v6, LSR #7
	and	v5, v5, v6
	bics	v5, v5, v1
	bne	1f
	sub	v5, v2, v6, LSR #7
	and	v5, v5, v6
	bics	v5, v5, v2
	bne	2f
	sub	v5, v3, v6, LSR #7
	and	v5, v5, v6
	bics	v5, v5, v3
	bne	3f
	sub	v5, v4, v6, LSR #7
	and	v5, v5, v6
	bics	v5, v5, v4
	bne	4f
	beq	0b
 
4:	mov	v1, v4
	bal	0f
3:	mov	v1, v3
	sub	v7, v7, #4
	bal	0f
2:	mov	v1, v2
	sub	v7, v7, #8
	bal	0f
1:	sub	v7, v7, #12
	## After the loop we calculate the diff
	## between the end and the begining of the str
	##
0:	tst	v1, #0xFF 
	subeq	v7, v7, #1
	tst	v1, #0xFF00
	subeq	v7, v7, #1
	tst	v1, #0xFF0000
	subeq	v7, v7, #1
	tst	v1, #0xFF000000
4:	subeq	v7, v7, #1
 
	sub	a1, v7, a1
	ldmfd	sp!, {v1, v2, v3, v4, v5, v6, v7, pc}