ARM Optimized strlen()

Few weeks ago I started a personal project to enable lua programming under u-boot to allow easy access of peripherals and create small embeddable initialization scripts which I have missed during professional Embedded Development.

Another goal of this projects is to optimize u-boot’s ARM code so the first optimization go through string manipulation functions, today strlen().

The current strlen code is just a simple C loop looking for 0x0 to then return the number of Bytes read. This is far from optimum since it really miss some interesting ARM features and algorithmic improvements.

Here is the code I have come up after studying different scenarios, it sure isn’t perfect so please drop a line with improvements or fixes!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
.globl mystrlen
mystrlen:
	stmfd	sp!, {v1, v2, v3, v4, lr}
 
	mov	v2, a1
	mov	v3, #255
 
	##
	## un-aligned address till we get aligned
	##
1:	ands	v4, v2, #3
	beq	0f
	ldrb	v1, [v2], #1
	ands	v4, v3, v1
	beq	1f
	bal	1b
 
	## un-rolling strings
	## as few instructions in the loop
	## as possible
	## - Check whether any position equals 0
0:	ldr	v1, [v2], #4
	ands	v4, v3, v1
	andnes	v4, v3, v1, LSR #8
	andnes	v4, v3, v1, LSR #16
	andnes	v4, v3, v1, LSR #24
	bne	0b
 
	## After the loop we calculate the diff
	## between the end and the begining of the str
	## 
	ands	v4, v3, v1
	subeq	v2, v2, #1
	andnes	v4, v3, v1, LSR #8
	subeq	v2, v2, #1
	andnes	v4, v3, v1, LSR #16
	subeq	v2, v2, #1
	andnes	v4, v3, v1, LSR #24
1:	subeq	v2, v2, #1
 
	sub	a1, v2, a1
	ldmfd	sp!, {v1, v2, v3, v4, pc}

The algorithm can be divided in three different stages:

  1. Lines 11-16: When the memory address is un-aligned we go byte-by-byte till we find 0x0 or we reach an 4-Bytes aligned memory so we jump to the unrolling loop
  2. Lines 22-27: When on aligned memory we read a word and mask each Byte looking for 0x0. Thanks to ARM instruction set we can save a few cycles when finding 0x0 in the least significant Byte.
  3. Lines 32-39: Onces finished depending whether the end-of-string Byte is we subtract from 1 to 4 bytes of the current mem address so we can easily return the length

Bencharmk

I have checked the performance of this algorithms against a simple C loop looking for ‘\0’ and uClibc’s implementation. The tests where carried calculating the lengths for strings from 1 to 99999 characters averaged over 3 trials to reduce outliers.

There is a huge difference between uClibc’s strlen/mystrlen and the simple C loop, so my optimization has passed the first test

This figure shows both algorithms are really close in performance, it seems mystrlen is just slightly below uClibc’s strlen but not for being even noticeable.

 

This figure shows the for each string length the time difference, when it is above 0 it tells mystrlen is better than uClibc’s strlen. Given that the time units are micro-seconds we can say both algorithms has the same performance; further data analysis shows mystrlen is slightly better.

Average: 0.88

Std Dev: 13.75

Conclusion

The ARM optimization shown here is at least as good as uClibc’s ARM specific strlen. One funny thing is that for 1 char mystrlen is 5x faster than uClibc’s strlen.