strchr() optimized for ARM and performance analysis

Note: This optimization are for an ARMv5 processor (arm926ejs) further improvement could be achieved in a newer ARM versions

The next function I want to play with is strchr() used for locating a char in a given string; returns a pointer to the first occurrence of the character or NULL if not found.

char *strchr(const char *s, int c);

The code I have come up with has 4 differentiated sections as can be seen in the code listing below:

  • Lines 12 – 18: Address is un-aligned so we look for the given and null character and then leave or we hit an aligned address so we continue to the word-based code
  • Lines 25 – 35: Read a word, check whether there is a 0x0 in any byte and save the shifted data for checking against the given char. If any of these two conditions matches we leave.
  • Lines 39 – 61: After leaving the word-code we must check whether we read a null char and return 0 or the byte read matches the char we are looking for and return the proper address.
  • Lines 67 – 69: Exit code for the un-aligned code, just check whether we left with 0 or not.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
.globl mystrchr
mystrchr:
	stmfd	sp!, { v1, v2, v3, v4, v5, v6, lr }
 
	# a1 : src
	# a2 : char
 
	mov	v6, #255
 
	# When src pointer is un-aligned go Byte by Byte
	# till we reach an aligned address
1:	and	v1, a1, #3
	beq	1f
	ldrb	v1, [a1], #1
	cmp	v1, #0
	cmpne	v1, a2
	beq	exit
	bal	1b
 
	# Aligned mode
	# 1. Read Word
	# 2. Check whether we got a \0 and keep the shifted data for later use
	# 3. Find char
	# 4. If any of the above is true, break
1:	ldr	v5, [a1], #4
	ands	v1, v6, v5
	andnes	v2, v6, v5, LSR #8
	andnes	v3, v6, v5, LSR #16
	andnes	v4, v6, v5, LSR #24
	eornes	v5, a2, v1
	eornes	v5, a2, v2
	eornes	v5, a2, v3
	eornes	v5, a2, v4
	beq	check
	bal	1b
 
	# Char Found?
	# Check for 0x0 and reduce src pointer to the matched char
check: 	cmp	v1, #0
	beq	error
	eors	v5, a2, v1
	subeq	a1, a1, #4
	beq	ret
 
	cmp	v2, #0
	beq	error
	eors	v1, a2, v2
	subeq	a1, a1, #3
	beq	ret
 
	cmp	v3, #0
	beq	error
	eors	v1, a2, v3
	subeq	a1, a1, #2
	beq	ret
 
	cmp	v4, #0
	beq	error
	eors	v1, a2, v4
	subeq	a1, a1, #1
	beq	ret
 
	# We shouldnt reach here, but just in case return NULL
	mov	a1, #0
 
	# If we leave with 0 do not substract	
exit:	cmp	v1, #0
	subne	a1, a1, #1
error:	moveq	a1, #0
 
ret:	ldmfd	sp!, { v1, v2, v3, v4, v5, v6, pc }

Benchmark Analysis

The next figure shows the performance of this algorithm against the one built in uClibc and shows that the longer the string length is the better mystrchr() outperforms the uClibc:

The next plot shows the difference between both times, positive means mystrch() did better job than uClibc’s implementation:

To better understand the data and quantify the performance we need to fit the previous two data-sets as shown in the next figure:

The parameters of the fitted lines (y = mx + c) are as follows:

  • uClibc: m = 0.062, c = -15.9
  • mystrchr: m = 0.051 c = -12
Conclusion

The above analysis shows mystrchr() is around 9% faster than uClibc’s version but it is only noticeable for large input strings which is the less likely scenario. For a common usage both functions show the same performance.

The funny thing is that uClibc’s seems to be kind of partially optimized ASM function but it is not, it is just a simple C loop as follow and it is not even a good compiler job as can be also seen in the listing below, problem might just simply be that the 8 instructions in the loop for checking the exit condition are just too much.

 
Wchar *Wstrchr(register const Wchar *s, Wint c)
{
	do {
		if (*s == ((Wchar)c)) {
			return (Wchar *) s;	/* silence the warning */
		}
	} while (*s++);
 
	return NULL;
}
00000000 <__GI_strchr>:
   0:	e20110ff 	and	r1, r1, #255	; 0xff
   4:	e5d03000 	ldrb	r3, [r0]
   8:	e1530001 	cmp	r3, r1
   c:	012fff1e 	bxeq	lr
  10:	e3530000 	cmp	r3, #0
  14:	12800001 	addne	r0, r0, #1
  18:	1afffff9 	bne	4 <__GI_strchr+0x4>
  1c:	e1a00003 	mov	r0, r3
  20:	e12fff1e 	bx	lr