Wednesday, March 25, 2009

Loop unrolling and speed-up techniques

Today I started some investigation on DTMF detection. After some research I decided to try the Goertzel algorithm, wrote some code from scratch and tried it on a real equipment based on an ARM7 core. It did fine actually, but after some benchmarking I thought it was somewhat slow.

I wrote the code thinking on future optimization but keeping simplicity in mind, specially because I didn't know how it was going to behave. However, as said above, calculations took much longer than what I expected, even though they were quite fast, exceeding the requirements for real-time processing (telephone channel, 8khz, mono). I still wanted to keep the code in C for portability so I tried some tricks.

I wrote the code with fixed-point math in mind, since floating point arithmetic would kill performance.

First, here are some basic definitions:

/* Struct for a single Goertzel calculator */typedef struct{ dsp_t sp1,sp2; //past values (IIR) dsp_t coeff;   //calculated only once} GOERTZ_s;/* Calculate Goertzel 'step' */#define GOERTZ_calc( _s, _val ) \ { \     dsp_t _gs = (_val) + FP_MULT2( _s.coeff, _s.sp1 ) - _s.sp2; \     _s.sp2 = _s.sp1; \     _s.sp1 = _gs; \ }

The first approach was the simplest one, I needed sixteen Goertzel calculations (8 frequencies + 8 second harmonic for the first frequencies). I had a buffer where the incoming PCM data was copied to so I had to process that array for each of the 16 Goertzel 'calculators'.

static inline void GOERTZ_processAll( dsp_t sample ){ int i; for (i=0; i < 16; i++)     GOERTZ_calc( goertzs[i], sample );}/* this is inside another function */int i;for (i=0; i < BUFFER_SIZE; i++) GOERTZ_processAll( bufferData[i] );

Where BUFFER_SIZE is the size of bufferData[] and goertzs[] is a 16-element array of GOERTZ_s structs, already initialized. Notice the inline modifier in GOERTZ_processAll(). It's really important since function inlining will help in execution time, specially when the function is called so frequently.

That code worked alright, but was too slow, at least for my intuition. The progress into a more efficient scheme had many steps, but basically there were three important changes:

• dsp_t was defined as int16_t, but the processor's 'native' word length is 32-bit, so changing dsp_t to be int32_t speeded up the calculations by removing cast and special assignment instructions.
• Loop unrolling was done in GOERTZ_processAll(). That means more flash is used but it's a good price to pay for a good speed up. Besides that, here we are talking about repeating the same thing 16 times so it's not a big problem if we have, to say, a 512k flash microcontroller or dsp.
• Instead of passing a single value to GOERTZ_processAll() a 8-byte array is used. This improves time considerably but also increases code size. As said in the previous point that wasn't an issue this case. Note that a 16-byte array won't necessarily increase performance. It depends on the core architecture, and in this case it made things worse (but better than single value parameter passing).

With these modifications I got a 100% performance increase, which means that there can be twice as much DTMF decoders compared to the non-optimized version.

Here is the code with the modifications:

static inline void GOERTZ_processAll( dsp_t samples ){ #define DO(j) \     GOERTZ_calc( goertzs[j], samples ); \     GOERTZ_calc( goertzs[j], samples ); \     GOERTZ_calc( goertzs[j], samples ); \     GOERTZ_calc( goertzs[j], samples ); \     GOERTZ_calc( goertzs[j], samples ); \     GOERTZ_calc( goertzs[j], samples ); \     GOERTZ_calc( goertzs[j], samples ); \     GOERTZ_calc( goertzs[j], samples ); DO(0); DO(1); DO(2); DO(3); DO(4); DO(5); DO(6); DO(7); DO(8); DO(9); DO(10); DO(11); DO(12); DO(13); DO(14); DO(15);#undef DO}/* somewhere in another function */int i;for (i=0; i < BUFFER_SIZE; i +=8 ) GOERTZ_processAll( bufferData + i );

But it can do better...

After observing how good performance went after this modifications I tried to do more loop unrolling to see if I could improve it. If the data buffer is large enough then the for loop that process 8 samples at a time wastes useful CPU instructions, so I included it inside a function called GOERTZ_processBuffer() which takes a buffer pointer and its length. There was another 100% performance increase from the previous optimization. This means a 4x speed up from the original code! Note that, once again, more flash is being used for loop unrolling.

static inline void GOERTZ_processBuffer( dsp_t *samples, int num ){ #define DO(j) \   for ( i=0; i < num; i+=32 ) { \       GOERTZ_calc( gss[j], samples[0+i] ); \       GOERTZ_calc( gss[j], samples[1+i] ); \       GOERTZ_calc( gss[j], samples[2+i] ); \       GOERTZ_calc( gss[j], samples[3+i] ); \       ... up to 32 .. ; } DO(0); DO(1); DO(2); DO(3); DO(4); DO(5); DO(6); DO(7); DO(8); DO(9); DO(10); DO(11); DO(12); DO(13); DO(14); DO(15);#undef DO}