Friday, December 26, 2008

GIT and CR/LF-ing

I recently started using git and it's amazing. Since I do many of my work in Windows I use msysgit which does a nice job. 

However, there are some issues regarding the famous <CR><LF> line endings: git will try to convert single <LF>'s to <CR><LF>'s by defaut. Recalling the previous post about automatic build number generation, git will warn about single <LF> termination on some files, specially the ones generated by sh under Windows.

Fortunately, the git docs discuss every detail about customization and git itself. To force git to leave the files as is and forget about line terminations the file .git/info/attributes should be modified by adding something like this (and created if it doesn't exist yet):

file_to_ignore -crlf
another_file_to_ignore -crlf

That way git will just ignore the single-LF ending inside the specified files. Wildcards are also possible.

Another nice trick that has nothing to do with line termination is the file .git/info/exclude which tells git to exclude certain files from the repository (wildcards allowed too).

Wednesday, December 17, 2008

Automatic build number generation

Version tracking is essential for several reasons. Without a proper versioning system it's difficult to know if this or that firmware contains certain update or feature, specially if one forgets to increment the version number.

That's where the build number appears, assigning a unique number per build (or 'make' in this special case). That way it can't go wrong. Version number is still there but there is also a build number, uniquely defining the release.

I usually never reset the build number, so even if the version number jumps (modified manually) the build number will always increment. It's not mandatory, it depends on how you feel about it.


Here is an automatic build number generator. It's a shell script based on sh. I used it with Windows without any problems as long as sh and friends are available on the system path.

version="`sed 's/^ *//' major_version`"
old="`sed 's/^ *//' build.number` +1"
echo $old | bc > build.number.temp
mv build.number.temp build.number
echo "$version`sed 's/^ *//' build.number` - `date`" > version.number
echo "#ifndef BUILD_NUMBER_STR" > build_number.h
echo "#define BUILD_NUMBER_STR \"`sed 's/^ *//' build.number`\"" >> build_number.h
echo "#endif" >> build_number.h

echo "#ifndef VERSION_STR" >> build_number.h
echo "#define VERSION_STR \"$version`sed 's/^ *//' build.number` - `date`\"" >> build_number.h
echo "#endif" >> build_number.h

echo "#ifndef VERSION_STR_SHORT" >> build_number.h
echo "#define VERSION_STR_SHORT \"$version`sed 's/^ *//' build.number`\"" >> build_number.h
echo "#endif" >> build_number.h

In order to make it work a file named major_version has to be created. It should contain something like "1.02." (no double commas). A file called build.number is also needed. It will be the starting build number like "1". Version number won't be modified but build number will be incremented each time is executed. In order to be useful for C programming the header file build_number.h is updated to reflect both build.number and major_version.

I usually create a makefile dependency to perform automatic build number generation by adding something like this to the Makefile:

# main rule
all: build_number.h $(TARGET).whatever $(TARGET).other
@whatever rules

# rule for build number generation
build_number.h: $(SOURCES)
@echo Generating build number..

A new build number is assigned every time a source file is modified. There can be other approachs too like generating a new build number even if no files were modified.

IMPORTANT: Under Windows care must be taken with CR and CRLF line endings. The files build.number and major_version have to be terminated with CRLF, otherwise sed and the other linux/unix utils won't be able to parse them.

Finally here is an example of what build_number.h looks like:

#define BUILD_NUMBER_STR "1234"
#define VERSION_STR "1.0.1234 - Fri Dec 19 11:00:23 HSE 2008"
#define VERSION_STR_SHORT "1.0.1234"

The bash script is easy to modify if any other information is needed, such as build number as a proper integer if some preprocessing is needed.

Thursday, December 4, 2008

Embedded C++ constructs - Part 2

As mentioned in part 1, I'll now get into an UART example, providing the basic functionality through a class. This example works with the LPC ARM7 family from NXP Semiconductors (Philips).

Register abstraction

First here is how I abstract a hardware register. It's nothing special but a simple template class:

namespace HW
* Individual HW register
template < unsigned long int BASE, unsigned long int OFF >
class Reg
void operator = (const unsigned long int & val) {
*((volatile unsigned long int *)(BASE + 4*OFF)) = val;

unsigned long int & operator = ( const Reg & reg ) {
return *((volatile unsigned long int *)(BASE + 4*OFF));

operator unsigned long int () {
return *((volatile unsigned long int *)(BASE + 4*OFF));


(I used C-like casts here but it would be better to use reinterpret_cast<> instead)

Three operators are implemented. It's essential to be able to cast the class to an unsigned int so we can use it transparently.

The template parameters let us provide a base and offset address which will be useful when a peripheral is abstracted through a template class.

UART peripheral abstraction

The UART peripheral in the LPC2xxx family consists of a collection of registers, starting from a base address. There are some interesting bits lying around on the microcontroller too that control peripheral power, the peripheral's clock and associated interrupts. Here all of them are encapsulated in a single class, except for the interrupts (to do).

namespace HW
* UART Module Abstraction
template<unsigned int BASE, unsigned int PCON_BIT, unsigned int PCLK_SEL>
class RegUART
void powerOn() {

void setClkDiv( unsigned int divv ) {

HW::Reg<BASE, 0> RBR;
HW::Reg<BASE, 0> THR;
HW::Reg<BASE, 0> DLL;
HW::Reg<BASE, 1> DLM;
HW::Reg<BASE, 1> IER;
HW::Reg<BASE, 2> IIR;
HW::Reg<BASE, 2> FCR;
HW::Reg<BASE, 3> LCR;
HW::Reg<BASE, 5> LSR;
HW::Reg<BASE, 7> SCR;
HW::Reg<BASE, 8> ACR;
HW::Reg<BASE, 9> ICR;
HW::Reg<BASE, 10> FDR;
HW::Reg<BASE, 12> TER;


PCONP and PCLOCK_SELECT() are macros defined in another file, nothing special.

HWUARTx are macros that help when specifying certain port. It will be used in the next piece of code.

#include <string.h>

namespace Drivers
* LPC2xxx UART Peripheral
* Polled.
* @param T Hw::RegUART
template< class T >
class PolledUart


* Init UART
* @param brate desired baudrate
void init(unsigned int brate) {

UART.setClkDiv( PCLK_DIV_1 );

UART.LCR = 0x80; //DLAB = 1

UART.DLL = (Fpclk/16)/brate & 0xFF;
UART.DLM = (((Fpclk/16)/brate) >> 8) & 0xFF;

UART.LCR = 3 | //8 bit char length
(0<<2) | // 1 stop bit
(0<<3) | //no parity
(0<<4) | //partity type
(0<<6) | //disable break transmission
(0<<0) ; //enable access to divisor latches

UART.FCR = (0x07); //FIFO ENABLE, Rx & Tx

* Transmit a single byte
* @param c
void tx( unsigned char c ) {
while ( !( UART.LSR & (1<<5) ) )

* Transmit several bytes
* @param data data to transmit
* @param length length of 'data'
void tx( const unsigned char *data, unsigned int length ) {
while( length-- > 0 )
tx( *(data++) );

* Transmit a C string
* @param str
void tx( const char *str ) {
tx( (const unsigned char *) str, strlen(str) );

* Byte receive
* @return unsigned char received char
unsigned char rx(void)
while ( !( UART.LSR & (1<<0) ) )

return UART.RBR;

* Returns if there is data to be read in the FIFO
* @return bool true if there is data available
bool isDataAvailable(void)
if ( UART.LSR & (1<<0) )
return true;
return false;

PolledUart implements UART basic functionality. No constructor is used to avoid undesired code creation, so init() must be called before using any other function.

Even though this class contains a RegUART class as a private member it won't take any extra memory since the compiler (at least gcc-elf-arm here) will optimize the template. I did some tests and there is no difference in code size with a simple C function doing the same statements directly.

Finally, to see how it works, if we wanted to use UART0 through this class we could instantiate it like this:

Drivers::PolledUart< HWUART0 > Uart0;

void testUart0()
Uart0.init(115200); //init @ 115200 bps

Uart0.tx("Hello there!\n");

If the class is to be used by many C++ files one should consider declaring it extern inside a header file and implementing it in a single cpp one. That would save code by avoiding function inlining each time it's used.


This is just a basic version to demonstrate how easy and clean code can get. Here are some modifications that will make the class more useful:

  • If there is an RTOS it would be useful to protect this class from concurrent access by using semaphores. It's not a difficult task, a mutex should be declared and initialized (maybe inside a constructor).
  • An interrupt-based uart is nice too, even better if RTOS' messages queues are used. This shouldn't be a problem either.

Tuesday, December 2, 2008

Embedded C++ constructs - Part 1

There has been a big explosion about using C++ within embedded systems. Recently David sent me some interesting papers and info about Embedded C++ so here I present what I've been doing so far (or at least a small portion of it).


First I have to say that templates are not a code bloat if they're managed with care. In fact the C++ compiler is supposed to optimize them at compile time.

Here are some base templates I've coded to ease pin mapping on the LPC23xx ARM family.

#include "lpc24xx.h" /* LPC23/24xx register definitions */
#include "static_assert.h" /* static assert for non-C++0x compilers */

namespace HW
* Output Pin Abstraction
template<unsigned int port, unsigned int pin>
class OutPin
STATIC_ASSERT(pin <= 31, PinMustBeLessThan32);

OutPin() {
*(&FIO0DIR + 8*port) |= (1<<pin); //configure pin as output
void Set(void) { *(&FIO0SET + 8*port) = (1<<pin);}
void Clr(void) { *(&FIO0CLR + 8*port) = (1<<pin);}

* Input Pin Abstraction
template<unsigned int port, unsigned int pin>
class InPin
STATIC_ASSERT(pin <= 31, PinMustBeLessThan32);

InPin() {
*(&FIO0DIR + 8*port) &= ~(1<<pin); //configure pin as input

operator bool () {
if ( (*(&FIO0PIN + 8*port)) & (1<<pin) )
return true;
return false;

This may look a like waste of code at first glance. However these classes avoid many mistakes and provide a hardware independent interface to pin outputs and inputs, it's just a question of redefining this templates to fit the platform.

Also note that the compiler will throw an error if pin's value is not within the allowed range. This is also a protection and won't waste any processor instructions or any other memory. It would be useful to limit the port range too. The above templates use the static assert method described earlier in this post.

The constructors will manage to configure the port pin as output or input. If the pin is defined globally then the constructor will be called before entering the main function, configuring the pin as it should. If it is defined inside a function or by using the new operator (which I would try to avoid)  then the constructor will be called when it is instantiated.

Here is an example:

// Pin 0.25
HW::OutPin< 0,25 > myLed;
// Pin 1.20
HW::InPin< 1,20 > mySwitch;

void invert(void)
if ( mySwitch )

Beautiful, isn't it? The generated code (arm-elf-gcc 4.2.x) is exactly the same as if it is done manually by writing/reading the corresponding registers. There is no loss in performance compared to the equivalent C code.

In the next post I will discuss a UART implementation by using similar code constructions.

What I do not like about C++

C++ lacks many useful C99 characteristics and g++ doesn't implement them either. I could live without most of them but what really hurts me is to avoid using Designated Initializers, specially on structs. It won't be a big problem if struct's data is on RAM but it's disappointing when structs are on ROM (const).

If we want a struct to be in ROM while using C++ it has to be declared const, but in addition constructors can't be used or the const-declared struct will be placed in RAM (cRazY). It is an understandable limitation but that means that the only way to initialize a struct is by passing each element one by one and in the exact order as they're defined in the struct. That is error prone, particularly when a new element is added in the middle of the struct. So when we try to switch to C++ to avoid errors we become prone to issues that were solved with C99.

So, that's the only thing I don't like about C++, the solution? Use C for cont struct initializers and C++ for the rest? Don't know.

Compile-time assertions

The new C++0x standard contains a neat feature called static_assert which does a compile-time assertion. It's really useful in cases we can't use the #if keyword (preprocessor) to throw a compile-time error. This happens when the preprocessor finds a C construct or expression it can't resolve such as sizeof, even if it is known at compile-time (actually it's only known to the compiler).

After searching the web I found this webpage that contains a nice trick to do this assertions in C. However, the macros defined there do not provide a description about the assertion, which might be useful, so here is my modified version:

* -- based on
#define ASSERT_CONCAT_(a, b, c) a##b##c
#define ASSERT_CONCAT(a, b, c) ASSERT_CONCAT_(a, b, c)
#define STATIC_ASSERT(e,descr) enum { ASSERT_CONCAT(descr,_Line_,__LINE__) = 1/(!!(e)) }

As an example, this would require the size of the struct myStruct to be less than 20 in order to compile (this is only an example, remember 'magic numbers' should be avoided when programming):

STATIC_ASSERT( sizeof( struct myStruct ) < 20, MyStructIsTooLarge );

If there is an assertion error at compile time this error will be thrown:

file.c:40: error: division by zero
file.c:40: error: enumerator value for 'MyStructIsTooLarge_Line_40' is not an integer constant

On the other side, if C++ is being used and there is access to a (at least partial) C++0x compiler you can use static_assert() in a similar way:

static_assert( sizeof( struct myStruct ) < 20, "My Struct Is Too Large" );

Currently GCC 4.3 supports static_asserts as mentioned here.

Wednesday, November 26, 2008

Frame pointers and Function Call Tracing

Function call tracing is really helpful when tracking down a problem, specially on an embedded system. On the ARM architecture data abort events and such provide useful data regarding where the problem was found like the instruction where that happened and the values each register had on that moment. However that usually isn't enough, specially when functions like memcpy() that are usually called from many places in our program. Even worse: imagine running a RTOS. If we knew where memcpy() was called it would be different and probably a lot easier to debug and trace down. Yesterday I was talking with David and he mentioned the function __builtin_return_address that comes with GCC. He uses it together with C++ (i386) to detect memory leaks, however it's not available for the Arm architecture.

GCC and frame pointers

GCC implements a nice concept called Frame Pointer. One of the CPU's registers is reserved and used as a frame pointer. Each time a function starts its execution the frame pointer is set to point exactly after the return address the caller has placed on the stack. Apart from being useful to the compiler to refer to function arguments in an easier way it can also be used to deduce who called us, and who called the one who called us, and so. A nice explanation can be found here.

However GCC turns on the -fomit-frame-pointer flag for some optimization levels so if we need the frame pointer we need to force it by adding -fno-omit-frame-pointer.
GCC provides a function called __builtin_return_address() with which you can trace up the function calls, but it's not available for ARM. However I found a nice piece of code in the linux kernel for the ARM architecture here. I just stripped this part (remember it's published under the GPL license by Russell King):

.align 0
.type arm_return_addr %function
.global arm_return_addr

mov ip, r0
mov r0, fp
cmp r0, #0
beq 1f //@ frame list hit end, bail
cmp ip, #0
beq 2f //@ reached desired frame
ldr r0, [r0, #-12] // else continue, get next fp
sub ip, ip, #1
b 3b
ldr r0, [r0, #-4] //@ get target return address
mov pc, lr //get back to callee

The 'magic' values are -4 and -12 which indicate the relative position of the previous function's link register (return address) and frame pointer. This values come from analysing the push instruction which is called in every function entry that pushes, apart from other registers, {pc, lr, ip, fp} in that precise order in the stack.
The C prototype for arm_return_addr would be:

void * arm_return_addr( unsigned int num );

Where num is the number of frames to search back. Take care and remember that if you reach the last frame it's value will be 0 and you should stop there.
What I did is to show the call trace once I get data abort or prefetch abort exception so I know what caused it and it's easier to track back.

Give me something I can understand: decrypting addresses

With the function discussed above we can get the return addresses but where to go from there? You can generate a listing with arm-elf-objdump and find the address, but there is a nice tool called arm-elf-addr2line which will do it for you, thanks to David for pointing this out. Just do something like:

    arm-elf-addr2line --exe=yourelf.elf

And it will output something like:


Which means you can find it in main.c, line 90. Quite nice, isn't it?


There is a stack penalty when using frame pointers. In one product I'm developing I saw between 20 to 30 words stack penalty when using frame pointers compared to the same program compiled without frame pointers.  That means about 20*4 = 80 bytes of stack. Since that was a RTOSsed product and has more than 10 tasks running simultaneously that number multiplies and yields an important increase in total stack usage. That can be a problem if RAM is not enough, not mentioning that a tight-tuned program will probably crash for the first time it's compiled to work with frame pointers because of stack overflows.

To see what causes this behaviour let's look at a simple C function:

int sumNumbers( int a, int b ) {
return a + b;

When compiled with -fomit-frame-pointer we get:

00007f6c <sumNumbers>:
7f6c: e0810000 add r0, r1, r0
7f70: e12fff1e bx lr

But if we force frame pointers with -fno-omit-frame-pointer we obtain:

0000818c <sumNumbers>:
818c: e1a0c00d mov ip, sp
8190: e92dd800 push {fp, ip, lr, pc}
8194: e0810000 add r0, r1, r0
8198: e24cb004 sub fp, ip, #4 ; 0x4
819c: e89da800 ldm sp, {fp, sp, pc}

By using frame pointers GCC is obliged to push the registers fp, ip, lr and pc and set up the frame pointer, that means bigger code and higher stack usage, at least for small or medium sized functions. Now you may notice why the GCC documentations says "-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging."
Using frame pointers or not depends on whether you give priority to debugging or small code footprint (and smaller ram/stack footprint too).

It's important to remember that if the program is compiled without frame pointers then the arm_return_addr function must not be called since the frame pointer register will contain other information, likely not related to anthing to do with frame pointers.

Alternative methods

There is another method we can use with GCC. It involves using the -finstrument-functions compiler flag. That will force GCC to call user-defined functions when entering and exiting functions, so an array can be kept on RAM with the call tree. However that could slow down the whole program excessively. On the other hand care must be taken with multithreaded designs. For more information here is the GCC documentation.

Tuesday, November 25, 2008

Signal conditioning: Mean might be too mean


  • adjective- unkind or spiteful: a mean trick
  • noun- Mathematics: The average value of a set of numbers.

Many times measured analog signals must be filtered before they can be used. Sometimes it's important to remove high frequency components, from a slight low pass filter to something more drastic like constant estimation. We could say we're doing noise reduction by oversampling on many cases.

The first method and probably the most intuitive one is by summing several samples and then calculating the mean.

Moving average

The conservative method would be to implement a FIR filter, leading to a moving average filter. This gives us a filtered sample for each incoming measurement. The simplest method is using a non-weighted averaging filter, where each sample gets equal importance. Here there's the bode plot corresponding to a 20-point moving average filter with sampling frequency 10kHz.

It's clear this is not a typical lowpass filter, even though it helps to remove high frequency components. Besides doing that this filter also presents a high attenuation at frequencies multiple of Fsample / N. This can be useful on some cases but it can also be an unwanted effect. Low-pass cut off frequency (-3 dB) is about Fsample / (2*N).

Averaging and subsampling

Another way is to take N samples and calculate the mean of those samples once they've been acquired. Once that happens the whole buffer is cleared and measurements are accumulated again until N new samples are received and the new mean is calculated. This is similar to reducing sampling rate but no lowpass filter is being applied to the signal, so expect to have disastrous results for a non-constant signal, specially when its zero frequency component is comparable to its harmonics. Since the filter is applied by chunks of N samples and then all de previous data is discarded the result depends on how synchronized the input signal is respect to the accumulator present in the averaging algorithm.

This method works nice when estimating a constant value, or a really slow varying one. However, in many cases, we can get better results with a Kalman filter.

What makes this method useful is it's simplicity in both calculations and firmware implementation. All it takes is an accumulator, a count variable and a division once it's full of data.

Kalman Filter - Constant estimation

Using a Kalman filter for constant estimation may look too complicated for such a simple application, but given the measurement noise can be modeled as white and if the variance is known we can obtain very good results. Variance can also be measured, and even better: it could be calculated every certain time or at start up if the signal is known to be constant for some time. Floating point calculations might be needed, although a fixed-point approach can be used too.

There is a good example on Kalman filter constant estimation in this paper by Greg Welch and Gary Bishop.

Low Passing

The other popular and classic method is to apply a low-pass filter to the signal, preserving frequency components up to a certain value. FIR and IIR filters come to mind but I won't discuss this here since it's a whole topic on itself. Using fixed point with these filters is quite easy and straightforward as long as we know we've chosen the right resolution for the fixed point calculations. Otherwise stability problems can arise, specially with IIR ones.

Friday, November 21, 2008

To volatile or not to volatile - Part 2

Hamlet did know about volatile. He just ignored it.

Recalling Hamlet's action on volatileness and the previous post we can 'trick' the compiler when working with ISRs and no nested interrupts enabled.
We can't undeclare a variable or remove the volatile specifier once it has been written before on the same file. However we can write our ISR code in a different source file and declare the variables as global, volatile in the file that uses it outside the ISR and non volatile inside the file containing the ISR.
Here is the resulting code:

// ----- FILE: non_isr.c ------- //
volatile unsigned int var1,var2;

/** Just to use the variables as volatile
* If Enter_Critical() and Exit_Critical() are
* global functions then the volatile specifier above
* could be avoided too
unsigned int getIntSum(void)
unsigned int temp;
temp = var1 - var2;

return temp;


// ----- FILE: isr.c ------- //
unsigned int var1,var2;

void ISR_Handler(void)
// do something with var1 and var2
// nested interrupts not enabled,
// thus it's secure to use them as normal variables

A big disadvantage when doing this is that variables will be global and we can't declare them static since they have to be shared between several source files.
As said in the previous post, it's only a matter on what you need in that specific situation. In some cases the ISR will result in less execution time but in some others it might be the same or only a small performance boost.

Thursday, November 20, 2008

To volatile or not to volatile

Hamlet never had to worry about volatile specifiers, only existence.

UPDATE:There is a continuation on this topic here

The volatile keyword is something every developer has to deal with when working with interrupts or multiple threads. It's nature is quite simple: when present in a variable declaration it says to the compiler that the current executing code is not the only one which may change its value, so the compiler can't rely on a cached value (on a register for instance) or make any other assumption based on past values. On one hand this means that we should use this keyword if we expect certain variables to behave that way. On the other side it also means that much more code and memory access will be done than when using a normal (non-volatile) variable.

When first using volatile there is a myth about declaring everything shared between interrupts/threads and the main execution path or thread as volatile. Of course it will work but it can lead to slower and larger code and it may not be necessary, specially when coding inside individual functions with critical sections or mutexes/semaphores.

Here is a piece of code, supposed to be a dumb interrupt handler:

extern volatile unsigned int x,y;
void __attribute__((interrupt ("IRQ"))) ISR_test(void)
if( x > 0 ) //optimized to ==0 since x is unsigned
x = x + y;
x = 2*y + x;

And the corresponding assembly listing for ARM7, compiling with gcc 4.2.2 optimization level 1:

2d7c: push {r1, r2, r3}
2d80: ldr r1, [pc, #68] // r1 = &x;
2d84: ldr r3, [r1] // r3 = x; READ X
2d88: cmp r3, #0 // r3 == 0 ?
2d8c: beq 2da8 // decision
2d90: ldr r2, [r1] // r2 = x; READ X
2d94: ldr r3, [pc, #52] // r2 = &y
2d98: ldr r3, [r3] // r3 = y; READ Y
2d9c: add r3, r3, r2 // r3 = r3 + r2;
2da0: str r3, [r1] // x = r3
2da4: b 2dc4 // ready to return
2da8: ldr r3, [pc, #32] // r3 = &y
2dac: ldr r3, [r3] // r3 = y READ Y
2db0: ldr r1, [pc, #20] // r1 = &x
2db4: ldr r2, [r1] // r2 = x READ X
2db8: lsl r3, r3, #1 // r3 = 2*r3
2dbc: add r3, r3, r2 // r3 = r3 + r2
2dc0: str r3, [r1] // x = r3
2dc4: pop {r1, r2, r3}
2dc8: subs pc, lr, #4 // 0x4
2dcc: .word 0x40004900 // &x
2dd0: .word 0x40004904 // &y

There are three instructions to read x's value and two to read y's. GCC did as we told, do not make any assumptions on the values. However we may know certain constraints which can make the use of the volatile keyword redundant, like knowing that nested interrupts are not enabled. That way nothing will interrupt our ISR routine. The same conclusions get to mind if we use semaphores, mutexes or critical sections whenever a thread tries to access the variables. There are some exceptions I will comment near the ending. If we remove the volatile specifiers from both x and y we get:

2d7c: push {r1, r2, r3}
2d80: ldr r1, [pc, #48] // r1 = &x
2d84: ldr r2, [r1] // r2 = x
2d88: cmp r2, #0 // r2 == 0?
2d8c: ldrne r3, [pc, #40] // only if neq
2d90: ldrne r3, [r3] // only if neq
2d94: addne r3, r3, r2 // only if neq
2d98: strne r3, [r1] // only if neq
2d9c: ldreq r3, [pc, #24] // only if eq
2da0: ldreq r3, [r3] // only if eq
2da4: lsleq r3, r3, #1 // only if eq
2da8: ldreq r2, [pc, #8] // only if eq
2dac: streq r3, [r2] // only if eq
2db0: pop {r1, r2, r3}
2db4: subs pc, lr, #4
2db8: .word 0x40004900
2dbc: .word 0x40004904

Looks like GCC is doing some black magic! The conditional store, add and load instructions help the compiler to avoid branches. The total number of instructions was reduced from 20 to 15. This is a simple example but on a complex one register popping/pushing due to variable volatileness can slow down things even more. As said before, we may need the volatile specifier on certain situations, but if we know when to avoid it then final code can be much more concise.
Recalling the mutexes here is another example:

extern void TakeMutex(void);
extern void GiveMutex(void);

int x,y;

int getSum( void ) {
int temp;
/* This should be in the critical section,
* however it's here to show what happens */
if ( x < 0 )
return 0;
if ( y < 0 )
return 0;


temp = x + y;


return temp;

Note that line 10 and line 12 have atomic reads since this is a 32-bit architecture and int is 32-bit for this compiler. Either one or both of them are unlikely to be true for 8-bit and 16-bit processors, so it's not a good practice if we're looking for portability. Assembly output results in:

2e14: push {r4, r5, lr}
2e18: ldr r5, [pc, #64]
2e1c: ldr r3, [r5]
2e20: cmp r3, #0 ; 0x0
2e24: blt 2e50
2e28: ldr r4, [pc, #52]
2e2c: ldr r3, [r4]
2e30: cmp r3, #0 ; 0x0
2e34: blt 2e50
2e38: bl 2d7c (takemutex)
2e3c: ldr r2, [r4]
2e40: ldr r3, [r5]
2e44: add r4, r2, r3
2e48: bl 2d80 (givemutex)
2e4c: b 2e54
2e50: mov r4, #0 ; 0x0
2e54: mov r0, r4
2e58: pop {r4, r5, lr}
2e5c: bx lr
2e60: .word 0x40004900
2e64: .word 0x40004904

Since TakeMutex() and GiveMutex() are proper functions (defined somewhere else) GCC doesn't know what they will or won't do to x and y, so the code will read them again after the function calls. The only values that were cached are the variable's addresses, which of course won't change.
However, if TakeMutex() and GiveMutex() are macros we may get into trouble:

#define TakeMutex() asm volatile("nop")
#define GiveMutex() asm volatile("nop")

int getSum( void ) {
int temp;
/* This should be in the critical section,
* however it's here to show what happens */
if ( x < 0 )
return 0;
if ( y < 0 )
return 0;


temp = x + y;


return temp;

I accept that a nop won't protect anything, it's just a snippet. Here is the resulting assembly code:

2dcc: ldr r3, [pc, #48]
2dd0: ldr r2, [r3]
2dd4: cmp r2, #0
2dd8: blt 2dfc
2ddc: ldr r3, [pc, #36]
2de0: ldr r0, [r3]
2de4: cmp r0, #0 ; 0x0
2de8: blt 2dfc
2dec: nop
2df0: add r0, r0, r2
2df4: nop
2df8: bx lr
2dfc: mov r0, #0 ; 0x0
2e00: bx lr
2e04: .word 0x40004900
2e08: .word 0x40004904

Values were cached so it will be a mess, a difficult error to track too. We can avoid this by declaring the variables as volatile.

We can say that we can avoid all this problems by declaring everything 'suspicious' as volatile, but if we were to optimize code and make it really tight while still programming in C then the non-volatile approach is valid too, given that we know what we are doing and the assumptions on the variables' types and values.

UPDATE:There is a continuation on this topic here

Wednesday, November 19, 2008

Writing portable Embedded code - Pin portability

EDIT: I've posted a C++ alternative using templates here.

Portability can be hard on embedded systems. The fact that we can code in C most of the time doesn't mean code is portable. Even worse, it means that you may have to rewrite many functions/macros in order to get the 'same' code to work on another platform/chip.

A nice approach is to write a simple yet powerful HAL (Hardware Abstraction Layer), sometimes called a driver. There are many protocols and peripherals whose functions are nearly the same between different vendors and chips, such as I2C, SPI, UART,MCI controllers. For SPI we would code at least three functions: SPI_init(), SPI_tx() and SPI_rx(). Actually SPI_tx() and SPI_rx() might be the same due to SPI's full duplex capability. There might be other functions or macros to control the chip select line. This approach works nice for standard peripherals, but we may need to access port pins individually to perform some task or communicate to a device by bit-banging.

Most LCD character displays work the same way and use the same protocol and control lines. A 'universal' C module for character LCD control sounds great, but pin compatibility should be addressed first, it's not attractive if we need to change tens of lines to port it.

We'll first define some useful string-concatenating macros:

#define    _CAT3(a,b,c)  a## b ##c

#define    CAT3(a,b,c)   _CAT3(a,b,c)

#define    _CAT2(a,b)    a## b

#define    CAT2(a,b)    _CAT2(a,b)

The macros are called twice to ensure tokens are preprocessed as we want them. Kernighan and Ritchie's wonderful C book has good information about how that works. These macros can be defined in a global header file so they can be included whenever they're needed.

Basic operations on port pins include setting a pin as output or input, clearing a bit, setting a bit and reading it's value when configured as input. Given that I defined another header file which looks like this, specially made for the Philips LPC23xx family (ARM7):

/* Set bit */
#define FPIN_SET(port,bit) CAT3(FIO,port,SET) = (1<<(bit))
#define FPIN_SET_(port_bit) FPIN_SET(port_bit)

/* Clear bit */
#define FPIN_CLR(port,bit) CAT3(FIO,port,CLR) = (1<<(bit))
#define FPIN_CLR_(port_bit) FPIN_CLR(port_bit)

/* Set as input */
#define FPIN_AS_INPUT(port,bit) CAT3(FIO,port,DIR) &=~(1<<(bit))
#define FPIN_AS_INPUT_(port_bit) FPIN_AS_INPUT(port_bit)

/* Set as output */
#define FPIN_AS_OUTPUT(port,bit) CAT3(FIO,port,DIR) |= (1<<(bit))
#define FPIN_AS_OUTPUT_(port_bit) FPIN_AS_OUTPUT(port_bit)

/* when used as input */
#define FPIN_ISHIGH(port,bit) ( CAT3(FIO,port,PIN) & (1<<(bit)) )
#define FPIN_ISHIGH_(port_bit) FPIN_ISHIGH(port_bit)

/* returns !=0 if pin is LOW */
#define FPIN_ISLOW(port,bit) (!( CAT3(FIO,port,PIN)& (1<<(bit)) ))
#define FPIN_ISLOW_(port_bit) FPIN_ISLOW(port_bit)

Done this we can set bit 2.1 by ussuing FPIN_SET(2,1), or clear it by doing FPIN_CLR(2,1). The functions ending with an underscore are meant to be used when pin position is given as a #define macro, such as:

#define LEDA 2,1
#define LEDB 2,1


I agree this may sound complicated, but by defining all these functions it's possible to manipulate all port pins easily and in a portable way. If we want to change the pin or port LEDA is using we only need to change it once, the macros will take care of it.

If we were to do the same on an AVR it's a question of changing the macros as shown below. Don't forget ports are named with letters (A,B,C,D...) rather than numbers.

/* Set bit */
#define FPIN_SET(port,bit) CAT2(PORT,port) |= (1<<(bit))
#define FPIN_SET_(port_bit) FPIN_SET(port_bit)

/* Clear bit */
#define FPIN_CLR(port,bit) CAT2(PORT,port) &=~(1<<(bit))
#define FPIN_CLR_(port_bit) FPIN_CLR(port_bit)

/* Set as input */
#define FPIN_AS_INPUT(port,bit) CAT2(DDR,port) &= ~(1<<(bit))
#define FPIN_AS_INPUT_(port_bit) FPIN_AS_INPUT (port_bit)

/* Set as output */
#define FPIN_AS_OUTPUT(port,bit) CAT2(DDR,port) |= (1<<(bit))
#define FPIN_AS_OUTPUT_(port_bit) FPIN_AS_OUTPUT(port_bit)

/* when used as input */
#define FPIN_ISHIGH(port,bit) CAT2(PIN,port) & (1<<(bit)))
#define FPIN_ISHIGH_(port_bit) PIN_ISHIGH(port_bit)

/* returns !=0 if LOW */
#define FPIN_ISLOW(port,bit) (!( CAT2(PIN,port) & (1<<(bit))) )
#define FPIN_ISLOW_(port_bit) FPIN_ISLOW(port_bit)

Now the LCD routines are really portable. Minor changes might be needed if there are other pin registers to modify, but the basic pin functionality is covered by the macros defined above.

Tuesday, November 18, 2008

FreeRTOS - Coding _real_ real-time interrupts

Ever since I discovered the RTOS world I've been amazed how fast, simple and beautiful coding can become.

Using an RTOS for devices with hard real-time constraints can get quite difficult, specially if the RTOS nature is not known by the developer, particularly when dealing with real time interrupts where latency plays an important role and has to be kept as low as possible.

Successful semaphore and queue synchronization (among other RTOS facilities) require interrupts to be disabled during some processing. Interrupt latency will vary depending on how often those resources are checked. If we want to make sure certain interrupts are executed as fastest as posible we must provide them a way to be processed, independently of how the RTOS and our application is behaving.

Things get worse when a TCP/IP stack gets to interact with the RTOS' tasks. As an example, lwIP is able to work with an RTOS like FreeRTOS. To do so we need to let lwIP define critical sections, if we don't we risk system stability.
Other data processing tasks or code might need critical sections too. If that involves disabling interrupts then interrupt latency will increase. If that happens we may lose the first two letters from 'RTOS'.

A trick to overcome this is to use interrupt priorities (if available) so that critical-time interrupts are placed on a priority group, while normal (ie: RTOS) interrupts are on another priority group. As an example, the ARM7 LPC2xxx family from Philips has an interrupt priority mask register where individual interrupt priority groups can be masked or unmasked.

There is an importan issue to consider: don't use RTOS queues/semaphores/etc from interrupts which are not disabled by the RTOS (critical-timing interrupts). There won't be any atomic protections nor critical sections for them.

This might look as if we would be loosing the great advantages of using an RTOS, but usually that can be solved by implementing manual queues if they're needed. Also, if realtime is such an issue on those interrupts, context-switching it's very likely to be a problem too. That means the action is to be taken directly from the interrupt, if possible.

If the interrupt belongs to an encoder the only action to take is to increment or decrement a variable. If you need to provide audio data to a DAC or another peripheral you just send data from an array (usually a double-buffered one) to the corresponding register.

The only change to be made to FreeRTOS is inside the port's code, in particular the macros entitled to disable and re-enable interrupts in portmacro.h. If thumb mode is needed portISR.c needs to be changed too.

#define portDISABLE_INTERRUPTS()    do{ VICSWPrioMask    = (1<<1); } while(0)
#define portENABLE_INTERRUPTS()        do{  VICSWPrioMask    = 0xFFFF; } while(0)

It's important to remember that we don't need to save nor restore any context information from our critical interupt handlers, since they won't interact with the RTOS (at least not directly). This ISR is coded as any ISR without RTOS. That makes it faster than the ISRs that need queue or semaphore management. Of course that real critical sections are needed when sharing information with our ISR, but that can be done by disabling all the interrupts, just like portDISABLE_INTERRUPTS() did before we changed it.

UPDATE: Later I found that portDISABLE_INTERRUPTS() and portENABLE_INTERRUPTS() are not the only macros used for interrupt enabling/disabling. There are other functions named vPortEnterCritical() and vPortExitCritical() which are extensibly used through the FreeRTOS code, so those should be changed too. However, I haven't tried this yet.