Tuesday, January 27, 2009

Quick text compression

Recently I faced a problem related to short text compression. The idea was to reduce the space needed to save SMS messages. Since each SMS contains up to 160 characters the classic LZW or Huffman methods won't work out-of-the-box, mostly because of the size of the resulting dictionary. It is even worse if we consider that most messages are less than 80 characters long.

Finally I decided to use a fixed dictionary and a sort of hybrid Huffman code compression (which isn't Huffman at all, but it retains some similarity -somehow). Then I started looking for letter, digraph and trigraph probability in words and texts. There are many resources on the net like this and this one where symbol probability is listed for different languages, including Spanish which was the one I used.

The compression algorithm tries to use the minimum number of bits to encode frequent letters/digraphs/trigraphs. It is not optimal since the dictionary is fixed but it does a good job, reducing message size to 60-70% on most cases. If the right text is picked then it can be reduced to 30% but that is cheating, of course! The space character is one of the most used ones along with the 'e' letter (Spanish at least). Trigraphs and digraphs also play an important role in compression.

Lower/uppercase letters is another issue, but since SMS messages are mostly written in lowercase or uppercase that is not a problem. A trick is to invert the whole text to see which text gets the best compression ratio. This is quite fast since the algorithm is simple and the strings are short. Another trick would be to provide more than one dictionary (maybe three or four) and see which one does better with the desired message. The resulting space overhead is about two or three bits which should be acceptable for long messages.

Another possibility is to compress many messages together with Huffman or any other compression method. The drawback is that the message won't be a unit itself and then message management becomes messy.

No comments:

Post a Comment