Sunday, November 18, 2007

Compress Binary Messages Using SQL

Compress Binary Messages Using SQL

Data compression is a great method for maximizing data storage space and making data communication faster. However, compression and decompression of binary data sometimes can be quite tricky. Learn a few useful data-compression techniques. 

Binary data is natural for computers. Everything running on the computer's physical layer, in the registers, ALU, memory, and so on, is just a result of manipulating a set of 0's and 1's, where 0 and 1 are the logical representations of two different voltage levels. While very convenient for machine processing, base-two numeral systems have some drawbacks. For one, each binary digit can keep only two values: 0 or 1. That means large binary numbers will produce very long binary messages, which cause problems with data transfer, processing, and storage.

One solution to the problem of long binary messages is data encoding that shortens the original message—in another words, data compression. You can employ any number of compression algorithms to compress data, but what happens under the hood? Rather than discuss the fairly large number of compression algorithms out there, this article focuses on the technical aspects of data compression, namely binary data manipulations that can be useful for any compression algorithm.

Author's Note: This article developed as a SQL/Transact-SQL solution to a problem called "Bit Compressor" from the ACM International Collegiate Programming Contest (ACM-ICPC) 2006 World Finals (click here to view the problem). While the solution technique(s) can serve as an approach to completing practical tasks (generating and manipulating binary numbers, encrypting data using patterns, and detecting and fixing data transmission errors), the problem primarily serves as a brainteaser.

How to Generate Binary Numbers in SQL
While binary data is native for computers, using binary data directly in the programming layers—above assemblers and C—is very inconvenient because the human eye and brain are much more familiar with decimal numbers. In addition, binary numbers can occupy a lot of storage space when represented in long character strings. This may be why SQL and SQL Server don't have programming interfaces that allow direct manipulation of binary data. If you want to generate binary numbers with either technology, you need to find your own way to do it.

One approach for producing binary numbers is generating decimal numbers (which is very easy to do in SQL Server or SQL) and then converting them to their binary equivalents. To convert data from an n-base (decimal) to an m-base (binary) numeral system, you can use the following well-known algorithm:

  1. Take the number in the n-base system and repeatedly divide it by the radix of the m-base numeral system.
  2. Taken in order from last to first, the remainders of your divisions will give you the converted number.

For example, if you want to convert 8710 to its binary equivalent, you need to take 87 and repeatedly divide it by 2 as follows:


87/2 = 43, remainder 1
43/2 = 21, remainder 1
21/2 = 10, remainder 1
10/2 = 5, remainder 0
5/2 = 2, remainder 1
2/2 = 1 remainder 0

When you attach the remainders of all the intermediate results to the result of the last division, you get 1010111 = 26 + 24 + 22 + 21 + 20 = 64 + 16 + 4 + 2 + 1 = 8710. This approach works, but it may become very inefficient if you need to generate lots of binary numbers.

If you have hexadecimal numbers, you can use another simple conversion approach: just replace each hexadecimal digit by its binary equivalent. For example, you can convert E7016 = 14 * 162 + 7 * 161 = 369610 to 1110011100002 = 211 + 210 + 29 + 26 + 25 + 24 = 369610.

Listing 1 shows a more interesting method for generating binary numbers with SQL cross joins.

Listing 1. How to Generate Binary Numbers Using Cross Joins


SET NOCOUNT ON
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.bin')
AND type in (N'U'))
DROP TABLE dbo.bin

CREATE TABLE bin(colBin varchar(50));
INSERT INTO bin VALUES('0');
INSERT INTO bin VALUES('1');

SELECT (b1.colBin + b2.colBin) as binNumber
FROM bin b1 CROSS JOIN bin b2
ORDER BY 1;

Result:

binNumber
----------------------------------------
00
01
10
11


Incrementing the number of cross joins by one and adding one more addend to the select list in the query from Listing 1, you will get eight binary numbers (see Listing 2).

Listing 2. How to Generate More Binary Numbers Using Cross Joins


SELECT (b1.colBin + b2.colBin + b3.colBin) as binNumber
FROM bin b1 CROSS JOIN bin b2
CROSS JOIN bin b3
ORDER BY 1;

Result:

binNumber
------------------------------------------------------------
000
001
010
011
100
101
110
111


The number of rows in the result (the amount of generated binary numbers) is 2 to the power of the number of tables participating in the cross joins (in this case, 23 = 8). If you continue incrementing the number of tables participating in cross joins together with the number of addends in the select list of the query, you theoretically can get any amount of binary numbers.

The problem with this approach is the growing query. If, for example, you need to generate all possible binary numbers in 20 digits, you need to produce the query with 20 cross joins and 20 addends in the select list. The script in Listing 3 helps solve the problem of a growing query.

I ran the script for the @binLen = 20 and generated all possible binary numbers in 20 digits: 220 – 1 = 1,048,575 numbers. To make it more convenient, I loaded the result into the table binFinal. That table stores all binary numbers together with their decimal equivalents.

Data Compression
To apply what you've learned about binary numbers to data compression, follow the example in this section, which compresses binary numbers by replacing each maximal sequence of n 1's with the binary number n. The example does this only when the replacement makes the original message shorter. For example, for the original message 10111111111111111000111011 (26 bits), you will get the compressed message 10111100011011 (14 bits). (You actually can apply the same algorithm to a sequence of 0's.)

Replacing a maximal sequence of n 1's with the binary number n means you can't replace just a subset of 1's from the sequence of 1's. For example, you can't take five 1's from the 15 in the above example. You need to replace an entire piece, which will be flanked by 0's on both sides (as is the case when 1's are in the middle of a binary sequence) or bounded by a zero on one side (as is the case when 1's are at the very beginning or at the very end of a binary sequence).

Because this example mandates that replacement has to shorten the message, there is a lower limit to the sequences of 1's that you can replace. That limit is equal to three 1's. Indeed, 111 can be replaced by 11, which shortens the length of the binary message by one digit. Conversely, 11 can be replaced by 10, which will not shorten the length of the message.

Even though the smallest unit of information is one bit, processing data bit by bit is very inconvenient. Usually data is stored, transferred, or processed by bytes or by words that consist of 1, 2, 4, or 8 bytes. For my tests, I used 20-bit binary numbers generated in the previous example. You should align byte operations on those numbers to the nearest byte boundary (24 bits). However, for the purposes of data compression, the data doesn't need to be aligned to the byte boundary. It doesn't even need to keep the leading zeros, which I've deleted to make the remaining transformations lighter (see Listing 4).

Listing 4. Deleting Leading Zeros


SET NOCOUNT ON;
DELETE binFinal WHERE decNum = 0;
UPDATE binFinal
SET binNum = RIGHT(binNum, LEN(binNum) - PATINDEX('%[^0]%',binNum) + 1);

SELECT * FROM binFinal;

Result:

decNum binNum
----------- ----------------------------
1 1
2 10
3 11
4 100
5 101
6 110
7 111
8 1000

. . . . . . . . . . . . . . . . . . . .

11219 10101111010011
11220 10101111010100

. . . . . . . . . . . . . . . . . . . .

1048573 11111111111111111101
1048574 11111111111111111110
1048575 11111111111111111111


Now it's time to try compressing the generated binary numbers in accordance with the example rules. First of all, create and load a new table similar to binFinal but with one more column that will store compressed data (see Listing 5).

Listing 5. Create and Prepare Table for Data Compression


SET NOCOUNT ON
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.binWithCompr')
AND type in (N'U'))
DROP TABLE dbo.binWithCompr;

CREATE TABLE dbo.binWithCompr(
decNum int PRIMARY KEY,
originSeq varchar(50),
compresSeq varchar(50))
INSERT INTO binWithCompr
SELECT decNum, binNum, binNum FROM binFinal;

SELECT * FROM dbo.binWithCompr;

Result:

decNum originSeq compresSeq
----------- ----------------------- --------------
1 1 1
2 10 10
3 11 11
4 100 100

. . . . . . . . . . . . . . . . . .


Next, generate all the patterns of 1's presented in the binary sequences of the binFinal table (see Listing 6). The patterns can vary from 111 to 111111111111111111111.

Listing 6. Generate Patterns of 1's


DECLARE @maxLen int, @i int;
SELECT @maxLen = LEN(originSeq) FROM binWithCompr
WHERE decNum = (SELECT MAX(decNum) FROM binWithCompr)
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.patterns')
AND type in (N'U'))
DROP TABLE dbo.patterns

CREATE TABLE dbo.patterns(c1 varchar(100) PRIMARY KEY, decLength int);
SET @i = 3;
WHILE(@i <= @maxLen)
BEGIN
INSERT INTO dbo.patterns VALUES(REPLICATE('1', @i), @i);
SELECT @i = @i + 1;
END
SELECT * FROM dbo.patterns;

Result:

c1 decLength
-------------------------------- ------------
111 3
1111 4
11111 5
111111 6
1111111 7
11111111 8

. . . . . . . . . . . . . . . . . .

1111111111111111111 19
11111111111111111111 20


Next, replace each maximal pattern of 1's you found with its binary equivalent. The entire compression requires two steps:


  1. Replace all maximal sequences of n 1's with the decimal version of n, surrounded by '*'.
  2. Replace the decimal numbers with their binary equivalents and remove '*'.

Listing 7 describes the first compression step. Listing 8 is the final compression step.

Message Decompression
When you have a table with original and compressed values, you can easily decompress data. All you need to do is query the table and find the original message corresponding to what you just received. Listing 9 is an example of message decompression.
Listing 9. Message Decompression


DECLARE @compresSeq VARCHAR(20);
SELECT @compresSeq = '10001011';
SELECT * FROM binWithCompr
WHERE compresSeq = @compresSeq;

Result:

decNum originSeq compresSeq
----------- --------------------------------- ---------------
139 10001011 10001011
279 100010111 10001011
491 111101011 10001011
983 1111010111 10001011
18431 100011111111111 10001011
63487 1111011111111111 10001011
1048571 11111111111111111011 10001011


Oops, there's a problem. Seven different original messages can be compressed to the same binary number: 10001011. That means you need to get more information in order to find the correct answer. If you knew the number of 1's in the original message, it would definitely help. Listing 10 shows how you can find the number of 1's and/or 0's in all binary numbers in the table using only one query.

Listing 10. How to Find Number of 1's and/or 0's in Binary Messages


DECLARE @compresSeq VARCHAR(20);
SELECT @compresSeq = '10001011';

SELECT COUNT(*) AS num_of_1s,
SUBSTRING(t2.originSeq, t1.decNum,1) AS binVvalue,
t2.originSeq,
t2.compresSeq
FROM (SELECT decNum FROM binWithCompr
WHERE decNum <=
(SELECT MAX(LEN(originSeq)) FROM binWithCompr
WHERE compresSeq = @compresSeq)) t1
CROSS JOIN
(SELECT originSeq, compresSeq FROM binWithCompr
WHERE compresSeq = @compresSeq) t2
WHERE SUBSTRING(t2.originSeq, t1.decNum,1) = '1'
GROUP BY SUBSTRING(t2.originSeq, t1.decNum,1), t2.originSeq, t2.compresSeq;

Result:

num_of_1s binVvalue originSeq compresSeq
----------- -------------- ------------------- --------------------
7 1 111101011 10001011
12 1 100011111111111 10001011
19 1 11111111111111111011 10001011
8 1 1111010111 10001011
4 1 10001011 10001011
5 1 100010111 10001011
15 1 1111011111111111 10001011


As you can see, the information about the number of 1's in the original message received with the compressed message can help you uniquely identify the original message. But this is not always correct. In some cases, different original messages with the same number of 1's can be converted into the same compressed message. For example, if you run the script from Listing 10 for @compresSeq = '1010101', you will get the following result:


num_of_1s binValue originSeq compresSeq
----------- --------- ------------ -------------
7 1 111110101 1010101
10 1 11111011111 1010101
7 1 101011111 1010101
7 1 101111101 1010101
4 1 1010101 1010101

As you can see, three original messages have seven 1's and two 0's, which can be compressed into the message 1010101. Therefore, there is no way to uniquely identify the original message (unless you have some additional information about it). All that means is you should use this type of compression only in situations where some data loss is acceptable (lossy data compression/decompression such as in image processing and transmission).

Message Size Matters
The example uses a hash table I created to make the decompression process easier. For obvious reasons, the described approach can be applied only to relatively short (2-4 bytes) binary messages. The number of binary combinations in 32 bits is 232 = 4,294,967,296. Generating that amount of binary numbers is difficult but doable. Generating all binary combinations for messages that are hundreds or thousands of bits in length, however, is impossible.

Nevertheless, you can use the techniques shown in this article in many data compression implementations and/or when you need to manipulate binary data.