I was thinking about data privacy issues one evening, and gave some thought as to how that might be achieved. One thought led to another, and eventually I reached the conclusion that for highest efficiency, a logical precursor step would be to increase the information density of the data object (file) before even attempting encryption. What follows here is some of what was written to effect that data compression.
My primary interest was in ASCII text files, but I wanted an approach that might also work for other data formats. I went back to basics and looked at the input file as a sequence of 8-bit bytes with, in the case of ASCII text files, one text character stored in each byte.
Wouldn’t it be nice, I joked to myself, if I could simply remove all the zero bits and pack all the ones together? That, of course, doesn’t really work - but I realized that some of the zeros can be removed during compression and restored during expansion. If the file consisted only of ASCII text, then the high order bit of each byte could be discarded and groups of eight seven-bit characters could be compressed into groups of seven eight-bit bytes to effect a 12.5% compression.
Not bad, I thought, and took another look at my assumptions/premises. Suppose I only use some of the 128 available codes in my text? Suppose I use 64 or fewer codes - could I then come up with a scheme that allowed me to pack eight six-bit codes into six eight-but bytes to effect a 25% compression? And what if my file uses 32 or fewer codes - could I pack eight 5-bit codes into five bytes for a 37.5% compression?
The approach I followed was to inventory all the codes used in the input file to discover which and how many unique codes were used, and I wrote this program to produce inventories of the byte values:
And here is sample console output for an 1175-byte ASCII text file named file.dat:
[ It occurred to me later that I should have extended this downward to files of four or fewer unique codes to accommodate, for example, files consisting only of the characters A, C, T, and G - which would provide 75% compression of text DNA sequences. Please feel free to do that. ]
The next step was to assign each of the inventoried codes a substitution value starting with zero for the first inventoried code, one for the next, and so on until each of the original codes had a substitution value less than the total number of unique codes.
Once the inventory was complete, the number of unique codes determined, and the substitution values known, it is a simple matter to replace all of the characters in the original file with their substitution values.
Next the number of unique codes is used to select the appropriate packing, and the entire file is packed into an output buffer.
The remainder of the activity comes under the category of “Housekeeping”. I wrote some header information (number of inventory bytes and substitution code bit size), the low-order byte of the file size, the bit-mapped code inventory, and the output buffer to disk.
The code below actually outputs more information than is needed to do the job, but will serve to illustrate the method...
For test input I used my file.dat ASCII text file:
which the encode program compressed into:
It might be interesting to notice that the compressed data bears no resemblance to any particular data type – it could as easily have been floating point data or image data as text. Even though there was no attempt to encrypt the input, already the data type cannot be determined from the compressed data itself...
To reconstruct the original (pre-compression) files, I wrote this short program:
It occurred to me later that I might also explore dividing files into segments, where each segment’s size would be determined by the maximum number of bits used to represent each of its data bytes. If I have time and if that appears to be a productive approach, I’ll add to this page.
If you’re interested in the privacy aspects, click on the “Documents” link below and take a look at my “Building Bocks for Data Privacy” page.