Huffman tree compressing/decompressing in C
In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.
My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.
My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.
Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.
Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).
- An exception is the last two functions
print_help
andmain
. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.
In order to simplify the idea, during designing and coding, I have assumed that
- the program will not be told to uncompress an invalid file, so there's no file validity check in the code
- file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way
- C library functions does not fail for environmental reasons (e.g. host is short of RAM for
malloc(3)
, target disk out of space forfwrite(3)
and consequentlywrite(2)
, orfread(3)
as said above) - reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome
so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.
I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make
output)
gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c
The last option is to suppress the warning about unused result from fread(3)
.
During my coding process, I run clang-format
occasionally and diff
the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.
* The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.
// File: huffman.c
// Author: iBug
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
typedef unsigned char byte;
typedef struct _HuffNode {
unsigned data;
struct _HuffNode *left, *right, *parent;
} HuffNode;
void count_frequency(FILE* fp, unsigned* freq) {
size_t orig_pos = ftell(fp);
int ch;
while (1) {
ch = fgetc(fp);
if (ch < 0)
break;
freq[ch]++;
}
fseek(fp, orig_pos, SEEK_SET);
}
void construct_huffman(unsigned* freq_in, HuffNode* tree) {
int count = 256;
unsigned freq[256];
HuffNode *node[256];
// Initialize data
for (int i = 0; i < 256; i++) {
freq[i] = freq_in[i];
tree[i].data = i;
tree[i].left = tree[i].right = NULL;
node[i] = &tree[i];
}
// Sort by frequency, decreasing order
/* WARNING: Although this Quick Sort is an unstable sort,
* it should at least give the same result for the same input frequency table,
* therefore I'm leaving this code here
*/
{
unsigned lower[256], upper[256], top = 1;
lower[0] = 0, upper[0] = 256;
while (top > 0) {
top--;
int left = lower[top], right = upper[top];
int i = left, j = right - 1, flag = 0;
if (i >= j) // Nothing to sort
continue;
while (i < j) {
if (freq[i] < freq[j]) {
unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
flag = !flag;
}
flag ? i++ : j--;
}
lower[top] = left, upper[top] = i;
lower[top + 1] = i + 1, upper[top + 1] = right;
top += 2;
}
}
// Construct tree
while (count > 1) {
int pos = 512 - count;
HuffNode *parent = &tree[pos];
// Select lowest 2 by freq
int i = count - 2, j = count - 1;
// Create tree, lower freq left
parent->left = node[j]; parent->right = node[i];
node[j]->parent = node[i]->parent = parent;
node[i] = parent;
freq[i] += freq[j];
// Insert
for (; i > 0 && freq[i] > freq[i - 1]; i--) {
unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
}
count--;
}
// Now HEAD = node[0] = tree[511]
node[0]->parent = NULL;
}
void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
int n;
byte ch;
byte buf = 0, nbuf = 0;
HuffNode *p;
byte code[256];
while (1) {
n = fgetc(fin);
if (n < 0)
break;
ch = n;
// Encode
p = &tree[ch];
n = 0;
while (p->parent) {
if (p == p->parent->left) {
// Left is 0
code[n] = 0;
} else if (p == p->parent->right) {
code[n] = 1;
}
p = p->parent;
n++;
}
// Write
for (int i = n - 1; i >= 0; i--) {
buf |= code[i] << nbuf;
nbuf++;
if (nbuf == 8) {
fputc(buf, fout);
nbuf = buf = 0;
}
}
}
fputc(buf, fout);
*padding = 8 - nbuf;
}
void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
size_t startpos = ftell(fin); // should be 1028
fseek(fin, 0L, SEEK_END);
size_t endpos = ftell(fin); // last byte handling
fseek(fin, startpos, SEEK_SET);
int count = endpos - startpos;
byte buf = 0, nbuf = 0, bit;
HuffNode *p;
while (count > 0 || nbuf > 0) {
// Start from tree top
p = tree + 510;
while (p->left || p->right) {
// Prepare next bit if needed
if (nbuf == 0) {
if (count <= 0)
return;
buf = fgetc(fin);
if (count == 1) {
// Last bit
nbuf = 8 - padding;
if (nbuf == 0) {
return;
}
} else {
nbuf = 8;
}
count--;
}
// p has child
bit = buf & 1;
buf >>= 1;
nbuf--;
if (bit == 0)
p = p->left;
else
p = p->right;
}
fputc(p->data, fout);
}
}
void compress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
size_t padding_pos;
count_frequency(fin, freq);
construct_huffman(freq, tree);
rewind(fin);
for (int i = 0; i < 256; i++)
fwrite(freq + i, 4, 1, fout);
// Write a placeholder for the padding
padding_pos = ftell(fout);
fwrite(&padding, 4, 1, fout);
encode_stream(fin, fout, tree, &padding);
// Write the padding to the placeholder
fseek(fout, padding_pos, SEEK_SET);
fwrite(&padding, 4, 1, fout);
fclose(fin);
fclose(fout);
}
void uncompress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
for (int i = 0; i < 256; i++) {
fread(&padding, 4, 1, fin);
freq[i] = padding;
}
fread(&padding, 4, 1, fin);
construct_huffman(freq, tree);
decode_stream(fin, fout, tree, padding);
fclose(fin);
fclose(fout);
}
void print_help(void) {
puts("Usage: huffman (-c|-d) input output");
puts(" -c Compress file from input to output");
puts(" -d Uncompress file from input to output");
puts("nCreated by iBug");
}
int main(int argc, char** argv) {
if (argc != 4) {
print_help();
return 1;
}
if (!strcmp(argv[1], "-c")) {
compress_file(argv[2], argv[3]);
} else if (!strcmp(argv[1], "-d")) {
uncompress_file(argv[2], argv[3]);
} else {
print_help();
return 1;
}
return 0;
}
In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.
On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times. Thanks in advance!
algorithm c
New contributor
add a comment |
In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.
My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.
My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.
Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.
Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).
- An exception is the last two functions
print_help
andmain
. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.
In order to simplify the idea, during designing and coding, I have assumed that
- the program will not be told to uncompress an invalid file, so there's no file validity check in the code
- file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way
- C library functions does not fail for environmental reasons (e.g. host is short of RAM for
malloc(3)
, target disk out of space forfwrite(3)
and consequentlywrite(2)
, orfread(3)
as said above) - reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome
so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.
I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make
output)
gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c
The last option is to suppress the warning about unused result from fread(3)
.
During my coding process, I run clang-format
occasionally and diff
the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.
* The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.
// File: huffman.c
// Author: iBug
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
typedef unsigned char byte;
typedef struct _HuffNode {
unsigned data;
struct _HuffNode *left, *right, *parent;
} HuffNode;
void count_frequency(FILE* fp, unsigned* freq) {
size_t orig_pos = ftell(fp);
int ch;
while (1) {
ch = fgetc(fp);
if (ch < 0)
break;
freq[ch]++;
}
fseek(fp, orig_pos, SEEK_SET);
}
void construct_huffman(unsigned* freq_in, HuffNode* tree) {
int count = 256;
unsigned freq[256];
HuffNode *node[256];
// Initialize data
for (int i = 0; i < 256; i++) {
freq[i] = freq_in[i];
tree[i].data = i;
tree[i].left = tree[i].right = NULL;
node[i] = &tree[i];
}
// Sort by frequency, decreasing order
/* WARNING: Although this Quick Sort is an unstable sort,
* it should at least give the same result for the same input frequency table,
* therefore I'm leaving this code here
*/
{
unsigned lower[256], upper[256], top = 1;
lower[0] = 0, upper[0] = 256;
while (top > 0) {
top--;
int left = lower[top], right = upper[top];
int i = left, j = right - 1, flag = 0;
if (i >= j) // Nothing to sort
continue;
while (i < j) {
if (freq[i] < freq[j]) {
unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
flag = !flag;
}
flag ? i++ : j--;
}
lower[top] = left, upper[top] = i;
lower[top + 1] = i + 1, upper[top + 1] = right;
top += 2;
}
}
// Construct tree
while (count > 1) {
int pos = 512 - count;
HuffNode *parent = &tree[pos];
// Select lowest 2 by freq
int i = count - 2, j = count - 1;
// Create tree, lower freq left
parent->left = node[j]; parent->right = node[i];
node[j]->parent = node[i]->parent = parent;
node[i] = parent;
freq[i] += freq[j];
// Insert
for (; i > 0 && freq[i] > freq[i - 1]; i--) {
unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
}
count--;
}
// Now HEAD = node[0] = tree[511]
node[0]->parent = NULL;
}
void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
int n;
byte ch;
byte buf = 0, nbuf = 0;
HuffNode *p;
byte code[256];
while (1) {
n = fgetc(fin);
if (n < 0)
break;
ch = n;
// Encode
p = &tree[ch];
n = 0;
while (p->parent) {
if (p == p->parent->left) {
// Left is 0
code[n] = 0;
} else if (p == p->parent->right) {
code[n] = 1;
}
p = p->parent;
n++;
}
// Write
for (int i = n - 1; i >= 0; i--) {
buf |= code[i] << nbuf;
nbuf++;
if (nbuf == 8) {
fputc(buf, fout);
nbuf = buf = 0;
}
}
}
fputc(buf, fout);
*padding = 8 - nbuf;
}
void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
size_t startpos = ftell(fin); // should be 1028
fseek(fin, 0L, SEEK_END);
size_t endpos = ftell(fin); // last byte handling
fseek(fin, startpos, SEEK_SET);
int count = endpos - startpos;
byte buf = 0, nbuf = 0, bit;
HuffNode *p;
while (count > 0 || nbuf > 0) {
// Start from tree top
p = tree + 510;
while (p->left || p->right) {
// Prepare next bit if needed
if (nbuf == 0) {
if (count <= 0)
return;
buf = fgetc(fin);
if (count == 1) {
// Last bit
nbuf = 8 - padding;
if (nbuf == 0) {
return;
}
} else {
nbuf = 8;
}
count--;
}
// p has child
bit = buf & 1;
buf >>= 1;
nbuf--;
if (bit == 0)
p = p->left;
else
p = p->right;
}
fputc(p->data, fout);
}
}
void compress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
size_t padding_pos;
count_frequency(fin, freq);
construct_huffman(freq, tree);
rewind(fin);
for (int i = 0; i < 256; i++)
fwrite(freq + i, 4, 1, fout);
// Write a placeholder for the padding
padding_pos = ftell(fout);
fwrite(&padding, 4, 1, fout);
encode_stream(fin, fout, tree, &padding);
// Write the padding to the placeholder
fseek(fout, padding_pos, SEEK_SET);
fwrite(&padding, 4, 1, fout);
fclose(fin);
fclose(fout);
}
void uncompress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
for (int i = 0; i < 256; i++) {
fread(&padding, 4, 1, fin);
freq[i] = padding;
}
fread(&padding, 4, 1, fin);
construct_huffman(freq, tree);
decode_stream(fin, fout, tree, padding);
fclose(fin);
fclose(fout);
}
void print_help(void) {
puts("Usage: huffman (-c|-d) input output");
puts(" -c Compress file from input to output");
puts(" -d Uncompress file from input to output");
puts("nCreated by iBug");
}
int main(int argc, char** argv) {
if (argc != 4) {
print_help();
return 1;
}
if (!strcmp(argv[1], "-c")) {
compress_file(argv[2], argv[3]);
} else if (!strcmp(argv[1], "-d")) {
uncompress_file(argv[2], argv[3]);
} else {
print_help();
return 1;
}
return 0;
}
In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.
On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times. Thanks in advance!
algorithm c
New contributor
add a comment |
In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.
My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.
My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.
Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.
Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).
- An exception is the last two functions
print_help
andmain
. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.
In order to simplify the idea, during designing and coding, I have assumed that
- the program will not be told to uncompress an invalid file, so there's no file validity check in the code
- file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way
- C library functions does not fail for environmental reasons (e.g. host is short of RAM for
malloc(3)
, target disk out of space forfwrite(3)
and consequentlywrite(2)
, orfread(3)
as said above) - reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome
so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.
I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make
output)
gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c
The last option is to suppress the warning about unused result from fread(3)
.
During my coding process, I run clang-format
occasionally and diff
the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.
* The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.
// File: huffman.c
// Author: iBug
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
typedef unsigned char byte;
typedef struct _HuffNode {
unsigned data;
struct _HuffNode *left, *right, *parent;
} HuffNode;
void count_frequency(FILE* fp, unsigned* freq) {
size_t orig_pos = ftell(fp);
int ch;
while (1) {
ch = fgetc(fp);
if (ch < 0)
break;
freq[ch]++;
}
fseek(fp, orig_pos, SEEK_SET);
}
void construct_huffman(unsigned* freq_in, HuffNode* tree) {
int count = 256;
unsigned freq[256];
HuffNode *node[256];
// Initialize data
for (int i = 0; i < 256; i++) {
freq[i] = freq_in[i];
tree[i].data = i;
tree[i].left = tree[i].right = NULL;
node[i] = &tree[i];
}
// Sort by frequency, decreasing order
/* WARNING: Although this Quick Sort is an unstable sort,
* it should at least give the same result for the same input frequency table,
* therefore I'm leaving this code here
*/
{
unsigned lower[256], upper[256], top = 1;
lower[0] = 0, upper[0] = 256;
while (top > 0) {
top--;
int left = lower[top], right = upper[top];
int i = left, j = right - 1, flag = 0;
if (i >= j) // Nothing to sort
continue;
while (i < j) {
if (freq[i] < freq[j]) {
unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
flag = !flag;
}
flag ? i++ : j--;
}
lower[top] = left, upper[top] = i;
lower[top + 1] = i + 1, upper[top + 1] = right;
top += 2;
}
}
// Construct tree
while (count > 1) {
int pos = 512 - count;
HuffNode *parent = &tree[pos];
// Select lowest 2 by freq
int i = count - 2, j = count - 1;
// Create tree, lower freq left
parent->left = node[j]; parent->right = node[i];
node[j]->parent = node[i]->parent = parent;
node[i] = parent;
freq[i] += freq[j];
// Insert
for (; i > 0 && freq[i] > freq[i - 1]; i--) {
unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
}
count--;
}
// Now HEAD = node[0] = tree[511]
node[0]->parent = NULL;
}
void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
int n;
byte ch;
byte buf = 0, nbuf = 0;
HuffNode *p;
byte code[256];
while (1) {
n = fgetc(fin);
if (n < 0)
break;
ch = n;
// Encode
p = &tree[ch];
n = 0;
while (p->parent) {
if (p == p->parent->left) {
// Left is 0
code[n] = 0;
} else if (p == p->parent->right) {
code[n] = 1;
}
p = p->parent;
n++;
}
// Write
for (int i = n - 1; i >= 0; i--) {
buf |= code[i] << nbuf;
nbuf++;
if (nbuf == 8) {
fputc(buf, fout);
nbuf = buf = 0;
}
}
}
fputc(buf, fout);
*padding = 8 - nbuf;
}
void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
size_t startpos = ftell(fin); // should be 1028
fseek(fin, 0L, SEEK_END);
size_t endpos = ftell(fin); // last byte handling
fseek(fin, startpos, SEEK_SET);
int count = endpos - startpos;
byte buf = 0, nbuf = 0, bit;
HuffNode *p;
while (count > 0 || nbuf > 0) {
// Start from tree top
p = tree + 510;
while (p->left || p->right) {
// Prepare next bit if needed
if (nbuf == 0) {
if (count <= 0)
return;
buf = fgetc(fin);
if (count == 1) {
// Last bit
nbuf = 8 - padding;
if (nbuf == 0) {
return;
}
} else {
nbuf = 8;
}
count--;
}
// p has child
bit = buf & 1;
buf >>= 1;
nbuf--;
if (bit == 0)
p = p->left;
else
p = p->right;
}
fputc(p->data, fout);
}
}
void compress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
size_t padding_pos;
count_frequency(fin, freq);
construct_huffman(freq, tree);
rewind(fin);
for (int i = 0; i < 256; i++)
fwrite(freq + i, 4, 1, fout);
// Write a placeholder for the padding
padding_pos = ftell(fout);
fwrite(&padding, 4, 1, fout);
encode_stream(fin, fout, tree, &padding);
// Write the padding to the placeholder
fseek(fout, padding_pos, SEEK_SET);
fwrite(&padding, 4, 1, fout);
fclose(fin);
fclose(fout);
}
void uncompress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
for (int i = 0; i < 256; i++) {
fread(&padding, 4, 1, fin);
freq[i] = padding;
}
fread(&padding, 4, 1, fin);
construct_huffman(freq, tree);
decode_stream(fin, fout, tree, padding);
fclose(fin);
fclose(fout);
}
void print_help(void) {
puts("Usage: huffman (-c|-d) input output");
puts(" -c Compress file from input to output");
puts(" -d Uncompress file from input to output");
puts("nCreated by iBug");
}
int main(int argc, char** argv) {
if (argc != 4) {
print_help();
return 1;
}
if (!strcmp(argv[1], "-c")) {
compress_file(argv[2], argv[3]);
} else if (!strcmp(argv[1], "-d")) {
uncompress_file(argv[2], argv[3]);
} else {
print_help();
return 1;
}
return 0;
}
In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.
On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times. Thanks in advance!
algorithm c
New contributor
In a past course one of the assignments was to write a program that can compress files using Huffman Tree algorithm, and uncompress the files that the program generates.
My design is to count the byte occurrences first, then construct a HT based on the counted byte frequency.
My compressed file format is 256*4 bytes of "header" that stores the counted frequency, so it can be used to construct the tree when decompressing the file. Then there's a 4-byte integer that indicates how many bits of the last byte is real data. The rest is the real (compressed) data.
Here is this specific version* of code that I want some feedback. Later versions introduced many messy changes (like GUI and buffered I/O) that is not necessary.
Specifically, I'm looking for feedback on my algorithm and data structure implementation, including but not limited to code style, best practices, potential flaws and defects (see below).
- An exception is the last two functions
print_help
andmain
. They're intended to be as simple as possible, so they contain the bare minimum amount of code to work in a reasonable way. Data validation and error checking etc. are omitted on purpose.
In order to simplify the idea, during designing and coding, I have assumed that
- the program will not be told to uncompress an invalid file, so there's no file validity check in the code
- file availability is ensured by the environment. It will always be a regular file, with no chance of generating a read error mid-way
- C library functions does not fail for environmental reasons (e.g. host is short of RAM for
malloc(3)
, target disk out of space forfwrite(3)
and consequentlywrite(2)
, orfread(3)
as said above) - reading/writing byte-by-byte is fine, because a later version of this code introduced chunk I/O and got a bit messier (I think). Suggestions on making the code run faster without implementing chunk I/O is welcome
so I'm also not looking for feedbacks regarding the above things that I have assumed / intentionally ignored.
I have ensured that the code is working properly, with no warnings when compiled with this command (taken from make
output)
gcc -O3 -std=c11 -Wall -Wno-unused-result -o huffman huffman.c
The last option is to suppress the warning about unused result from fread(3)
.
During my coding process, I run clang-format
occasionally and diff
the output and my written code to check for potentially bad indentation / styling issues. I am not confident if it can solve everything.
* The link points to my GitHub repo. The code on that page is identical to the code submitted below verbatim.
// File: huffman.c
// Author: iBug
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
typedef unsigned char byte;
typedef struct _HuffNode {
unsigned data;
struct _HuffNode *left, *right, *parent;
} HuffNode;
void count_frequency(FILE* fp, unsigned* freq) {
size_t orig_pos = ftell(fp);
int ch;
while (1) {
ch = fgetc(fp);
if (ch < 0)
break;
freq[ch]++;
}
fseek(fp, orig_pos, SEEK_SET);
}
void construct_huffman(unsigned* freq_in, HuffNode* tree) {
int count = 256;
unsigned freq[256];
HuffNode *node[256];
// Initialize data
for (int i = 0; i < 256; i++) {
freq[i] = freq_in[i];
tree[i].data = i;
tree[i].left = tree[i].right = NULL;
node[i] = &tree[i];
}
// Sort by frequency, decreasing order
/* WARNING: Although this Quick Sort is an unstable sort,
* it should at least give the same result for the same input frequency table,
* therefore I'm leaving this code here
*/
{
unsigned lower[256], upper[256], top = 1;
lower[0] = 0, upper[0] = 256;
while (top > 0) {
top--;
int left = lower[top], right = upper[top];
int i = left, j = right - 1, flag = 0;
if (i >= j) // Nothing to sort
continue;
while (i < j) {
if (freq[i] < freq[j]) {
unsigned t = freq[i]; freq[i] = freq[j]; freq[j] = t;
HuffNode *p = node[i]; node[i] = node[j]; node[j] = p;
flag = !flag;
}
flag ? i++ : j--;
}
lower[top] = left, upper[top] = i;
lower[top + 1] = i + 1, upper[top + 1] = right;
top += 2;
}
}
// Construct tree
while (count > 1) {
int pos = 512 - count;
HuffNode *parent = &tree[pos];
// Select lowest 2 by freq
int i = count - 2, j = count - 1;
// Create tree, lower freq left
parent->left = node[j]; parent->right = node[i];
node[j]->parent = node[i]->parent = parent;
node[i] = parent;
freq[i] += freq[j];
// Insert
for (; i > 0 && freq[i] > freq[i - 1]; i--) {
unsigned t = freq[i]; freq[i] = freq[i - 1]; freq[i - 1] = t;
HuffNode *p = node[i]; node[i] = node[i - 1]; node[i - 1] = p;
}
count--;
}
// Now HEAD = node[0] = tree[511]
node[0]->parent = NULL;
}
void encode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned* padding) {
int n;
byte ch;
byte buf = 0, nbuf = 0;
HuffNode *p;
byte code[256];
while (1) {
n = fgetc(fin);
if (n < 0)
break;
ch = n;
// Encode
p = &tree[ch];
n = 0;
while (p->parent) {
if (p == p->parent->left) {
// Left is 0
code[n] = 0;
} else if (p == p->parent->right) {
code[n] = 1;
}
p = p->parent;
n++;
}
// Write
for (int i = n - 1; i >= 0; i--) {
buf |= code[i] << nbuf;
nbuf++;
if (nbuf == 8) {
fputc(buf, fout);
nbuf = buf = 0;
}
}
}
fputc(buf, fout);
*padding = 8 - nbuf;
}
void decode_stream(FILE* fin, FILE* fout, HuffNode* tree, unsigned padding) {
size_t startpos = ftell(fin); // should be 1028
fseek(fin, 0L, SEEK_END);
size_t endpos = ftell(fin); // last byte handling
fseek(fin, startpos, SEEK_SET);
int count = endpos - startpos;
byte buf = 0, nbuf = 0, bit;
HuffNode *p;
while (count > 0 || nbuf > 0) {
// Start from tree top
p = tree + 510;
while (p->left || p->right) {
// Prepare next bit if needed
if (nbuf == 0) {
if (count <= 0)
return;
buf = fgetc(fin);
if (count == 1) {
// Last bit
nbuf = 8 - padding;
if (nbuf == 0) {
return;
}
} else {
nbuf = 8;
}
count--;
}
// p has child
bit = buf & 1;
buf >>= 1;
nbuf--;
if (bit == 0)
p = p->left;
else
p = p->right;
}
fputc(p->data, fout);
}
}
void compress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
size_t padding_pos;
count_frequency(fin, freq);
construct_huffman(freq, tree);
rewind(fin);
for (int i = 0; i < 256; i++)
fwrite(freq + i, 4, 1, fout);
// Write a placeholder for the padding
padding_pos = ftell(fout);
fwrite(&padding, 4, 1, fout);
encode_stream(fin, fout, tree, &padding);
// Write the padding to the placeholder
fseek(fout, padding_pos, SEEK_SET);
fwrite(&padding, 4, 1, fout);
fclose(fin);
fclose(fout);
}
void uncompress_file(const char* filename, const char* newname) {
FILE *fin = fopen(filename, "rb"),
*fout = fopen(newname, "wb");
unsigned freq[256], padding;
HuffNode tree[512];
for (int i = 0; i < 256; i++) {
fread(&padding, 4, 1, fin);
freq[i] = padding;
}
fread(&padding, 4, 1, fin);
construct_huffman(freq, tree);
decode_stream(fin, fout, tree, padding);
fclose(fin);
fclose(fout);
}
void print_help(void) {
puts("Usage: huffman (-c|-d) input output");
puts(" -c Compress file from input to output");
puts(" -d Uncompress file from input to output");
puts("nCreated by iBug");
}
int main(int argc, char** argv) {
if (argc != 4) {
print_help();
return 1;
}
if (!strcmp(argv[1], "-c")) {
compress_file(argv[2], argv[3]);
} else if (!strcmp(argv[1], "-d")) {
uncompress_file(argv[2], argv[3]);
} else {
print_help();
return 1;
}
return 0;
}
In addition to the mandatory CC BY-SA 3.0 license by posting on Stack Exchange, the code itself also has a MIT license.
On a side note: Although the course has ended and this code is not maintained anymore, it's still one of the programs that I have written with maximum attention and carefulness, so I believe that any feedback to this code is highly valuable and I will remember them in my future C-coding times. Thanks in advance!
algorithm c
algorithm c
New contributor
New contributor
New contributor
asked 1 hour ago
iBug
1165
1165
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Header size
256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:
- Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
- Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
- Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)
Encoding process
Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf
in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++
turns into nbuf += codelength
. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.
Decoding process
Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.
Some possible relevant sources for these techniques are the one in the previous paragraph and:
- Introduction to table based Huffman decoding
- An efficient algorithm of Huffman decoder with nearly constant decoding time
- Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
iBug is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210654%2fhuffman-tree-compressing-decompressing-in-c%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Header size
256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:
- Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
- Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
- Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)
Encoding process
Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf
in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++
turns into nbuf += codelength
. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.
Decoding process
Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.
Some possible relevant sources for these techniques are the one in the previous paragraph and:
- Introduction to table based Huffman decoding
- An efficient algorithm of Huffman decoder with nearly constant decoding time
- Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)
add a comment |
Header size
256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:
- Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
- Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
- Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)
Encoding process
Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf
in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++
turns into nbuf += codelength
. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.
Decoding process
Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.
Some possible relevant sources for these techniques are the one in the previous paragraph and:
- Introduction to table based Huffman decoding
- An efficient algorithm of Huffman decoder with nearly constant decoding time
- Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)
add a comment |
Header size
256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:
- Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
- Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
- Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)
Encoding process
Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf
in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++
turns into nbuf += codelength
. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.
Decoding process
Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.
Some possible relevant sources for these techniques are the one in the previous paragraph and:
- Introduction to table based Huffman decoding
- An efficient algorithm of Huffman decoder with nearly constant decoding time
- Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)
Header size
256*4 bytes is very big for a header. The size could be reduced substantially by using one or several of these common techniques:
- Store the code length instead of symbol frequency. These definitely won't need 32 bits each, 8 would already be a lot. You can pack them in 4 bits each if you set a length limit of 15. Storing lengths is not ambiguous because you can use canonical Huffman codes (there is an easy algorithm to generate them from your table of code lengths, discarding the code itself).
- Compress the header with delta encoding: storing the length difference between subsequent codes, using a variable-length encoding. Small differences tend to be more common. (seen in eg DEFLATE)
- Remove most zero-lengths from the header, by first storing a sparse bitmap that indicates which symbols occur in the file. (seen in eg bzip2)
Encoding process
Walking up the tree for every byte of the file is needlessly inefficient. You could precompute an array of codes and lengths once in advance and then use the array during encoding. The code could be represented as a single unsigned integer, no array necessary (it won't be that long, and you will want to limit the code lengths anyway for decoding and header reasons). It can be prepended to buf
in a couple of simple bitwise operations, similar to how you currently add a single bit, but nbuf++
turns into nbuf += codelength
. Together this lets the encoding process take a constant number of operations instead of scaling linearly with the code length.
Decoding process
Currently your code implements bit-by-bit tree walking, which is (as one source puts it) dead slow. The alternative is decoding several bits at the same time by using an array lookup. There are a lot of subtly different ways to do that, but the basis of all of them is that part of the buffer is used to index into a table. Limiting the maximum length of the codes is very useful, because with a limited length you can guarantee that decoding is a single-step process, resolving one symbol from the buffer in a constant number of operations, with no looping.
Some possible relevant sources for these techniques are the one in the previous paragraph and:
- Introduction to table based Huffman decoding
- An efficient algorithm of Huffman decoder with nearly constant decoding time
- Huffman revisited - Part 2 : the Decoder
A Fast and Space - Economical Algorithm for Length - Limited Coding (for a way to generate the code lengths with a length limit)
answered 52 mins ago
harold
1,02857
1,02857
add a comment |
add a comment |
iBug is a new contributor. Be nice, and check out our Code of Conduct.
iBug is a new contributor. Be nice, and check out our Code of Conduct.
iBug is a new contributor. Be nice, and check out our Code of Conduct.
iBug is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210654%2fhuffman-tree-compressing-decompressing-in-c%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown