Recent Question/Assignment
its coding in c programming
3 Your Task
Your task is to implement two programs called encode and decode which perform LZ78 compression and decompression, respectively. The requirements for your programs are as follows:
1. encode can compress any file, text or binary.
2. decode can decompress any file, text or binary, that was compressed with encode.
3. Both operate on both little and big endian systems. Interoperability is required.
4. Both use variable bit-length codes.
5. Both perform read and writes in efficient blocks of 4KB.
6. Both encode and decode must interoperate with the provided binaries — not just your code.
4 Specifics
You will need to implement some new ADTs for this assignment: an ADT for tries and an ADT for words. In addition to these new ADTS, you will need to be concerned about variable-length codes, I/O, and endianness.
4.1 Tries
The most costly part of compression is checking for existing prefixes, or words. You could utilize a hash table, or just an array to store words, but that wouldnt be optimal, as many of the words you need to store are prefixes of other words. Instead you choose to utilize a trie
A trie is an efficient information re-trie-val data structure, commonly known as a prefix tree. Each node in a trie represents a symbol, or a character, and contains n child nodes, where n is the size of the alphabet you are using. In most cases, the alphabet used is the set of ASCII characters, so n = 256. You will use a trie during compression to store words.
I Edward Fredkin, -Trie memory.- Communications ofthe ACM 3, no. 9 (1960): 490—499.
Above is an example ofa trie containing the following words: -She-, -sells-, -sea-, -shells-, -by-, -the-, -sea-, - s hore-. Searching for a word in the trie means stepping down for each symbol in the word, starting from the root. Stepping down the trie is simply checking if the current node we have traversed to has a child node representing the symbol we are looking for, and setting the current node to be the child node if it does exist. Thus, to find - sea-, we would start from the tries root and step down to s, then e, then a. If any symbol is missing, or the end of the trie is reached without fully matching a word, while stepping through the trie, then the word is not in the trie. You must follow the specification shown below when implementing your trie ADT.
The TrieNode struct will have the two fields shown above. Each trie node has an array of 256 pointers to trie nodes as children, one for each ASCII character. It should be easy to see how this simplifies searching for the next character in a word in the trie. The code field stores the 16-bit code for the word that ends with the trie node containing the code. This means that the code for some word -abc- would be contained in the trie node for c. Note that there isnt a field that indicates what character a trie node represents. This is because the trie nodes index in its parents array of child nodes indicates what character it represents. The trie_step function will be repeatedly called to check if a word exists in the trie. A word only exists if the trie node returned by the last step corresponding to the last character in the word isnt NULL.
4.1.2 TrieNode *trie_node_create (uint16_t code)
Constructor for a TrieNode. The nodes code is set to code. Make sure each of the children node pointers are NULL.
4.1.3 void trie_node_delete (TrieNode *n)
Destructor for a TrieNode. Note that only a single pointer is passed here. The destructors you have written in the past have taken double pointers in order to NULL the pointer by dereferencing it. For this assignment, a single pointer is much more manageable and simplifies the code you have to otherwise write.
4.1.4 TrieNode *trie_create (void)
Initializes a trie: a root TrieNode with the code EMPTY_CODE. Returns the root, a TrieNode *, if successful, NULL otherwise.
4.1.5 void trie_reset (TrieNode *root)
Resets a trie to just the root TrieNode. Since we are working with finite codes, eventually we will arrive at the end of the available codes (MAX_CODE). At that point, we must reset the trie by deleting its children so that we can continue compressing/ decompressing the file. Make sure that each of the roots children nodes are NULL.
4.1.6 void trie_delete (TrieNode *n)
Deletes a sub-trie starting from the trie rooted at node n. This will require recursive calls on each of ns children. Make sure to set the pointer to the children nodes to NULL after you free them with trie_node_delete ( ).
4.1.7 TrieNode *trie_step (TrieNode *n, uint8_t sym)
Returns a pointer to the child node reprsenting the symbol sym. If the symbol doesnt exist, NULL is returned.
4.2 Word Tables
Although compression can be performed using a trie, decompression still needs to use a look-up table for quick code to word translation. This look-up table will be defined as a new struct called a WordTab1e. Since we can only have 2 16 — 1 codes, one ofwhich is reserved as a stop code, we can use a fixed word table size ofUINT16_MAX, where UINT16_MAX is a macro defined in inttypes. h as the maximum value of a unsigned 16-bit integer. Hint: this has the exact same value as MAX_CODE, which we defined earlier.
Why arent we using hash tables to store words? Because there is a fixed number of codes. Each index of this word table will be a new struct called a Word. You will store words in byte arrays, or arrays of uint8_t. This is because strings in C are null-terminated, and problems with compression occur if a binary file is being compressed and contains null characters that are placed into strings. Since we need to know how long a word is, a Word will also have a field for storing the length of the byte array, since we cant use string. h functions like strlen on byte arrays. You must use the following specification for the new Word ADT.
4.2.1 Word
1 struct Word { uint8_t *syms; uint32_t len;
A Word holds an array of symbols, syms, stored as bytes in an array. The length of the array storing the symbols a Word represents is stored in len.
4.2.2 WordTab1e
1 type def Word * WordTab1e
To make things easier to reason about, we can define an array of Words as a WordTab1e.
4.2.3 Word *word_create (uint8_t *syms, uint32_t len)
Constructor for a word where sysms is the array of symbols a Word represents. The length of the array of symbols is given by len. This function returns a Word * if successful or NULL otherwise.
4.2.4 Word (Word *w, uint8_t sym)
Constructs a new Word from the specified Word, w, appended with a symbol, sym. The Word specified to append to may be empty. If the above is the case, the new Word should contain only the symbol. Returns the new Word which represents the result of appending.
4.2.5 void word_delete (Word *w)
Destructor for a Word, w. Like with trie_node_create() in S4.1.3, a single pointer is used here to reduce the complexity of memory management, thus reducing the chances of having memory-related errors.
4.2.6 WordTab1e *vt_create (void)
Creates a new WordTab1e, which is an array ofWords. A WordTab1e has a pre-defined size ofMAX_CODE, which has the value UINT16_MAX. This is because codes are 16-bit integers. A WordTab1e is initialized with a single Word at index EMPTY_CODE. This Word represents the empty word, a string of length of zero.
4.2.7 void vt_reset (WordTab1e *vt)
Resets a WordTab1e, wt, to contain just the emptyWord. Make sure all the other words in the table are NULL.
4.3 Codes
It is a requirement that your programs, encode and decode, be able to read and write pairs containing variable bit-length codes. As an example, assume we are going to write the pair (13, a). What is the minimum number of bits needed to represent 13? We can easily calculate the minimum number of bits needed to represent any integer x 1 using the formula 110b + 1. The only edge case is with 0, whose bit-length is 1. Thus, we calculate that the minimum number of bits needed to represent 13, or the bit-length of 13, to be 4. That being said, in practice, the bit-length of the code in the pair youre going output wouldnt be the bit-length of the code itself, but rather the bit-length of the next available code assignable, as described in the earlier compression example. The reason for this is because decompression must know at all times the bit-length of the code it is going to read. Because the dictionary constructed by decompression contains exactly the words and codes contained by the trie constructed by compression, it is evident that, by using the bit-length of the next available code, that compression and decompression will agree on the variable bit-lengths of codes. Note that only codes have variable bit-lengths: symbols in a pair are always 8 bits.
As an example, assume we are going to output the pair (13, a), and the next available code assignable in our dictionary is 64. We will need to convert this pair to binary, starting with the code. As stated earlier, the bit-length of the code is the bit-length of the next available code. We calculate that the bit-length of 64 is 7. We start from the least significant bit of our code, 13, and construct the code portion of the pairs binary representation: LSB 1011000 MSB. Note that the ISB is on the left and the MSB is on the right. Zeroes are padded because the bitlength of64 is 7, while the bit-length of 13 is 4, which is why there are three padded zeroes. In the same fashion, we start from the least significant bit of our symbol, a, and add the symbol portion of the pairs binary representation to the previous code portion, which yields LSB 101100010000110 MSB, since the ASCII value ofa is 97.
Now assume that we are reading in the binary LSB 101100010000110 MSB and converting that back into a pair. We la-low that the next available code assignable by compression and decompression are the same at every step. Thus, we know that compression must have output the pairs code with the bit-length of 64, which is 7. We go through the first 7 bits ofthe binary: 1, O, 1, 1, O, O, and O, summing up the bits as we go, simulating how positional numbering systems work. This means we perform 1 • 20 + O • 2 1 + 1 • 22 + 1 • 23 + O • 24 +0 • 25 + 0 • 26 = 13, exactly the code output by compression. We do the same for the remaining 8 bits to reconstruct the symbol.
agma once
4.4 1/0
It is also a requirement that your programs, encode and decode, perform efficient I/O. Reads and writes will be done 4KB, or a block, at a time, which implicitly requires that you buffer I/O. Buffering is the act of storing data into a buffer, which you can think of as an array of bytes. You will be implementing an I/O module for this assignment, and its API is explained in the following sections.
4.4.1 FileHeader
This is the struct definition for the file header, which contains the magic number for your program and the protection bit mask for the original file. The file header is the first thing that appears in a compressed file. The magic number field, magic, serves as a unique identifier for files compressed by encode. decode should only be able to decompress files which have the correct magic number. This magic number is OxBAADBAAC.
Before writing the file header to the compressed file using wri te_header ( ) , you must swap the endianness of the fields if necessary since interoperability is required If your program is run on a system using big endian, the fields must be swapped to little endian, since little endian is canonical. A module specifically for handling endianness will be provided.
4.4.2 int read_bytes (int infile, uint8_t *buf, int to_read)
This will be a useful helper function to perform reads. As you may know, the read () syscall does not always guarantee that it will read all the bytes specified. For example, a call could be issued to read a block of bytes, but it might only read half a block. So, we write a wrapper function to loop calls to read() until we have either read all the bytes that were specified (to_read), or there are no more bytes to read. The number of bytes that were read are returned. You should use this function whenever you need to perform a read.
4.4.3 int write_bytes (int outfile, uint8_t *buf, int to_write)
This function is very much the same as read_bytes () , except that it is for looping calls to write (). As you may imagine, write () isnt guaranteed to write out all the specified bytes (to_write), and so we loop until we have either written out all the bytes specified, or no bytes were written. The number of bytes written out is returned. You should use this function whenever you need to perform a write.
4.4.4 void read_header(int infile, FileHeader *header)
This reads in sizeof (FileHeader) bytes from the input file. These bytes are read into the supplied header. Endianness is swapped ifbyte order isnt little endian. Along with reading the header, it must verify the magic number.
4.4.5 void write_header (int outfile, FileHeader *header)
Writes sizeof (FileHeader) bytes to the output file. These bytes are from the supplied header. Endianness is swapped if byte order isnt little endian.
4.4.6 bool read_sym (int infile, uint8_t *sym)
An index keeps track of the currently read symbol in the buffer. Once all symbols are processed , another block is read. Ifless than a blockis read , the end ofthe buffer is updated. Returns true if there are symbols to be read , false otherwise.
4.4.7 void write_pair (int outfile, uint16_t code, uint8_t sym, int bitlen)
-Writes- a pair to outfile. In reality, the pair is buffered. A pair is comprised of a code and a symbol. The bits of the code are buffered first, starting from the ISB. The bits of the symbol are buffered next, also starting from the LSB. The code buffered has a bit-length of bitlen. The buffer is written out whenever it is filled.
4.4.8 void flush_pairs (int outfile)
Writes out any remaining pairs of symbols and codes to the output file.
4.4.9 bool read_pair(int infile, uint16_t *code, uint8_t *sym, int bitlen)
-Reads- a pair (code and symbol) from the input file. The -read- code is placed in the pointer to code (e.g. *code val) The -read- symbol is placed in the pointer to sym (e.g. *sym val). In reality, a block of pairs is read into a buffer. An index keeps track of the current bit in the buffer. Once all bits have been processed, another block is read. The first bitlen bits are the code, starting from the ISB. The last 8 bits of the pair are the symbol, starting from the LSB. Returns true if there are pairs left to read in the buffer, else false. There are pairs left to read if the read code is not STOP_CODE.
4.4.10 void write_word(int outfile, Word *w)
-Writes- a pair to the output file. Each symbol of the Word is placed into a buffer. The buffer is written out when it is filled.
4.4.11 void flush_words (int outfile)
Writes out any remaining symbols in the buffer to the outfile.
Note that the output file in which you write to must have the same protection bits as the original file. Like in assignment 4, you will make use of f stat () and fchmod ( ).
All reads and writes in this program must be done using the system calls read and write () , which means that you must use the system calls open() and close() to get your file descriptors. As stated earlier, all reads and writes must be performed in efficient blocks of 4KB. You will want to use two static 4k8 uint8_t arrays to serve as buffers: one to store binary pairs and the other to store characters. Each of these buffers should have an index, or a variable, to keep track of the current byte or bit that has been processed.
5 Program Options
Your encode program must support the following getopt () options:
• -v : Print compression statistics to stderr.
• -i : Specify input to compress (stdin by default)
• -o output : Specify output of compressed input (stdout by default)
Your decode program must support the following getopt () options:
• -v : Print decompression statistics to stderr.
• -i : Specify input to decompress (stdin by default)
• -o output : Specify output of decompressed input (stdout by default)
The verbose option enables a flag to print out informative statistics about the compression or decompression that is performed. These statistics include the compressed file size, the uncompressed file size, and space saving. The formula for calculating space saving is:
compressed size
100x 1-
uncompressed size The verbose output of both encode and decode must match the following:
Compressed file size: X bytes Uncompressed file size: X bytes Space saving: XX.
6 Compression
The following steps for compression will refer to the input file descriptor to compress as infile and the compressed output file descriptor as outfile.
1. Open infile with open(). If an error occurs, printa helpful message and exit with a status code indicating that an error occurred. inf ile should be stdin if an input file wasnt specified.
2. The first thing in outfile must be the file header, as defined in the file io .h. The magic number in the header must be OxBAADBAAC. The file size and the protection bit mask you will obtain using fstat See the man page on it for details.
3. Open outfile using open(). The permissions for outfile should match the protection bits as set in your file header. Any errors with opening outfile should be handled like with infile. outfile should be stdout if an output file wasnt specified.
4. Write the filled out file header to outfile using write_header(). This means writing out the struct itself to the file, as described in the comment block of the function.
5. Create a trie. The trie initiallyhasnochildren and consists solely ofthe root. The code stored by this root trie node should be EMPTY_CODE to denote the empty word. You will need to make a copy of the root node and use the copy to step through the trie to check for existing prefixes. This root node copy will be referred to as curr_node. The reason a copy is needed is that you will eventually need to reset whatever trie node youve stepped to back to the top of the trie, so using a copy lets you use the root node as a base to return to.
6. You will need a monotonic counter to keep track of the next available code. This counter should start at START_CODE, as defined in the supplied code .h file. The counter should be a uint16_t since the codes used are unsigned 16- bit integers. This will be referred to as next _ code.
7. You will also need two variables to keep track of the previous trie node and previously read symbol. We will refer to these as prev_node and prev_sym, respectively.
8. Use read_sym() inaloop to readin all the symbols from infile. Your loop should breakwhenread_sym() returns false. For each symbol read in, call it curr_sym, perform the following:
(a) Set next_node to be trie_step (curr_node, curr_sym), stepping down from the current node to the currently read symbol.
(b) Ifnext_node is not NULL, that means we have seen the current prefix. Set prev_node to be curr_node and then curr_node to be next_node.
(c) Else, since next_node is NULL, weknowwe have not encountered the current prefix. We write the pair
(curr_node- code, curr_sym) , where the bit-length of the written code is the bit-length ofnext_code. We now add the current prefix to the trie. Let curr_node- children[curr_sym] be a new trie node whose code is next_code. Reset curr_node to point at the root of the trie and increment the value of next _ code.
(d) Check ifnext_code is equal to MAX_CODE. Ifitis, use trie_reset to reset the trie to just having the root node. This reset is necessary since we have a finite number of codes.
(e) Update prev_sym to be curr_sym.
9. After processing all the characters in infile, check if curr_node points to the root trie node. If it does not, it means we were still matching a prefix. Write the pair (prev_node- code, prev_sym). The bit-length ofthe code written should be the bit-length of next_code. Make sure to increment next_code and that it stays within the limit of MAX_CODE. Hint: use the modulo operator.
10. Write the pair (STOP_CODE, 0) to signal the end of compressed output. Again, the bit-length of code written should be the bit-length of next _ code.
11. Make sure to use flush_pairs to flush any unwritten, buffered pairs. Remember, calls to write_pair() end up buffering them under the hood. So, we have to remember to flush the contents of our buffer.
12. Use close() to close infile and outfile.
7 Decompression
The following steps for decompression will refer to the input file to decompress as infile and the uncompressed output file as outfile.
1. Open infile with open(). If an error occurs, print a helpful message and exit with a status code indicating that an error occurred. inf ile should be stdin if an input file wasnt specified.
2. Read in the file header with , which also verifies the magic number. If the magic number is verified then decompression is good to go and you now have a header which contains the original protection bit mask.
3. Open outfile using open(). The permissions for outfile should match the protection bits as set in your file header that you just read. Any errors with opening outfile should be handled like with inf ile. outf ile should be stdout if an output file wasnt specified.
4. Create a new word table with wt_create() and make sure each of its entries are set to NULL. Initialize the table to have just the empty word, a word of length 0, at the index EMPTY_CODE. We will refer to this table as table.
5. You will need two uint16_t to keep track of the current code and next code. These will be referred to as curr_code and next_code, respectively. next_code should be initialized as START_CODE and functions exactly the same as the monotonic counter used during compression, which was also called next _ code.
6. Use read_pair() in aloop to read all the pairs from infile. We will refer to the code and symbol from each read pair as curr_code and curr_sym, respectively. The bit-length of the code to read is the bit-length of next _ code. The loop breaks when the code read is STOP_CODE. For each read pair, perform the following:
(a) As seen in the decompression example, we will need to append the read symbol with the word denoted by the read code and add the result to table at the index next_code. The word denoted by the read code is stored in table [curr_code]. We will append table [curr_code] and curr_sym using word_ append_sym ( ) .
(b) Write the word that we just constructed and added to the table with This word should have been stored in table [next _ code].
(c) Increment next_code and check ifit equals MAX_CODE. If it has, reset the table using wt_reset () and set next _ code to be START_CODE. This mimics the resetting of the trie during compression.
7. Flush any buffered words using flush_words(). Like with write_pair(), write_word() buffers words under the hood, so we have to remember to flush the contents of our buffer.
8. Close infile and outfile with close().
8 Deliverables
You will need to turn in:
1. Makefile:
• CFLAGS=-Wa11 -Wextra -Werror -Wpedantic must be included.
• CC=c1ang must be specified.
• make clean must remove all files that are compiler generated.
• make encode should build your encode program.
• make decode should build your decode program.
• make should build both encode and decode, as should make all.
• Your programs should have no memory leaks.
2. Your programs must have the following source and header files:
• encode. c : contains the main function for the encode program.
• decode. c : contains the main function for the decode program.
• trie. c: the source file for the Trie ADT.
• trie. h: the header file for the Trie ADT. You must not modify this file.
• word. c: the source file for the Word ADT.
• word. h: the header file for the Word ADT. You must not modify this file.
• io. c: the source file for the I/O module.
• io. h: the header file for the I/O module. You must not modify this file.
• endian. h: the header file for the endianness module. You must not modify this file.
• code. h: the header file containing macros for reserved codes. You must not modify this file.
3. You may have other source and header files, but do not try to be overly clever.
4. Running scan-build yields no bugs. Ifthere are any false positives, make sure to explain why they are false positives in your README. md.
5. README. md: This must be in Markdown. This should describe how to use your program and Makefile. This also contains any explanations for complaints generated by scan-build.
6. DESIGN. pdf: This must be a PDF. The design document should describe your design for your program with enough detail that a sufficiently knowledgeable programmer would be able to replicate your implementation. This does not mean copying your entire program in verbatim. You should instead describe how your program works with supporting pseudo-code.
7. Working encode and decode programs, along with test files, and other source files will be supplied in the resources repository.