Do1e

Do1e

github
email

JPEG Encoding Details

This article is synchronized and updated to xLog by Mix Space
For the best browsing experience, it is recommended to visit the original link
https://www.do1e.cn/posts/others/JPEG-detail


Preface#

This blog was originally published on 2021-08-22 on CSDN, and has been copied here with some formatting issues corrected.

Recently, I have been learning how to perform JPEG encoding. I found many articles online, but very few explain every detail clearly, leading to many pitfalls while programming. Therefore, I intend to write an article that covers the details, incorporating Python code as much as possible. The specific program can be referenced in my open-source project on GitHub.

Do1e/JPEG-encode

Of course, my introduction and code are not very complete and may even contain some errors; they can only serve as a beginner's guide, and I hope for your understanding.

Various Markers in JPEG Files#

Many articles introduce the markers in JPEG files. I have also uploaded a document that annotates an actual image (click to download) for reference.

All markers start with 0xff (hexadecimal 255), followed by the byte count representing this block of data and the data describing this block's information, as shown in the figure below:

image

At this point, we have only the image data part left to write, but how the image data part is encoded, as well as how the quantization and Huffman encoding mentioned above are implemented, please refer to the next section.

JPEG Encoding Process#

Since the JPEG encoding process requires the image to be divided into 8*8 blocks, the height and width of the image must be multiples of 8. Therefore, we can use image interpolation or subsampling methods to make slight changes to the image, transforming it into an image with both height and width as multiples of 8. For an image with thousands of pixels, this operation will not significantly change the overall aspect ratio of the image.

Color Space Conversion#

JPEG images uniformly use the YCbCr color space because the human eye is more sensitive to brightness than to chromaticity. Therefore, we selectively increase the compression of the Cb and Cr components, which can ensure that the visual quality of the image remains unchanged while significantly reducing the image size. After transforming to the YCbCr space, we can sample the Cb and Cr color components to reduce their number of points, achieving greater compression. Common sampling formats include 4:4:4, 4:2:2, and 4:2:0. This corresponds to the horizontal and vertical sampling factors in the SOF0 marker. For simplicity, all sampling factors in this article are set to 1, meaning no sampling is performed, with one Y component corresponding to one Cb and Cr component (4:4:4). In contrast, 4:2:2 means two Y components correspond to one Cb and Cr component, while 4:2:0 means four Y components correspond to one Cb and Cr component. As shown in the figure below, each cell corresponds to a Y component, while the blue cells are the sampled pixels of the Cb and Cr components.

image

The formulas for color space conversion are:

Y=0.299R+0.587G+0.114BY = 0.299*R + 0.587*G + 0.114*B
Cb=0.1687R0.3313G+0.5B+128Cb = -0.1687*R - 0.3313*G + 0.5*B + 128
Cr=0.5R0.4187G0.0813B+128Cr = 0.5*R - 0.4187*G - 0.0813*B + 128

The above calculations are all rounded to integers. In a 24-bit RGB BMP image, the ranges of R, G, and B components are all 0-255. Through simple mathematical relationships, we can find that the ranges of Y, Cb, and Cr components are also 0-255. In JPEG images, we usually need to subtract 128 from each component to allow for both positive and negative ranges.

In Python, we can use functions from the OpenCV library to perform color space transformations:

8*8 Block Division#

In JPEG encoding, each 8*8 block is processed sequentially from top to bottom and left to right. The data from each block is then combined together. For each block's Y, Cb, and Cr color components, the same operations are performed (the quantization table and Huffman table used may differ).

DCT Transformation#

DCT (Discrete Cosine Transform) converts spatial domain data into frequency domain data, allowing us to selectively reduce the data of high-frequency components without significantly affecting the visual quality of the image. Compared to the Discrete Fourier Transform, DCT operates in the real number domain, making it more advantageous for computer calculations. The formula for the Discrete Cosine Transform is as follows:

F(u,v)=2MNx=0M1y=0N1f(x,y)C(u)C(v)cos(2x+1)uπ2Mcos(2y+1)vπ2NF(u,v)=\frac2{\sqrt{MN}}\sum_{x=0}^{M-1}\sum_{y=0}^{N-1}f(x,y)C(u)C(v)\cos\frac{(2x+1)u\pi}{2M}\cos\frac{(2y+1)v\pi}{2N}

where $C(u)=\begin{cases}\frac{1}{\sqrt{2}}&u=0\\1&u\neq0\end{cases}$ . In JPEG, $M=N=8$.

Of course, we can also use functions from the OpenCV library:

Quantization#

After the DCT transformation, the DC component is the first element of the 88 block, with low-frequency components concentrated in the top left corner and high-frequency components concentrated in the bottom right corner. To selectively discard high-frequency components, we can perform quantization, which essentially involves dividing each element in the 88 block by a fixed value. The elements in the quantization table are smaller in the top left corner and larger in the bottom right corner. An example of a quantization table is as follows (the Y component and Cb Cr components use different quantization tables):

Quantization process code:

After quantization, many zeros appear in the lower right corner of the 8*8 block. To concentrate these zeros and allow run-length encoding to produce less data, we will perform a zigzag scan next.

Zigzag Scan#

The so-called zigzag scan actually converts the 8*8 block into a 64-item list in the following order.

image

Ultimately, we obtain a list of length 64: (41, -8, -6, -5, 13, 11, -1, 1, 2, -2, -3, -5, 1, 1, -5, 1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0). The subsequent operations will use this list as an example.

It is important to note that when storing the quantization table, we also need to perform a zigzag scan on the quantization table to store it in this form so that the image viewer can decode the correct image (I spent a lot of debugging time on this detail at the beginning), as seen in the code that writes the identifiers at the beginning of this article.

Differential Encoding (DC Component)#

The values of DC components are often large, and the DC components of adjacent 8*8 blocks are usually very similar. Therefore, using differential encoding can save space to a greater extent. Differential encoding means storing the difference between the current block's DC component and the previous block's DC component, while the first block stores its own value. It is important to note that the Y, Cb, and Cr components are differentially encoded correspondingly, meaning each component is subtracted from its corresponding previous value. In the following section, I will introduce how to encode and store the DC component now_block_dc.

Run-Length Encoding of Zeros (AC Component)#

After the zigzag scan, many zeros are concentrated together. The AC component list is: (-8, -6, -5, 13, 11, -1, 1, 2, -2, -3, -5, 1, 1, -5, 1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0).

Run-length encoding of zeros stores two numbers each time; the second number is a non-zero number, and the first number indicates how many zeros precede this non-zero number. Finally, two zeros are added as an end marker (especially note that when the input data does not end with zero, two zeros are not needed as an end marker; this bug took me a long time to find, see line 23 of the code below). After run-length encoding, the above list becomes: (0, -8), (0, -6), (0, -5), (0, 13), (0, 11), (0, -1), (0, 1), (0, 2), (0, -2), (0, -3), (0, -5), (0, 1), (0, 1), (0, -5), (0, 1), (3, -1), (6, 1), (0, 1), (0, -1), (27, 1), (0, 0). The length of this data is 42, which is a slight reduction compared to the original 63. Of course, this is a special case; actual data will have more zeros, even all zeros, and the encoded size can be smaller.

You may have noticed that the data (27, 1) is highlighted in red because in the encoding of the eighth part, the first number is stored as a 4-bit number, so the range is 0-15. Here, it clearly exceeds that, so we need to split it into (15, 0) and (11, 1), where (15, 0) represents 16 zeros, and (11, 1) represents 11 zeros followed by a 1.

Special Binary Encoding in JPEG#

After the above groundwork, this section will truly introduce how the encoded DC and AC components are written into the file in data stream form.

In JPEG encoding, there are the following binary encoding forms:

For a number to be stored, we need to obtain the required bit length and the actual binary value to be stored according to the above form. Observing the pattern, it is not difficult to find that the stored value of a positive number is its actual binary, and the bit length is also its actual bit length. The same applies to negative numbers, where the binary value is the bitwise negation of the number. Zero does not need to be stored.

For the DC component, assuming the value after differential encoding is -41, we can find that its bit length is 6, and the stored binary data stream is 010110. For the data 6, we need to use the normalized Huffman coding to store its binary data stream; this part will be introduced in section 9. For now, let's assume the stored binary data stream for 6 is 100, then the DC value for a color component of this 8*8 block is stored as 100010110.

After writing the binary data stream for the DC component into the file, we then encode the AC values for this color component of the 8*8 block. After run-length encoding, the values are: (0, -8), (0, -6), (0, -5), (0, 13), (0, 11), (0, -1), (0, 1), (0, 2), (0, -2), (0, -3), (0, -5), (0, 1), (0, 1), (0, -5), (0, 1), (3, -1), (6, 1), (0, 1), (0, -1), (15, 0), (11, 1), (0, 0).

First, we store (0, -8). For the second number, we perform the same operation and obtain 4 bits and 0111. However, unlike the DC component, we need to perform normalized Huffman coding on 0x04, where the high four bits represent the first number of (0, -8), and the fourth bit represents the bit length of the second number. Assuming the normalized Huffman coding for 0x04 results in 1011, then (0, -8) is stored as 10110111. Next, we perform the same operations for (0, -6) and so on, writing the resulting data streams into the file.

To give another example, (6, 1), where 1 is stored as 1, 1 bit, so for 0x61, we perform normalized Huffman coding, assuming it results in 1111011, then (6, 1) is stored as 11110111. (15, 0) only stores the normalized Huffman coding value of 0xf0.

After writing the data for one color component (assumed to be Y) of an 88 block, we then write the data for the Cb color component of this block, followed by the Cr component data. Using the same method, we write the data for each 88 block from left to right and top to bottom, and finally write the EOI marker (0xffd9) to indicate the end of the image.

Note: During the data writing process, we need to check if we are writing 0xff. To prevent marker conflicts, we need to append 0x00 afterward.

Normalized Huffman Coding#

In this article, four coding tables for normalized Huffman coding are introduced, which are used for luminance DC components, chrominance DC components, luminance AC components, and chrominance AC components.

In the above code, std_huffman_DC0, etc., are the actual values stored in the identifiers. The first 16 numbers (0, 0, 7, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) represent how many numbers there are for each length of 1-16 bits after encoding, followed by 12 numbers that exactly equal the sum of the previous 16 numbers. std_huffman_DC0 describes the following figure:

image

Now we only know the length of the encoded data for each original data, but we do not know what the actual values are.

Normalized Huffman coding has its own set of rules:

  1. The first number of the minimum encoding length is encoded as 0;
  2. The encoding of the same length is continuous;
  3. The first number of the next encoding length (assumed to be j) depends on the last number of the previous encoding length (assumed to be i), i.e., a=(b+1)<<(j-i).

According to rule 1, we know that the encoding for 4 is 000. According to rule 2, the encoding for 5 is 001, the encoding for 3 is 010, the encoding for 2 is 011..., and the encoding for 0 is 110. According to rule 3, the encoding for 7 is 1110, the encoding for 8 is 11110...

The resulting Huffman dictionary is quite long; you can check it in my GitHub project. By finding the pattern, you can understand how I derived the index of the dictionary in the write_num function.

Do1e/JPEG-encode

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.