imaging basics

Basics of Images and Videos:

Images: Images can be black & white or color. These are formed by having different light intensity at different pixels of an image. Black & white images just have varying degree of white light at different pixels, while color images use varying degree of primary color (RGB) lights. We store every pixel value light intensity as a decimal number going from 0 to 255 (if 8 bits used).

For black-white picture, each pixel only needs to store 1 number of how much bright that pixel is (0 for total black and maximum number for total white).

For color images, we need to store 3 numbers for each pixel (as each pixel will need to store all 3 primary color info). We can either store color info for each pixel as RGB directly, or we can store it as having 2 portions = one portion storing black/white info, while other portion storing color info. When stored in this 2 portion format, instead of storing as RGB, we store it as Y'CbCr (aka YUV or YCC). However, this image in YUV ultimately needs to be transformed into RGB for display by the monitor (as RGB values are what are used by all modern display devices). The reason, we still use this 2 portion format is because everything being transferred to your TV on air, or from your Blu-ray disc is being sent in YUV. Reason for that is that these high definition video transfer require a lot of data, and YUV can be compressed lot more than RGB, with no noticeable loss in quality. If it wasn't for this bandwidth reduction, we would just stick with RGB, since it's simple and gives accurate images. Anyway for YUV, te 2 portions of image are:

1. achromatic (without color) portion: This is called luma (represented by Y'). It represents the brightness of an image, i.e black and white portions of image. Black means no brightness, while white means full brightness. Luma is the weighted sum of gamma compressed R'G'B' components of color video - primes on RGB denote gamma compression. Y' = a*R' + b*G' + c*B', where a,b,c are coefficients b/w 0 and 1. These coefficients vary, and there are different Y' corresponding to different coeff. It's hard to get correct achromatic info by using a certain coeff, and so different standards use different coeff (lot's of debate on what are the correct coeff to use). Weighted sum of linear RGB (w/o primes), is called relative luminance (or just luminance or Y), and is used in color science. Y = a*R + b*G + c*B. We actually want luma to represent luminance exactly, but it's not possible as errors in chroma affects luma.

2. chromatic (color) portion: This is called chroma. It represents the color, hue, phase info of an image. It is further separated into Cb and Cr components, which are the blue and red components respectively. Jut like luma, Cb and Cr are also some linear combination of R,G,B. Just like luminance vs luma, chroma is different from chrominance.

In an image, luma is typically paired with chroma. Human eye is more sensitive to luma than to chroma (i.e it can perceive black/white differences more easily than color differences). This characteristic is used to do chroma subsampling. The idea is that instead of having unique color for each pixel, we duplicate color for every 2 pixels or every 4 pixels. Then the bandwidth requirement for color goes down by factor of 2 or 4. This is how chroma subsampling is rep:

J:a:b => subsampling rep in this format where we have a grid of pixels: J pixels wide by 2 pixels high (i.e J*2 pixels in total), out of which there are "a" samples of chroma (CbCr) in 1st row of pixels (top row), while there are "b" changes of samples of chroma b/w 1st and 2nd row. NOTE: luma is not sampled, and is present for each pixel.

These are common subsampling ratios:

4:4:4 => refers to no sampling, i.e all pixels are rep by their unique CbCr. Similar to RGB (as Y'CbCr were transformed from RGB, so if each pixel retains unique CbCr, then in essence, all color info is retained for all pixels). This mode is enabled when TV are used as monitor, since then text starts appearing blurry, if any subsampling done.

4:2:2 => The chroma components (CbCr) are sampled at half the sample rate of luma. Two adjoining pixels in horizontal line (total 2 pixels) repeat same color.

4:2:0 => The chroma components (CbCr) are sampled at one fourth the sample rate of luma. Two adjoining pixels in horizontal line and 2 pixels below it in next line (total 4 pixels) repeat same color. So, resolution for color is reduced by 1/2 in horizontal dir, and 1/2 in vertical dir. This most widely used in 4K transmission since BW reduction of 1/4th results in almost no loss in quality.

Each of the pixel luma/color info is called a channel. So, for RGB, the 3 channels are R,G,B. For YUV, 3 channels are Y,U,V.

Image Formats:

Images may be in raster or vector form. Both types of format need to be rastorized (converted into pixels) to be displayed on a monitor. Raster image use bitmap which represent a rectangular grid of pixels, with each pixel's color being specified by a number of bits (for ex color may be 24 bit, with 8 bit for Red, 8 for green, 8 for blue). These grid of pixels forms a colored image. It can be stored as file (bitmap image file), or in computer's video memory for display on monitor. Modern displays are bit mapped, where each on-screen pixel directly corresponds to a small number of bits in memory. Vector format store images as geometric description, and will need to be converted into a grid of pixels for display on monitor. These are not so common today. Data storage in an image can be pixel oriented (color values for each pixel are clustered and stored consecutively) or planar oriented (color values for each pixel are stored in separate planes, so in essence each color component is stored as separate array).

An image file format may store data in uncompressed or compressed forms.

Compressed images: 2 compression algorithms = lossless and lossy.

1. Lossy ones lose info when compressed, so that when reduced images are enlarged, they are not a exact replica of original image.Lossy compression method most popular is DCT (discrete cosine transform). A DCT is similar to a Fourier Transform in the sense that it produces a kind of spatial frequency spectrum.

2. Lossless ones preserve all the info of original image even in compressed form. However, lossless images are larger in size compared to lossy ones. LZW is most popular Lossless compression method.

A. JPEG (Joint Photographic Experts Group) = stored as jpeg or jpg extension. Most popular image format. It's lossy. It can support 8 bit grayscale and 24 bit color image. JPEG image consists of sequence of segments. Codecs are required to encode/decode jpeg image. These are the basic steps of encoding: (detailed example here on wiki: https://en.wikipedia.org/wiki/JPEG)

  • Image is converted from RGB to Y'CbCr with Y' (luma) representing brightness and CbCr (chroma) representing color. How to convert from RGB to Y'CbCr is specified in JFIF standard.
  • Resolution of chroma is reduced by factor of 2 to 3, since eye is less sensitive to fine color details than to fine brightness details.
  • However, some implementations do not convert, and instead keep RGB itself, which results in less efficient compression.
  • The image is split into blocks of 8×8 (for 4:4:4 or no subsampling. For 4:2:2, it's 16x8 block size, as chroma is repeated for 2 pixels) for each channel, and for each block, each of the Y, CB, and CR data undergoes DCT. These blocks are called MCU (minimum coded unit) or macroblocks. After DCT, data is adjusted to bring it in valid range (i.e for 8 bit, all values must be within 0 to 255)
  • The amplitudes of the frequency components are quantized. Quantization is the process of dividing each value of DCT with a constant, and then rounding it. This causes most of the high freq components to be rounded to 0, and remaining to become small +ve/-ve numbers, which may take fewer bits to represent. Rounding operation is the only one where lossy operation is performed. A quantization matrix is used (as specified by JPEG), and this matrix controls the compression ratio.
  • The resulting data for all 8×8 blocks is further compressed with a lossless algorithm, a variant of huffman encoding
  • Decoding a JPEG image consists of doing all the above steps in reverse.

B. GIF (graphics interchange format) = because of it's simplicity and age, it's very popular. However, it can only support 8 bit color image. GIF is patent free now, as patents have expired.

C. PNG (Portable network graphics) = it was created as a free, open-source alternative to GIF. 

Ex: For a 3 Mega Pixel camera, with 24 bits/pixel, to store the picture in raw format would require 3M*24bits=72Mbits/8=8MB of memory. However, in stored in jpeg format, it can be reduced in size anywhere from 10X to 100X, depending on quality loss acceptable. A size of 0.3MB in compressed format, offers pretty good quality.

 

Video:

video is basically a series of still image frames. It contains both spatial (within same frame as in still image) and temporal (in time) redundancy. video can be more effectively be compressed, since successive images in video differ by small amounts, so only the relative differences in successive frames can be stored. spatial compression is called intra-frame (within a single frame) compression, while temporal compression is called inter-frame (across multiple frames). Temporal compression is possible for cases where frames move in a simple manner, so that short commands can tell the compressor to just shift, rotate, lighten or darken the copy. However in areas of video with more motion, more data has to be stored to maintain quality. Various prediction techniques are applied to predict data for new frame. Various filters can also be applied to both encoding and decoding to remove soften blurring effects and to further improve compression. Varyimg bitrates are used for encoding/decoding depending on whether more bits or less bits need to be stored for that sequence of frames. I-frames (intra frame or baseline jpeg images) and P-frames (predicted frames or frames computed from I-frames) are names commonly used in video formats.

video is almost always stored in lossy format. Very high compression can be achieved via this. If we store 1 sec of HD video (1280 columns x 720 rows = 921.6K pixels = approx 1M pixels) @ 30frames/sec with 24 bit/pixel, it would need 3MB*30=90MB/sec of video. However, if each frame is compressed, and assuming that subsequent frames contain no new data, it can be compressed to atleast 10X or 9MB/sec. 1 hour of such video would require 3600*9MB=30GB of space. If you look at DVD, they store 2 hours of movie in 4GB of disc. Their transfer rate is 1.5Mbits/sec. That implies that each frame is compressed to more than 100X, and then predicted frames use very little data xfer to predict new frames.

various video coding standards emerged from 1980's. Few of the most popular standards shown below:

1. H.261: One of the first video coding standard was H.120 created in 1984, which was based on DPCM (differential pulse code modulation). It wasn't popular, as it's performance was poor. Then H.261 standard was developed based on DCT lossy compression. It proved to be very popular and a precursor to subsequent video coding standards as H.262, H.263, H.264/AVC and H.265/HEVC.

2. H.262: aka MPEG-2 Part 2, as it was developed by MPEG (discussed below).

3. H.263: developed as part of MPEG-4 standard

4. H.264: developed as part of MPEG-4 Part 10 standard for video compression. Also known as AVC (advanced video coding) or MPEG-4 Part 10. It's the most commonly used video compression standard, and supports resolution upto 8K UHD. All streaming video such as netflix, etc use this video std.

5. H.265: developed in 2013 as part of MPEG-H Part 2 standard for video compression. Also known as HEVC (High Efficiency Video Coding) or MPEG-H. It's just an extension of H.264. It offers much more data compression than H.264 for same video quality. It is mostly targeted for high resolution video upto 8K.  It's competing with royalty free AV1 coding. NOTE: all the above standards require royalty, except whose patents have expired.

6. AV1: Developed as open and royalty free alternative to MPEG by Alliance for Open Media. It was extended from VP9 developed by google, which itself was extended from VP8. VP10 is successor of VP9, but AV1 was started in parallel. AV1 is intended for use in HTML5 web video.

Audio + Video: Most of the video that we watch has associated audio with it. So, files for storing video consists of container: it contains video data in video coding format, and audio data in audio coding format. There is also some other data as metadata stored in the container. However, there is no uniformity as to which audio/video format is there based on file extension. Windows media player (.wmv), flash video (.flv) container have well defined video/audio formats they support, while more general container types like AVI (.avi) and QuickTime (.mob) can contain audio/video in any format, making it hard for the end user to determine what codecs to use to play that file.

FFmpeg is a free software, whose project libraries have wide support for lots of video/audio file formats. Free open source player, VLC media player, uses ffmpeg libraries, and so can play almost all video files.

MPEG (Motion Pictures Experts Group) was established in 1988 by various companies. It defined standards consisting of different parts (i.e video compression, audio compression, etc). Various standards developed by it were as follows:

MPEG-1 (1993): limited to 1.5Mbits/sec as the lowest bitrate, but can support upto 100Mbit/sec. It compresses video to 26:1 and audio to 6:1, w/o excessive quality loss. Used on video CD (VCD). It introduced popular MPEG-1 Audio layer III (MP3) audio compression format, after introducing MP1 and MP2 formats earlier. It supported only 2 channels (stereo) for audio. For video compression, it was based on H.261 standard. All patents related to MPEG-1 expired in 2017. So, codecs for MPEG-1 can be developed royalty free.

MPEG-2 (1995): based on H.262 for video compression and mp3 for audio. It also introduced AAC audio coding std. More ever, it allowed coding of audio programs with more than 2 channels, upto 5.1 channels.

MPEG-3: found to be redundant, and so merged with MPEG-2 standard.

MPEG-4 (1998): MPEG-4 includes more advanced compression algo, resulting in higher computational requirement. It's the most popular video std. It consists of several standards (called parts, i.e MPEG-4 part2, MPEG-4 part 10, etc). MPEG-4 doesn't define a single audio/video compression std, but allows one to choose amongst various profiles. It has a complex toolbox to perform wide range of audio compression from low bit rate (2Kbits/sec) to high quality audio (64Kbits/sec). Similarly for video compression. MPEG4 was initially targeted for low bit rate video communication, but was later expanded for HD content via Advanced Video Coding (AVC) used in Part 10. MPEG-4 part2 codecs are used in DivX, Xvid, Quicktime, while MPEG-4 part4 (advanced video coding, or MPEG-4 AVC) are used by Nero, HD disc, Blu-ray disc, etc. Part 10 is based on H.264 video compression. It also standardized DRM (digital rights management). MPEG-4 contains patented tech, so royalty fees required.

Containers: These standards above were put in containers by that name.

MPEG containers: So, video files based on MPEG-1/MPEG-2 were put in .mpg or .mpeg container, MPEG-4 in .mp4 or .m4v or m4p, etc.

Flash containers: Adobe Flash video format (.flv) uses MP3/H.264 for audio/video, but newer format (.f4v) have Adobe audio/video format, and is de facto std for web based streaming videos.

Real Media containers: Real media container (.rm) uses proprietary video and audio by RealNetworks, and can be played by their video played called RealPlayer.

Vob container: (.vob) most commonly seen in VIDEO_TS folder of DVD contain MPEG-1/MPEG-2 video and MP2/AC-3 audio.

WebM container: It uses VP9 video and Opus audio in *.webm container. It is the format served by youtube. VP9 was the format developed by google and is royalty free.