Towards Ultra-Low-Bitrate Video Conferencing Using Facial Landmarks

Pin-Chun Wang, Ching-Ling Fan, Chun-Ying Huang, Kuan-Ta Chen, and Cheng-Hsin Hsu

PDF Version | Contact Us


Providing high-quality video conferencing experience over the best-effort Internet and wireless networks is challenging, because 2D videos are bulky. In this paper, we exploit the common structure of conferencing videos for an ultra-low-bitrate video conferencing system. In particular, we design, implement, optimize, and evaluate a video conferencing system, which: (i) extracts facial landmarks, (ii) transmits the selected facial landmarks and 2D images, and (iii) warps the untransmitted 2D images at the receiver. Several optimization techniques are adopted for minimizing the running time and maximizing the video quality, e.g., the image and warping frames are optimally determined based on network conditions and video content. The experiment results from real conferencing videos reveal that our proposed system: (i) outperforms the state-of-the-art x265 by up to 11.05 dB in PSNR (Peak Signal-to-Noise Ratio), (ii) adapts to different video content and network conditions, and (iii) runs in real-time at about 12 frame-per-second.
Keywords: Data compression; facial landmarks; facial models; video conferencing; video codecs; optimization
Figure 1: Illustrations of our low-bitrate video conferencing system. Only one-way video streaming is shown for brevity. Sample reconstructed video frames from our and image-based (x265) systems are given.

1  Introduction

Recent market research [19] depicts that the market share of video conferencing systems is expected to grow from 3.31 billion USD in 2013 to 6.40 billion in 2020, at a compound annual growth rate of 9.3%. Providing high-quality video conferencing experience, however, is challenging, because video conferencing, like many other real-time interactive applications, is resource demanding. For example, Skype recommends bitrates between 700 Kbps and 2.5 Mbps for 1280x720 video calls using H.264 codecs [13]. Guaranteeing such a high end-to-end bandwidth requirement is no easy task in the best-effort Internet; and doing so in shared wireless networks, such as WiFi and cellular networks, is even more difficult due to congestion, channel fading, shadowing, and interference. In this paper, we design an ultra-low-bitrate high-quality video conferencing system for commodity computers by analyzing the structures of typical conference video frames and aggressively skipping redundant information. By typical conference videos, we consider a talking head in a conference room, while our system can be generalized for multiple participants in the same conference room. We make a crucial observation: in these videos, the major movements come from talking heads. To leverage this observation for bitrate reduction, we may: (i) transmit a snapshot image of talking heads at the beginning of each video conferencing session as a reference image, which is referred to as a base image, and (ii) describe the talking heads using facial landmarks [17], in order to synthesize facial expressions without sending (bulky) 2D images in all video frames. In particular, we propose such a system (one-way for brevity) in Fig. 1. We divide all video frames from the Webcam into two groups: (i) image frames that are transmitted as regular video frames encoded with video codecs and (ii) warped frames that are synthesized using image frames and facial landmarks, such as key feature points on edges of eyes, nose, and mouth. Selected landmarks and image frames are sent to the receiver. The receiver reconstructs the warped frames and sequentially plays all video frames. Fig. 1 also presents sample reconstructed frames from our proposed system and those from a conventional (image-based) video codec at 25 Kbps. The blocking features of the sample frames from the image-based video codec are clear, leading to degraded video conferencing quality.
We emphasize that our goal is very aggressive, as we aim to provide acceptable video conferencing quality at 30 Kbps, which is even lower than some audio codecs, such as G.711. To cope with this challenge, we carefully design and implement the individual components in the proposed system. The crux of the whole system is the frame type selector, which analyzes the expected video quality of sending the current video frame as: (i) an image frame or (ii) a warped frame under the bandwidth constraints. The frame type selector then makes the decision based on the analysis results, so as to maximize the overall video quality and thus the conferencing experience. We implement the proposed system and conduct experiments with real conferencing videos from several subjects. The experiment results show the merits of our system, for example, it: (i) outperforms the state-of-the-art x265 by up to 11.05 dB in PSNR (Peak Signal-to-Noise Ratio), (ii) adapts to different video content and network conditions, and (iii) runs in real-time at about 12 fps (frame-per-second).

2  Related Work

Qi et al. [16] propose to skip transmitting some video frames at the sender, and employ 2D interpolation to synthesize the skipped video frames at the receiver. Different from our system, their solution ignores the fact that facial expressions of the talking head are the most crucial content in video conferences, and may lose the opportunity to exploit the redundancy among faces in adjacent frames. Facial models may be used to convey more facial details, e.g., Allen et al. [2] extract facial model parameters for higher compression ratios, and Zeng et al. [23] propose a solution to emphasize the appearance of mouths and eyes during video conferences. These studies [2,23] transmit potentially duplicated texture parameters across adjacent video frames. In contrast, we aggressively avoid sending repeated information by using image frames and landmark coordinates. An early study [10] transmits images from three different angles of a face, and uses manually-selected landmark coordinates for frame interpolation. The authors find the facial expressions are not clear, and apply Principle Component Analysis (PCA) to transmit the regions of eyes and mouth. Our work also leverages base images similar to Koufakis and Buxton [10]; however, their system requires human interventions, and thus is not suitable to interactive video conferences. MPEG-4 [14] provides coding tools for face animation. In particular, the high-level expressions, like, joy, anger, and sadness, can be encoded with a given facial model. However, extracting these high-level expressions is time-consuming (3.72 fps at 352x288 on smartphones [18]), rendering it not suitable for real-time video conferences. MPEG-4 also supports low-level expressions, which are encoded using six parameters, such as the distance between upper and lower eyelids. Six parameters are too few, and lead to degraded facial expressions, as reported in Zeng et al. [23]. Compared to the high- and low-level MPEG-4 coding tools, our system runs faster and encodes more facial details, respectively.

3  Facial Landmark-Based Video Conferencing System

3.1  Overview

Figure 2: The architecture of our proposed system.
Fig. 2 presents the architecture of our ultra-low-bitrate video conferencing system. Different from conventional video conferencing systems, the proposed system consists of five unique components: (i) Landmark Detector, (ii) Video Codec, (iii) Landmark Codec, (iv) Image Warper, and (v) Frame Type Selector. The interactions among those components are as follows. The original video frames are captured by a Webcam, and sent to the landmark detector, which outputs the coordinates of detected landmarks. Afterwards, the landmark codec quantizes and compresses the landmarks and transmits them to the image warper. The image warper synthesizes the latest image frame based on the quantized landmark and passes the image to the frame type selector. We assume the very first base image is sent to the receiver right before each video conference, which is used to bootstrap frame warping. The frame type selector compares the video quality between the warped frame and the image frame to decide the frame type. Following the decision, the image frames and facial landmarks are transmitted as needed. At the receiver side, the image frames are decoded by the video codec and the warped frames are reconstructed by the uncompressed landmarks and the last received image frame. Among the five components, the frame type selector hosts the intelligence of maximizing video quality. We present the designs of the other four components in Sec. 3.2, and the details on frame type selector in Sec. 3.3.

3.2  Component Designs (Except Frame Type Selector)

Landmark Detector. We use facial models to derive the landmarks on each frame. Facial models can be roughly classified into 3D [3,5] and 2D [7,8] ones. We adopt 2D facial models since the illumination and talking head features are rather static in video conferencing. In particular, we consider two most popular 2D facial models: Active Appearance Model [7] (AAM) and Constrained Local Model [8] (CLM). The goal of AAM is to fit a static shape and appearance model to a new image. Through iterations, AAM fits the models and computes the coordinates of the landmarks. In contrast, CLM builds shape models by labeled training data. Generally, AAM preforms better on face alignment and rotated faces, and CLM provides more delicate changes on different expressions. Considering the properties of conferencing videos (more facial expressions, fewer changes on face orientations), we choose CLM with 68 landmarks [4] for better video quality.
Image Warper. Image Warper synthesizes frames based on landmarks and prior image frames. We first use Delaunay Triangulation algorithm [11] to divide faces (excluding the background region) into triangles based on the landmarks. Then, an affine transformation is applied to those triangles. This process can be viewed as if those triangles are projected onto new surfaces pixel by pixel. Moreover, the affine function is ideal for image mapping when the transformed regions are small due to its geometric properties. In our application, the affine transformation not only solves the deformation problem on facial regions, but also incurs lower computational complexity because of its linear property.
Video Codec. We adopt the state-of-the-art H.265 codec, which achieves higher coding efficiency than earlier codecs, such as H.264 and MPEG-4. For example, Ohm et al. [12] report that H.265 achieves 50% bitrate reduction compared to H.264 in objective tests. Their subjective tests show even larger gaps. If codec availability is a concern, any other low-complexity 2D video codec can also be used in our system.
Landmark Codec. Since coordinates of the same landmark impose temporal redundancy across neighboring frames, we encode the landmark coordinates in deltas instead of raw numbers, as detailed below. First, we send the number of landmarks in each video frame, which is an 8-bit unsigned integer. This is followed by a series of video frames. Each frame starts with a timestamp as a 16-bit unsigned integer, followed by a series of landmarks. The landmarks are first normalized to the width/height of the video resolution (i.e., between 0 and 100%), and then quantized into the unit of ten thousand. For the first frame, we store each landmark as a 14-bit integer. The landmarks in the following frames are stored in deltas, which can be either a short or long delta, as indicated by a flag bit. Using real conferencing videos (detailed in Sec. 4), we analyze the distributions of landmark deltas. We find that 4-bit short deltas cover 76.2% of all landmark deltas, and 7-bit long deltas cover additional 24.5% of them. Since there are only 0.3% remaining landmark deltas, we decide to encode them in 7-bit, and carry over the residues to the next frame in the worst case. With the proposed representation format, we reduce the size of landmark coordinates by about 30%. The resulting landmark representations are compressed by the 7z [1] algorithm before being transmitted.

3.3  Maximizing Video Quality Using Frame Type Selector

We design the frame type selector to be content dependent and network adaptive. It dynamically instructs: (i) the sender to transmit an image frame or (ii) the receiver to synthesize a warped frame, so as to maximize the video quality of every single frame. In particular, the sender keeps track of the available bitrate, or bit budget, of each video frame. The sender encodes the current video frame using the bit budget and also simulates the warping procedure at the receiver side. The frame type selector then chooses the type (image or warped frame) that leads to higher video quality. Notice that we only transmit the landmarks whenever needed. This is because warping each frame only requires the landmarks of the latest image frame and the current frame. By deferring the transmission of the landmarks of image frames, we never transmit the landmarks that will not be used. We note that our frame type selector implicitly determines the frequency of sending an image frame based on the video content characteristics and network resource availability, which is the core research problem to optimize the proposed ultra-low-bitrate video conferencing system.

4  Experiments

4.1  Setup

We have implemented our proposed video conferencing system using CLM [4] and OpenCV [6] libraries, and in C++. Our system has five components: (i) landmark detector, which analyzes video frames and generates the landmarks for each video frame, (ii) frame type selector, for selecting whether it is an image frame or a warped frame under different total bitrate constraints, (iii) encoder, which compacts and compresses landmarks, and invokes x265 [22] to encode images frames according to frame type and the per-frame bitrate constraints, (iv) decoder, which decodes the images and landmarks, and (v) warper, which generates intermediate synthesized images with OpenCV [6]. The base images are compressed into JPEG, and the x265 configurations are ultrafast preset, zero latency tuning, and IPPP … structure, if not otherwise specified.
Figure 3: The video quality of a sample video under different bitrates.
figs/improv_psnr.png (a)
figs/improv_ssim.png (b)
Figure 4: Our system outperforms the image-based system in terms of conferencing video quality, in: (a) PSNR and (b) SSIM.
figs/avgwarpratio.png (a)
figs/warpratio.png (b)
Figure 5: Our system automatically adapts to diverse: (a) network conditions and (b) video content.
Figure 6: Better base image quality leads to higher conferencing video quality. Sample results in PSNR are shown.
We recruit nine subjects in our university, and record nine videos at 1280x720 using commodity Webcams and computers for our experiments1. When recording the videos, we ask the subjects to talk as if they are in video conferences. Each video has 300 frames and lasts for 10 seconds. We adopt the following performance metrics:
We conduct the experiments with the abovementioned conferencing videos in our ultra-low-bitrate system (denoted as Our System in figures). Our system adaptively sends encoded images and landmarks according to the expected video quality. To our best knowledge, our system is the first complete system of its own kind, e.g., although Zeng et al. [23] also use facial models, they do not propose rate control mechanism, and thus their work cannot serve as the baseline system. For comparisons, we also run the same experiments with the x265 [22] codec, which is the state-of-the-art image-based codec (denoted as Image-Based in figures). For each conferencing video, we vary the bitrate at {25, 30, 40, 50, 60, 100} Kbps, and the base image size at {20, 40, 80} KB. We let the bitrate be 40 Kbps and the image size be 80 KB if not otherwise specified. We run the experiments on a Linux workstation with an Intel i7 CPU at 3.6 GHz and 8 GB RAM. Our system works with different video quality metrics. Results from optimizing video quality in PSNR are shown by default; only some SSIM results are presented due to the space limitations.

4.2  Results

Our system outperforms the image-based system. Fig. 3 plots the PSNR values of a sample video under different bitrates. In both systems, higher bitrates lead to better conferencing video quality. However, our system constantly outperforms the image-based system. For example, our system achieves 30 dB in PSNR at merely 25 Kbps2, while the image-based system requires 60 Kbps (2.4 times). Next, we calculate the quality improvements in PSNR and SSIM of all conferencing videos and plot the mean improvements with 95% confidence intervals in Fig. 4. This figure shows that our system always outperforms the image-based system at all bitrates. The mean PSNR improvement of all conferencing videos is up to 7.5 dB, while the mean SSIM improvement is up to 0.08. For individual conferencing videos, the maximal PSNR improvement is 11.05 dB, and the maximal SSIM improvement is 0.15.
Our system adapts to network conditions and video content. We plot the average warping ratios at different bitrates in Fig. 5(a). In this figure, the average warping ratio increases as the total bitrate decreases. This shows the effectiveness of our system under different network conditions, because warped frames consume smaller bit budgets. Fig. 5(b) plots the ratio of the warped frames of all conferencing videos under different bitrates. This figure shows that our system results in diverse warping ratios with different video content. A closer look indicates that our system selects more warped frames when sending more active conferencing videos; and it selects fewer warped frames when sending more static ones. Hence, our system is effective under different video content.
Implications of base image quality. We plot the average PSNR values across all conferencing videos with 95% confidence intervals in Fig. 6. We observe that, compared to the image-based system, our system achieves higher average video quality with all considered base image sizes, and the gap becomes slightly smaller if we reduce the base image size from 80 to 20 KB. More precisely, among all conferencing videos, our system outperforms the image-based system in 100%, 91%, and 83% of the conferencing videos, with the base image size of 80, 40, and 20 KB.
Running time. We measure and report the per-component running time. On average, the running times of the landmark detection, landmark encoding/decoding, image warping, image encoding/decoding, and the video quality assessment are 38, 1, 28, 13, and 5 ms, respectively. We note that some components are not invoked for every single video frame and some components may be pipelined. Our analysis reveals that the current (unoptimized) prototype system achieves up to 12 fps. Several optimization techniques can be applied to further increase the frame rate, e.g., H.265 codec chips may become commodity soon, which run much faster than the x265 software used in our experiments.

5  Conclusion and Future Work

In this paper, we designed, implemented, optimized, and evaluated an ultra-low-bitrate video conferencing system that extracts facial landmarks, compresses and transmits facial landmarks and images, and warps the untransmitted images. Various optimization techniques were proposed to maximize the video quality. Using real conferencing videos, we conducted experiments to quantify the performance and limitations of our proposed system. The results show that our system: (i) outperforms the state-of-the-art x265 by up to 11.05 dB in PSNR, (ii) adapts to different video content and network conditions, and (iii) runs in real-time at 12 fps. We believe the lessons learned when developing the proposed system will stimulate future research in this direction.
The presented work can be extended in several directions. First, multiple base images, e.g., with different facial expressions, may be sent and cached, so that the frame type selector may choose the base image that produces the highest warped frame quality. Second, a compression algorithm specifically designed for landmarks can be developed. Currently, landmarks are encoded using generic compression algorithm, leading to 21 Kbps bitrate on average. A customized compression algorithm that takes the landmark structure into considerations or even drops some less-critical landmarks may achieve lower landmark bitrate. Last, we plan to further speed up the individual components, especially the warper, using multi-threading, GPU, and techniques proposed in the literature [15,9].

6  Acknowledgements

This work was partially supported by the Ministry of Science and Technology of Taiwan under the grants: 103-2221-E-001-023-MY2, 103-2221-E-009-230-MY2, 102-2221-E-007-062-MY3.


[1] 7-zip official site, 2015.">
[2] N. Allen, B. Naidoo, and S. McDonald. Model-based compression for low-bitrate comms: A statistical approach to facial video encoding. In Proceedings of Southern Africa Telecommunication Networks and Applications Conference (SATNAC), Sep. 2006.
[3] A. Ansari and A. Mohamed. 3D face modeling using two views and a generic face model with application to 3D face recognition. In Proceedings of IEEE Advanced Video and Signal Based Surveillance (AVSS), Jul. 2003.
[4] T. Baltruvsaitis, P. Robinson, and L. Morency. 3D constrained local model for rigid and non-rigid facial tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2012.
[5] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), Aug. 1999.
[6] G. Bradski. The OpenCV Library. Dr. Dobb's Journal of Software Tools, 2000.
[7] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):681-685, Jun. 2001.
[8] D. Cristinacce and T. Cootes. Feature detection and tracking with constrained local models. In Proceedings of the British Machine Vision Conference (BMVC), Sep. 2006.
[9] V. Fuetterling, C. Lojewski, and F. Pfreundt. High-performance delaunay triangulation for many-core computers. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics (HPG), Aug. 2014.
[10] I. Koufakis and B. Buxton. Very low bit rate face video compression using linear combination of 2D face views and principal components analysis. Image and Vision Computing, 17(14):1031-1051, Jan. 1999.
[11] D. Lee and B. Schachter. Two algorithms for constructing a delaunay triangulation. International Journal of Computer and Information Sciences, 9(3):219-242, Feb. 1980.
[12] J. Ohm, G. Sullivan, H. Schwarz, T. Tan, and T. Wiegand. Comparison of the coding efficiency of video coding standards including high efficiency video coding HEVC. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1669-1684, Dec. 2012.
[13] Plan network requirements for Skype for business 2015, Sep 2015.">
[14] A. Puri and A. Eleftheriadis. MPEG-4: An object-based multimedia coding standard supporting mobile applications. Mobile Networks and Applications, 3(1):5-32, Jun. 1998.
[15] M. Qi, T. Cao, and T. Tan. Computing 2D constrained delaunay triangulation using the GPU. IEEE Transactions on Visualization and Computer Graphics, 19(5):736-748, May. 2013.
[16] X. Qi, Q. Yang, D. Nguyen, G. Zhou, and G. Peng. LBVC: towards low-bandwidth video chat on smartphones. In Proceedings of ACM Multimedia System Conference (MMSys), Mar. 2015.
[17] D. Rathod, A. Vinay, S. Shylaja, and S. Natarajan. Facial landmark localization - a literature survey. International Journal of Current Engineering and Technology, 4(3):1901-1907, Jun. 2014.
[18] M. Suk and B. Prabhakaran. Real-time facial expression recognition on smartphones. In Proceedings of the IEEE Applications of Computer Vision (WACV), Jan. 2015.
[19] Video conferencing market to expand at 9.3% CAGR to 2020 thanks to increasing usage in healthcare and defense, Jul. 2015.">
[20] Y. Wang, J. Ostermann, and Y. Zhang. Video Processing and Communications. Prentice Hall, 2001.
[21] Z. Wang, L. Lu, and A. Bovik. Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication, 19(2):121-132, Feb. 2004.
[22] x265 HEVC Encoder official site.">
[23] W. Zeng, M. Yang, and Z. Cui. Ultra-low bit rate facial coding hybrid model based on saliency detection. Journal of Image and Graphics, 3(1):25-29, Jun. 2015.


1. To be realistic, we consider users who feel comfortable talking in diverse ways, e.g., some of them constantly move their heads, and others are rather static.
2. 30 dB is considered as good quality [20].

Sheng-Wei Chen (also known as Kuan-Ta Chen) 
Last Update August 01, 2019