Smart Beholder: An Open-Source Smart Lens for Mobile Photography

Chun-Ying Huang, Chih-Fan Hsu, Tsung-Han Tsai, Ching-Ling Fan,
Cheng-Hsin Hsu, and Kuan-Ta Chen

PDF Version | Contact Us


Smart lenses are detachable lenses connected to mobile devices via wireless networks, which are not constrained by the small form factor of mobile devices, and have potential to deliver better photo (video) quality. However, the viewfinder previews of smart lenses on mobile devices are difficult to optimize, due to the strict resource constraints on smart lenses and fluctuating wireless network conditions. In this paper, we design, implement, and evaluate an open-source smart lens, called Smart Beholder. It achieves three design goals: (i) cost effectiveness, (ii) low interaction latency, and (iii) high preview quality by: (i) selecting an embedded system board that is just powerful enough, (ii) minimizing per-component latency, and (iii) dynamically adapting the video coding parameters to maximizing Quality of Experience (QoE), respectively. Several optimization techniques, such as anti-drifting mechanism for video frames and QoE-driven resolution/frame rate adaptation algorithm, are proposed in this paper. Our extensive measurement study indicates that Smart Beholder outperforms Altek Cubic and Sony QX100 in terms of lower bitrate, lower latency, slightly higher frame rate, and better preview quality. We also demonstrate that Smart Beholder adapts to network dynamics. Smart Beholder has been made public as an experimental platform for researchers and developers for optimized smart lenses, and other embedded real-time video streaming systems.

1  Introduction

The popularity of smartphones grows dramatically in the past few years, and the growth rate shows no sign of slowing down. For example, a recent report [34] indicates that more than 1 billion smartphones were shipped in 2013, which is equivalent to 38.4% of increase compared to 2012. The smartphones come with cameras, and are used by casual photographers to replace their digital cameras. In fact, we observe clear drops on the number of shipped digital cameras starting from 2012, and the projected shipment volume of digital cameras is only 54 millions in 2014 [14]. Such replacement effect may be attributed to the convenience of smartphones and the shrinking performance gap between smartphone cameras and digital cameras. Nevertheless, there are still well-known reasons that differentiate digital cameras from smartphone cameras [1]. First, smartphones must be compact in order to fit into users' pockets. Therefore, most smartphone cameras are not equipped with bulky optical zoom lenses. The users have to resort to suboptimal digital zooms. Second, smartphones often come with smaller optical sensors due to the space concerns, which lead to inferior photo quality especially under low-light conditions. Third, smartphones do not support interchangeable lenses, and cannot adopt long-focus, macro, fish-eye, and wide-angle lenses for high-quality and more extreme needs. Last, smartphone cameras are not normally adjustable in terms of, e.g., ISO, aperture, and shutter speed, which result in inflexibility. These limitations prevent mobile photographers from producing high-quality photos using their smartphones.
Smart lenses, such as Sony QX100 [11], Kodak SL10 [30], and Altek Cubic [2], are detachable lenses connected to mobile devices via wireless networks. Mobile photographers use smartphones (or tablets) to access the smart lenses for: (i) previewing photos (or videos) in live viewfinders, (ii) adjusting various lens configurations, (iii) capturing photo (or video) shots, and (iv) applying digital effects. Since smart lenses are no longer embedded in smartphones, the form factor of smart lenses are not limited by that of smartphones. Therefore, smart lenses are capable to address the aforementioned limitations, closing up the gap of gears between mobile and professional photographers. Moreover, some special shooting angles, such as low-angle and close-up shots, are easier to take with smart lenses as they are detached from viewfinders on smartphones. While smart lenses offer such new opportunities to mobile photographers, delivering good photo taking experience is not an easy task because smart lenses are connected to smartphones via wireless networks, which are sensitive to fading, shadowing, and interference in wireless communications. In addition, users have two expectations: (i) low interaction delay and (ii) high graphics quality, which are contracting to each other for several reasons. For example, while complex motion estimation algorithms lead to good graphics quality, they also result in long interaction delay. Last, smart lenses are often implemented on resource-scarce embedded systems, which further complicate the design, development, and implementation of smart lenses for diverse applications. However, existing commercial smart lenses are proprietary and closed, and cannot facilitate customizations and parameter tuning to exercise the design space.
Figure 1: The working Smart Beholder prototype.
In this paper, we design, implement, and evaluate an open-source smart lens, called Smart Beholder, or Beholder for short. The term "beholder" is normally used to refer a fictional flying orb with a large eye (commonly seen in AD&D games) [9], in analogy to very portable smart lenses with powerful (but potentially bulky) optical lenses. We carefully design and implement Smart Beholder for cost effectiveness, low interaction delay, and high viewfinder preview quality. This is done by (i) selecting an embedded system board that is just powerful enough, (ii) minimizing per-component latency, and (iii) dynamically adapting the video coding parameters to maximizing Quality of Experience (QoE), respectively. Several optimization techniques, such as anti-drifting mechanism for video frames and QoE-driven resolution/frame rate adaptation algorithm, are proposed and implemented.
Smart Beholder is built to be an open platform for researchers and developers to evaluate different design alternatives, so as to make educated, if not optimal, design decisions. Furthermore, we conduct real experiments to compare the performance of Smart Beholder and two commercial products [11,2], which are the only two smart lenses available when we started this project in 2014. Our experiment results reveal some limitations of the commercial products, e.g., the preview video quality in viewfinders is low, which leaves rooms for improvements. Furthermore, we thoroughly evaluate the performance of a complete Smart Beholder platform as illustrated in Figure 1. In this picture, the server runs on a Raspberry Pi board [29] on the left; the client runs on an Android tablet showing what the webcam on the server points to. Our evaluation results show the practicality and efficiency of the Smart Beholder platform over the considered commercial products.
This paper makes the following contributions:

2  Related Work

Smart lenses are remotely related to mobile photography and camera sensor networks. Mobile photography refers to using smartphones for photo taking, which has attracted considerable attentions in several application domains, such as health care [33] and ethnography [15]. Existing mobile photography studies rely on built-in cameras of smartphones, and can be extended by attaching smart lenses. Camera sensor networks consist of motes with camera sensors and network interfaces, and transmit captured videos over multi-hop wireless networks to one or multiple clients [12,36]. Camera sensor networks focus more on multi-hop routing, while smart lenses support single-hop transmission of high quality photo (video) to smartphones (tablets).
Remote screen sharing systems impose similar requirements as smart lenses: (i) low interaction delay and (ii) high video quality, but Chang et al. [6] show that earlier screen sharing systems [4,35,22] fail to concurrently achieve these two goals. To cope with this limitation, several companies offer streaming-based cloud gaming platforms [26,13,32], and multiple research groups also develop open-source cloud gaming [19] and screen sharing [5] platforms. Some platforms adopt adaptive video streaming technology to optimize QoE under various network conditions [17]. These remote screen sharing systems are not designed for resource-scarce smart lenses, and they assume the Internet infrastructure is always available. Our proposed Smart Beholder platform is, in contrast, tailored for smart lenses.
Although the performance evaluations of smart lenses have never been done, similar measurement methodologies have been proposed for remote screen sharing systems [31,21], cloud gaming platforms [8,7], and screencast technologies [16,18]. Nonetheless, the existing measurement methodologies work on videos captured from the frame buffer, while the methodology proposed in this paper considers real-time videos captured from camera sensors. The measurement methodology is useful in its own right, e.g., to evaluate the commercial smart lens products which are proprietary and closed.
Figure 2: The server and client architecture of Smart Beholder.
Table 1: Candidate Embedded System Boards
Arduino Raspberry Pi (B) UDOO BeagleBoard Pandaboard Jetson TK1
ATmegaAVR, ARM1176JZF-S ARM Cortex-A9, ARM Cortex-A8 ARM Cortex-A9 ARM Cortex A15
ARM Cortex-M3 ARM Cortex-M3/M4
GPU None Broadcom VideoCore IV Integrated graphics PowerVR SGX530 SGX540 graphics 192 SM3.2 CUDA cores
I/O port Regular USB USB 2.0 USB 2.0 USB 2.0 USB 2.0 USB 3.0
HW encoder None H.264 H.264 H.264, MPEG4 H.264, MPEG4 H.264, VC-1, VP8
Memory 16 - 512 KB 512 MB512 MB - 1 GB 256 - 512 MB 1 GB 2 GB
Price $13 - $60 $35 $135 $49 - $149 $174 - $182 $192
Camera module Yes Yes Yes Yes Yes No
Smart Beholder is released with two types of software packs: all-in-one and pre-compiled binary packs. In addition, the source codes and complete documents are available on our project website at ">"> Users may extend Smart Beholder to support other hardware platforms running embedded Linux. Banana Pi-D1 [3] is an open IP camera project based on a different embedded system board. Compared to Banana Pi-D1, Smart Beholder has been optimized by solving various research problems described in the paper, e.g., constructing QoE model for preview adaptation, and minimizing latency by reducing memory copies. These optimization techniques can also be applied to Banana Pi-D1 and other similar projects.

3  Proposed System Architecture

The server and client architecture of the proposed Smart Beholder is given in Figure 2. Smart Beholder is inspired by cloud gaming and screen sharing platforms [19,5], but concentrates on solving the unique challenges of smart lenses, including (i) resource constraints of embedded system boards, (ii) uncertainty of single-hop short-range networks, and (iii) high overhead of external camera modules. The Smart Beholder server runs on an embedded system board, and consists of three software components: AP (Access Point) service, DHCP (Dynamic Host Configuration Protocol) server, and video streamer. The AP service turns the server into an access point, allowing Smart Beholder clients to connect to the server via Wi-Fi (or other wireless networks). The DHCP server assigns IP addresses to connected mobile clients. Meanwhile, the video streamer: (i) captures videos using a camera, (ii) encodes videos using software/hardware codecs, and (iii) streams encoded videos via the RTSP (Real-Time Streaming Protocol) and RTP (Real-Time Protocol) servers.
The Smart Beholder client runs on mobile devices and consists of two components: UI (User Interface) and video streamer. The UI component is composed of the viewfinder and camera controller. The viewfinder renders the live videos received from the Smart Beholder server, and the camera controller sends camera control commands to the server. Possible camera control commands include taking photo, recording video, setting white balance, applying image effects, configuring exposure, and tuning sensitivity. The video streamer contains hardware/software decoders, controller client, and RTSP/RTP client.

4  Design Objectives

Smart Beholder aims to provide an open platform for researchers and developers to study and build real-time mobile photography applications. The design objectives of the proposed Smart Beholder platform include:
We emphasize that concurrently achieving all design goals is no easy task. For example, we have to optimize individual components in the video processing pipeline to minimize the system-wide latency. In addition, we need to consider multiple user-perceived quality metrics, such as graphics quality and interactivity, which further complicates the design of Smart Beholder. We present our approaches to achieve the design goals in the next few sections.

5  Hardware Platform

We present the options of main hardware components, and our design decisions.

5.1  Embedded System Boards

Table 1 summarizes the candidate boards. While Arduino is the least expensive board, it does not have enough resources (such as memory) to host Linux OS. This significantly increases the implementation complexity. Moreover, Arduino is not equipped with GPU, which is dictated by real-time video encoding. Hence, we adopt Raspberry Pi, which has a GPU and supports Linux OS and is just powerful enough for Smart Beholder.

5.2  Camera Modules

There are two ways to attach cameras to Raspberry Pi: USB and Camera Serial Interface (CSI). We have experimentally integrated cameras via both interfaces, as detailed below. We adopt Video4Linux API to access USB cameras. The API supports UVC (USB Video Class) compatible cameras [25]. The slower read system call supports all UVC cameras, while the more efficient mmap system call only supports some UVC cameras. Modern webcams like Logitech C525 are supported by mmap, but many of them can only capture raw video frames in YUYV (YUV422) format. The YUYV frames are not supported by some encoders, and have to be converted into YUV420 format. We find that Raspberry Pi only achieves 6 to 9 fps (frame per second) at 720p resolution. Hence, USB cameras are less suitable for live previews. CSI cameras, such as Omnivision OV5647, support Video4Linux and OpenMAX IL. Different from user-space Video4Linux, OpenMAX IL abstracts a set of multimedia hardware components for developers to use in an efficient way. Therefore, we employ OpenMAX IL to access CSI cameras. Doing so increases the 720p frame rate to 15 fps, which is still lower than acceptable. A closer look indicates that such inferior frame rate is partially due to expensive (and redundant) memory copies, which are further optimized in Section 6.2.

6  Software Design Decisions

We minimize the latency of several software components, to minimize the overall latency.

6.1  Hardware Encoder

To reduce the encoding latency, we leverage Raspberry Pi's H.264 hardware encoder via OpenMAX IL. This encoder supports various configuration options, and we exercise the following options: profile, bitrate, frame-rate, GoP size, and B frames. One minor issue of Raspberry Pi's hardware encoder is lack of mechanisms to retrieve the SPS (Sequence Parameter Set) and PPS (Picture Parameter Set) parameters associated with an encoder, which are mandatory for correctly setting up the RTSP/RTP server. To cope with this limitation, we first initialize the hardware encoder with a set of parameters P, use it to encode some dummy frames, and then retrieve the SPS and PPS parameters from the encoded video frames. We use those retrieved SPS and PPS parameters to setup the RTSP/RTP server. Next, we re-initialize the encoder with the same set of parameters P in order to ensure that encoded video frames have identical SPS and PPS parameters with the RTSP/RTP server. This allows us to use hardware encoder for lower latency.

6.2  Reduce the Number of Memory Copies

Figure 3(a) presents the software components that are not optimized for low latency. The camera capturer and hardware encoder both span over software and hardware, and thus several memory copies, such as arrows 1 and 3, incur unnecessary overhead. We propose an optimized design in Figure 3(b), which directly passes raw video frames from camera to hardware encoder. By doing so, we significantly increase the capture and encoding rates to 60 fps at 720p and 30 fps at 1080p. This leads to much smoother previews and shorter latency. In addition, the optimized server components are simpler and easier to implement.
eps/lo-integration.png (a) eps/hi-integration.png (b)
Figure 3: Server software components: (a) unoptimized and (b) optimized.

6.3  Software Decoder

We have experimentally implemented both software and hardware decoders. Intuitively, hardware decoders run faster than software ones. However, our experiments using Android's MediaCodec framework to access hardware decoders on several Sony/HTC mobile devices incur an additional delay between 80 and 100 ms, which is independent to frame resolutions. Some preliminary tests indicate that official Java-based hardware decoder APIs always buffer a couple of frames. The buffer size, however, is not configurable via MediaCodec framework. Therefore, we adopt ffmpeg software decoder, which achieves 24 fps at 720x480. This is sufficient for live preview on mobile clients. By adopting software decoders, we have full control over the decoding and buffering mechanisms. For minimum delay, we decode a frame whenever we see an end-of-frame mark; consequently, we achieve a ≤ 10 ms buffering time unless the network condition is highly unstable. Currently we adopt the software decoder for shorter latency, but future Smart Beholder may switch to hardware decoders if the extra buffering time can be controlled and eliminated.
(a) Good network condition.
(b) Bad wireless network condition.
(c) Recovered from bad wireless network condition.
Figure 4: An illustrative example of time-drifted video frames.

6.4  Time-Drifted Video Frames

Our early experiments indicate that playout times of video frames may be drifted. Figure 4 presents an example that causes time-drifted video frames. When network condition is good (Figure 4(a)), server sends video frames at a fixed rate, and mobile client renders video frames at the same rate. When network transmission is stalled due to weak signals or wireless interference (Figure 4(b)), video frames are queued on the server (at IP or MAC layer). Once network condition is recovered (Figure 4(c)), queued video frames are sent in a burst. The decoder at mobile client may fail to keep up with bursty video frames, and renders some video frames too late. This results in time-drifted video frames.
We propose to drop late video frames at mobile clients to address this issue. For this purpose, we attach two timestamps with each video frame: (i) receiving time at client and (ii) sending time at server. For frame i, we denote the receiving timestamp as tri and sending timestamp as tsi, and the time offsets as δtri = tri − tr1 and δtsi = tsi − ts1. The frame delay is ∆fi = δ tri − δtsi. We drop a frame i iff ∆fi > L, where L is a user-configurable threshold. L typically is in the order of tenth of ms, and we use 50 ms if not otherwise specified.

7  Adaptive Live Preview

We develop empirical models and then propose an efficient adaptation algorithm to dynamically maximize the QoE of live previews.

7.1  Single-Hop Wi-Fi Network Model

Estimating available bandwidth of an ongoing live preview session is extremely challenging, although several attempts have been made in wired [28,20] and wireless [24] networks. These approaches send extra probing packets, which incur additional overhead on the already tight network resources. In contrast, we develop a customized network model to leverage existing video packets for estimating the available bandwidth. Our core idea, inspired by WBest [24], is to keep track of the size and receiving timestamp of packet p as sp and trp. We then compute the dispersion time of every pair of adjacent packets (belonging to the same video frame), and estimate the instantaneous capacity cp as cp = sp−1 / (trp − trp−1). We can use video packets as probing packets, because: (i) Smart Beholder server sends a video frame every 33 ms (assuming a 30-fps configuration), and thus the instantaneous sending bitrate is much higher than the coding bitrate; and (ii) each video frame is composed of several back-to-back packets due to the limited network MSS (Maximum Segment Size). Furthermore, the single-hop Wi-Fi network is dedicated to Smart Beholder, and thus the available bandwidth is the same as estimated capacity. Our initial experiments indicate that cp fluctuates quite a bit. Therefore, we adopt a sliding window of W+1 packets for de-noising. This is similar to prior studies [24,28,20], which employ diverse aggregation approaches, such as mean, medium, and maximum. To be more general and adaptive, we sort all instantaneous capacity values within sliding window (packets cp−W, cp−W+1,..., cp) in the increasing order. We let [ˉc]αp be the α-percentile capacity, and use it to estimate the network capacity (available bandwidth).
We have instructed Smart Beholder and conducted experiments to determine the best W and α parameters to better match the estimated capacity with the ground truth given by (intrusive) tools, like iperf. We place the sender and client in a hallway, and vary the distance between them between 1 and 40 meters. We measure the network capacity using Smart Beholder and iperf at each distance for 1 minute. We then derive the best α parameter based on the ground truth. We first vary the sliding windows size W = { 375, 750, 1500, 3000} and repeat the experiments 5 times to check the consistency of the best α parameters. We compute the variance of α and find that the variance becomes negligible (at most 8 ×10−4) with W=3000. Hence, we set W to be 3000. In our experiments, we find that α parameter depends on the signal strength, denoted as g, of Wi-Fi. Most OS's, including Android, constantly report Wi-Fi signal strength in dBm, which may be readily used by our adaptation algorithm. Therefore, we conduct additional experiments and log the g values, in order to model α as a function of g. The empirical results reveals that α can be modeled as a piecewise linear function [23] as illustrated in Figure 5. Using adaptive α parameters allow us to better approximate the ground truth from iperf without the excessive network overhead. Last, we note that, currently α values are derived offline, while online training of α is also possible.
Figure 5: Piecewise linear model of α.

7.2  Quality of Experience Model

Figure 6: Mean MOS scores from a Smart Beholder testbed.
Smart Beholder supports dynamic adjustments of bitrate b, frame rate f, and resolution r of viewfinder previews on-the-fly. However, determining these encoding parameters for high QoE is challenging, and therefore we conduct a user study to derive the model for our adaptation algorithm as follows. We run the sever on a Raspberry Pi and the client on an Xperia tablet in our lab. We vary the encoding configurations in terms of bitrate b = {0.5, 1, 2} Mbps, frame rate f = {10, 20, 30} fps, and resolution r = {160x120, 352x288, 544x288, 640x480, 864x480}. We recruit subjects on campus and online for the user study. Each subject has at most 3 minutes to use Smart Beholder under each configuration. For each configuration, a subject gives three quality scores between 1 (worst) and 7 (best) on: (i) graphics quality, (ii) interactivity, and (iii) overall satisfaction. Subjects are free to terminate 3-minute experiments earlier. We have 30 subjects (63% male) and perform 89 sessions (45 rounds per session) in total. Each session lasts for 31 minutes on average, and the total user study time is almost 46 hours.
We made two observations on the overall MOS scores given in Figure 6. First, when the bitrate is ≥ 1 Mbps, higher frame rates and resolutions lead to higher MOS scores. Second, when the bitrate is lower (0.5 Mbps), higher resolutions (such as 864x480) may result in lower MOS scores (than, e.g., 544x288). These two observations show the importance of the QoE model because higher bitrates, frame rates, and resolutions do not guarantee better QoE. Based on Figure 6, we let Qb be the overall MOS table at bitrate b, and qb(f,r) be the MOS score at b, f, and r. Qb table is given in the figure if b ∈ {0.5, 1, 2}, but is interpolated/extrapolated otherwise. Then, to get qb(f,r) we look up table Qb with potential interpolation/extrapolation as well. The presented QoE model enables us to pick the encoding parameters for optimal user experience.
eps/beholder_board.png eps/cubic.png eps/qx100.png
Figure 7: Photos of the considered smart lenses: Smart Beholder (left), Altek Cubic (middle), and Sony QX100 (right).

7.3  Preview Adaptation Algorithm

We develop an efficient algorithm to dynamically adjust the encoding parameters, in order to avoid QoE degradation due to network impairments such as insufficient bandwidth and high packet loss rate. The algorithm runs periodically, evaluating available bandwidth [ˉc]αp (using the network model developed above) and packet loss rate ρ on mobile client once every T seconds. T is a system parameter, which is set to 10 seconds by default. [ˉc]αp and ρ are sent back to the server to make decisions on encoding parameters based on four other system parameters: minimum loss rate ρl, maximum loss rate ρh, bitrate increment step δi, and bitrate restoration factor γr. If not otherwise specified, we set ρl = 5%, ρh = 20%, δi = 0.2, and γr = 0.7.
[tb] [1] every T seconds Compute [ˉc]αp and ρ b < [ˉc]αp and ρ < ρl b = b + ([ˉc]αp − b ) ×δi b > [ˉc]αp or ρ > ρh b = γr ×[ˉc]αp Given b, lookup qb (f*, r*) for the highest MOS score Reconfigure video encoder with b, f*, r* #1Preview Adaptation Algorithm
Algorithm 1 gives the pseudocode of our algorithm. Line 3 checks whether our encoding bitrate is lower than available bandwidth1 and packet loss rate is low; if it passes, line 4 increases encoding bitrate. Lines 5 and 6 are similar, but to reduce encoding bitrate by setting it to a certain ratio (i.e., γr) times the measured available bandwidth. At line 7, we know the target encoding bitrate b, and employ the QoE model to get the best f* and r*. Line 9 reconfigures the video encoder. It is easy to see that our preview adaptation algorithm runs in constant time at the server. At the mobile client, sorting instantaneous capacity values when deriving [ˉc]αp dominates the time complexity, which is O(W logW). Given that W is at most a few thousands, the computation complexity is relatively negligible to modern smartphones.
Table 2: Average Accuracy of Capacity Measured by Beholder
Distance 1.25 m 2.5 m 5 m 10 m 20 m
Beholder 44.89 Mbps 39.33 Mbps 33.76 Mbps 31.29 Mbps 10.02 Mbps
iperf 39.47 Mbps 36.22 Mbps 31.77 Mbps 29.38 Mbps 10.06 Mbps
Deviation 13.7 % 8.5 % 6.3 % 6.5 % 0.4 %
Last, we evaluate the accuracy of the proposed adaptation algorithm by separating the server and client by 1.25, 2.5, 5, 10, and 20 meters. We use Smart Beholder and iperf to measure the network capacity for 1 minute at each distance, and repeat the experiments 5 times. Table 2 summarizes the estimated capacity from Smart Beholder and iperf. This table shows that Smart Beholder achieves very small deviation compared to the ground truth from iperf. The deviation is higher under shorter distances, which however is not a big issue because the available bandwidth is sufficient (e.g.,  ∼ 40 Mbps at 1.25 m) for all practical frame rates and resolutions.

8  Performance Evaluation

We conduct real experiments to compare Smart Beholder against commercial smart lenses.
Figure 8: Experiment setup for inferring the preview resolutions of commercial smart lenses.
Figure 9: Live preview screenshots of line intensity testing region on the resolution test chart.
Table 3: Considered Smart Lenses
Altek Cubic Sony QX100 Smart Beholder
Altek Cubic C01 Sony DSC-QX100 Raspberry Pi Model B
2 MP (1600x1200)
13 MP (4160x3120)
5 MP (2592x1944)
18 MP (4864x3648)
5 MP (2592x1944)
1920x1080 @ 30 fps 1920x1080 @ 30 fps 1920x1080 @ 30 fps

8.1  Preview Resolutions of Commercial Smart Lenses

We consider two commercial smart lens products: Altek Cubic and Sony QX100. They both offer in-house mobile apps on Android devices. Figure 7 shows the three smart lenses and Table 3 presents their specifications. Our Smart Beholder is fully configurable, and supports different preview resolutions. The two commercial products only support fixed preview resolutions, which is unknown to users.
Figure 10: Viewable line ratios of different smart lenses under diverse line densities.
We use a PIMA/ISO 12233 Resolution Test Chart [10] to infer the preview resolutions. The experiment setup is presented in Figure 8. We place the smart lens about 30 centimeters away from the test chart, so that the test chart on viewfinder approximately spans the viewable area, and stream preview videos to the tablet on the right. We use only the lower-middle part of the test chart, where 10 numbered blocks with vertical lines in different densities (and widths) are arranged into a row. The block numbers indicate how dense these vertical lines are: from 1 (fewest, thickest lines) to 10 (most, thinnest lines), and we refer to the block number as line density. To count the number of lines in blocks with higher line densities, video previews with higher resolutions are needed. We define viewable line ratio as the fraction of distinguishable vertical lines over all vertical lines in the test chart.
Figure 11: Testbed setup.
Figure 12: Experimental procedure.
We then execute the following experiments with Altek Cubic, Sony QX100, and Smart Beholder under 5 different resolutions. For each smart lens (and resolution), we take screenshots of the live preview of the resolution test chart and crop the line intensity testing region, as shown in Figure 9. We then convert the regional screenshots to binary (black and white) using the threshold 128 (with the gray levels ranging from 0 to 255) and programmatically count the numbers of vertical lines in individual blocks (from 1 to 10) in the video previewers in order to calculate the viewable line ratios. We plot the results in Figure 10. This figure shows that Altek Cubic and Sony QX100 achieve very similar viewable line ratios over different line densities as Smart Beholder at 320x240 and 640x480 resolutions, respectively. Hence, we conclude that the preview resolutions of these two commercial products are approximately 320x240 and 640x480, respectively. Even though the quality of camera lenses may be very different, we believe that the (relatively low) resolutions of preview video would dominate how distinguishable the thin lines are.

8.2  Setup

We create two video/image datasets for objective and subjective performance metrics. For objective metrics, we use a Canon EOS 600D camera to capture eight segments of 25-sec videos at 720p. Half of the segments are taken indoor (outdoor); and all segments are taken under typical smart lens usage scenarios. We concatenate eight segments into a 216-sec video, in which we insert a 2-sec white screen between any two consecutive videos to reset the video codecs for minimum interference across videos. This dataset represents typical viewfinder previews, and are suitable for objective metrics. It is, however, less suitable for subjective metrics due to the relatively low resolution. For subjective metrics, we collect 9 high-resolution (1080p) popular Creative Commons (CC) photos from Flickr. We play each photo for 10 seconds, and record the viewfinder previews using different smart lenses.
Figure 13: Overall performance comparisons among smart lenses: (a) bitrate, (b) latency, (c) frame rate, and (d) preview quality.
Figure 11 shows the testbed used in our lab. We play the videos on the video source display on the right, put a Smart Beholder (or other smart lenses) server in front of the video source display, and send the previews to the corresponding smart lens client running on a tablet (Sony Xperia). The server and mobile client have a distance of 1 meter. The tablet is connected to an external monitor on the left. Last, we use a Canon EOS 600D camera to capture the videos of the two side-by-side displays at 60 fps. The captured video is then used to derive performance results. We also run tcpdump on the tablet to capture and calculate the transmitted bitrate. Figure 12 summarizes the measurement procedure.
For a subjective evaluation of preview quality, we conduct a crowdsourcing-based user study over the Internet via web interface. We present the original images (from Flickr) on the left half of the web page, and the degraded images (extracted from the viewfinder previews) on the right half. For each comparison, a subject gives a DMOS (Differential Mean Opinion Score) between 0 (un-degraded) and 6 (seriously degraded and unacceptable). We convert the DMOS score to an MOS score by MOS = 7−DMOS and use the resulting MOS score to be the image quality metric. We recruit 52 subjects and perform 117 sessions with a total of 14,410 comparison rounds. The total study duration is 30 hours, where each session lasts for 15 minutes on average.
We consider the following performance metrics:
The first four metrics are objective and the last one is subjective. We give mean results with 95% confidence intervals if applicable.

8.3  Results

We first present the results with static Smart Beholder configurations, which are followed by the results with preview adaptation algorithm enabled.
Smart Beholder outperforms other smart lenses. We plot the overall performance in Figure 13. Figure 13(a) shows that Smart Beholder consumes as low as half of the bitrate compared to commercial smart lenses. This conforms to our expectation as we configure Smart Beholder to use an average bitrate of 3 Mbps. Figure 13(b) reveals that Smart Beholder results in at least 50 ms shorter latency, which in turn leads to more responsive user experience. Figure 13(c) depicts that our Smart Beholder achieves comparable, actually slightly higher, frame rate than the two commercial smart lenses. In summary, the Smart Beholder outperforms the two commercial products in all considered objective performance metrics. We report the preview image quality in Figures 13(d), which shows that our Smart Beholder achieves better MOS scores than two commercial smart lenses. More importantly, such higher preview quality does not come with higher network nor system loads as Figures 13(a) and 13(b) show.
Table 4: The System Parameters
Parameter Values
Frame Rate 6 fps, 12 fps, 24 fps
Resolution 160x120, 320x240, 640x480, 1280x720, 1920x1080
Bitrate 1 Mbps, 2 Mbps, 3 Mbps
Figure 14: Beholder performance with different target bitrates: (a) achieved bitrate, (b) latency, and (c) frame rate.
Configurability of Smart Beholder. We vary the configurations of Smart Beholder following the parameter values in Table 4 with the default values highlighted in boldface. With each configuration, we measure the system performance using the 216-sec preview video. Two sets of sample results are given below. First, we adjust the target encoding bitrates and present the results in Figure 14. This figure depicts that when the target bitrate is increased, the achieved bitrate (Figure 14(a)) and the latency (Figure 14(b)) increase, while the frame rate (Figure 14(c)) slightly decreases. We believe the slightly increased latency and slightly decreased frame rate is due to a higher complexity and workload in decoding the preview videos at the client.
Figure 15: Beholder performance under different target frame rates: (a) bitrate, (b) latency, and (c) frame rate.
Next, we adjust the target frame rate and give the results in Figure 15. Figure 15(c) shows that Smart Beholder always achieves the target frame rates, which reveals the efficiency of its implementation. Figures 15(a) and 15(b) show that higher target frame rates lead to higher bitrate and lower latency, which are consistent with our intuition, as a 6 fps live preview incurs at least 1000 ms÷6 ≈ 166 ms latency. In summary, Figures 14 and 15 demonstrate the configurability of our implementation.
Figure 16: Needs of adaptation algorithm: (a) bitrate, (b) latency, and (c) frame rate over time.
Needs of adaptation algorithm. We zoom into a sample run of the 216-sec video, and report the per-second bitrate, latency, and frame rate in Figure 16. Figures 16(a) and 16(b) show that Smart Beholder constantly results in low bitrate and low latency compared to the two commercial smart lenses. Figure 16(c) shows that Sony QX100 suffers from severe frame rate drops at 150-th, 175-th, and 205-th seconds, due to bad network conditions. Smart Beholder also suffers from a frame rate drop at 110-th second, which demonstrates the needs of preview adaptation algorithm.
eps/distance_rssi.png (a) eps/cbw_bitrate_2.png (b)
Figure 17: A sample of 3-minute experiment: (a) distance affects signal strength and (b) reconfiguration (bitrate) decisions are driven by capacity estimation.
Effectiveness of preview adaptation algorithm. We conduct experiments to quantify the performance of our proposed preview adaptation algorithm. We fix the position of the server and move the client following a 3-minute moving pattern of distance from 5 to 30 m, while configuring Smart Beholder to record the signal strength, capacity, and bitrate. We plot the results in Figure 17. Figure 17(a) shows that the signal strength is inversely proportional to the distance, which is inline with our intuition. At the beginning of the experiment, we set the configuration to be 〈1 Mbps, 24 fps, 640x480〉, and trigger the adaptation algorithm once every 10 seconds. Figure 17(b) shows the estimated capacity and the sample bitrate decisions of our adaptation algorithm throughout the experiment. The capacity decreases with the decreasing signal strength. At 140-th second, the signal strength increases a little because the client stays at 20 meters for a while, which makes the network condition more stable. The adaptation algorithm increases the encoding bitrate until it approaches the estimated capacity (at 70-th second). At that moment, our algorithm reconfigures f to 60 fps and r to 1280x720 for the highest MOS score based on the QoE model (Section 7.2). Several other reconfiguration samples are also annotated in Figure 17(b). Last, we note that the resulting bitrate may be lower than the configured bitrate, e.g., at f=60 fps and r= 1280x720, the highest resulting bitrate is 9 Mbps in our experiments.
Figure 18: MOS scores of different configurations and preview adaptation algorithm.
We next conduct a user study for a subjective evaluation on the preview adaptation algorithm. We compare our algorithm against two configurations: static #1, which is 〈2 Mbps, 10 fps, 864x480〉; and static #2, which is 〈0.5 Mbps, 30 fps, 544x288〉. We recruit 12 subjects and carry out 60 sessions in total. In each session, we randomly select a configuration and ask the subject to use Smart Beholder for at most 3 minutes, and then score the preview quality. 8 subjects think static #1 outperforms static #2 in graphics quality , and 10 subjects feel static #2 outperforms static #1 in interactivity. Overall, majority of subjects prefer our preview adaptation algorithm, as summarized in Figure 18. Figures 17 and 18 reveal the effectiveness of our preview adaptation algorithm in both objective and subjective metrics.
Energy efficiency. We encode the 216-sec preview video at different frame rates using software and hardware encoders on Raspberry Pi. We repeat each experiment 3 times and clear cache each time for fair comparisons. The average results reported in Table 5 show that using hardware encoder saves at least 86.5% in energy compared to using software encoder. In addition, we use PowerTutor [27] to measure the per-application energy consumption of Smart Beholder client running on an HTC One X. PowerTutor is a popular measurement tool for energy consumption on Android devices. Smart Beholder client averagely consumes about 0.97 W and the LCD display alone is responsible for 0.89 W. This shows that Smart Beholder client is energy efficient.
Table 5: The Energy Consumption of Beholder Server
Frame Rate S/W Encoder H/W Encoder Saving of H/W Encoder
30 fps 883.44 J 104.98 J 88.1 %
20 fps 530.06 J 71.50 J 86.5 %
15 fps 421.63 J 42.55 J 89.9 %
10 fps 304.78 J 28.51 J 90.6 %
5 fps 176.04 J 18.36 J 89.6 %

9  Conclusion

In this paper, we have proposed an open-source smart lens platform called Smart Beholder, which is designed with three objectives in mind: cost effectiveness, low latency, and high preview quality. We have designed, implemented, and evaluated Smart Beholder using off-the-shelf components. Several optimization techniques have been proposed and implemented in this paper. We have compared the performance of Smart Beholder against two commercial smart lens products, and we have found that Smart Beholder achieves lower bitrate, lower latency, slightly higher frame rate, and better preview quality. Our measurement methodology presented in this paper is useful when more commercial smart lenses hit the market. More importantly, since Smart Beholder is an open-source project, it can be leveraged by researchers and developers for real experiments to quantify the performance resulted by different design alternatives. We believe that Smart Beholder will lead to optimized smart lenses and other real-time video streaming systems in the future.


This work was partially supported by the Ministry of Science and Technology of Taiwan under the grants: 103-2221-E-019-033-MY2, 103-2221-E-019-033-MY2, and 102-2221-E-007-062-MY3.


[1] 5 areas where cameras still beat smartphones if you want great photo quality.">
[2] Altek Cubic: Perfect to selfies.">
[3] Banana pi, 2014.">
[4] R. Baratto, L. Kim, and J. Nieh. Thinc: a virtual display architecture for thin-client computing. In Proc. of ACM Symposium on Operating Systems Principles, (SOSP '05), pages 277-290, Brighton, UK, Oct 2005.
[5] S. Chandra, J. Boreczky, and L. Rowe. High performance many-to-many Intranet screen sharing with DisplayCast. ACM Transactions on Multimedia Computing, Communications, and Applications, 10(2):19:1-19:22, Feb 2014.
[6] Y. Chang, P. Tseng, K. Chen, and C. Lei. Understanding The Performance of Thin-Client Gaming. In Proc. of IEEE International Conference on Communications Quality and Reliability Workshop (CQR'11), pages 1-6, Naples, FL, May 2011.
[7] K. Chen, Y. Chang, H. Hsu, D. Chen, C. Huang, and C. Hsu. On The Quality of Service of Cloud Gaming Systems. IEEE Transactions on Multimedia, 16(2):480-495, Feb 2014.
[8] M. Claypool, D. Finkel, A. Grant, and M. Solano. Thin to win? network performance analysis of the OnLive thin client game system. In Proc. of ACM Workshop on Network and Systems Support for Games (NetGames'12), pages 1-6, Venice, Italy, Nov 2012.
[9] D. Z. Cook. Advanced Dungeons & Dragons-Player's Handbook. TSR, 1989.
[10] Digital camera resolution test procedures.">
[11] DSC-QX100 lens-style camera with 1.0-type sensor.">
[12] M. Farooq and T. Kunz. Wireless multimedia sensor networks testbeds and state-of-the-art hardware: A survey. In Communication and Networking, volume 265 of Communications in Computer and Information Science, pages 1-14. Springer Berlin Heidelberg, 2012.
[13] Gaikai web page.">
[14] Global digital camera market decline slowing down in 2014, predicts new report.">
[15] M. Halpern and L. Humphreys. Iphoneography as an emergent art world. SAGE New Media and Society, 2014.
[16] Y. He, K. Fei, G. Fernandez, and E. Delp. Video quality assessment for Web content mirroring. In Proc. of Imaging and Multimedia Analytics in a Web and Mobile World, pages 90270C-1-90270C-8, San Francisco, CA, Mar 2014.
[17] H. Hong, C. Hsu, T. Tsai, C. Huang, K. Chen, and C. Hsu. Enabling Adaptive Cloud Gaming in an Open-Source Cloud Gaming Platform. IEEE Transactions on Circuits and Systems for Video Technology, Jun 2015. Accepted to appear.
[18] C. Hsu, T. Tsai, C. Huang, C. Hsu, and K. Chen. Screencast Dissected: Performance Measurements and Design Considerations In Proc. of ACM Conference on Multimedia Systems (MMSys'15), pages 177-188, Portland, OR, Mar 2015.
[19] C. Huang, K. Chen, D. Chen, H. Hsu, and C. Hsu. GamingAnywhere: The First Open Source Cloud Gaming System. ACM Transactions on Multimedia Computing, Communications, and Applications, 10(1):36-47, Jan 2014.
[20] R. Kapoor, L. Chen, L. Lao, M. Gerla, and M. Y. Sanadidi. Capprobe: A simple and accurate capacity estimation technique. In Proc. of SIGCOMM'04, pages 67-78, Portland, OR, Aug 2004.
[21] H. Lagar-Cavilla, N. Tolia, E. de Lara, M. Satyanarayanan, and D. O'Hallaron. Interactive resource-intensive applications made easy. In Proc. of the ACM/IFIP/USENIX International Conference on Middleware (Middleware'07), pages 143-163, Newport Beach, CA, Nov 2007.
[22] A. Lai and J. Nieh. On the performance of wide-area thin-client computing. ACM Transactions on Computer Systems, 24:175-209, May 2006.
[23] D. M. Leenaerts and W. M. V. Bokhoven. Piecewise linear modeling and analysis. Kluwer Academic Publishers, 1998.
[24] M. Li, M. Claypool, and R. Kinicki. Wbest: a bandwidth estimation tool for ieee 802.11 wireless networks. In Proc. of IEEE Conference on Local Computer Networks (LCN'08), pages 374-381, Montreal, Canada, Oct 2008.
[25] Linux UVC driver and tools.">
[26] OnLive web page, 2014.">
[27] PowerTutor.">
[28] V. Ribeiro, R. Riedi, R. Baraniuk, J. Navratil, and L. Cottrell. Pathchirp: Efficient available bandwidth estimation for network paths. In Proc. of Passive and Active Monitoring Workshop (PAM'03), volume 4, San Diego, CA, Apr 2003.
[29] M. Richardson and S. Wallace. Getting Started with Raspberry Pi. Ö'Reilly Media, Inc.", 2012.
[30] Sl10 smart lens camera sl10.">
[31] N. Tolia, D. Andersen, and M. Satyanarayanan. Quantifying interactive user experience on thin clients. Computer, 39(3):46-52, 2006.
[32] Ubitus web page.">
[33] K. Wac. Smartphone as a personal, pervasive health informatics services platform: Literature review. IMIA Yearbook 2012: Personal Health Informatics, 7(1):83-93, 2012.
[34] Worldwide smartphone shipments top one billion units for the first time, according to IDC.">
[35] S. Yang, J. Nieh, M. Selsky, and N. Tiwari. The performance of remote display mechanisms for thin-client computing. In Proc. of USENIX Annual Technical Conference (ATC'02), pages 131-146, Monterey, CA, Jun 2002.
[36] S. Yoon, H. Oh, D. Lee, and S. Oh. Virtual lock: A smartphone application for personal surveillance using camera sensor networks. In Proc. of IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'11), pages 77-82, Toyama, Japan, Aug 2011.


1. We note that the protocol overhead is considered in our implementation when comparing [ˉc]α. p and ρ. We omit this technical detail in our descriptions for brevity.

Sheng-Wei Chen (also known as Kuan-Ta Chen) 
Last Update July 20, 2017