Toward an Understanding of the Processing Delay of Peer-to-Peer Relay Nodes

Kuan-Ta Chen and Jing-Kai Lou
Institute of Information Science, Academia Sinica,

PDF Version | Contact Us


Peer-to-peer relaying is commonly used in real-time applications to cope with NAT and firewall restrictions and provide better quality network paths. As relaying is not natively supported by the Internet, it is usually implemented at the application layer. Also, in a modern operating system, the processor is shared, so the receive-process-forward process for each relay packet may take a considerable amount of time if the host is busy handling some other tasks. Thus, if we happen to select a loaded relay node, the relaying may introduce significant delays to the packet transmission time and even degrade the application performance.
In this work, based on an extensive set of Internet traces, we pursue an understanding of the processing delays incurred at relay nodes and their impact on the application performance. Our contribution is three-fold: 1) we propose a methodology for measuring the processing delays at any relay node on the Internet; 2) we characterize the workload patterns of a variety of Internet relay nodes; and 3) we show that, serious VoIP quality degradation may occur due to relay processing, thus we have to monitor the processing delays of a relay node continuously to prevent the application performance from being degraded.
E-Model, Internet Measurement, Peer-to-Peer Systems, QoS, VoIP

1  Introduction

Voice communication over IP is becoming one of the most profitable Internet businesses. It has been shown that VoIP users are willing to pay for value-added services, such as intercommunication with a PSTN phone (dialing to a PSTN phone and vice versa), voice mails, and call forwarding. Although one of the major players, Skype [27], was not the first company to provide VoIP service, but it did pioneer the delivery of such services to an unprecedented wide range of end-users. From a technical point of view, we believe that three factors are responsible for Skype's popularity: the user-friendly interface, the high quality audio codecs, and the sophisticated peer-to-peer network infrastructure.
Skype is well-known, or perhaps notorious, for its capability to "steal" computation and communication resources from a computer with a Skype instance running. This is because Skype employs a technique called peer-to-peer relaying, where the network communications between two call parties can be made through a third intermediate node, commonly called a relay node. Peer-to-peer relaying brings the following advantages to VoIP applications: 1) the voice quality can often be improved by detouring traffic via a better quality network path, which can be achieved by using a relay node [23]; 2) a relay node can help establish connections if both call parties are behind NATs or firewalls [1,[6,[24]; and 3) relaying enables data aggregation, which reduces network bandwidth usage when more than two parties are involved in a conference call. For these reasons, peer-to-peer relaying is widely employed by VoIP services, such as Skype and Google Talk [7], as well as by video streaming services, such as AnySee [13] and PPLive [10].
Even though peer-to-peer relaying is beneficial to real-time applications in many aspects, we argue that its dark side might be easily overlooked. As relaying is not natively supported by the Internet, it is currently implemented by deploying an overlay network at the application layer. In this way, a packet "forwarded" by a relay node is actually a brand new IP packet that the node "clones" from the packet to be relayed. Also, since relay applications are usually run at a normal priority, their execution could be deferred due to high-priority tasks, such as system threads, device I/O request handlers, or foreground jobs. Thus, the time needed for a relay node to receive, process, and regenerate a relay packet could be unpredictably long because it depends on the workload of the node. For these reasons, the extra delays incurred at a relay node could be considerable and even detrimental to the application's performance.
In this paper, based on an extensive Internet measurement, we consider whether peer-to-peer relaying leads to substantial or even detrimental extra delays. Our analysis is divided into three parts. First, we describe how we collect the processing delays of relayed packets from a large set of relay nodes on the Internet. Second, we analyze and characterize the processing delays of the relay nodes, and show that the delays are closely related to the host's activity. Third, we investigate whether the relay process degrades the quality of VoIP calls and whether such degradation is a general or occasional phenomenon. To the best of our knowledge, this paper is the first large-scale study to measure the processing delays introduced by peer-to-peer relaying and their impact on an application's performance. Our contribution is three-fold:
  1. We propose a methodology that can measure the processing delays at any relay node on the Internet, without modifying the existing network and system infrastructure.
  2. We collect the relay processing delays from a large set of Internet nodes distributed all over the world. In addition, we analyze the sampled processing delays to gain an understanding of the workload pattern of Internet nodes, the effect of packet size on relay processing, and the instability of processing delays.
  3. We show that the processing delays incurred at relay nodes may significantly degrade VoIP quality based on ITU-T E-model [12] and the delays at relay nodes are generally unstable.
We consider this paper as a first step in focusing on the extra processing delays introduced by peer-to-peer relaying. From our results, we suggest that all real-time peer-to-peer systems keep track of the workload on relay nodes to prevent unexpected performance degradation.
The remainder of this paper is organized as follows. Section II describes related works. We then propose our processing delay inference methodology in Section III. In Section IV, we describe the trace collection methodology and discuss the quality of our collected traces. In Section V, we analyze a number of characteristics of the collected relay processing delays, such as typical workload patterns and stability. In Section VI, we evaluate the impact of the processing delays of relay nodes on VoIP quality. Finally, in Section VII, we present our conclusions.

2  Related Work

To the best of our knowledge, this paper is the first to study the processing delays of peer-to-peer relay nodes and their impact on the application performance. Thus, there are no earlier studies directly related to our work. In [14], Liu and Zimmermann mentioned that the average processing delay at each overlay node of AudioPeer [31], a commercial voice chat system, was approximately 30 ms. However, they did not report how the measurement was conducted and how large and representative the data set was. One closely related research is probably that on peer-to-peer relay node selection. A number of studies [23,[5,[4,[11,[3,[29,[15] have been devoted to selecting relay nodes from a set of candidates to obtain a good quality network path. However, the selection criteria are mainly based on the network latency and loss rate, and do not consider the processing delays introduced by relay nodes. Our work complements these network-quality-based relay node selection studies by emphasizing that relay processing delays should also be considered when selecting the best relay node.

3  Processing Delay Inference

In this section, we propose a methodology for measuring the processing delays induced by relaying packets at an intermediate node. The processing delay is defined as the time an intermediate node takes to send a data packet to its destination on behalf of a source host. Delays can be due to a variety of operations, such as receiving a packet from the source host, passing it to the relay application (mostly in user mode) for processing, and forwarding the packet to the target host.
The processing delay at a relay node is generally unmeasurable unless we place a sniffer to monitor the incoming and outgoing traffic of the node. Note that even the relay application cannot measure the processing delays of relayed packets directly because it has no information about when a packet arrives at or departs from the system. To measure processing delays exactly, an application must exploit a kernel-mode packet filter, such as BPF [17], or raise its thread priority to a high value so that it will always be served immediately. However, either approach is inadequate for a large-scale deployment, as they incur additional resource overhead and may disturb the execution of the user's own tasks.
Our methodology is designed to measure the processing delays experienced by relayed packets at a relay node without any modification to the existing network infrastructure and peer-to-peer software. In the following, we first define the terms used throughout this paper. We then explain the basic rationale behind our inference methodology, and elucidate the IPID filtering scheme for improving the estimation accuracy. Finally we evaluate both the basic and improved methods through an experiment.

3.1  Term Definition

In this paper, we assume a two-hop relaying scenario in which every packet from a source host transits an intermediate node before reaching the destination host.

3.2  The Basic Method

Our method is based on the following assumptions, which are generally observable across current peer-to-peer implementations:
  1. The relay node forwards a relay packet to the destination as soon as it receives a source packet, i.e., no intentional delays are introduced in relay packet forwarding.
  2. The source transmits packets to the relay node by TCP; therefore, the relay node will respond with TCP ack packets upon the receipt of every packet (or every two packets if the delayed ack option is enabled [18]).
  3. The TCP implementation (which generates ack packets) is running at a high priority, while the relay application is running at a relatively low priority. We require that ack packets are elicited with an approximate constant delay so that they can be taken as a reference for "relaying without processing delay."
Our basic concept is that, on the arrival of a TCP source packet, the relay node will respond with two packets: 1) an TCP ack packet sent back to the source, and 2) a relay packet forwarded to the destination. These packets are generated by different parts of the relay node system. An ack packet is generated by the TCP implementation, which is part of modern operating systems and normally runs at a high scheduling priority. On the other hand, a relay packet is generated by the relay application, such as Skype and PPLive, which is developed by a third-party vendor and runs at a normal scheduling priority. As a result, an ack packet can always be elicited promptly, while it usually takes some time to generate a relay packet because the relay application can only get its quantum when high-priority threads have completed their tasks2. In this way, a relay packet's additional processing delay can be computed as the time difference between the time instant the relay packet and the instant its corresponding ack packet left the relay node.
However, this technique requires us to monitor the incoming and outgoing traffic of the relay node, which is not practical for large-scale measurements. To overcome this restriction, we tactically place the source and the destination hosts at the same location, which ensures that the ack packet and relay packet traverse the same network path to the target. By so doing, we can estimate the processing delay as the difference between the instant an ack packet arrives at the sender and the instant the corresponding relay packet reaches the destination host. This strategy renders large-scale measurements feasible because we only need to monitor the traffic of the source and destination hosts.
Figure 1: Experiment setup for measuring ack packet generation delays and evaluating the accuracy of processing delay estimation.

3.2.1  Constancy of ACK Generation Delay

Our method only works if the TCP implementation generates ack packets with a constant delay. To verify this assumption, we conduct experiments on the network topology depicted in Fig. 1. In the experiment, all the computers are commodity PCs equipped with Pentium III, 1 GB RAM, and Windows XP, but the traffic monitor is a SuSE Linux running tcpdump. During the experiments, the source keeps sending 200-byte data packets at a 30 pkt/sec frequency to the relay node, which then forwards the packets to their destination. To emulate a heavy workload on the relay node, 10 different movie clips are played simultaneously, which generates a considerable CPU workload (for video decoding and playout) and I/O workload (for reading audio/video data from the disk).
Because the source/destination nodes and the relay node are at the same Ethernet LAN, the network delay between them is negligible. Fig. 2 plots the relationship between the data delivery time (DDT) and the ack response time (ART) of each source packet (note that the x-axis is with a log-scale). The leftmost dense area indicates that a linear relationship between DDT and ART exists when no other threads are competing with the relay application for computation cycles. However, when the relay node is busy handling other tasks, the DDT increases by orders of magnitude (spread between 0 and 500 ms), while the ART is always less than 0.3 ms. The reason for such a small ART is that the TCP/IP stack gives a high priority to serving incoming TCP packets and generates ack packets at the first opportunity [28]. In contrast, the execution of a relay application can be deferred for a long time because the processor serves high-priority jobs, such as kernel-related threads, device I/O requests, and the foreground tasks, first.
Figure 2: The scatter plot of data delivery time and ack response time.

3.3  Sample Filtering based on IPID

Although we can calculate the processing delay of a relayed packet by simply subtracting the DDT from the ART in a LAN-scenario like Fig. 1, the result will be less accurate if the relay node is on the Internet because of network delays. The problem is that, in the Internet scenario, both DDT and ART are subject to network dynamics and network delays independently. Thus, the result of subtracting DDT from ART will be affected by the network delays of the relay packet and ack packet, and deviate from the true processing delays at the relay node.
We cope with network variabilities by filtering out packets that lead to inaccurate processing delay estimates based on the IPID field. We call this the IPID filtering method. Since the IPID field is strictly increasing for IP packets generated by a computer, we use the information to determine the release order of packets from the relay node. The rationale of our IPID filtering method is as follows: "If a set of packets sent by a node are reordered in the network, then at least one of them must have experienced unusual network delays and should be filtered out."
As we do not have direct access to every relay node on the Internet, the following rules are employed to determine the packet ordering at the node. 1) For packets from the source host, we detect whether they arrive at the relay node sequentially by analyzing the IPIDs of their corresponding ack packets. A smaller IPID in an ack packet indicates that its corresponding source packet arrived at the relay node earlier, and vice versa. 2) For packets from the relay node, we detect the sequence they departed the relay node by their IPIDs.
Fig. 3 illustrates three possible packet-reordering scenarios. Note that this is not an exhaustive listing of possible reordering cases. For example, a relay packet for a source packet x may exchange its order with an ack packet for a source packet y, which forms another category of reordering. Any occurrence of packet sequence rearrangement indicates that at least one packet is experiencing unusual network delays. Suppose a source packet i that departed from the source host at time ts,i elicits an ack packet with IPID idack,i and a relay packet with IPID idr,i, which arrive at their destinations at time tack,i and tr,i respectively. We detect packets with unusual network delays as follows:
  1. For all source packets, we obtain a sequence {ts,idack} sorted by ts. We then apply the longest increasing subsequence (LIS) algorithm [26] to the {idack} sequence, IDack, and obtain its longest increasing subsequence L. The set {L} denotes the IPIDs of the packets that keep order with each other in the network. Thus, we remove the packets with IPIDs in the set {IDack −L}, because they are the packets with unusual delays that lead to inaccurate processing delay estimates.
  2. We combine {idack,tack} and {idr,tr} as a sequence, and sort them by the first element. Then, we apply the LIS algorithm to the sequence formed by the second element, IDack,r, and obtain the longest increasing subsequence L. Once again, the packets with IPIDs in the set {IPIDack,r − L} are those that experience unusual network delays and should be removed.
After removing packets with inconsistent network delays (i.e., compared to the remaining packets), the estimated processing delays will be all greater than zero, as the DDT must be longer than the ART for a source packet. Also, the error of a processing delay estimate will be bounded by the interarrival times of the packets surrounding pairs comprised of an ack packet and a relay packet.
Figure 3: Some common network reordering scenarios

3.4  Accuracy Evaluation

We verify the effectiveness of the proposed processing delay inference methods with a series of experiments. The network setup of the experiments is identical to that described in Section  III-B1 and shown in Fig. 1. However, this experiment described here is much more realistic because it takes account of network delay variability and packet reordering. We use a trace-driven approach to emulate the network dynamics. The network delay of each packet is sequentially assigned a measured ART from the set of Internet traces described in Section IV. This introduces a great deal of randomness into the network delay of each packet, thus simply subtracting ART from DDT might even derive a negative processing delay estimate.
We run the experiment for 500 flows, each of which lasts for 10 minutes. We first evaluate the performance of both methods by the average and maximum of the absolute estimation errors, which are computed by |Real PD − Estimated PD|. Fig. 4 shows the absolute errors of the estimated processing delays versus the real delays. We observe that the average absolute errors are only 4 ms and 3 ms for the basic method and the IPID filtering method respectively, so the difference between the two methods is not significant. However, the benefit of the IPID filtering is significant in terms of the maximum absolute errors, e.g., the error reduces by 20 ms with IPID filtering, while the real processing delays are around 200 ms.
Figure 4: The average and maximum absolute errors of processing delay estimates vs. the real processing delays.
Another way to assess the estimation accuracy is by the number of inaccurate samples, where the accuracy of a sample is defined by the magnitude of the absolute error of the estimate compared to the real value. Fig. 5 plots the percentage of inaccurate samples with respect to a range of error thresholds. We observe that the proportion of inaccurate samples declines rapidly as the error threshold increases. While the basic method induces a low error rate ( <  10%) for thresholds higher than 10 ms, IPID filtering can always further reduce the number of inaccurate samples by a significant amount.
Figure 5: The ratio of inaccurate processing delay estimates using a variety of accuracy thresholds.

4  Large-Scale Measurement

In this section, we describe how we measure the processing delays of relayed packets for a large set of Internet relay hosts, after which we begin with a description of our trace collection procedures and a summary of the traces. We then discuss the anomalous behavior patterns identified and their causes. The section concludes with an accuracy assessment of the estimated processing delays derived from the Internet measurements.

4.1  Trace Collection Methodology

Our trace collection platform is based on Skype for several reasons.
The trace collection mechanism comprises three commodity PCs deployed as shown in Fig. 6. One serves as the VoIP caller, one serves as the callee, and the third is a monitor that collects information about the traffic of both call parties. The collection procedures are as follows:
  1. When the measurement program is initialized, we block the caller from directly reaching the callee with a firewall program ipfw.
  2. The caller initiates a VoIP call to the callee (via the callee's Skype name). Because of the firewall setting, the caller will be connected to the callee host via one of its super nodes.
  3. If a VoIP call is established, we know that Skype has found a super node to relay voice packets between the caller and the callee.
    Occasionally, after a few retries, a VoIP call cannot be established because the caller fails to find a usable relay node from the candidate list. In this case, our daemon will restart the Skype program and re-attempt to establish a VoIP call. The Skype program will retrieve a new list of super nodes from the central server at the startup, so further VoIP calls should be successful.
    Also, we sometimes find that voice packets from the caller to callee and vice versa are relayed through different super nodes. In this case, we simply drop the call, block both relay nodes, and re-dial.
  4. To simulate a conversation, a WAV file comprised of all the English recordings in the Open Speech Repository3 is played continuously for both parties during a call.
  5. After a call has lasted 10 minutes, we block the current relay node at the caller by ipfw and terminate the call. We then wait for 30 seconds before re-iterating the loop from Step 2.
Figure 6: Network setup for measuring processing delays at any relay node on the Internet.
We ran the trace collection procedures over a two-month period during April 9th and April 20th in 2007 and collected 1,115 calls, which are summarized in Table I. The observed relay nodes are spread over the world, as shown in Fig. 7. On average, each call lasted 10 minutes and had an average packet rate of 32 pkt/sec. The overall average ack response time was 203 ms, while the average data delivery time was 212 ms. The difference between the ART and the DDT shows that, on average, forwarding a relay packet takes 9 ms longer than generating a TCP ack packet at a relay node. This observation strongly supports our conjecture that the processing delays at relay nodes are not negligible and are therefore worthy of our attention.
Figure 7: Geographical locations of observed relay nodes in our trace.
Table 1: Summary of collected traces
# CallsTime# Pkts/callARTDDT
1,11510 min19,210 pkts203 ms212 ms

4.2  Anomalous Behavior

Figure 8: Scatter plots of DDT and ART for selected calls in which anomalous behavior was rarely observed.
Since we have collected all the packets sent to and received from the relay node, the processing delays at relay nodes can be inferred using the method described in Section III. To demonstrate the observed anomalous behavior, we plot the scatter plots of the DDT and ART for four calls in Fig. 8. On each graph, we plot a 45° diagonal line through the origin to represent the boundary case where the DDT is equal to the ART. Thus, a point to the right of the diagonal line indicates a positive PD sample, and a point to the left stands for a negative PD sample. While it may not be clear from the figure, more than 95% of the PD estimates in each call are positive. Obviously, a negative PD is impossible and must be due to some anomalous behavior. We classify the anomalous behavior into two categories: host reordering and network reordering.

4.2.1  Host Reordering

We find that certain hosts occasionally send out a relay packet earlier than its corresponding ack packet. We call this host reordering. Of the 1,115 traces, only 2% have more than 10% of the samples host-reordered, while only 1% have more than 50% of the samples host-reordered. This behavior is uncommon because Skype's thread runs at a normal priority (15-17)4, while the TCP/IP stack runs at a high priority (in the context of svchost.exe). When a data packet arrives at a relay host, the TCP/IP stack normally gets a very early opportunity to process the packet and generate an ack packet (if appropriate). Normal priority threads like those of Skype only get a chance to execute when higher-priority threads have completed their tasks.
There are two reasons why a relay packet is sent earlier than its corresponding ack packet:
  1. Some traffic shaping software, e.g., NetLimiter5 and cFosSpeed6, is installed on the relay node. Such tools arrange the incoming and/or outgoing packet sequence so as to provide better QoS. It is reasonable that Skype voice packets are assigned a higher priority than TCP ack packets so that the relay packets can be sent earlier than the corresponding ack packets. We believe that such packet prioritization tools are responsible for the relay nodes where host reordering consistently occurs.
  2. On a heavily loaded computer, a thread may not get a chance to execute for a long period. To prevent starvation, the operating system normally boosts the priority of all ready-to-run threads at every time tick, so even the lowest-priority threads can preempt a high-priority thread after a long wait. In this way, if a source packet arrives at the relay node when a Skype thread finally gets a chance to preempt the system threads after waiting some time, the packet will be served immediately; consequently, the relay packet will be generated prior to its ack packet. We believe that this behavior explains why packets are rarely reordered at some relay nodes, say less than 5%.

4.2.2  Network Reordering

In contrast to host reordering, network reordering, which is caused by variable queueing delays, is much more frequent. Network reordering occurrences can be classified into two categories: 1) the relay packet is sent earlier than the ack packet (due to host reordering), but arrives at the destination later; and 2) the ack packet is sent earlier than the relay packet, but arrives later. We call these two categories relay packet reordering and ack packet reordering respectively. Histograms of the network reordering rates for both categories are shown in Fig. 9. While the density of calls for higher relay packet reordering rates decreases, the ack packet reordering rates look irregular because several hundred calls have a reordering rate of approximately 50%. We believe this behavior is due to network appliances with a traffic shaping capability, such as Packeteer7 and NetEqualizer8, which prioritize the packets in the output queue according to the traffic type. The 50% ack packet reordering rate is very likely due to a traffic shaper configured to treat VoIP packets and TCP ack packets fairly.
Figure 9: Histograms of the host and network packet reordering rates observed in the collected calls
In Fig. 10, we depict the scatter plots of DDT and ART for selected calls with high packet reordering rates. The packet reordering phenomenon is prominent because a significant proportion of samples are in the top-left area. We observe that the host-reordered samples (red crosses) tend to have high ack response times, while the network-reordered samples (blue circles) tend to be close to the diagonal line. According to the graphs, a significant number of positive PD samples are detected as network-reordered, which are then removed to ensure the accuracy of estimating processing delays. This shows that the IPID filtering is not equivalent to simply removing negative PD estimates.
Figure 10: Scatter plots of DDT and ART for selected calls with significant reordering (red crosses: host reordered packets; blue circles: network reordered packets).

4.3  PD Estimation Results

Table 2: Summary of Estimated Processing Delays
# Samp.Samp. dens.PD (Avg / Max / SD)
15,739 per call26 samp. / sec5 ms / 239 ms / 7 ms
We apply the IPID filtering method to the collected traces and compute the processing delays at each relay node. A summary of the processing delay estimates is given in Table II. On average, each call contains 15,739 processing delay estimates, which corresponds to 26 samples per second. To gain a general idea of the processing delay estimates, the average processing delay is 5 ms, and the maximum processing delay is 239 ms, which leads to a large maximum-to-mean ratio of 48.

4.4  PD Estimation Accuracy Assessment

Even though we evaluated the accuracy of our processing delay inference scheme in Section  III-D, we now provide further accuracy checks of the estimated processing delays of Internet relay nodes. We focus on large processing delay estimates, as they have more impact on the application performance.
We use the above tests to confirm the estimation accuracy of the processing delays in our large-scale measurement. Having verified the validity of the estimated processing delays, we further analyze the characteristics of processing delays at relay nodes and their impact on VoIP quality in subsequent sections.

5  Processing Delay Characterization

In this section, we analyze the relay node processing delays obtained from the Internet measurement (Section IV). We begin with a classification of the observed processing delays, and the possible causes of each type of delay, and then analyze the effect of packet size on processing delays. Finally, we characterize the stability of the processing delays and investigate the relationship between delay stability and host activity.

5.1  Classification of Processing Delays

Since processing delays at a relay node are closely related to the workload on the node, we can consider a pattern of processing delays as an indication of the workload structure. We find that the processing delay patterns can be classified into five categories, as shown in Fig. 20. The categories and the possible causes of each type of delay are as follows:
Fig. 11(a) and Fig. 11(b) plot the distributions of the average and maximum PDs of each call. It can be seen that the average PDs are generally shorter than 20 ms. The maximum PDs, however, have a much broader range, which spreads over 0 to 500 ms with a mean around 100 ms. We use the ratio between the maximum and average PDs to quantify the degree of maximum PD variation in a call, and plot its distribution in Fig. 11(c). The median of the ratios is 30, while its 90% percentile is around 200. The high variability of the ratios indicates that the variation of PDs can be extremely large, even within a 10-minute period.
Figure 11: The distributions of (a) average processing delays (b) maximum processing delays (c) the ratio of the maximum processing delay versus the average processing delay in a call.

5.2  Effect of Packet Size

It is well known that the packet size impacts on the processing time of network routing devices [21]. Here, we investigate how the packet size affects the processing delays at a relay node. Intuitively, it takes longer to process a larger packet and transfer it between the buffers of the I/O device, TCP/IP stack, and the relay application. To verify this conjecture, we depict the scatter plots of packet size and processing delays for selected traces in Fig. 12. In each figure, a point stands for a processing delay sample, and the squares mark the windowed averages of processing delays with a window size of 10 bytes.
Figure 12: The scatter plots of packet size and processing delays for selected calls.
We find that, in most calls, the average processing delays are significantly higher for larger packets. To determine how the processing delays vary in terms of the packet size, we apply ordinal linear regression to quantify their interrelationships. We define the base processing time as the average processing delay of a theoretical zero-size packet, and the processing speed as the increase of the average processing delay for a one-byte increase in packet size. The distributions of the base processing time and processing speed are plotted in Fig. 13. Interestingly, rather than being bell-shape distributed, the baseline delays are bi-modal and the processing speeds are tri-modal. We believe that the peaks in both distributions are due to different transmission speeds of I/O buses, different implementations of operation systems, and different versions of Skype, which might be exploited as a signature for a particular hardware/software combination.
We identify two peculiar phenomena in the traces. 1) In some calls, the average processing delays remain nearly constant regardless of the packet size. This can probably be explained by the very fast processing speed of the relay node; however, the calls' base processing times are not particularly small, like Call 539 in Fig. 12. 2) The relationship between the processing delay and the packet size is not exactly linear for 52% of calls. In those calls, the processing delays increase abruptly when the packet size reaches 80 and 130 bytes, as shown by Call 1042 in the figure. We believe the two behavior patterns can be attributed to different codecs or encryption schemes used by Skype. As Skype's implementation details are beyond the scope of this paper, we simply point out these peculiar patterns and leave them for future research.
Figure 13: The effect of packet size on processing delays: (a) the distribution of baseline delays, (b) the distribution of processing speeds.

5.3  Stability Analysis

We now turn to the stability of processing delays, which is closely related to the workload structure of a relay node. We begin our analysis with a definition of the metric busy level, which is designed to capture the level of the workload on the relay node. The busy level (BL) is defined as the 95% quantile of processing delays within a window of 10 seconds.
We choose the 95% quantile, rather than a more intuitive 50% quantile, because even on a heavily loaded machine, the processing delays incurred by relay packets are not always large due to the thread scheduling mechanism. More specifically, when a relay load is lightly loaded, a source packet is always serviced by the relay application as soon as it arrives at the node. In contrast, a source packet sometimes has to wait a long time before it is serviced on a heavily loaded node. The processing delay of a packet depends on the exact time it arrives at the relay node. If a source packet arrives at a node just before the relay application's execution time, it will receive almost no processing delay no matter how heavy the workload is. Otherwise, the packet will be postponed by a load-dependent period before being processed. For these reasons, the busy levels are defined bias toward high processing delays in order to obtain the true workload.
After computing the busy levels for each second in a call, we detect change points where BLs differ significantly. Rather than employ a mathematical definition of change points [30,[9], we identify a BL change from an operational point of view as follows. The time t is deemed a change point if the BLs of two consecutive seconds, t and t+1, have a difference larger than 10 ms. Based on the detected change points, we classify each relay node into two categories: stable and unstable. A relay node is considered stable if its PD series contains no change points, and unstable otherwise. In our traces, 75% of relay nodes are classified as stable and 25% as unstable.

5.3.1  Host Activity Inference

The stability of relay nodes can also seen an indicator of the activity of the relay host. In other words, if the busy levels of a relay node change significantly during a call, it is likely that the computer is in use for that period. We verify the correctness of the inferred stability of relay nodes based on the intuition that a host is more likely to be active during working hours. To do so, we compute the local time of each call by mapping a relay node's IP address to its geographical address and time zone. The ratios of all hosts and unstable hosts for each hour of a day are plotted in Fig. 14. We observe that, while the numbers of observed relay nodes are roughly constant, the numbers of unstable nodes are significantly higher from 8AM to 4PM in each node's local time. These results strongly support the contention that the measured processing delays reflect the true workload on relay nodes and can indicate whether the computer is currently being used by a person or running a workload-varying application.
Figure 14: The average ratios of all relay nodes and unstable relay nodes for each hour of a day, where the ratios are relative to the number of all relay nodes and unstable relay nodes respectively.
Table 3: Summary of Busy Levels of Stable and Unstable Relay Nodes
Stable RNUnstable RNRatio
BL Min5 ms6 ms1.26
BL Mean6 ms17 ms2.75
BL Max8 ms120 ms14.17
BL SD1 ms16 ms24.72
We summarize the busy levels of stable and unstable relay nodes in Table III. The table also lists the ratios of the BL statistics of unstable and stable relay nodes. The results show that the busy levels of unstable relay nodes are not only significantly higher than those of stable nodes (the ratio of BL maximums is 14), but are also much more variable (the ratio of BL standard deviations is 24).

5.3.2  Change Frequency

We investigated how frequently the busy levels change during a call, and found that the change frequency was low for most unstable relay nodes. Among the 25% of unstable nodes, only 12% were unstable for more than 10% of the time, and only 8% for more than 20% of the time. However, 4% of the relay nodes changed constantly for more than 50% of time. Given an average BL of 20 ms and a change-detection threshold of 10 ms, the degree of fluctuation for the top 4% of relay nodes is considered frequent and large.

5.3.3  Change-Free Regions

A change-free region (CFR) is a continuous time period where no BL changes occur. For the 25% of unstable relay nodes, the mean CFR length was 146 seconds, and the CFR inter-arrival time was 36 seconds on average. We find that the busy levels in CFRs are significantly lower and less variable than those in non-CFRs. Specifically, the mean of busy levels in CFRs is only 60% of that in non-CFRs and the standard deviation of busy levels in CFRs in only 28% of that in non-CFRs.

5.3.4  Processing Delay Spikes

For both stable and unstable relay nodes, we occasionally observe PDs with orders of magnitude larger than the average (see Fig. 20). To quantify such outliers, we assume the PDs are normally distributed, and define a threshold for outlier detection as the mean plus 10 times the standard deviation. By this definition, we find that PD spikes occur in 95% of the calls, in which there are 17 spikes and 0.03 spikes per second on average. In addition, the maximum PD spike is 133 ms and the mean PD spike is 38 ms on average. To examine whether the occurrences of delay spikes can be modeled as a Poisson process, we apply the statistical tests described in [20], which comprise an Anderson-Darling Test for exponential distributions and auto-correlation tests for independence. We find that 70% of the calls pass the independence test, but only 28% of the calls have exponential inter-spike times. Hence, we conclude that the occurrence of processing delay spikes cannot be modeled as a Poisson process, even for a 10-minute time scale.

6  Impact of Processing Delay on VoIP Quality

Currently, peer-to-peer relaying technology is adopted by VoIP applications for the following reasons: 1) relaying can help bypass the connection restriction of NATs and firewalls because communication availability is an important consideration for widely used VoIP services; 2) VoIP requires a high degree of real-timeliness and interactivity, and relaying can help reduce network latency; 3) the bandwidth requirement of VoIP communication is not high, e.g., 32 Kbps, which is not a high bandwidth overhead to a relay node. In many cases, if we choose a relay node carefully, we are likely to find a relayed path that yields better network quality than the native Internet routing path. Thus, a number of studies have been devoted to finding a good relay node that yields a better quality path in terms of network latency or the packet loss rate [23,[5,[4,[11,[3,[29,[15,[2]. However, without considering the workload on the relay nodes, the processing of relay packets may introduce significant delays to the relayed data stream and therefore degrade the application performance.
Having shown that the processing delays at a relay node are not always negligible, we now consider the impact of relay processing delays on VoIP quality. Via trace-driven simulations, we show that such delays may have a negative impact on VoIP quality. We also perform a characterization of busy periods on relay nodes and show that the busy status of relay nodes is generally quite unstable.

6.1  Methodology

We use trace-driven simulations to assess the impact of processing delays on VoIP quality. To measure the effect of processing delays under various network conditions, in each run, we combine a network delay trace and a processing delay trace to simulate a VoIP call with and without packet relaying. We use ack response times extracted from the collected calls as the input of network delays. As each of our 1,115 traces provides a series of network delays and a series of processing delays separately, our simulation comprises a total of 1,243,225 runs. The simulation time of each run is set to 250 sec and the packet rate is set to 33 pkt/sec. During each simulation run, we compute the end-to-end delay and packet loss rate based on a given pair of network delay and processing delay traces. The end-to-end delay is decided by the playout buffer size, which also determines the end-to-end loss rate because a packet is considered lost if it cannot meet the playout schedule. There are two common schemes for adjusting the playout buffer size, namely static and adaptive, both of which are included in our simulations. For the static buffer strategy, we use a fixed 60-ms buffer after Skype's setting [25]. Our adaptive buffer is computed by the following equation:
di + 4 ×vi,
α×di + (1 − α) ×ni,
β×vi − 1 + (1 − α)| di − ni |,
where pi is the buffer size for packet i+1, ni is the network delay of packet i, and α = β = 1−0.998002 according to [22].
We use the ITU-T E-model [12] to quantify the voice quality of each simulated call. Essentially, the E-model transforms a set of transmission impairments into a psychological satisfaction level. The main computation equation of the E-model is
R = Ro − Is − Id − Ie + A,
where Ro represents the fundamental signal-to-noise ratio, Is represents the impairments occurring simultaneously with the voice signal, Id represents the impairments related to delays, and Ie represents the impairments related to information loss. The advantage factor A is used for compensation when there are other advantages of access available to the user. Since the computation of Ro and Ie is codec-dependent, for simplicity, we assume the voice codec is the widely deployed G.711. The output of the E-model is the R-score, which represents the overall voice quality via a 100-pt scale. In our settings, the R-score reaches a maximum of 93.2 without any end-to-end delay or loss. Generally an R-score higher than 80 implies a satisfactory VoIP quality, while an R-score lower than 70 implies a quality that many users find unacceptable [19]. Following Table 1 in [19], we use a reduction of 10-points in the R-score to denote a significant degradation in VoIP conversation quality.

6.2  Performance Degradation

6.2.1  Transmission Delay and Loss

For each pair of network delay and relay processing delay traces, we assess the quality degradation by evaluating the network performance separately, i.e., end-to-end delay and loss, as well as voice quality, with and without relay processing. First, we consider the network performance degradation shown in Fig. 15. From the figure, we observe that the average delay increase caused by relay processing is not large because the playout buffer absorbs the variability introduced by processing delays. However, a fraction of calls still experience significant end-to-end delay increases, say, greater than 50 ms. The reason for the larger delay increase in the adaptive buffer scheme is that the buffer size is proportional to the mean deviation of network delays (Eqn. 3); thus, it becomes larger in the face of highly spiky processing delays. Moreover, the adaptive buffer does not seem resilient enough to absorb highly variable processing delays, as it leads to both longer delays and higher packet loss rates than the static buffer scheme. In addition, the static buffer always absorbs delay deviations shorter than 60 ms such that only 1% of calls have an end-to-end loss rate higher than 1%, compared to 10% of calls under the adaptive buffer strategy.
Figure 15: The distributions of (a) average increase of end-to-end delays, (b) increase of packet loss rate in each call.

6.2.2  VoIP Quality

With respect to VoIP quality degradation, we can see from Fig. 16(a) that the average R-score decrease caused by relay processing is generally small; most calls have an average R-score decrease smaller than 10. A few calls with adaptive buffering do have better voice quality, i.e., an increased R-score. Our analysis shows that this counter-intuitive behavior is due to a lower packet loss rate. The lower rate is a consequence of a larger playout buffer, which in turn is caused by a higher degree of delay variation due to relay processing. At the same time, the adaptive buffer scheme also leads to poor conversation quality, as about 20% of calls have an average R-score decrease of more than 10, which we consider a significant degradation in quality.
The performance degradation is much more obvious if we check the distribution of maximum R-score decreases in Fig. 16(b). The adaptive buffer scheme still performs much worse than the static buffer scheme. However, both schemes are subject to significant quality degradation due to relay processing; 40% and 58% of calls experienced considerable quality degradation for the static and adaptive buffer strategies respectively.
Figure 16: The distributions of (a) average R-score increase, (b) maximum R-score increase in each call.

6.2.3  Original Quality vs Degradation

We note that the impact of processing delay is not equivalent for all calls. According to the E-model, a better quality call is more prone to performance degradation [12,[16]. We verify this effect by mapping the original R-score of each call to its degraded R-score, as shown in Fig. 17. Generally speaking, calls with a high R-score incur more quality degradation than those with a low R-score. We observe that the adaptive buffer scheme causes more degradation than the static buffer scheme for calls with an original R-score higher than 40. The reason is that a call with an initial high R-score is associated with short network delays; in this case, the adaptive buffer size tends to be decided by the spiky relay processing delays. On the other hand, the static buffer scheme is not susceptive to delay variations, so the quality degradation remains relatively constant.
Figure 17: The relationship between the original R-scores and the averaged degraded R-scores.

6.2.4  Summary

Here, we formally define that a call is "degraded" if it experiences an R-score decrease greater than 10 points. Among the simulated calls, we find that the static buffer scheme causes 31% of the calls to be degraded, while the adaptive buffer scheme causes 54% to be degraded. In addition, for the two schemes, the average degradation time ratio within a call is 10% and 18% respectively.

6.3  Impact of Busy Levels

Having evaluated the overall VoIP quality degradation caused by relay processing delays, we now examine the relationship between workload levels and the degree of voice quality degradation. Recall that we defined the busy level to quantify the workload at a relay node in Section  V-C. For each call, we first divide the busy levels and R-score decreases into 10 groups by their ranks. Then, we plot the average R-score decreases for a variety of busy levels, as shown in Fig. 18.
The figure clearly shows that higher busy levels lead to more serious quality degradation. On average, the R-score decreases are higher than 10 points when the busy level of the relay node is higher than 20 ms. This result implies that we should avoid a relay node with a busy level higher than 20 ms, as transmitting VoIP packets through this node would lead to significant quality degradation.
Figure 18: The relationship between busy levels and the corresponding average R-score decreases. The dashed lines are the 95% confidence bands of the averages.

6.4  Busy Period Characterization

Based on the analysis results, we define that a relay node is busy when its busy level is higher than 20 ms. In our traces, 23% of relay nodes were even in a busy state during a 10-minute call. We also define a busy period of a relay node as a continuous time span during which the node is busy.
To understand the patterns of busy periods on relay nodes, we extract all the busy periods in each call and plot the distributions of their length and interarrival times in Fig. 19. We find that a busy period lasts for 18 seconds on average, where 65% of busy periods are shorter than 10 seconds. Also, the interarrival time of busy periods is 25 seconds on average. These statistics suggest that the busy status of relay nodes is quite unstable because the nodes tend to switch between busy and non-busy states frequently.
Figure 19: The distributions of (a) busy period length, (b) busy period interarrival time in a call.

7  Conclusion

In this paper, we consider a hidden aspect of peer-to-peer relaying-the processing delays at relay nodes. Existing related works mostly focus on how to improve current peer-to-peer systems by data relaying, but seldom discuss its adverse effects. Through the trace collection in Section IV, the statistical analysis in Section V, and the network simulations in Section VI, we show that relaying is a double-edged sword in that it could be also detrimental to VoIP quality if an inappropriate relay node is used. The degradation cannot be avoided completely as a lightly-loaded relay node may start running a load-intensive application at anytime. Thus, we have to monitor the processing delays of a relay node continuously, as we usually do for network conditions, to prevent the application performance from being degraded. We hope this study will motivate future peer-to-peer systems to focus on this negative aspect of the application-layer relaying technique.
Figure 20: A classification of common processing delay patterns. We list five categories: typical, variable, level-shifted, periodic, and loaded.


[1] S. Baset and H. Schulzrinne, "An analysis of the Skype peer-to-peer internet telephony protocol," in INFOCOM.    IEEE, 2006.
[2] K.-T. Chen, C.-Y. Huang, P. Huang, and C.-L. Lei, "Quantifying Skype User Satisfaction," in Proceedings of ACM SIGCOMM 2006, Pisa Itlay, Sep 2006.
[3] C.-M. Cheng, Y.-S. Huan, H. T. Kung, and C.-H. Wu, "Path probing relay routing for achieving high end-to-end performance," in Global Telecommunications Conference, 2004. GLOBECOM '04. IEEE, vol. 3, 2004, pp. 1359-1365.
[4] T. Fei, S. Tao, L. Gao, and R. Guérin, "How to select a good alternate path in large peer-to-peer systems?" in INFOCOM.    IEEE, 2006.
[5] T. Fei, S. Tao, L. Gao, R. Guérin, and Z.-L. Zhang, "Light-weight overlay path selection in a peer-to-peer environment," in INFOCOM.    IEEE, 2006.
[6] B. Ford, P. Srisuresh, and D. Kegel, "Peer-to-peer communication across network address translators," in USENIX Annual Technical Conference, 2005, pp. 179-192.
[7] Google, Inc.,
[8] S. Guha and N. Daswani, " _US An experimental study of the Skype peer-to-peer VoIP system," Cornell University, Tech. Rep., Dec. 16 2005.
[9] F. Gustafsson, Adaptive Filtering and Change Detection.    John Wiley & Sons, September 2000.
[10] X. Hei, C. Liang, J. Liang, Y. Liu, and K. Ross, "A Measurement Study of a Large-Scale P2P IPTV System," in IPTV Workshop, International World Wide Web Conference, 2006.
[11] X. Hei and H. Song, "Stochastic relay routing in peer-to-peer networks," in Proceedings 41st IEEE International Conference on Communications, 2006.
[12] ITU-T Recommandation, "G. 107. The E-Model, a Computational Model for Use in Transmission Planning," International Telecommunication Union, CHGenf, 2002.
[13] X. Liao, H. Jin, Y. Liu, L. M.Ni, and D. Deng, "Anysee: Peer-to-peer live streaming," in INFOCOM.    IEEE, 2006.
[14] L. Liu and R. Zimmermann, "Adaptive low-latency peer-to-peer streaming and its application," Multimedia Systems, vol. 11, no. 6, pp. 497-512, 2006.
[15] Y. Liu, Y. Gu, H. Zhang, W. Gong, and D. Towsley, "Application level relay for high-bandwidth data transport," in The First Workshop on Networks for Grid Applications (GridNets), San Jose, October 2004.
[16] A. Markopoulou, F. A. Tobagi, and M. J.Karam, "Assessment of voIP quality over internet backbones," in Proceedings of INFOCOM, 2002.
[17] S. McCanne and V. Jacobson, "The BSD packet filter: A new architecture for user-level packet capture," in Proceedings of USENIX'93, 1993, pp. 259-270.
[18] J. Nagle, "Congestion control in IP/TCP internetworks," Computer Communication Review, vol. 14, no. 4, pp. 11-17, Oct. 1984.
[19] M. Narbutt, A. Kelly, L. Murphy, and P. Perry, "Adaptive voIP playout scheduling: Assessing user satisfaction," IEEE Internet Computing, vol. 9, no. 4, pp. 28-34, 2005.
[20] V. Paxson and S. Floyd, "Wide-area traffic: The failure of poisson modeling," in SIGCOMM, 1994, pp. 257-268.
[21] R. Ramaswamy, N. Weng, and T. Wolf, "Characterizing network processing delay," in Proceeding of IEEE Global Communications Conference (GLOBECOM), Dallas, TX, nov 2004.
[22] R. Ramjee, J. F. Kurose, D. F.Towsley, and H. Schulzrinne, "Adaptive playout mechanisms for packetized audio applications in wide-area networks," in INFOCOM, 1994, pp. 680-688.
[23] S. Ren, L. Guo, and X. Zhang, "ASAP: an AS-aware peer-relay protocol for high quality voIP," in Proceedings of ICDCS, 2006, pp. 70-79.
[24] J. Rosenberg, R. Mahy, and C. Huitema, "Traversal Using Relay NAT (TURN)," draft-rosenberg-midcom-turn-05 (work in progress), July, 2004.
[25] B. Sat and B. W. Wah, "Playout scheduling and loss-concealments in voip for optimizing conversational voice communication quality," in Proceedings of Multimedia'07.    New York, NY, USA: ACM, 2007, pp. 137-146.
[26] C. Schensted, "Longest increasing and decreasing subsequences," Canad. J. Math., vol. 13, pp. 179-191, 1961.
[27] Skype Limited,
[28] D. A. Solomon and M. Russinovich, Inside Microsoft Windows 2000.    Microsoft Press Redmond, WA, USA, 2000.
[29] H. Zhang, L. Tang., and J. Li, "Impact of Overlay Routing on End-to-End Delay," in Proceedings of ICCCN, 2006, pp. 435-440.
[30] Y. Zhang and N. G. Duffield, "On the constancy of internet path properties," in Proceedings of Internet Measurement Workshop, V. Paxson, Ed.    San Francisco, California, USA: ACM, Nov 2001, pp. 197-211.
[31] R. Zimmermann, B. Seo, L. Liu, R. Hampole, and B. Nash, "Audiopeer: A collaborative distributed audio chat system," Distributed Multimedia Systems, San Jose, CA, 2004.


1. This work was supported in part by Taiwan Information Security Center (TWISC), National Science Council of the Republic of China under the grants NSC 96-2219-E-001-001, NSC 96-2219-E-011-008, and NSC 96-2628-E-001-027-MY3.
2. This statement is somewhat simplified. Even if high-priority tasks have not been completed, the scheduler will regularly give lower-priority tasks a chance to be executed in order to avoid starvation.
4. We take Microsoft Windows as an example, but other modern operating systems work similarly.

Sheng-Wei Chen (also known as Kuan-Ta Chen) 
Last Update September 19, 2017