Inferring Speech Activity from Encrypted Skype Traffic

Yu-Chun Chang, Kuan-Ta Chen, Chen-Chi Wu, and Chin-Laung Lei
Department of Electrical Engineering, National Taiwan University
Institute of Information Science, Academia Sinica

PDF Version | Contact Us

Abstract

Normally, voice activity detection (VAD) refers to speech processing algorithms for detecting the presence or absence of human speech in segments of audio signals. In this paper, however, we focus on speech detection algorithms that take VoIP traffic instead of audio signals as input. We call this category of algorithms network-level VAD.
Traditional VAD usually plays a fundamental role in speech processing systems because of its ability to delimit speech segments. Network-level VAD, on the other hand, can be quite helpful in network management, which is the motivation for our study. We propose the first real-time network-level VAD algorithm that can extract voice activity from encrypted and non-silence-suppressed Skype traffic. In addition, we demonstrate that speech activity can be helpful in two potential applications for network management, namely, VoIP flow identification and user perception assessment.
QoS Provisioning, Traffic Classification, User Satisfaction, VoIP, Voice Activity Detection

1  Introduction

Traditionally, voice activity detection (VAD) refers to speech processing algorithms for detecting the presence or absence of human speech in segments of audio signals. One of the most well-known applications of VAD is called silence suppression. To implement silence suppression, a speech coder needs to incorporate a VAD module so that it only outputs sound signals when human speech is present and the signal length is therefore reduced. By so doing, silence suppression can reduce the network bandwidth used by voice packets and achieve higher communication channel utilization. In addition, VAD has many applications in speech processing systems, such as speech encoding, echo cancellation, and speech recognition. Hereafter, we call this category of speech detection algorithms source-level VAD, as they operate on audio signals directly.
In this paper, we focus on speech detection algorithms that take VoIP traffic instead of audio signals as input. We call this category of algorithms network-level VAD. The location where the algorithm is implemented depends on the type of signal that each category of VAD algorithms process. Because source-level VAD deals with audio signals, the VAD module usually resides in end-users' PCs or phones. In contrast, network-level VAD infers speech activity from network traffic, so it can run on any network node. From the perspective of applications, source-level VAD usually plays a fundamental role in speech processing systems because of its ability to delimit speech segments. Network-level VAD, on the other hand, can be helpful in network management. We discuss some of its applications below and detail them in Section VI. The differences between source-level and network-level VAD are summarized in Table I.
Table 1: The differences between source-level and network-level voice activity detection
source-level network-level
input audio signal network traffic
location speaker's host network node
purpose silence suppression, traffic management,
echo cancellation QoS measurement
Motivation. Our proposed network-level VAD scheme, which infers speech activity from network traffic, is motivated by its potential applications in network management. In this paper, we consider two of those applications, namely VoIP flow identification and user perception assessment.
  1. VoIP Flow Identification: Flow identification is an essential component in network traffic and QoS management. In business enterprises, there is often a need to manage VoIP flows due to institutional policies, such as restricting calls to certain destinations, or blocking calls at certain times. Providing better QoS is another motivation for identifying VoIP flows. While such flows are difficult to recognize due to proprietary protocols, non-standard port numbers, and encrypted payloads, the human conversation pattern embedded in the traffic can be a unique signature of VoIP flows. The discrepancy arises because the conversation activity between two people often occurs on a multi-second time scale, while the traffic patterns of other applications, e.g., web traffic or online game traffic, are usually on a sub-second time scale [6].
  2. User Perception Assessment: An efficient, objective, and accurate method of assessing a user's perception of a call is fundamental to ensuring a satisfying VoIP user experience. Previous studies have indicated that the conversational interactivity and smoothness of a call can reflect the user's perception of quality [17,[19]. When the network quality is not good, the conversation tends to be intermittent rather than continuous. Moreover, the level of interactivity is lower as time is wasted waiting for a response, asking for something to be repeated, slowing the pace, repeating sentences, and thinking. Thus, a network node can determine the user's perception of a VoIP call's quality by extracting speech activity from the network traffic and taking appropriate action if the user perceives that the quality is unsatisfactory.
Challenges. Intuitively, it should be easy to achieve network-level VAD by simply examining the payload of VoIP packets. In ideal circumstances, we would be able to extract speech signals from the packets' payloads, and then apply source-level VAD to the signals. However, payload encryption is becoming a common design feature to preserve privacy, so that parties other than call participants cannot know the content of a conversation. Even if one VoIP application does not encrypt its packets, the packets' payloads may be inaccessible because obtaining such information would be a violation of privacy. An alternative solution is to determine speech activity based on the packet rate, as VoIP applications normally use silence suppression for channel multiplexing and more efficient communications [1]. However, an increasing number of VoIP applications, such as Skype and UGS [2], do not support silence suppression in order to obtain better voice quality and maintain UDP bindings at the NAT. This indicates that we cannot rely on either the packet payload or the packet rate to determine the presence or non-presence of speech bursts. Specifically, the traffic generated by Skype, one of the most popular VoIP applications, raises both challenges-Skype's traffic is encrypted and the system generates packets no matter whether a speaker is talking or not. Thus, we need a more sophisticated means than intuitive methods to infer speech activity from encrypted and non-silence-suppressed VoIP traffic.
Contributions. Given the above difficulties, in this paper, we propose a scheme that can infer speech activity from encrypted Skype traffic with no silence suppression. We chose Skype as our study subject because its traffic exhibits both difficulties-encryption and a constant packet rate. The proposed scheme is based on our observation that, in Skype traffic, speech activity is highly correlated to packet size, as more information will be encoded in a voice packet while a user is speaking. To demonstrate this point, we show the sound volume of an audio segment and the sizes of the voice packets corresponding to the segment in Fig. 4. The graph reveals that the packet size and speech volume are highly correlated as they fluctuate in tandem.
The contribution of this paper is three-fold. 1) Traditional VAD algorithms work on audio signals that are only available on the speakers' devices, whereas we propose using network-level VAD to infer speech activity from VoIP traffic. 2) We propose the first real-time network-level VAD algorithm, which is able to deal with encrypted and non-silence-suppressed VoIP traffic. 3) We demonstrate that speech activity can be helpful in network management by applying it in two potential applications: VoIP flow identification and measurement of users' perceptions of conversation quality.
The remainder of this paper is organized as follows. Section II summarizes earlier studies on speech detection. We describe our trace collection methodology and summarize our traces in Section III. In Section IV, we propose a network-level VAD algorithm for encrypted and non-silence-suppressed VoIP traffic. The speech detection accuracy of the proposed algorithm is evaluated in Section V. In Section VI, we propose and discuss two network management applications that are made feasible by network-level VAD. We then summarize our conclusions in Section VII.

2  Related Work

When designing source-level VAD algorithms, it is difficult to determine whether a speech burst is present because of background noise. As a result, a large number of source-level VAD algorithms have been proposed to reduce the influence of background sound. For example, in [9], Hoyt and Wechsler used concave or convex formant patterns to detect the presence of speech. The algorithm proposed in [7] is based on a pattern recognition approach, in which the matching phase follows six fuzzy rules and is trained by a new hybrid learning tool. In [15], Prasad, Vijay, and Shankar used an information theoretic measure, called spectral entropy, to differentiate silence segments from speech segments. In [16], Rabiner and Sambur proposed classifying a speech segment as a silence-, noise-, or speech-period based on linear prediction coding (LPC) parameters and the energy distance. In [11], Jongseo and Wonyong adopted the generalized likelihood ratio test based on the assumption that noise statistics are known a priori before speech detection is performed. Meanwhile, the approach in [21] employed adaptive modeling of voice signals to detect and remove noise signals.
Our approach differs from earlier studies in a number of respects, as shown in Table I. We consider that detecting speech activity based on network traffic is more difficult than that based on audio signals. This is because network traffic takes the form of audio signals compressed by an encoder, which may adjust the coding level, redundancy level, quantization levels, and packetization period at any time to achieve more efficient voice transmission. All these factors introduce more randomness into VoIP traffic and thus make the design of network-level VAD algorithms more challenging.

3  Data Description

In this section, we describe the data set we used to develop, test, and validate our proposed VAD algorithm. The data set is comprised of two traces: 1) VoIP traffic, and 2) the corresponding speech signals, to verify the correspondence of VoIP traffic and the speech activity inferred by our scheme. In the following, we first explain the experiment setup and the method used to collect the traffic and speech signal traces. We then describe how we extract speech activity from audio recordings for use as the ground-truth in subsequent performance evaluations.

3.1  Experiment Setup

exp_setup.png
Figure 1: Network setup for measuring processing delays at any relay node on the Internet.
We conducted a number of Internet experiments to collect Skype traffic because it mimics the real-life traffic that a network node would encounter. The reason is that the structure and pattern of network traffic may change due to network impairment, i.e., network delay and packet loss, or the application itself. For example, Skype may adjust its encoding factor, redundancy level, and packetization intervals to adapt to the host and network load.
Our trace collection mechanism comprises three commodity PCs deployed in the way shown in Fig. 1. One serves as the VoIP sender, one serves as the receiver, and the third is a relay node that collects information about the traffic of both call parties. The collection procedure is as follows:
  1. When the measurement program is initialized, we block the sender from reaching the receiver directly with a firewall program ipfw.
  2. The sender initiates a VoIP call to the receiver (via the receiver's Skype name). Because of the firewall setting, the sender will be connected to the receiver host via one of its relay nodes.
  3. If a VoIP call is established, we know that Skype has found a relay node to relay voice packets between the sender and the receiver. Occasionally, after a few retries, a VoIP call still cannot be established because the sender fails to find a usable relay node from the candidate list. In this case, our daemon will restart the Skype program and re-attempt to establish a VoIP call. The Skype program will retrieve a new list of relay nodes from the central server at the startup, so further VoIP calls should be successful.
    In addition, we sometimes find that voice packets from the sender to receiver and vice versa are relayed through different delay nodes. In this case, we simply drop the call, block both relay nodes, and re-dial.
  4. To simulate a conversation, a WAV file comprised of all the English recordings in the Open Speech Repository1 is played continuously for both parties during a call. At the same time, we record the received voice data in MP3 files at the receiver by using a Skype plugin program Pamela.
  5. After a call has lasted 10 minutes, we block the current relay node at the sender by ipfw and terminate the call. We then wait for 30 seconds before re-iterating the loop from Step 2.
Table 2: Summary of VoIP traces
Total # of traces # TCP # UDP
1839 1427 412
# Relay node Mean packet size Mean time period
1677 109.6 bytes 612.5 sec
We ran the trace collection procedures over a two-month period in mid-2007 and collected 1,839 calls, which are summarized in Table 2. The observed relay nodes are spread around the world, as shown in Fig. 2. Of the 1,839 traces, 1,427 were based on TCP, and 412 were based on UDP.
map1.png
Figure 2: Geographical locations of observed relay nodes in our trace.

3.2  Extracting Speech activity from Audio Recordings

To evaluate the performance of our VAD algorithm, in addition to network traffic, we need the "true" speech activity that was "heard" by the receiver. Therefore, we recorded the audio signals that were played by the receiver host's output device into WAV files and applied a source-level VAD algorithm to the speech activity in the sound recordings.
Our source-level VAD scheme is based on the sound volume, which is a commonly used indicator of voice activity. We apply the function
volume = 10*log(

i 
Si2)
(1)
in [10] to compute the volume of each sound sample, Si, in a segment of audio signals. The quantity computed is referred to as the log energy in units of decibels and the series we obtain is called the volume process.
To determine whether speech is present at a given time, we apply a static thresholding method to the volume process. First, we construct a window of 5 samples in the volume process and observe whether the difference between a local extreme and any other samples is larger than 50 db. Then, we collect all the local maximums and local minimums in the volume process, where the former indicate the volumes of speech bursts, and the latter indicate the volume when speech is not present. The graph in Fig. 3 plots the density of the extreme volumes detected in our trace. In the figure, the volumes that represent speech and silence form two separate clusters. Based on the graph, we computed the threshold for determining whether a speech burst is present by the half-way point between two peaks, and obtained 183 db as our static threshold.
Based on the computed static threshold, we classify each speech sample as speech or silence depending on whether the volume of the sample is higher than the threshold. We define an ON period as a segment of speech signals whose volume is higher than the threshold, and define an OFF period as a segment of speech signals that is considered silence. In Section V, we use the inferred ON and OFF periods to evaluate the accuracy of our VAD algorithm, which infers speech activity from network traffic.
volume_density.png
Figure 3: The distributions of sound volumes in speech and silence periods.

4  The Proposed Scheme

In this section, we present our VAD algorithm for extracting conversation activity from encrypted and non-silence-suppressed VoIP traffic. We first explain our choice of packet size as the indicator of voice activity, after which we discuss the design of our voice activity detection scheme, which comprises two phases-smoothing and dynamic thresholding.
wav_pkt_bit.png
Figure 4: The volume process of a human speech recording and the packet size and bit rate process of the corresponding VoIP traffic.
From our preliminary analysis, we find that both the packet size and the bit rate are indicators of speech activity, even though the packet payload is encrypted. Fig. 4 shows a comparison of the volumes, packet sizes, and bit rates that correspond to a segment of speech signals. The graph reveals that both the packet size and the bit rate correlate, to some extent, with the volume process. We use correlation coefficients to quantity the strength of their relationships. On average, the correlation coefficient between the volume and the packet size process is 0.78. Between the volume and the bit rate process, it is 0.59; and between the volume and the packet rate process, it is only −0.02. The result suggests that the packet rate is nearly independent of user speech. Although the bit rate exhibits a reasonable correlation with the volume process, it may contain randomness due to packet retransmission or congestion control mechanisms. Therefore, we adopted the packet size process as the basis for inferring voice activity.
As there is a strong correlation between packet size and speech activity, intuitively, a static threshold should be sufficient to determine whether speech is present. Note that this is the same as the method used for speech signals in Section  III-B. However, we find that a static threshold is not feasible because the packet size process is not stationary as its mean may change over time. This may occur because Skype adjusts the encoding bit rate, redundancy factor, and the packetiziation delay according to the host's CPU load, the congestion level, and the bandwidth of the network path. To address these challenges, we apply a smoothing procedure to remove high-frequency variations in the packet size process, and then apply an adaptive thresholding mechanism to deal with the non-stationarity of the packet sizes.

4.1  Smoothing

We apply an exponentially weighted moving average (EWMA) on the packet size process to remove high-frequency fluctuations. The EWMA is defined as
Pi = λYi + (1−λ)Pi−1,
(2)
where Yi denotes the observed packet size in the ith time unit (a time unit of 0.1 second is used in this study) of the process, and Pi denotes the smoothed packet size in the ith time unit. We find that setting the weight λ to 0.2 achieves the best performance (in terms of voice activity detection accuracy, which is discussed in the next section), while λ within a range of 0.1 to 0.4 yields a similar performance.

4.2  Adaptive Thresholding

We now introduce the adaptive thresholding algorithm, which tries to find a reasonable threshold for determining the presence of speech given a non-stationary packet size process. The steps of the algorithm are as follows:
  1. In the smoothed packet size process, we first find all the local maximums and local minimums within a window of 5 samples. In a window, if the difference between a local extreme and any other sample is greater than 25 bytes, we call it a "peak" if it is a local maximum, and a "trough" if it is a local minimum. The detected peaks and troughs are collected in a peak list and a trough list, respectively.
  2. We denote each peak and trough as (ti, si), where ti means the occurrence time of the peak or trough and si refers to the smoothed packet size. For each pair of adjacent troughs, (ta, sa) and (tb, sb), on the trough list, if there are one or more peaks on the peak list between these two troughs, we take the peak with the largest packet size and denote the packet size as sp. We then draw an imaginary line from (ta, (sa + sp)/2) to (tb, (sb + sp)/2) as an adaptive threshold.
  3. We determine the state of each voice sample as ON or OFF by checking whether the smoothed packet size is greater than any of the adaptive thresholds defined at the time the sample was obtained.
We summarize the voice activity algorithm in Algorithm 1.
[t] #1Voice Activity Detection Algorithm perform packet size smoothing compute dynamic thresholds check each sample in the smoothed packet size process define this time unit as an ON period define it as an OFF period

5  Performance Evaluation

smooth_original.png
Figure 5: The packet size process and its smoothed process, along with the computed adaptive thresholds and the true and estimated speech activity.
In this section, we evaluate our proposed VAD algorithm based on the collected traces described in Section III. We begin by explaining how the algorithm works. Fig. 5 shows a comparison of the observed packet size process and the smoothed packet size process. On the upper graph, the blue squares mark the times a speech burst is present. On the lower graph, the lines formed by black crosses are the adaptive thresholds computed by our algorithm and taken as the boundary between speech and silence periods. From the figure, we observe that packet size smoothing removes high-frequency fluctuations without affecting the correlation between packet size and speech activity. The lower graph shows the level of agreement between the estimated speech activity (red circles) and the ground truth (blue squares). The graph indicates that the extracted ON/OFF periods reflect the true voice activity.
We use three metrics to evaluate the performance of our VAD algorithm:
We believe that the detection accuracy of our method is quite good given the high amount of randomness injected into network traffic by the Skype application and network dynamics.

6  Applications of The Proposed Algorithm

In this section, we consider two network management applications that are made possible or enhanced by network-level VAD. First, we explain how we assess the conversation quality of a VoIP call by analyzing users' conversation patterns. We then demonstrate how inferred speech activity is applied to the VoIP flow identification problem.

6.1  Conversational Interactivity Assessment

The first application measures user perception of VoIP calls. QoS provisioning is an important issue in real-time interactive applications. In the past decade, numerous attempts have been made to provide better network and application quality through mechanisms like admission control, resource reservation, and traffic prioritization. Generally, networking researchers are concerned about one question: "Are users satisfied when using an application with certain network/application QoS settings?" Therefore, measuring users' satisfaction when they use network applications is an important topic [3,[13]. Measurement of user satisfaction with VoIP calls is especially in demand because it can help operators provide the required QoS level by selecting a better network path or switching to a more appropriate bit or packet rate [5].
In a VoIP call, if the transmitted voice signals are degraded due to network impairment, such as network delay or loss, user interactivity tends to be lower as the response times are longer or the sound from the other side is unclear or intermittent. Thus, the quality of a call can be determined, to some extent, by examining the conversation activity. This indicates that, by inferring speech activity from VoIP traffic with network-level VAD, we can gauge the conversation quality of a call on any network node and take appropriate action to maintain a satisfactory QoS level.
In the following, we discuss how different quality levels can be obtained for different conversation patterns with the conversation quality model proposed in [8].
qos_states.png
Figure 9: Brady's 4-state conversation model.

6.1.1  Conversation Interactivity Model

To measure the interactivity of a conversation, we employ the model based on the thermodynamic theory in [8,[17] and Brady's 4-state conversation model [4], which is illustrated in Fig. 9. As shown in the graph, a conversation between two parties, A and B, can be divided into four states at any time: state A, where speaker A is active and speaker B is silent; state B, where speaker A is silent and speaker B is active, state D, where both speakers are active, and state M represents mutual silence.
Assume tI is the average sojourn time spent in state I, I ∈ {A,B,M,D}. We define [ˉ(tI)] as the average sojourn time in state I for a norm conversation defined in [20]. Norm conversation is defined to be at a standard room temperature, [ˉ(τ)]=20°C. The conversation temperature of state I is computed as follows [8]:
τI =
-
τ

ln(tI)−ln(
-
tI
 
)+1
,I ∈ {A,B,M,D}.
(3)
For an arbitrary conversation whose "temperature" we want to infer, we compute [^(tI)] as the average sojourn time in each of the four states and compute an overall index of interactivity using a least-square fitting approach as follows:
^
τ
 
= argmin

I 
(
-
tI
 
·exp(
-
τ

τI
−1)−
^
tI
 
)2,I ∈ {A,B,M,D},
(4)
where [^(τ)] is the final scalar interactivity temperature (in units of °C) associated with the conversation.

6.1.2  An Example

We select two real-life VoIP calls from the Skype traffic captured according to the procedures detailed in [5]. Fig. 10 shows the conversation patterns of the two calls extracted from the network traffic. We observe that two participants, A and B, talked to each other with high interactivity in Call 1, while B kept talking and A kept silent most of the time in Call 2. In Fig. 11, we plot the conversational temperature for both calls, computed using Eqn 4. The graph indicates that the temperature of Call 1 is higher than that of Call 2 most of the time. This supports our contention that the temperature computed by the interactivity model can capture the interactivity in a conversation. The computed interactivity index can be employed in QoS management applications to monitor the quality of VoIP calls and reallocate network resources whenever necessary.
qos_onoff.png
Figure 10: The speech activity of two calls: Call 1 is highly interactive and Call 2 is somewhat unidirectional.
qos_twoline.png
Figure 11: The conversational temperature computed for the two calls shown in Fig. 10.

6.2  VoIP Flow Identification

Traffic classification is now a hot topic in the computer networking field [14,[12,[18]. Among the various kinds of traffic, identifying VoIP traffic is one of the most highly demanded techniques in traffic management. One reason is that corporations often need to identify VoIP flows in order to block or limit VoIP calls due to institutional policies, or boost the quality of VoIP calls to improve user satisfaction. However, VoIP flows are not easily recognized because of 1) non-standard signaling protocols (and thus non-standard port numbers), 2) non-standard audio codecs, and 3) encrypted payloads. While it may be easy to identify VoIP traffic generated by certain kinds of VoIP applications, it is much more difficult to classify all kinds of VoIP traffic in general.
The human conversation patterns embedded in VoIP traffic could be a key to the VoIP traffic identification problem, as they are dissimilar to the communication patterns of other network applications. Therefore, following this intuition, we propose a simple VoIP flow identification scheme based on the voice activity inferred by our VAD algorithm.
We apply a standard machine learning framework to the flow identification problem. For each segment of a network flow, we extract ON/OFF periods from the network traffic, and derive the following features from the estimated ON/OFF periods: 1) the number of ON periods; 2) the mean length of ON periods; and 3) the standard deviation of the ON-period length. We extract additional features from the conversation pattern based on Brady's state definition, as shown in Fig. 9. For each state I ∈ {A,B,M,D}, we derive the following features: 1) the number of state I; 2) the mean length of state I; and 3) the standard deviation of the state-I length. We expect the derived features of real VoIP flows will be similar because normal human conversations have similar patterns. On the other hand, the derived features of non-VoIP flows will be very far away from those of VoIP flows because their ON/OFF patterns are not "human-like."
classify.png
Figure 12: The error rate of classifying VoIP flows from non-VoIP flows in different time periods.
A set of real-life Internet traffic traces are used to verify the performance of the simple VoIP flow identification scheme. We choose Skype to represent VoIP applications, and TELNET and World of Warcraft, two real-time interactive applications, to represent non-VoIP applications. The collection procedures for these traces were as follows: 1) Skype traffic was captured according to the procedures detailed in [5]. 2) TELNET traffic was captured on a gateway router for all TCP flows using port 22 (SSH) and port 23 (telnet). 3) World of Warcraft traffic was captured on a gateway router for all TCP flows with port number 3274 and either the source or destination address in the network 203.66 (where the World of Warcraft server in Taiwan resides).
We use a naive Bayesian classifier to perform a supervised classification of the flows. All the classification results reported were computed with 10-fold cross-validation. The detection accuracy for traffic segments of different lengths is summarized in Fig. 12. The blue line represents the error rate of VoIP flows, and the red line represents that of non-VoIP flows. The graph shows that VoIP flows can be identified with an error rate of 5% within 15 seconds. While this real-time VoIP flow identification scheme is simple, general, and effective, we believe that the human-behavior-based approach will play a more important role in future traffic classification research.

7  Conclusion

In this paper, we propose the concept of network-level VAD, which infers speech activity from network traffic instead of audio signals, which are used in source-level VAD. Extracting voice activity from network traffic is more difficult because VoIP traffic can be seen as a compressed audio signal with additional randomness injected, such as redundancy, network congestion, and retransmissions. However, if network-level VAD is feasible, it will have many applications in network management.
Besides proposing network-level VAD, we make two contributions in this paper. First, we propose a VAD algorithm that can extract voice activity from encrypted and non-silence-suppressed VoIP network traffic. Taking Skype as the subject of our study, we show that inferring voice activity from Skype traffic achieves a reasonable level of detection accuracy. Second, we show that network-level VAD has some useful applications in network management because VAD can now be run on any network node. We demonstrate the effectiveness of a real-time VoIP flow identification scheme and explain how to measure the quality of a VoIP call using a conversation interactivity model based on the inferred speech activity.

References

[1] "Fine - tuning voice over packet services," http://www.protocols.com/papers/voip2.htm.
[2] I. S. 802.16-2004, "Ieee standard for local and metropolitan area networks part 16: Air interface for fixed broadband wireless access systems," Oct 2004.
[3] R. Beuran, M. Ivanovici, B. Dobinson, N. Davies, and P. Thompson, "Network quality of service measurement system for application requirements evaluation," in SPECTS'03.    New York, NY, USA: ACM, July 2003, pp. 380-387.
[4] P. T. Brady, "A statistical analysis of on-off patterns in 16 conversations," Bell System Technical Journal, vol. 47, no. 1, Jan 1968.
[5] K.-T. Chen, C.-Y. Huang, P. Huang, and C.-L. Lei, "Quantifying Skype User Satisfaction," in Proceedings of ACM SIGCOMM 2006, Pisa, Itlay, Sep 2006.
[6] W. C. Feng, F. Chang, W. C. Feng, and J. Walpole, "A traffic characterization of popular on-line games," IEEE/ACM Transactions on Networking, vol. 13, no. 3, pp. 488-500, June 2005.
[7] B. Francesco, C. Salvatore, and C. Alfredo, "A robust voice activity detector for wireless communications using soft computing," IEEE Journal on Selected Areas in Communications, vol. 16, no. 9, 1998.
[8] F. Hammer, P. Reichl, and A. Raake, "The well-tempered conversation: Interactivity, delay and perceptual voip quality," in Proceedings IEEE ICC'05, May 2005.
[9] J. Hoyt and H. Wechsler, "Detection of human speech in structured noise," in Proceedings of ICASSP '94, vol. ii.    ACM Press, 1994, pp. 237-240.
[10] J.-S. R. Jang, "Audio signal processing and recognition," http://www.cs.nthu.edu.tw/ jang.
[11] S. Jongseo and S. Wonyong, "A voice activity detector employing soft decision based noise spectrum adaptation," in Proceedings of ICASSP '98, 1998, pp. 365-368.
[12] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, "Blinc: multilevel traffic classification in the dark," SIGCOMM Comput. Commun. Rev., vol. 35, no. 4, pp. 229-240, 2005.
[13] P. V. Marsden, "Network data and measurement," Annual Review of Sociology, vol. 16, no. 1, pp. 435-463, 1990.
[14] A. W. Moore and D. Zuev, "Internet traffic classification using bayesian analysis techniques," in SIGMETRICS '05: Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems.    New York, NY, USA: ACM, 2005, pp. 50-60.
[15] V. Prasad, M. R., S. Vijay, H. Shankar, P. Pawelczak, and I. Niemegeers, "Voice activity detection for voip-an information theoretic approach," in Proceedings of GLOBECOM '06, vol. ii.    ACM Press, 2006.
[16] L. Rabiner and M. Sambur, "Voiced-unvoiced-silence detection using the itakura lpc distance measure," in Proceedings of ICASSP '77, May. 1977, pp. 323-326.
[17] P. Reichl and F. Hammer, "Hot discussion or frosty dialogue? towards a temperature metric for conversational interactivity," in ICSLP/INTERSPEECH 2004, Oct 2004.
[18] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, "Class-of-service mapping for qos: a statistical signature-based approach to ip traffic classification," in IMC '04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement.    New York, NY, USA: ACM, 2004, pp. 135-148.
[19] L. Sun and E. Ifeachor, "Prediction of perceived conversational speech quality and effects of playout buffer algorithms," Communications, 2003. ICC '03. IEEE International Conference on, vol. 1, pp. 1-6 vol.1, 11-15 May 2003.
[20] I. T. Union, "Artificial conversational speech," March 1993, iTU-T Recommendation P.59.
[21] N. B. Yoma, F. McIness, and M. Jack, "Robust speech pulse-detection using adaptive noise modeling," Electron. Lett., vol. 32, July 1996.

Footnotes:

1. http://www.voiptroubleshooter.com/open_speech/


Sheng-Wei Chen (also known as Kuan-Ta Chen)
http://www.iis.sinica.edu.tw/~swc 
Last Update September 28, 2019