Crowdsourcing Multimedia QoE Evaluation: A Trusted Framework

	T₁	T₂	T₃	T₄
T₁	-	a₁₂	a₁₃	a₁₄
T₂	a₂₁	-	a₂₃	a₂₄
T₃	a₃₁	a₃₂	-	a₃₄
T₄	a₄₁	a₄₂	a₄₃	-

Case Study

	Participant
Source

	Total
Cost

	#
Rounds

	#
Person

	Qualified
Rate

	Cost /
Round (cent)

Avg. TSR

WST Violation

MST Violation

SST Violation

Kendall

Laboratory

$46.47

1,200

13

59%

6.59

0.95

0

0.00

0.62

MTurk

$4.20

1,035

26

41%

1.00

0.95

0

0.00

0.61

Community

$0.62

1,695

85

52%

0.07

0.96

0

0.25

0.60

Laboratory

$30.45

990

10

67%

4.61

0.98

0

0.05

0.78

MTurk

$2.85

390

17

73%

1.00

0.98

0

0.10

0.78

Community

$0.39

645

39

86%

0.07

0.98

0

0.15

0.80

Laboratory

$28.19

1,860

10

80%

1.90

0.98

0

0.15

0.57

MTurk

$4.80

750

28

64%

1.00

0.98

0

0.35

0.57

Community

$0.93

1,890

103

71%

0.07

0.97

0

0.30

0.57

Laboratory

$66.59

1,800

13

69%

5.35

0.96

0

0.30

0.60

MTurk

$5.70

1,620

38

35%

1.00

0.97

0

0.20

0.60

Community

$0.62

2,535

122

35%

0.07

0.96

0

0.30

0.58

Overall

$191.80

16,410

388

58%

2.02

0.97

0

0.18

0.64

This work was supported in part by the National Science Council under the grants NSC101-2221-E-001-012-MY3 and NSC101-2221-E-002-190-MY3.

References

[1] O. Alonso, D. E. Rose, and B. Stewart, "Crowdsourcing for relevance evaluation," ACM SIGIR Forum, vol. 42, no. 2, pp. 9-15, 2008.

[2] A. Basso and S. Sicco, "Preventing massive automated access to web resources," Computers and Security, vol. 28, no. 3, pp. 174-188, 2009.

[3] M. H. Birnbaum, Psychological Experiments on the Internet. Academic Press, 2000.

[4] J. M. Bland and D. G. Altman, "Statistical methods for assessing agreement between two methods of clinical measurement," The Lancet, vol. 327, no. 8476, pp. 307-310, 1986.

[5] D. C. Brabham, "Crowdsourcing as a model for problem solving: An introduction and cases," Convergence: The International Journal of Research into New Media Technologies, vol. 14, no. 1, pp. 75-90, 2008.

[6] R. A. Bradley and M. E. Terry, "Rank analysis of incomplete block designs: I. The method of paired comparisons," Biometrika, vol. 39, no. 3/4, pp. 324-345, 1952.

[7] S. Buchholz and J. Latorre, "Crowdsourcing preference tests, and how to detect cheating," in Proceedings of Interspeech 2011, August 2011, pp. 3053-3056.

[8] B. Carnero and A. Drygajlo, "Perceptual speech coding and enhancement using frame-synchronized fast wavelet packet transform algorithms," IEEE Transactions on Signal Processing, vol. 47, no. 6, pp. 1622-1635, 1999.

[9] Y.-C. Chang, K.-T. Chen, C.-C. Wu, C.-J. Ho, and C.-L. Lei, "Online Game QoE Evaluation using Paired Comparisons," in Proceedings of IEEE CQR 2010, June 2010, pp. 1-6.

[10] K.-T. Chen, C.-Y. Huang, P. Huang, and C.-L. Lei, "Quantifying Skype User Satisfaction," in Proceedings of ACM SIGCOMM 2006, Pisa, Italy, Sep 2006.

[11] K.-T. Chen, J.-W. Jiang, P. Huang, H.-H. Chu, C.-L. Lei, and W.-C. Chen, "Identifying MMORPG Bots: A Traffic Analysis Approach," EURASIP Journal on Advances in Signal Processing, vol. 2009, 2009.

[12] K.-T. Chen, C. C. Tu, and W.-C. Xiao, "OneClick: A Framework for Measuring Network Quality of Experience," in Proceedings of IEEE INFOCOM 2009, April 2009.

[13] K.-T. Chen, C.-C. Wu, Y.-C. Chang, and C.-L. Lei, "A Crowdsourceable QoE Evaluation Framework for Multimedia Content," in Proceedings of ACM Multimedia 2009, 2009.

[14] --, "Quantifying QoS Requirements of Network Services: A Cheat-Proof Framework," in Proceedings of the second annual ACM Conference on Multimedia systems, February 2011, pp. 81-92.

[15] S. Choisel and F. Wickelmaier, "Evaluation of multichannel reproduced sound: Scaling auditory attributes underlying listener preference," The Journal of the Acoustical Society of America, vol. 121, no. 1, pp. 388-400, 2007.

[16] F. M. Ciaramello and S. S. Hemami, "Quality versus intelligibility: Studying human preferences for American sign language video," in Proceedings of SPIE Vol. 7865, Human Vision and Electronic Imaging XVI, January 2011.

[17] H. A. David, The Method of Paired Comparisons, 1988.

[18] L. Ding and R. Goubran, "Speech quality prediction in VoIP using the extended E-Model," in Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM 2003), December 2003, pp. 3974-3978.

[19] R. Dittrich, R. Hatzinger, and W. Katzenbeisser, "Modelling the effect of subject-specific covariates in paired comparison studies with an application to university rankings," Journal of the Royal Statistical Society (Series C): Applied Statistics, vol. 47, no. 4, pp. 511-525, 1998.

[20] A. Doan, R. Ramakrishnan, and A. Y. Halevy, "Crowdsourcing systems on the World-Wide Web," Communications of the ACM, vol. 54, no. 4, pp. 86-96, April 2011.

[21] A. Eichhorn, P. Ni, and R. Eg, "Randomised pair comparison: An economic and robust method for audiovisual quality assessment," in Proceedings of International Workshop on Network and Operating Systems Support for Digital Audio and Video, June 2010, pp. 63-68.

[22] E. O. Elliott, "Estimates of error rates for codes on burst-noise channels," Bell System Technical Journal, vol. 42, no. 5, pp. 1977-1997, 1963.

[23] ETSI TR 102 643, "Quality of experience (QoE) requirements for real-time communication services," 2010.

[24] E. N. Gilbert, "Capacity of a burst-noise channel," Bell System Technical Journal, vol. 39, no. 5, pp. 1253-1265, 1960.

[25] B. Girod, "The information theoretical significance of spatial and temporal masking in video signals," in Proceedings of SPIE Vol. 1077, Human Vision, Visual Processing, and Digital Displays, 1989, pp. 178-187.

[26] R. J. Hill, "A note on inconsistency in paired comparison judgments," American Sociological Review, vol. 18, no. 5, pp. 564-566, 1953.

[27] C.-J. Ho, T.-H. Chang, and J. Y.-J. Hsu, "Photoslap: A multi-player online game for semantic annotation," in Proceedings of the 22nd Conference on Artificial Intelligence, July 2007, pp. 1359-1364.

[28] C.-J. Ho and K.-T. Chen, "On formal models for social verification," in Proceedings of Human Computation Workshop 2009 (affiliated to ACM KDD 2009), June 2009, pp. 62-69.

[29] J. J. Horton and L. B. Chilton, "The labor economics of paid crowdsourcing," in Proceedings of ACM Conference on Electronic Commerce, June 2010, pp. 209-218.

[30] T. Hossfeld, M. Seufert, M. Hirth, T. Zinner, P. Tran-Gia, and R. Schatz, "Quantification of YouTube QoE via crowdsourcing," in Proceedings of the 2011 IEEE International Symposium on Multimedia, 2011, pp. 494-499.

[31] J. Howe, "The rise of crowdsourcing," Wired Magazine, vol. 14, no. 6, pp. 176-183, 2006.

[32] Z. Huang, A. Arefin, P. Agarwal, K. Nahrstedt, and W. Wu, "Towards the understanding of human perceptual quality in tele-immersive shared activity," in Proceedings of the ACM Multimedia Systems Conference, February 2012, pp. 29-34.

[33] Y. Ito and S. Tasaka, "Quantitative assessment of user-level QoS and its mapping," IEEE Transactions on Multimedia, vol. 7, no. 3, pp. 572-584, 2005.

[34] ITU-R Recommendation P.800, "Methods for subjective determination of transmission quality," 1996.

[35] ITU-T Recommendation G.107, "The E-model, a computational model for use in transmission planning," 2005.

[36] ITU-T Recommendation J.247, "Objective perceptual multimedia video quality measurement in the presence of a full reference," 2008.

[37] ITU-T Recommendation P.862, "Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs," 2001.

[38] ITU-T Recommendation P.910, "Subjective video quality assessment methods for multimedia applications," 2008.

[39] R. Jain, "Quality of experience," IEEE Multimedia, vol. 11, no. 1, pp. 96-97, 2004.

[40] S. Jain, Y. Chen, and D. C. Parkes, "Designing incentives for online question and answers forums," in Proceedings of ACM Conference on Electronic Commerce, July 2009, pp. 129-138.

[41] M. G. Kendall and B. B. Smith, "The problem of m rankings," The Annals of Mathematical Statistics, vol. 10, no. 3, pp. 275-287, 1939.

[42] --, "On the method of paired comparisons," Biometrika, vol. 31, no. 3/4, pp. 324-345, 1940.

[43] A. Kittur, E. H. Chi, and B. Suh, "Crowdsourcing user studies with Mechanical Turk," in Proceedings of ACM CHI 2008, April 2008, pp. 453-456.

[44] C. L. Knott and M. S. James, "An alternate approach to developing a total celebrity endorser rating model using the analytic hierarchy process," International Transactions in Operational Research, vol. 11, no. 1, pp. 87-95, 2004.

[45] K.-C. Lan and T.-H. Wu, "Evaluating the perceived quality of infrastructure-less VoIP," in Prceedings of IEEE Workshop on Streaming and Media Communications, July 2011.

[46] C. T. Lee, E. M. Rodrigues, G. Kazai, N. Milic-Frayling, and A. Ignjatovic, "Model for voter scoring and best answer selection in community Q&A services," in Proceedings of IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, September 2009, pp. 116-123.

[47] J.-S. Lee, L. Goldmann, and T. Ebrahimi, "A new analysis method for paired comparison and its application to 3D quality assessment," in Proceedings of ACM Multimedia 2011, 2011, pp. 1281-1284.

[48] J.-S. Lee, F. D. Simone, and T. Ebrahimi, "Subjective quality evaluation via paired comparison: Application to scalable video coding," IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 882-893, 2011.

[49] Y.-T. Lee, K.-T. Chen, H.-I. Su, and C.-L. Lei, "Are All Games Equally Cloud-Gaming-Friendly? An Electromyographic Approach," in Proceedings of IEEE/ACM NetGames 2012, October 2012.

[50] R. D. Luce, Individual Choice Behavior: A Theoretical Analysis, 1959.

[51] T. Manjunath, "Limitations of perceptual evaluation of speech quality on VoIP systems," in Proceedings of the IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, May 2009, pp. 1-6.

[52] J. N. S. Matthews and K. P. Morris, "An application of Bradley-Terry-type models to the measurement of pain," Journal of the Royal Statistical Society (Series C): Applied Statistics, vol. 44, no. 2, pp. 243-255, 1995.

[53] R. R. Pastrana-Vidal, J. C. Gicquel, C. Colomes, and H. Cherifi, "Temporal masking effect on dropped frames at video scene cuts," in Proceedings of SPIE Vol. 5292, Human Vision and Electronic Imaging IX, 2004, pp. 194-201.

[54] G. L. Peterson and T. C. Brown, "Economic valuation by the method of paired comparison, with emphasis on evaluation of the transitivity axiom," Land Economics, vol. 74, no. 2, pp. 240-261, 1998.

[55] N. L. Powers and R. M. Pangborn, "Paired comparison and time-intensity measurements of the sensory properties of beverages and gelatins containing sucrose or synthetic sweeteners," Journal of Food Science, vol. 43, no. 1, pp. 41-46, 1978.

[56] G. Qiu, Y. Mei, and J. Duan, "Evaluating HDR photos using Web 2.0 technology," in Proceedings of SPIE Vol. 7867, Image Quality and System Performance VIII, 2011.

[57] D. G. Rand, A. Dreber, T. Ellingsen, D. Fudenberg, and M. A. Nowak, "Positive interactions promote public cooperation," Science, vol. 325, no. 5945, pp. 1272-1275, 2009.

[58] P. V. Rao and L. L. Kupper, "Ties in paired-comparison experiments: A generalization of the Bradley-Terry model," Journal of the American Statistical Association, vol. 62, no. 317, pp. 194-204, 1967.

[59] F. Ribeiro, D. Florencio, and V. Nascimento, "Crowdsourcing subjective image quality evaluation," in Proceedings of IEEE International Conference on Image Processing, September 2011, pp. 3097-3100.

[60] F. Ribeiro, D. Florencio, C. Zhang, and M. Seltzer, "crowdMOS: An approach for crowdsourcing mean opinion score studies," in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, May 2011, pp. 2416-2419.

[61] P. E. Rossi, Z. Gilula, and G. M. Allenby, "Overcoming scale usage heterogeneity: A Bayesian hierarchical approach," Journal of the American Statistical Association, vol. 96, no. 453, pp. 20-31, 2001.

[62] T. L. Saaty, "A scaling method for priorities in hierarchical structures," Journal of Mathematical Psychology, vol. 15, no. 3, pp. 234-281, 1977.

[63] P. E. Shrout and J. L. Fleiss, "Intraclass correlations: Uses in assessing rater reliability," Psychological Bulletin, vol. 86, no. 2, pp. 420-428, 1979.

[64] A. Sorokin and D. Forsyth, "Utility data annotation with Amazon Mechanical Turk," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, June 2008, pp. 1-8.

[65] I. Sprow, Z. Baranczuk, T. Stamm, and P. Zolliker, "Web-based psychometric evaluation of image quality," in Proceedings of SPIE Vol. 7242, Image Quality and System Performance VI, 2009.

[66] L. Sun and E. C. Ifeachor, "Voice quality prediction models and their application in VoIP networks," IEEE Transactions on Multimedia, vol. 8, no. 4, pp. 809-820, 2006.

[67] P. Suppes, D. H. Krantz, R. D. Luce, and A. Tversky, Foundations of measurement, Volume II: Geometrical, Threshold, and Probabilistic Representations, December 2006.

[68] S. Tasaka, H. Yoshimi, A. Hirashima, and T. Nunome, "The effectiveness of a QoE-based video output scheme for audio-video IP transmission," in Proceeding of ACM Multimedia 2008, 2008, pp. 259-268.

[69] A. Watson and M. A. Sasse, "Measuring perceived quality of speech and video in multimedia conferencing applications," in Proceedings of ACM Multimedia 1998, 1998, pp. 55-60.

[70] P. Welinder and P. Perona, "Online crowdsourcing: Rating annotators and obtaining cost-effective labels," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, June 2010, pp. 25-32.

[71] W. Wu, A. Arefin, G. Kurillo, P. Agarwal, K. Nahrstedt, and R. Bajcsy, "A psychophysical approach for real-time 3D video processing," in Proceedings of ACM Multimedia 2011, November 2011, pp. 683-686.

[72] Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao, "HodgeRank on random graphs for subjective video quality assessment," IEEE Transactions on Multimedia, vol. 14, no. 3, pp. 844-857, June 2012.

Crowdsourcing Multimedia QoE Evaluation: A Trusted Framework

Abstract

1 Introduction

2 Related Work

2.1 QoE Assessment Methods

2.2 Paired Comparison

2.3 Crowdsourcing

2.4 QoE Assessment Studies based on Paired Comparison and/or Crowdsourcing

2.4.1 Paired Comparison

2.4.2 Crowdsourcing

2.4.3 Paired Comparison and Crowdsourcing

2.5 Reward and Punishment Mechanisms

3 QoE Assessment Framework

3.1 Experiment Design

3.2 Overall Consistency Checks

3.3 QoE Score Estimation

Model Interpretation

4 Cheat Detection Mechanism

4.1 Transitivity Satisfaction Rate

4.2 Trust Thresholding

4.2.1 Why Use Real Traces?

4.2.2 Principles for Detecting Suspicious User Inputs

4.2.3 Suspicious User Inputs in Real Trace

4.2.4 Trust Threshold Derivation

4.3 Reward and Punishment

5 Framework Evaluation

5.1 Audio QoE Evaluation

5.1.1 Effect of MP3 Compression Level

5.1.2 Effect of Packet Loss on VoIP QoE

5.2 Visual QoE Evaluation

5.2.1 Comparison of Video Codecs

5.2.2 Comparison of Loss Concealment Schemes

5.3 Cost and Performance Analysis

6 Paired Comparison and MOS: A Side-by-Side Comparison

6.1 Experiment Design

6.2 Consistency between PC and MOS

6.3 Intra-Subject Reliability

6.4 Inter-Subject Reliability

6.5 Convergence Rate

7 Discussion

7.1 QoP or QoE?

7.2 Issues with Paired Comparison

7.3 Issues with Crowdsourcing

8 Conclusion

Acknowledgements

References

Footnotes: