-
Notifications
You must be signed in to change notification settings - Fork 552
/
Copy pathPaper1.txt
197 lines (151 loc) · 35.3 KB
/
Paper1.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
A transformation between visual and auditory perception
based on emotion mapping
OVERVIEW
Psychological studies have already demonstrated that the content of visual (including color, image and video) and auditory (including sound and music) perception can bring us different emotions from sadness to joy. The emotion recognition and classification of different contents of visual and auditory perception have already been a wide-spread technique for a wide range of applications, such as product and environmental color design [1-3], image classification and retrieval [4], acoustic environment design, music genre detection, organization, retrieval and recommendation [5-8]. Inspired by these studies and applications, computer scientists began to predict the emotional reactions of people given a series of visual and auditory contents, such as color emotion and harmony quantification [9-15], image emotion recognition and classification [4, 16-25], sound classification [26-28], music visualization [5, 29-31], music emotion recognition and classification[25, 32-38].
However, most of them just focus on the emotion analysis of single perception contents (visual or auditory perception), rather than a transformation and integration between visual and auditory perception based on emotion mapping, because they ignore the fact that different forms of perception could evoke the similar and same emotion and the synesthesia that literally means cross-sensing, one of the senses being triggered by another sense. A synesthete might hear colors, taste sounds, see smells or any combination of the senses. Hence, we propose a transformation between visual and auditory perception based on emotion mapping to realize the synesthesia, as shown in Figure 1, including color vocalization, image musicalization (indicated by the red arrows), and sound colorization and music visualization and picturization (indicated by the black arrows).
Compared with the application of emotion analysis for the single perception, the transformation and integration between visual and auditory perception have a more humanized and more applications:
1. Color and image vocalization for the blind, users with protanopia and children. Hearing colors is quite a common experience among synesthetes. The blind can distinguish and detect the colors and image contents to hear colors with the help of a device of color vocalization. Furthermore, children can learn musical pitches with the help of color recognition, arrange colors to play a song, and create their own instruments by coloring sketches or cutting out construction paper.
2. Sound colorization for the deaf and sound-and-light designer. The deaf can see sounds with the help of a device of sound colorization. Furthermore, the sound-and-light designer can adjust the corresponding light color according to the sound to evoke even more feelings and give a more touching presentation as long as they are synchronized in emotions.
3. Image musicalization for automatic music generation. Image and video musicalization can compose personalized music quickly according to single image or video, which has significance in the fields of film music making and individual short video production.
4. Music picturization for music video production. Music picturization is a reverse process of image and video musicalization. Based on the music, a series of image and photos can be generated quickly to make music video.
Figure 1. Framework of the proposed transformation model between visual and auditory perception based on emotion mapping.
The proposed transformation between visual and auditory perception based on emotion mapping is shown in Figure 1, consisting of five main objectives: color vocalization, image and video musicalization, and sound colorization and music visualization and picturization. The emotional mapping model is also shown in Figure 1, including feature extraction, emotion detection and mapping of visual and auditory inputs. More specifically, different objectives have to be realized by different methods because of the different input contents.
1. Color vocalization and Sound colorization. Firstly, the training data of color patches with emotion annotations can be generated by doing a psychological experiment and the color emotions can be quantification by confirming the relationships between the color values in CIE L*a*b and CIE L*C*h color space and assessed values of emotion annotations. Secondly, the training data of sounds with emotion annotations can be generated by doing a psychological experiment and the sound emotions can be quantification by confirming the relationships between the sound features (consists of loudness, pitch and timbre) and assessed values of emotion annotations. Finally, a transformation model can be established based on the qualifications of the sound and color emotions.
2. Image vocalization. Recent advancements in deep learning [39] have brought a wide range of applications, such as image detection and classification [40-42], object detection [43, 44], human pose estimation [45-47] and clothing parsing [48, 49]. We will adopt a novel convolutional neural networks (CNNs) to extract features and recognize the specific emotion of an image. With availability of large-scale datasets, CNNs can learn the representations of data in multiple abstracting levels. The image vocalization can be realized by mapping the image and sound emotions.
3. Image musicalization and music picturization. Generative adversarial networks (GANs) [50-52] are frameworks in deep learning that are achieving state-of-the-art performance in generative tasks. Sequence generative adversarial networks (SeqGAN) [53, 54] are one of the first models that try to combine reinforcement learning and GANs for learning from discrete sequence data. Hence, we will try to adopt and modify the SeqGAN to realize the image musicalization and music picturization. The SeqGAN model consists of RNNs as a sequence generator and convolutional neural networks (CNNs) as a discriminator that identifies whether a given sequence is real or fake. SeqGAN successfully learns from artificial and real-world discrete data and can be used in language modeling and monophonic music generation. Music picturization is a reverse process of image and video musicalization.
?
1. BACKGROUND
Psychological studies have already demonstrated that the content of visual (including color, image and video) and auditory (including sound and music) perception can bring us different emotions from sadness to joy. From these empirical studies, a great variety of emotion models have been proposed, most of which belong to one of the following two approaches to emotion conceptualization: the categorical approach and the dimensional approach.
(1) The categorical model assumes that there are discrete emotions which are believed to be distinguishable by an individual¡¯s facial expression and biological processes. Each emotion category is characterized by a set of emotion patterns or structures that sets it apart from other categories. Plutchik [55] added trust and anticipation to Ekman¡¯s six basic emotions set [56], Plutchik¡¯s eight emotions are organized into four bipolar sets: trust vs. disgust, joy vs. sadness, anger vs. fear, and surprise vs. anticipation. Izard [57] defined ten categories as basic emotions which are: anger, contempt, disgust, distress, fear, guilt, interest, joy, shame and surprise.
(2) The dimensional model describes emotions according to one or more of dimensions, circumflex model of affect is one of particular dimensional approach [58], identifies two main dimensions, containing arousal and valence dimensions, valence referred to as polarity measures whether an emotion is pleasant or unpleasant. Arousal measures the degree of activation, which can range from calm to excited.
The emotion recognition and classification of different contents of visual and auditory perception have already been a wide-spread technique for a wide range of applications, such as product and environmental color design [1-3], image classification and retrieval [4], acoustic environment design, music genre detection, organization, retrieval and recommendation [5-8]. Inspired by these studies and applications, computer scientists began to predict the emotional reactions of people given a series of visual and auditory contents, such as color emotion and harmony quantification [9-15], image emotion recognition and classification [4, 16-25], sound emotion classification [26-28], music visualization [5, 29-31], music emotion recognition and classification[25, 32-38].
Emotion recognition can be viewed as a multiclass-multilabel classification or regression problem where we try to annotate each music piece with a set of emotions, whose overall model is shown in Figure 2. After extracting and reducing features of the object content, different supervised learning methods are used to recognize and classify their emotions, including Gaussian Mixture Model (GMM) [59], Support Vector Machines (SVM) [59] , convolutional neural network (CNN)[20-22, 37, 60, 61] and Generative adversarial network (GAN) [54].
Figure 2. Overall model of emotion recognition and classification systems.
(TF-IDF: term frequency-inverse document frequency; POS Tagging: Part-Of-Speech Tagging)
1.1 Color Emotion Quantification
Besides the textures and patterns of the products and environments, during the color perception process, an associate feeling or emotion is normally induced in our brains. The series of L. C Ou et al. studies clarify the relationship between color emotion and color preference in terms of solid colors and two-color combinations [9-11, 15]. The study of J. H. Xin et al. investigates and compares the color emotions among three different regions and find out the direct relationship between color emotions of subjects from these three regions with the use of color planners [13, 14]. However, these researches just focus on three basic color emotions: Warm-Cold, Heavy-Light and Active-Passive and ignore other color emotions, such as sadness, Happiness, fear and anger.
1.2 Image Emotion Detection
The most important element to be captured in the image is the emotion, which may be the key connection between the author and the viewer. Unlike psychological researches, most computer vision works are trying to predict the emotional reaction on a particular image. This task is about high-level semantics inference of images, which attempts to infer the content of an image and associate high-level semantics to it. The main difficulty is to bridge the gap between low-level features extracted from images and high-level semantics concepts, being emotions here. To be able to recognize the different emotions, features should be designed to carry sufficient information, and can be extracted by two main methods: handcrafted [18, 19, 62, 63] and automatic [20-22, 61] extraction methods. The handcrafted extraction method is to design a series of principle-of-art features, which are color, texture, composition, balance, emphasis, harmony, variety, gradation and movement, to classify and score the emotion of an image [18, 19, 62, 63]. Four typical low-level representation features are shown in Figure 3.
(a) (b) (c) (d)
Figure 3. Typical low-level representation features. (a) Color. (b) Line. (d) Texture. (d) Shape.
The automatic extraction method is to use convolutional neural networks (CNNs) to extract features instead of using handcrafted features, trying to solve a binary classification problem [20-22, 61]. With availability of large-scale datasets, CNNs can learn the representations of data in multiple abstracting levels.
1.3 Sound and Music Emotion Recognition
Music plays an important role in human history, even more so in the digital age. Never before has such a large collection of music been created and accessed daily. The prevailing context in which we encounter music is now ubiquitous, including those contexts in which the most routine activities of life take place: waking up, eating, housekeeping, shopping, studying, exercising, driving, and so forth [64]. Emotions can be influenced by such attributes as tempo, timbre, harmony, and loudness (to name only a few), and much prior work in music emotion recognition has been directed towards the development of informative acoustic features, including dynamics, timbre, harmony, register, rhythm and articulation [33]. Compared with the handcrafted extraction method of acoustic features, the automatic extraction method is also to use convolutional neural networks (CNNs) [37, 65-67] or convolutional recurrent neural network (CRNNs) [68] to extract features.
However, most of them just focus on the emotion analysis of single perception contents (visual or auditory perception), rather than a transformation and integration between visual and auditory perception based on emotion mapping, because they ignore the fact that different forms of perception could evoke the similar and same emotion and the synesthesia that literally means cross-sensing, one of the senses being triggered by another sense. Synesthesia is a perceptual phenomenon in which stimulation of one sensory or cognitive pathway leads to automatic, involuntary experiences in a second sensory or cognitive pathway [69]. A synesthete might hear colors, taste sounds, see smells or any combination of the senses.
Therefore, we propose a transformation between visual and auditory perception based on emotion mapping to realize the synesthesia, as shown in Figure 1, including color vocalization, image and video musicalization, and sound colorization and music visualization and picturization.
Compared with the application of emotion analysis for the single perception, the transformation and integration between visual and auditory perception have a more humanized and more applications:
1. Color and image vocalization for the blind, users with protanopia and children. Hearing colors is quite a common experience among synesthetes. The blind can distinguish and detect the colors and image contents to hear colors with the help of a device of color vocalization. Furthermore, children can learn musical pitches with the help of color recognition, arrange colors to play a song, and create their own instruments by coloring sketches or cutting out construction paper.
2. Sound colorization for the deaf and sound-and-light designer. The deaf can see sounds with the help of a device of sound colorization. Furthermore, the sound-and-light designer can adjust the corresponding light color according to the sound to evoke even more feelings and give a more touching presentation as long as they are synchronized in emotions.
3. Image musicalization for automatic music generation. Image and video musicalization can compose personalized music quickly according to single image or video, which has significance in the fields of film music making and individual short video production.
4. Music picturization for music video production. Music picturization is a reverse process of image and video musicalization. Based on the music, a series of image and photos can be generated quickly to make music video.
?
2. RESEARCH METHODOLGY
The proposed transformation between visual and auditory perception based on emotion mapping is shown in Figure 1, consisting of five main objectives: color vocalization, image and video musicalization, and sound colorization and music visualization and picturization. The emotional mapping model is also shown in Figure 1, including feature extraction, emotion detection and mapping of visual and auditory inputs. More specifically, different objectives have to be realized by different methods because of the different input contents.
(1) Color vocalization and Sound colorization. Firstly, the training data of color patches with emotion annotations can be generated by doing a psychological experiment and the color emotions can be quantification by confirming the relationships between the color values in CIE L*a*b and CIE L*C*h color space and assessed values of emotion annotations. Secondly, the training data of sounds with emotion annotations can be generated by doing a psychological experiment and the sound emotions can be quantification by confirming the relationships between the sound features (consists of loudness, pitch and timbre) and assessed values of emotion annotations. Finally, a transformation model can be established based on the qualifications of the sound and color emotions.
(2) Image vocalization. Recent advancements in deep learning [39] have brought a wide range of applications, such as image detection and classification [40-42], object detection [43, 44], human pose estimation [45-47] and clothing parsing [48, 49]. We will adopt a novel convolutional neural networks (CNNs) to extract features and recognize the specific emotion of an image. With availability of large-scale datasets, CNNs can learn the representations of data in multiple abstracting levels. The image vocalization can be realized by mapping the image and sound emotions.
(3) Image musicalization and music picturization. Sequence generative adversarial networks (SeqGAN) [53, 54] are one of the first models that try to combine reinforcement learning and GANs for learning from discrete sequence data. Hence, we will try to adopt and modify the SeqGAN to realize the image musicalization and music picturization. The SeqGAN model consists of RNNs as a sequence generator and convolutional neural networks (CNNs) as a discriminator that identifies whether a given sequence is real or fake. SeqGAN successfully learns from artificial and real-world discrete data and can be used in language modeling and monophonic music generation. Music picturization is a reverse process of image and video musicalization.
2.1 Emotion Model Selection
From these empirical studies, a great variety of emotion models have been proposed, most of which belong to one of the following two approaches to emotion conceptualization: the categorical approach [55-57] and the dimensional approach [58, 70].
While the categorical approach focuses mainly on the characteristics that distinguish emotions from one another, the dimensional approach focuses on identifying emotions based on their placement on a small number of emotion dimensions with named axes, which are intended to correspond to internal human representations of emotion. These internal emotion dimensions are found by analyzing the correlation between affective terms. In the seminal work of Russell [58], the circumplex model of emotion is proposed. The model consists of a two-dimensional, circular structure involving the dimensions of valence and arousal, as shown in Figure 4. Within this structure, emotions that are inversely correlated are placed across the circle from one another.
One of the strengths of the circumplex model is that it suggests a simple yet powerful way of organizing different emotions in terms of their affect appraisals (valence) and physiological reactions (arousal), and it allows for direct comparison of different emotions on two standard and important dimensions. Hence, we use the dimensional approach to select emotions.
Figure 4. The 2D valence-arousal emotion space [58] (the position of the affective terms are only approximated, not exact).
Thumfart and colleagues [16] proposed that the structure of the hierarchical feed-forward model of aesthetic texture perception consists of three layers: Affective layer (How can the texture be described), Judgment layer (What the object says about itself) and Emotional layer (What do I feel when interacting with the texture). Hence, we should add some aesthetic properties as the effective and judgment layers, such as Warm-Cold, Heavy-Light, Active-Passive, Simple-Complex and Unharmonious-Harmonious.
2.2 Sound Colorization: Sound Emotion Recognition
(a)
(b)
Figure 5. A typical sound.
2.2 Color Vocalization: Color Emotion Quantification
This experiment consists of four steps. Firstly, we select these aesthetic and emotional properties of colors as the semantic scales to evaluate the inherent features of colors. Secondly, we select different discrete solid colors that are distributed uniformly in CIE L*C*h color space as the evaluation objects, as shown in Figure 6. Thirdly, the color patches are assessed by a certain number of observers (the number of males is nearly equal to that of females) in terms of the semantic scales. Finally, the relationships between the color values in CIE L*a*b and CIE L*C*h color space and assessed values of semantic scales are analyzed by data analysis and visualization methods.
Figure 6. 36 Solid Colors distributed in CIE L*a*b color space.
It is possible and significant to quantify aesthetic emotion of solid colors and confirm the relationship between color values and aesthetic emotions. In our studies, the CIELAB values is used to describe the color of patches. The visible gamut plotted within the CIELAB color space is shown in Figure 7(a). The CIELAB color space (also known as CIE L*a*b* or sometimes abbreviated as simply "Lab" color space) is a color space defined by the International Commission on Illumination (CIE) in 1976. It expresses color as three numerical values, L* for the lightness and a* and b* for the green-red and blue¨Cyellow color components. CIELAB was designed to be perceptually uniform with respect to human color vision, meaning that the same amount of numerical change in these values corresponds to about the same amount of visually perceived change.
The CIELCh color space is a CIELab cube color space, where instead of Cartesian coordinates a*, b*, the cylindrical coordinates C* (chroma, relative saturation) and h¡ã (hue angle, angle of the hue in the CIELab color wheel) are specified, as shown in Figure 7(b). The CIELab lightness L* remains unchanged. The conversion of a* and b* to C* and h¡ã is done using the following formulas:
(5)
(6)
(a) (b)
Figure 7. Color space. (a) The visible gamut plotted within the CIELAB color space. a and b are the horizontal axes. L is the vertical axis. Uses D65 whitepoint. (b) The visible gamut plotted within the CIELCH color space. L is the vertical axis; C is the radius; h is the angle around the circumference.
The internal relationship of aesthetic emotions can be confirmed by correlation analysis and regression analysis. Based on the hierarchical feed-forward theory of aesthetic texture perception [16, 19], a hierarchical feed-forward model of aesthetic perception for solid colors will be developed.
2.2 Image vocalization: image Emotion Detection
Traditionally, handcrafted extraction method of image features [29] is used for image emotion recognition, including color, line, texture and shape. Recently, some efforts were spent on employing CNNs to extract features for visual emotion analysis [20-22]. Features extracted by CNNs proved to better than the manually designed features as CNNs can learn representations on the samples for specific purposes [71, 72]. Figure 7 illustrates a network architecture of Alex-Net [73], is another very successful model that achieved breakthrough results in image recognition [40-42].
Figure 6. Example images from IAPS (The International Affective Picture System) dataset [74].Images with positive affect from left to right, and high arousal from bottom to top.
Figure 7. Architecture of a typical Convolution Neural Network. Each plane is a feature map, i.e. a set of units whose weights are constrained to be identical.
In our proposal, the Convolution Neural Network (CNN) and bidirectional recurrent neural network (RNN) will be taken into consideration to realize image emotion recognition.
(1) In the CNN-based method, the Convolution Neural Network model can be considered as a multi-level feature extractor and emotion classifier, that is, classify the image as positive or negative (valence) and high or low (arousal) first and then recognize the specific emotion based on the binary result.
(2) In the RNN-based method, there are two main parts: CNN feature extractor and Bidirectional GRU (Bi-GRU) feature fusion. The CNN model is used to extract multiple levels of features at different branches which represent different parts of images, such as line, color, texture, and object. And then the Bi-GRU model is developed to integrate the different levels of features which are concatenated for visual emotion classification.
2.4 Image Musicalization
2.5 Music Picturization
?
REFERENCES
1. Singh, S., Impact of color on marketing. Management Decision, 2006. 44(6): p. 783-789.
2. Deng, X., S.K. Hui, and J.W. Hutchinson, Consumer preferences for color combinations: An empirical analysis of similarity-based color relationships. Journal of Consumer Psychology, 2010. 20(4): p. 476-484.
3. Hsiao, S.W., M.H. Yang, and C.H. Lee, An aesthetic measurement method for matching colours in product design. Color Research & Application, 2017. 42(5): p. 664-683.
4. Solli, M. and R. Lenz, Color emotions for multi©\colored images. Color Research & Application, 2011. 36(3): p. 210-221.
5. Bogdanov, D., et al., Semantic audio content-based music recommendation and visualization based on user preference examples. Information Processing & Management, 2013. 49(1): p. 13-33.
6. Wang, J.-C., et al. The acoustic emotion Gaussians model for emotion-based music annotation and retrieval. in Proceedings of the 20th ACM international conference on Multimedia. 2012. ACM.
7. Li, T., M. Ogihara, and Q. Li. A comparative study on content-based music genre classification. in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. 2003. ACM.
8. Malandrino, D., et al. A Color-based Visualization Approach to understand harmonic structures of Musical Compositions. in Information Visualisation (iV), 2015 19th International Conference on. 2015. IEEE.
9. Ou, L.C., et al., A study of colour emotion and colour preference. Part I: Colour emotions for single colours. Color Research and Application, 2004. 29(3): p. 232-240.
10. Ou, L.C., et al., A study of colour emotion and colour preference. part II: Colour emotions for two-colour combinations. Color Research and Application, 2004. 29(4): p. 292-298.
11. Ou, L.C., et al., A study of colour emotion and colour preference. Part III: Colour preference Modeling. Color Research and Application, 2004. 29(5): p. 381-389.
12. Gao, X.P. and J.H. Xin, Investigation of human's emotional responses on colors. Color Research & Application: Endorsed by Inter©\Society Color Council, The Colour Group (Great Britain), Canadian Society for Color, Color Science Association of Japan, Dutch Society for the Study of Color, The Swedish Colour Centre Foundation, Colour Society of Australia, Centre Fran?ais de la Couleur, 2006. 31(5): p. 411-417.
13. Xin, J.H., et al., Cross-regional comparison of colour emotions part I: Quantitative analysis. Color Research And Application, 2004. 29(6): p. 451-457.
14. Xin, J., et al., Cross©\regional comparison of colour emotions Part II: Qualitative analysis. Color Research & Application, 2004. 29(6): p. 458-466.
15. Ou, L.C., et al., Universal models of colour emotion and colour harmony. Color Research & Application, 2018. 43(5): p. 736-748.
16. Thumfart, S., et al., Modeling Human Aesthetic Perception of Visual Textures. Acm Transactions on Applied Perception, 2011. 8(4).
17. Leder, H. and M. Nadal, Ten years of a model of aesthetic appreciation and aesthetic judgments: The aesthetic episode - Developments and challenges in empirical aesthetics. British Journal of Psychology, 2014. 105(4): p. 443-464.
18. Zhao, S., et al. Exploring principles-of-art features for image emotion recognition. in Proceedings of the 22nd ACM international conference on Multimedia. 2014. ACM.
19. Liu, J.L., E. Lughofer, and X.Y. Zeng, Could linear model bridge the gap between low-level statistical features and aesthetic emotions of visual textures? Neurocomputing, 2015. 168: p. 947-960.
20. Yang, J., D. She, and M. Sun. Joint image emotion classification and distribution learning via deep convolutional neural network. in Int. J. Conf. Artif. Intell. 2017.
21. Zhu, X., et al. Dependency exploitation: a unified CNN-RNN approach for visual emotion recognition. in Proceedings of the Internal Joint Conference on Artificial Intelligence (IJCAI 2017). 2017.
22. He, X. and W. Zhang, Emotion recognition by assisted learning with convolutional neural networks. Neurocomputing, 2018. 291: p. 187-194.
23. Alarc?o, S.M. and M.J. Fonseca, Identifying emotions in images from valence and arousal ratings. Multimedia Tools and Applications, 2017: p. 1-23.
24. Redies, C., Combining universal beauty and cultural context in a unifying model of visual aesthetic experience. Frontiers in Human Neuroscience, 2015. 9.
25. Dunker, P., et al. Content-based mood classification for photos and music: a generic multi-modal classification framework and evaluation approach. in Proceedings of the 1st ACM international conference on Multimedia information retrieval. 2008. ACM.
26. Piczak, K.J. Environmental sound classification with convolutional neural networks. in 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP). 2015.
27. Schuller, B., et al. Automatic recognition of emotion evoked by general sound events. in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. 2012. IEEE.
28. Nielsen, A.B., L.K. Hansen, and U. Kjems. Pitch Based Sound Classification. in ICASSP (3). 2006.
29. Chen, C.-H., et al. Emotion-based music visualization using photos. in International Conference on Multimedia Modeling. 2008. Springer.
30. Ciuha, P., B. Klemenc, and F. Solina. Visualization of concurrent tones in music with colours. in Proceedings of the 18th ACM international conference on Multimedia. 2010. ACM.
31. Mardirossian, A. and E. Chew. Visualizing Music: Tonal Progressions and Distributions. in ISMIR. 2007. Citeseer.
32. Juslin, P.N. and D. V?stfj?ll, Emotional responses to music: The need to consider underlying mechanisms. Behavioral and brain sciences, 2008. 31(5): p. 559-575.
33. Kim, Y.E., et al. Music emotion recognition: A state of the art review. in Proc. ISMIR. 2010. Citeseer.
34. Koelsch, S., Towards a neural basis of music-evoked emotions. Trends in cognitive sciences, 2010. 14(3): p. 131-137.
35. Yang, Y.-H. and H.H. Chen, Machine recognition of music emotion: A review. ACM Transactions on Intelligent Systems and Technology (TIST), 2012. 3(3): p. 40.
36. Jiang, L., et al., Emotion Categorization based on probabilistic Latent Emotion Analysis Model. Proceedings of 2016 Ieee 13th International Conference on Signal Processing (Icsp 2016), 2016: p. 618-623.
37. Delbouys, R., et al., Music Mood Detection Based On Audio And Lyrics With Deep Neural Net. arXiv preprint arXiv:1809.07276, 2018.
38. Tong, H., et al. Music Mood Classification Based on Lifelog. in China Conference on Information Retrieval. 2018. Springer.
39. LeCun, Y., Y. Bengio, and G. Hinton, Deep learning. Nature, 2015. 521(7553): p. 436-444.
40. He, K.M., et al., Deep Residual Learning for Image Recognition. 2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 2016: p. 770-778.
41. Szegedy, C., et al. Going deeper with convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
42. Krizhevsky, A., I. Sutskever, and G.E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks. Communications of the Acm, 2017. 60(6): p. 84-90.
43. Ren, S.Q., et al., Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Ieee Transactions on Pattern Analysis and Machine Intelligence, 2017. 39(6): p. 1137-1149.
44. He, K., et al., Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
45. Newell, A., K.U. Yang, and J. Deng, Stacked Hourglass Networks for Human Pose Estimation. Computer Vision - Eccv 2016, Pt Viii, 2016. 9912: p. 483-499.
46. Chen, Y., et al., Cascaded pyramid network for multi-person pose estimation. arXiv preprint arXiv:1711.07319, 2017.
47. Papandreou, G., et al., Towards Accurate Multi-person Pose Estimation in the Wild. 30th Ieee Conference on Computer Vision and Pattern Recognition (Cvpr 2017), 2017: p. 3711-3719.
48. Yamaguchi, K., et al. Parsing clothing in fashion photographs. in Computer Vision and Pattern Recognition. 2012.
49. Yamaguchi, K., M.H. Kiapour, and T. Berg, Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items. IEEE, 2013: p. 3519-3526.
50. Goodfellow, I., et al. Generative adversarial nets. in Advances in neural information processing systems. 2014.
51. Goodfellow, I., NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
52. Creswell, A., et al., Generative Adversarial Networks: An Overview. IEEE Signal Processing Magazine, 2018. 35(1): p. 53-65.
53. Yu, L., et al. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. in AAAI. 2017.
54. Lee, S.-g., et al., A seqgan for polyphonic music generation. arXiv preprint arXiv:1710.11418, 2017.
55. Plutchik, R., The emotions. 1991: University Press of America.
56. Ekman, P., An argument for basic emotions. Cognition & emotion, 1992. 6(3-4): p. 169-200.
57. Izard, C.E., The face of emotion. 1971.
58. Russell, J.A., A circumplex model of affect. Journal of personality and social psychology, 1980. 39(6): p. 1161.
59. Bischoff, K., et al. Music Mood and Theme Classification-a Hybrid Approach. in ISMIR. 2009.
60. Medhat, F., D. Chesmore, and J. Robinson. Masked conditional neural networks for environmental sound classification. in International Conference on Innovative Techniques and Applications of Artificial Intelligence. 2017. Springer.
61. Rao, T., M. Xu, and D. Xu, Learning multi-level deep representations for image emotion classification. arXiv preprint arXiv:1611.07145, 2016.
62. Liu, J., E. Lughofer, and X. Zeng, Toward Model Building for Visual Aesthetic Perception. Computational intelligence and neuroscience, 2017. 2017.
63. Machajdik, J. and A. Hanbury. Affective image classification using features inspired by psychology and art theory. in Proceedings of the 18th ACM international conference on Multimedia. 2010. ACM.
64. Juslin, P.N. and J.A. Sloboda, Music and emotion: Theory and research. 2001: Oxford University Press.
65. Choi, K., G. Fazekas, and M. Sandler, Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:1607.02444, 2016.
66. Choi, K., Deep Neural Networks for Music Tagging. 2018, Queen Mary University of London.
67. Lee, J., et al., Samplecnn: End-to-end deep convolutional neural networks using very small filters for music classification. Applied Sciences, 2018. 8(1): p. 150.
68. Choi, K., et al. Convolutional recurrent neural networks for music classification. in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. IEEE.
69. Cytowic, R.E., Synesthesia: A union of the senses. 2002: MIT press.
70. Lu, X., et al. On shape and the computability of emotions. in Proceedings of the 20th ACM international conference on Multimedia. 2012. ACM.
71. You, Q., et al. Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark. in AAAI. 2016.
72. Kavukcuoglu, K., et al. Learning convolutional feature hierarchies for visual recognition. in Advances in neural information processing systems. 2010.
73. Krizhevsky, A., I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. in Advances in neural information processing systems. 2012.
74. Lang, P.J., International affective picture system (IAPS): Affective ratings of pictures and instruction manual. Technical report, 2005.