Understanding and correctly identifying human emotional states is important for mental health providers. Can artificial intelligence (AI) machine learning prove the human capacity for cognitive empathy? A new peer-reviewed study shows how AI can detect emotions on par with human performance from just 1.5 seconds of audio clips. It shows what you can do.
“The human voice serves as a powerful means of expressing emotional states because it provides understandable clues about the sender's situation and can be transmitted over long distances. ” wrote the study's lead author, Hannes Diemerling of the Max Planck Institute for Human Development. Center for Lifespan Psychology. In collaboration with Germany-based psychology researchers Leonie Stresemann, Tina Braun, and Timo von Elzen.
In AI deep learning, the quality and quantity of training data are critical to algorithm performance and accuracy. The audio data used in this study was from over 1,500 unique audio clips from an open source emotion database in English and German sourced from the Ryerson Audio-Visual Database of Emotional Speech and Song, with German The audio recordings are from the Berlin Database of Emotional Speech. (Emo DB).
“Emotion recognition from voice recordings is a rapidly evolving field with significant implications for artificial intelligence and human-computer interaction,” the researchers wrote.
For the purpose of this study, the researchers narrowed down emotional states into six categories: joy, fear, neutral, anger, sadness, and disgust. Voice recordings were integrated into his 1.5 second segments and various features. Quantified features include pitch tracking, pitch amplitude, spectral bandwidth, amplitude, phase, MFCC, chroma, tunnels, spectral contrast, spectral rolloff, fundamental frequency, spectral centroid, zero-crossing rate, root mean square, HPSS, Includes spectral flatness, and uncorrected. audio signal.
Psychoacoustics is the psychology of sound and the science of human sound perception. The frequency (pitch) and amplitude (volume) of audio have a significant impact on how people experience sound. In psychoacoustics, pitch refers to the frequency of sound and is measured in hertz (Hz) and kilohertz (kHz). The higher the pitch, the higher the frequency. Amplitude refers to the loudness of a sound and is measured in decibels (db). The higher the amplitude, the louder the volume.
Spectral bandwidth (spread spectrum) is the range between the upper and lower frequencies and is derived from the spectral centroid. The spectral centroid measures the spectrum of an audio signal and is the center of mass of the spectrum. Spectral flatness measures the uniformity of the energy distribution across frequencies relative to a reference signal. Spectral rolloff finds the most strongly represented frequency range in a signal.
MFCC (Mel Frequency Cepstral Coefficients) is a widely used feature in audio processing.
Chroma, or pitch class profile, is a method of analyzing the key of music (usually the 12 semitones of an octave).
In music theory, a tonnet (meaning “audio network” in German) is a visual representation of the relationships between chords in Neo-Reymannian theory, a German music school that was one of the founders of modern musicology. Named after the scholar Hugo Riemann (1849-1919).
A common acoustic feature for audio analysis is the zero cross rate (ZCR). For an audio signal frame, zero-crossing rate measures the number of times the signal amplitude changes sign and crosses the x-axis.
In audio production, root mean square (RMS) measures the average loudness or power of an audio waveform over time.
HPSS (Harmonic and Percussion Separation) is a method of decomposing an audio signal into harmonic and percussion components.
Researchers combined Python, TensorFlow, and Bayesian optimization to implement three different AI deep learning models for classifying emotions from short audio clips and benchmarked the results against human performance. . The AI models evaluated include deep neural networks (DNNs), convolutional neural networks (CNNs), and hybrid models of composite DNNs that process features with CNNs and analyze spectrograms. The objective was to see which model performs best.
Artificial Intelligence Essentials
The researchers found that the AI model's emotion classification accuracy was overall better than chance accuracy and comparable to human performance. Among the three AI models, deep neural networks and hybrid models performed better than convolutional neural networks.
How the combination of artificial intelligence and data science, applied to features of psychology and psychoacoustics, shows the potential for machines to perform cognitive empathy tasks based on speech that rival human-level performance. I understand.
“This interdisciplinary research, bridging psychology and computer science, highlights the potential for advances in automatic emotion recognition and a wide range of applications,” the researchers concluded.
References
Copyright © 2024 Kami Rosso. All rights reserved.