Hi @andy-connory - This is exactly what we do! It’s a complex methodology but in essence: Audio recordings are first processed to separate speech from noise or silence and then determine who spoke when. Next, we extract a rich set of interaction-level and speaking-style related features such as pause duration, speaking rate, and intonation-related characteristics, at various temporal resolutions. Overall, more than 600 features are used to represent an utterance in the recording in the domain of emotions and behaviors. Based on these features we then train a set of deep models to identify various emotional and behavioral states. Last, a separate set of models is trained on these emotional and behavioral outputs to predict the domain-specific KPIs.