Special Journal Issue on “Automatic Speech Recognition and Understanding in Air Traffic Management” concluded

The Aerospace journal special issue on Automatic Speech Recognition and Understanding (ASRU) in Air Traffic Management (ATM) already comprised of six papers (see here) until summer 2023. The issue was complemented by six further scientific articles until the end of the year 2023. The completed special issue is available at: https://www.mdpi.com/journal/aerospace/special_issues/COQJVT00W7.

Voice communication between ATCo and pilots is not always split into different voice streams. Khalil et al. present a pipeline that deploys (i) speech activity detection to identify speech segments, (ii) speech-to-text system to generate the transcriptions for audio segments, (iii) text-based speaker role classification to detect if ATCo or pilot is the speaker, and (iv) unsupervised speaker clustering to create a cluster of each individual pilot speaker from the obtained speech utterances.

Xu et al. show that air traffic control speech data can also be used for purposes where the recognition of words is not relevant. They join speech and gaze data from a laboratory environment to detect fatigue based on the entropy weight method. The authors compare the automatic fatigue-state recognition with the controller self-ratings on the Karolinska Sleepiness Scale and achieve an accuracy rate of 86%.

The development of machine-learning based ASR systems demands large-scale annotated datasets, which are currently lacking in the field. The ATCO2 project aimed to develop a unique platform to collect, preprocess, and transcribe large amounts of ATC audio data from airspace in real time. The paper of Zuluaga et al. reviews (i) robust ASR, (ii) natural language processing, (iii) English language identification, and (iv) contextual ASR biasing with surveillance data.

The work of Park and Na investigates on-board ASR where computer resources are very limited. Therefore, the authors propose that variable Hidden-Markov-Models (HMM) are sufficient, although it is known that Deep Neural Networks (DNN) outperform HMM in complex application domains, e.g., noisy environment and with complex unstructured grammar. The average sentence length is less than four words in the considered application area of a pilot flying and simultaneously controlling multiple UAVs via voice. The variable HMM achieves a word error rate of 0.86% in a simulation environment, whereas DNN is only slightly better with 0.80%. However, the recognition speed on the same computer engine differs by a factor of more than 100, i.e., 60 ms versus 6500 ms. Though, performance differences might be much bigger in the complex real-life pilot air traffic controller communication environment, because the best engines are currently achieving word error rates for pilot speech between 5% and 10%. Also, the speech understanding task is much more complex as it cannot be mapped to a simple decision tree.

Bringing ASRU technology from the lab environment to the ops room requires a safety assessment. A safety assessment process consists of defining design requirements for ASR technology application in normal, abnormal, and degraded modes of ATC operations. Eight functional hazards were identified based on the analysis of four uses cases by Pinska-Chauvin et al. The safety assessment was supported by a top-down and bottom-up modelling and analysis of the causes of hazards to derive system design requirements for the purposes of mitigating the hazards. Assessment of achieving the specified design requirements was supported by evidence generated from two real-time simulations for pre-filling radar labels and callsign highlighting with pre-industrial ASR prototypes in approach and en-route operational environments. It was demonstrated that the use of ASR does not increase safety risks. This paper has been selected for the cover of the November issue of the MDPI Aerospace journal (link)

The task of callsign highlighting in situation displays requires reliable detection of callsigns from radio telephony utterances. This topic has also been addressed in the paper of Saïd Kasttet et al. They use Automatic Dependent Surveillance – Broadcast (ADS-B) data that contains flight numbers to match the recognized words from ground-recorded air traffic control utterances with existing callsigns.