The first run of manual transcriptions finished

The HAAWAII project plan expects 3 runs of manual transcriptions (or manual corrections of automatically transcribed) recordings of the communication between ATC and pilots. For two of the ATC consortium members – NATS (London) and Isavia (Island) – the three runs should produce for each member 1, 4, and 15 hours of transcribed speech. Actually, it could be even slightly more as the first run, finished in February 2021, resulted in about 2 hours of speech from NATS and more than 1.75 hours from Isavia.

The transcription was preceded by manual correction of the automatic splitting of long speech segments and their classification into air traffic controller or pilot speech. The feedback, questions, and comments from subject matter experts from NATS and Isavia have already shown, what it means to cooperate with professionals and why it is extremely valuable and interesting to meet them in a project. “Ordinary mortals”, i.e. speech recognition experts without a deep ATC background, could never understand some recordings and the questions of whether the background conversation by the second pilot should be transcribed or annotated as cross-talk, while everybody else from the consortium identified it as just a noise, amused us for a long time. An interesting research question also resulted from this phase: There seem to be no easy mechanisms or tools that would support splitting a WAV file into partially overlapping speech segments. If one does not want to bother end users like air traffic controllers with complex interfaces such as that of Audacity, there is no way to produce clearly segmented parts including the end of one speaker segment and another one with the full recording of the second one, if they partially overlap. HAAWAII controllers will continue to use the SpokenData web interface in the project and the cross-talk will be solved on the transcription level only.

Of course, the transcribers were not happy with the quality of the output produced by the general (not adapted) automatic speech recognizer employed for automatic transcriptions of the first hours. They had to manually correct all the repeating names of waypoints and other location-specific words and even the most frequent words, repeating in almost all utterances (such as Reykjavik) had to be manually corrected during transcription many times. The technical teams from IDIAP and BUT started immediately their work on the recognition model adaptation so that the transcription of the additional 4 hours of the communication should go much more easily. Stay tuned…