If you have the task to listen to ATC communication and to write down what has been said word by word, there are some challenges to make transcriptions unique – independent of who transcribed them.
Is it “klm604”, “k_l_m six oh four”, “KLM six o for”, “~k~l~m 6 ohh 4 ”, “kay el am six ouh four”?
HAAWAII already refined and extended a set of rules for transcription (and annotation) of ATC utterances, that was agreed by 22 European partners from ATM research and industry.
However, the feedback of air traffic controllers who actively transcribed and also the automatic and manual checking results showed that there is a need for some further improvements.
Those definitions shall make transcriptions consistent, less ambiguous and better to use for automatic learning of models such as acoustic model, language model, and command extraction model.
Hence, the HAAWAII team (especially the partners of DLR, BUT, Idiap) updated the deliverable D3.1 “Transcription and Annotation Handbook” (for excerpt click here).
The update concerned, e.g.,
- Eliminating hyphens, underscores, and special characters from different languages “ä,ö,ü,ß,é” with a blank “ “ and closest English expression “ae, oe, ue, ss, e”, respectively
- Consistent splitting of meaningful words into multiple parts, e.g., “german air force” instead of “german_airforce” and “korean air” instead of “koreanair”
- Clarification on use of upper case letters if English spelling is used instead of ICAO spelling, i.e., “KLM”, “QNH”, and “Rnav” (in “Rnav” the English letter ”r” is spoken whereas “nav” is pronounced as a word)
- Further replacements of frequently misspelled words, e.g., “until” instead of “untill”, “victor” instead of “viktor”, and “ILS” instead of “ils”
All of those updates help to assure a unique high quality of transcriptions to be used by the derived models and also to improve an easy exchange of high quality transcriptions.
In the course of applying the airline designator rule and spelling updates, multiple ten thousand transcription files from different ASR projects have been modified along with required changes in implementation source code and unit tests.
…and the pending answer on the starting question is that “KLM six O four” is the correct transcription and “KLM604” is the correct annotation due to the defined transcription and annotation rules.