Transcription accuracy is bedrock on which data-driven strategies must be built: part 1
The results of our recent survey on the transformative potential of transcription tech – which canvassed the views of nearly 100 financial markets professionals spanning trading, sales, compliance, risk management, operations and IT – are in. And they are conclusive: the industry remains uneasy over the level of accuracy transcription engines offer.
When questioned what concerns respondents had around the use of transcription technology in the financial sector, as many as half cited accuracy and reliability. When asked which specific features or improvements would make transcription services more appealing to their specific role, the largest majority (33%) opted for greater accuracy and less need for manual correction of transcripts.
Anxiety over accuracy is well founded, particularly given the complex and jargon-heavy nature of investment markets. The most pressing questions we must now ask ourselves as an industry are the following: where are we in our quest for transcription perfection, and how impactful could the realization of this goal be for financial institutions?
With major advancements in machine learning and artificial intelligence making headlines on an almost weekly basis, and the rapid development of the infrastructure underpinning these technologies showing no signs of slowing, the answers to these questions are beyond exciting. And the rewards could materialize much sooner than you might expect.
Pinpointing accuracy
Before we delve into the current state of play with regards to transcription accuracy, it is important to get to grips with how it is measured, as well as the recent improvements in the technologies underpinning transcription.
The most common metric used to measure transcription accuracy is word error rate (WER). While alternative methods like precision and recall are gaining traction, the popularity of this metric allows us to compare the accuracy of various transcription technologies more easily.
It is calculated by comparing the transcription generated by a machine to that of a human, using the same audio file. It counts the number of errors in the transcription, including words added to the transcript that are not in the recording (insertions), words missing from the transcript that are spoken in the recording (deletions), and words transcribed incorrectly (substitutions). WER is then expressed as a percentage, with a lower WER indicating higher accuracy. For instance, a WER of 5% means there are five errors for every 100 words transcribed.
When transcribing a typical everyday conversation – such as a customer chatting with a call centre agent – most off the shelf automated transcription products are highly accurate, often scoring well below 10% in terms of their WER. This means they can be marketed as over 90% accurate – a dramatic improvement compared to the accuracy of automated transcription services just a couple of years ago.
Developments in the underlying technology behind transcription engines are behind this jump in accuracy. Advancements in the supportive AI technologies, as well as the leap in computing enabled by evermore powerful GPUs, have allowed the creation of bigger, more accurate transcription models, while still remaining more efficient than the previous CPU-based technologies.
Automated transcription technologies have subsequently reached parity with humans when transcribing general conversations in English, and several specialist technologies have even proven more accurate than humans if the audio quality is poor or the conversation littered with jargon – like most capital markets conversations.
Markets convo complexity
Given popular automated transcription services like Otter are often free and can transcribe reems of text instantly, a growing number of financial institutions are adopting off the shelf solutions to capture staff conversations. After all, transcribing employee conversations is essential for many regulatory reporting processes, as well as increasingly helping to inform front-office trading and investment strategies – but more on this in part 2. The issue is the level of accuracy offered by these solutions is more nuanced than it may seem – particularly in the context of capital markets.
We recently conducted an analysis of transcription accuracy among several common automatic speech recognition products, looking specifically at how well they interpret complex financial markets conversations. We used our own dataset, which consisted of 58 conversations had by real-life traders in English, many of which were poor audio quality. The results were somewhat alarming. In terms of their WER, AWS Transcribe scored 29%, Azure Speech to Text 37.51%, and Google Chirp 26.01%.
Consider the risks of relying on a transcription tool where as many as a third of the words transcribed are incorrect. For a compliance officer, there is the risk of vastly inaccurate regulatory reports, which could result in watchdog penalties and a dent to the brand’s reputation. For the front office, the potential implications of feeding erroneous data into models are harder to predict – but they are just as severe.
Insights are about integrity
It pays to view transcription accuracy as a bedrock or, better still, the foundations on which sophisticated structures like skyscrapers – or data-driven investment strategies, in our case – are constructed. Before you can empower the front office to deploy shiny new AI or machine learning tools to enrich decision making, you must first ensure the input data is accurate and uniform.
A building crafted with the most modern construction materials will still crack in an earthquake if its foundations are shoddy, just as an investment strategy will not hold up to scrutiny if it is informed by chaotic or inaccurate data. With this in mind, it is essential firms are extremely selective with the transcription technology they implement. The most accurate will be those trained on real-life capital markets conversations, like VoxSmart’s Scribe®. Its accuracy reading considerably outshone AWS Transcribe, Azure Speech to Text and Google Chirp, scoring 90.11% when interpreting intricate financial markets-based conversations. Indeed, AI-based transcription engines trained specifically on sector-specific audio have proven more accurate than human transcription, largely due to the proliferation of jargon and industry-specific slang in fields like trading. Of course, as more asset class-specific transcription models are developed, this level of accuracy will only increase.
Only by adopting technology capable of reliably interpreting these conversations with near 100% accuracy can firms use this data to inform front office strategies. Once this degree of sophistication is integrated, however, the strategic opportunities are many. Tune in for part two of this transcription blog series for an analysis of the opportunities on offer to firms that arm themselves with highly accurate transcription technology.