AI is at the forefront of today’s technological advancements. In 2022, McKinsey found the proportion of companies adopting AI more than doubled in the five years between 2017 and 2022 and research shows the AI market is projected to reach $1,345 billion by 2030. This transformation is revolutionizing industries and profoundly impacting the way we live and work.
Among the most significant breakthroughs in AI are large language models (LLMs), such as OpenAI's GPT-3.5 and GPT-4, the models behind ChatGPT. These models have shown impressive capabilities in natural language processing and text generation. At Trint, we have been utilizing language models and AI since our inception in 2014, allowing journalists, podcasters, and content creators to save time transcribing their media, making it easier to access and analyze recorded interviews, lectures, meetings, podcasts, and more.
Today, nearly every industry has started using AI to reduce repetitive tasks like transcribing long audio files, summarizing lengthy texts, or extracting keywords. They’re also using AI for more complex tasks, like identifying unusual patterns for fraud detection, implementing virtual assistants for customer service, or automatically evaluating CVs to improve recruiting.
It’s essential to analyze how AI models work and particularly how language models are trained. The process of training a model can be compared to a young child learning to speak; both involve acquiring language skills through exposure to a large amount of data. We show them some new words and check whether they use them correctly in a sentence. If they don't, we keep giving them examples until they do. The more data we feed them, the more they learn.
“While we are rightly concerned about how children learn certain words, the same level of concern has not always been applied to the data used to train new AI services.”
Knowing the data used for training is vital because it significantly influences what the AI will generate. Unlike children, LLMs operate more like black boxes; they learn from the data provided during training, but interpreting or explaining their specific decision-making process and internal workings can be challenging.
The training data of LLMs often consists of a large amount of publicly available text from the internet, which can contain stereotypes or cultural prejudices that inadvertently make their way into the training set. This can lead to biases in the AI's generated text, favoring certain perspectives or displaying discriminatory attitudes.
AI researchers and developers are actively working on mitigating bias by employing techniques like careful dataset curation, or fine-tuning with additional human review. However, a clear solution is yet to be found. Another challenge for AI researchers is the problem of ‘hallucinations’, where the model generates responses or outputs that are incorrect, nonsensical, or unrelated to the given input.
“If we ask an AI to summarize a text, for example, sometimes the summary produced can contain details like dates, places, or people that are entirely made up and were not mentioned in the original text.”
Hallucinations can occur due to various factors, including limitations in training data, biases in the data, and inherent biases in the model's architecture. Addressing hallucinations in LLMs is an ongoing research challenge, making it currently impossible to guarantee that AI-generated text will be entirely free of them.
When it comes to AI at Trint, we always recognize the importance of keeping the human element in the loop. We have adopted a design that makes it clear whether AI-generated text has been reviewed by a human or not. This way, users can have complete trust in the content they access and utilize, knowing that it has gone through a combination of AI algorithms and human validation.
When using AI, it's crucial to know if the data provided will be used for training the model. To revisit the child analogy, we are extremely careful about what we say around our kids, especially avoiding swear words, because we don't know when they might repeat them. Similarly, enterprises should exercise caution when sending confidential data to a third-party service that uses customer data to train AI models, as it may pose security risks and potential breaches of privacy.
Journalists, in particular, have to be careful to protect the identity of sources and whistleblowers. If their confidential data is used to train a language model accessible to others, it diminishes the value of their exclusive information and compromises their sources and competitive advantage.
“At Trint, we prioritize privacy and security, offering the same level of security to both our self-serve and enterprise customers. One key aspect of our commitment to privacy is our strict policy of not using customer data to train our AI models.”
We understand the significance of maintaining confidentiality and the trust our customers place in us. As a result, any confidential or proprietary information shared with Trint remains strictly confidential and is not utilized for training our AI algorithms. By implementing this practice, we ensure that our customers' data is not shared in any way that could compromise their privacy.
Furthermore, we deploy our AI models inside our infrastructure, granting us greater control over the security measures implemented. We make sure that each file uploaded to Trint is processed exclusively within our secure systems. By keeping the AI operations within our infrastructure, we can also assure our customers that none of their data is shared with third parties, providing an extra layer of security.
At Trint, we believe we can deliver powerful new AI features without compromising our high standard of security. We are committed to releasing new AI-powered features in the near future – in fact, we recently launched our new Mobile Live feature in the mobile app, which provides live transcription streaming that can cope with an unstable internet connection. We also believe language shouldn't be a barrier to efficient transcription, and are launching a new multi-language transcription service that can automatically detect and transcribe multiple languages in real-time.
On top of that, we have many more ideas about how we can use AI to improve the efficiency of Trint and increase the productivity of our users in various ways. Stay tuned!