Generative AI is a headline act in many industries, but the data powering these AI tools plays the lead role backstage. Without clean, curated, and compliant data, even the most ambitious AI and machine learning (ML) initiatives will falter.
Today, enterprises are moving quickly to integrate AI into their operations. According to McKinsey, in 2024, 65% of organizations reported regularly using generative AI, marking a twofold increase from 2023.
However, the true potential of AI and ML in the enterprise won’t come from surface-level content generation. It will come from deeply embedding models into decision-making systems, workflows, and customer-facing processes where data quality, governance, and trust become central.
Additionally, simply incorporating AI and ML features and functionality into foundational applications won’t do an enterprise any good. Organizations must leverage all aspects of their data to create strategic advantages that help them stand out from the competition.
To do this, the data powering their applications must be clean and accurate to mitigate bias, hallucinations, and/or regulatory infractions. Otherwise, they risk issues in training and output, ultimately negating the benefits that the AI and ML projects were initially meant to create.
The importance of good, clean data
Data is the foundation of any successful AI initiative, and enterprises need to raise the bar for data quality, completeness, and ethical governance. However, this isn’t always as easy as it sounds. According to Qlik, 81% of companies still struggle with AI data quality, and 77% of companies with over $5 billion in revenue expect poor AI data quality to cause a major crisis.
In 2021, for example, Zillow shut down Zillow Offers because it failed to accurately value homes due to faulty algorithms, leading to massive losses. This case highlights a critical importance – AI and ML projects must operate on good, clean data in order to produce the most accurate, best results.
Today, AI and ML technologies rely on data to learn patterns, make predictions and recommendations, and help enterprises drive better decision-making. Techniques like retrieval-augmented generation (RAG) pull from enterprise knowledge bases in real-time, but if those sources are incomplete or outdated, the model will generate inaccurate or irrelevant answers.
Agentic AI’s ability to act reliably hinges on consuming accurate, timely data in real time. For example, an autonomous trading algorithm reacting to faulty market data could trigger millions in losses within seconds.
Establishing and maintaining an environment of good data
In order for enterprises to establish and maintain an environment of good data that can be leveraged for AI and ML usage, there are three key elements to consider:
1. Build a comprehensive data collection engine
Effective data collection is essential for successful AI and ML projects, and modern data platforms and tools, such as those for integration, transformation, quality monitoring, cataloging, and observability, to support the demands of their AI development and output. They ensure the organization is getting the right data.
Whether the data be structured, semi-structured, or unstructured, any data collected should come from a variety of sources and methods to support robust model training and testing to encapsulate the different user scenarios that they may encounter upon deployment. Additionally, companies must ensure they follow ethical data collection standards. Whether the data is first-, second-, or third-party, it must be sourced correctly and with consent given for its collection and use.
2. Ensure high data quality
High-quality, fit-for-purpose data is imperative for the performance, accuracy, and reliability of AI and ML models. Given that these technologies introduce new dimensions, the data used must be specifically aligned with the requirements of the intended use case. However, 67% of data and analytics professionals say they don’t have complete trust in their organizations’ data for decision-making.
To address this, it’s essential that enterprises have data that is representative of real-world scenarios, monitor for missing data, eliminate duplicate data, and maintain consistency across data sources. Furthermore, recognizing and addressing biases in training data is critical, as biased data can compromise outcomes and fairness and negatively impact customer experience and credibility.
3. Implement trust and data governance frameworks
The push for responsible AI has placed a spotlight on data governance. With 42% of data and analytics professionals saying their organization is unprepared to handle the governance of legal, privacy, and security policies for AI initiatives, it’s critical that there is a shift from traditional data governance frameworks to more dynamic frameworks.
In particular, with Agentic AI coming into significant prominence, it’s crucial to address why agents make specific decisions or take specific actions. Enterprises must have a sharp focus on Explainable AI techniques to build trust, assign accountability and ensure compliance. Trust in AI outputs begins with trust in the data behind them.
In summary
AI and ML projects will fail without good data because data is the foundation that enables these technologies to learn. Data strategies and AI and ML strategies are intertwined. Enterprises must make an operational shift that puts data at the core of everything they do – from technology infrastructure investment all the way to governance.
Those that take the time to put data first will see projects flourish. Those that don’t will be faced with ongoing struggles and competition biting at their heels.
We list the best data visualization tools.
This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro