Bad Data: The $3 Trillion-Per-Year Problem That’s Actually Solvable
In 2016, IBM released a report estimating that bad data costs U.S. businesses and organizations trillion $3.1 per year. These funds were wasted in, among other things, knowledge staff (such as IT) time spent in digitizing or updating older sources, finding and fixing errors while organizing, and simply hunting for both information and for confirmed sources for data they are hesitant to trust. An additional point of critical concern is the degree to which the age of Big Data has not been equally leveraged by companies; even very successful and well-established businesses have an assorted quantity of data in different places and formats, but might be powerless to use it because it is unstructured or semi-structured. If all of the possibility of artificial intelligence (AI) is to be actualized, data has to be available for use in meaningful ways.
A few obvious frontrunners, such as Google or Amazon, set precedents for data management, but most businesses are not like these, and don’t deal with nearly the same volume or speed when it comes to data. For every other company in the world, including those in startup or scale-up mode, a solution is needed.
Straddling digital and paper-based worlds
Many businesses of varying types and sizes straddle the digital and paper-based worlds. Some of their most potentially helpful data is in documents — file types like PDFs, images and scanned documents — practically unavailable for informing high-level decisions or arriving at decisive conclusions. If positive outcomes are going to be possible, these types of information sources need to be organized, and AI and machine learning (ML) can provide the tools to do it.
AI for managing data
For many entrepreneurs in launch mode, there are three potential barriers in place. First is the idea that AI requires immense amounts of data to result in precision activities; second, that data is often in numerous formats — structured, semi-structured and unstructured; and third, that data management might not have been an inherent part of existing business operations, and therefore course correcting would require too much effort before better results can be achieved.
First, the quantity challenge. If businesses have less data, how can they hope to gain the same level of insight or train algorithmic models as fast as larger competitors? The solution is the same as in any sphere of software development: Do it incrementally. With one-shot learning, a model can learn from literally any data point. This tech already exists and is on display anytime a user uses facial recognition to open their smartphone, for example. The system needs very little data and can quickly learn to adapt even if small feature changes occur. Many open-source models for data don’t operate this way, but could.
Second is the challenge of data in many formats. Especially in well-established industries, the digital transformation remains incomplete. This means that any and every type of historical data exists in file cabinets, on hard drives and in hard-to-access places or hard-to-match formats. This is where the power of machine learning comes in.
Data hygiene is a method of processing data to ensure it is relatively error free. There is a cycle to this, from import, to normalization, to verification to export. Depending on the nature of the data (for instance whether it is encrypted or anonymized), the method of cleaning data may vary. Machine learning can create an error-free system where objective data components are measured against one another, issues quickly identified, irrelevant parts removed and the resulting data made reliable. This can be automated and, once set up, significant amounts of information can be processed right away. Then every new data point can be processed for maximum efficiency and effectiveness.
Third is making data management part of regular business operations. The key challenge to this isn’t necessarily in deploying the manpower to manage it, but in the setup. Most data-management systems use proprietary algorithms and require skilled coders or technologists to implement and maintain them. This dilemma is similarly represented in other contexts, and one in which the no-code movement is making a difference.
Innovators in this space realize that the people who know the data, and understand it intuitively, are not data scientists. Rather, they are the business owners and operators who have worked the issue and found it critical that they operate the data-management platforms. They will do the labeling and searching and actual using, so need to have self-service options — otherwise, the chaos of unorganized or inaccessible data will only be replaced by the stress of a long-term vendor contract to keep a model up and running, one that could be unsustainable.
What about data privacy?
Inevitably, any consideration of how data management is changing due to emerging technology must focus on privacy. Many of the companies seeking solutions in that space include those with encrypted data, medical data and financial data, but data in a broad range of company documents has all types of sensitive proprietary and customer information. At the most basic level, these should be protected, both at rest and in transit, ideally with more stringent levels of security, like SOC II Type 2 and HIPAA.
Machine learning can also help this effort, as it is capable of providing a layered approach. If algorithms are pre-trained using anonymized data, there should be no need to use real customer data to refine them. The aforementioned one-shot learning model takes data from an individual user and learns the document structure of that user’s data, without sharing it with other users to train their models.
While entrepreneurial businesses are trying to achieve digital transformation and data accuracy, there are ways in which they can improve efficiency and reduce errors through intelligent document processing powered by machine learning. While there are many things machine learning can’t yet do, data management is something well within the scope of what’s possible now.
Discover Past Posts