Platt Perspective on Business and Technology

Big data 5: accuracy and error correction from the business perspective

Posted in business and convergent technologies by Timothy Platt on January 10, 2013

This is my fifth installment in a series on an emerging capability that has become surrounded by hype, even as it has emerged as a powerfully disruptive societal force: big data (see Ubiquitous Computing and Communications – everywhere all the time, postings 177 and following.) And I turn in this posting to at least begin considering the issues of data accuracy, and error identification and correction that arise in big data accumulation and use, and as viewed:

• From the perspective of the businesses that accumulate and use this information, and
• From the perspective of the consumers individually identified by it, that this information comes from and seeks to predictively describe.

The business perspective: The general name for the processes involved in maintaining data accuracy, from the business side of this is data cleansing and in its widest sense this means identifying and correcting (or deleting) corrupt or inaccurate records from a database. More specifically this means identifying and taking corrective action on incomplete, inaccurate, irrelevant or otherwise problematical entries in the data accumulated, replacing, modifying, or deleting what has been found to be dirty data.

The primary reason for this is very simple. With time, people move and their addresses and phone numbers change. People change names or use more than one name variant. They change jobs and their income levels can go up or down. Their purchasing levels and preferences can change, and in fact virtually any and every type of data entry that might be collected concerning them can become obsolete – with multiple perhaps successively once-accurate versions simultaneously maintained side by side in the same databases, along with outright data errors. The only fields not likely to change for an individual are data such as a specific individual’s social security number that are assigned as unique identifiers for life. And on top of that, people sometimes intentionally enter faulty information into forms, and particularly where they see questions about issues such as their personal income to be overly intrusive. And unintentional data entry errors and other forms of faulty data accumulation can and do occur too. So with time, absent effective data cleansing, any accumulated data records degrade in value and effectiveness and certainly at the level of the specific individuals that all of this data is collected about.

Over time, and for large enough accumulations of data, and on large enough numbers of individuals so to be able to perform statistical analyses, it is even possible to quite accurately predict the rate of decay of overall data accuracy and quality, absent any data correcting effort. And this correlates closely with the information theory conceptualization of entropy. Think of data cleaning from the larger perspective of overall data quality, as an effort to roll back loss of obtainable value that would come with accumulating information entropy. And to take that out of the abstract, consider one seemingly stable type of data field: individual name fields for first and last names, and retail businesses that do a lot of their business transactions and that bring in a lot of their revenue through mailed catalog sales.

I find myself actually writing this soon before Christmas, 2012 to go live in early January, 2013 so this example is very timely for me and my family. We do at least occasionally make catalog purchases, mostly using a combination of print and online catalogs to decide what to buy and then making our purchases online or by phone. The problem is that we get so many duplicate copies of so many printed catalogs. Minor but insignificant differences in how our address is specified contribute to this but much of this unnecessary and wasteful duplication comes from differences in how our names are listed, with a full first name or an initial only, with or without my PhD included and so on. And the worst offenders here are actually a couple of businesses that market themselves the most actively as being environmentally conscious and responsible. And here, with these unwanted duplicate catalogs clogging mailboxes, they are marketing themselves as causing the wasteful cutting down of large numbers of trees that are only going to end up in customer trash or paper recycling collection bins – and without ever serving a useful purpose. I write “large numbers of trees” here as I am certain that my wife and I are not the exception for receiving uselessly and meaninglessly duplicated catalogs and in large numbers. Absent any real, meaningful effort to data cleanse their catalog mailing lists, these companies must send out vast numbers of unneeded and unwanted catalog copies, and with all of the up-front cost that adds, and with the additional significant if perhaps less easily quantified cost of this waste challenging their intended, positive marketing message as to who they are as a responsible business.

This is a simple example. When accumulated data errors and duplications are considered across the whole range of data fields collected, with all of the possibilities for business inefficiency and waste that holds potential to create, data cleansing becomes a vitally important information management activity, and doing so efficiently a significant strategic goal – or at least it should.

I am going to turn in my next series installment to consider data accuracy and error correction from the consumer perspective, there considering issues of who gets to see what, who gets to correct what and to delete inaccuracies, and how. Meanwhile, you can find this and related postings at Ubiquitous Computing and Communications – everywhere all the time and also see Business Strategy and Operations and its continuation page, Business Strategy and Operations – 2.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: