Platt Perspective on Business and Technology

Mining and repurposing of raw data into new types of knowledge – 1

Posted in business and convergent technologies, macroeconomics by Timothy Platt on February 19, 2012

I have written a number of times in this blog about data and raw, unprocessed data, and knowledge – processed, organized and analyzed data. In a fundamental sense, this posting is an examination of the two, and with a goal of more clearly articulating them, both for their differences and as distinctive sources of value.

The immediate impetus for this posting is an email exchange conversation I have been having with a friend and colleague in Europe. His hands-on work is in information management and his focus has increasingly been on helping large organizations to repurpose large and even tremendously vast accumulations of data for new uses – and where this data is commoditized, by new users and types of users.

Businesses, and organizations of all sorts accumulate data and raw information the way organisms breath – automatically, continuously and usually without explicitly thoughtful attention. True, many data fields on customer facing forms and other entry points are added to address specific process-based needs, but many forms also have fields that are simply there and for no other reason than the fact that they have always been there. And then there is all of that structured, less structured and unstructured raw data and information that automatically flows in and gets stored. This includes web site activity log files and a lot of other data streams that come in through automated processes and that might only become noticeable if their accumulated volume reaches a point where they risk maxing out a web server or other storage resource – I have definitely seen that happen. But even the data that is collected for specific functional and operational purposes generally persists in systems way beyond the point where it was so needed and it too becomes data landfill if left as is.

• Even when raw data is collected with specific intended uses in mind, and certainly when it is simply collected and saved, it can best be thought of as raw potential, and as being unformed as far as specific usage is concerned.
• When data is selected out of this accumulated pool, processed and analyzed to address specific questions, and to provide specific insight or verification, that potential collapses down to some specific realized point of value. And this is the process of converting raw data into processed knowledge.
• But the same data points, combined perhaps with different data from that same initial pool, and processed and analyzed differently and to different ends, might feed into and support completely different types of realized, processed knowledge and different points of realized potential.

My friend’s company has been in business for over 140 years in Europe, and globally as well and they have accumulated truly vast storehouses of raw data. Many if not most of the businesses they work with, as partners and as clients, have also accumulated vast pools of raw data in this way.

This data, stripped of source-identity information where needed, to protect privacy and meet confidentiality due diligence concerns, can be viewed as a marketable resource. And effectively managed, this has become the basis of a new and emerging industry, and one of great potential value.

This is my first posting in what I at least initially see as a new series on the commercialization of data repurposing as an industry. And I want to start by discussing how raw data might be preliminarily organized and cataloged so as to effectively match the right data with the right customer.

• Start by cataloging data sources within the overall data pool as being fully structured, partly structured and unstructured.
• Where this data is structured, identify any and all labels that specific data fields are organized with and re-label them according to a standard format. When as a simple example and one using personally identifiable information, you are merging data tagged with a field identifier Last_Name, and similar data identified as name.family do so according to a single field identifier that you would use in your own raw data cloud.
• When data is partly structured, you need ways to break out more elemental data fields for treatment as above. As an extension of the same example from above, if Last.Name is also capturing suffixes such as Jr., Sr., MD and PhD, these and variants would be separated out into a new data field with your family/last name field (hopefully) only holding that data and with the correct last name always showing.
• The above two bullet points at least touched on the issues of characterizing, parsing and meta-tagging data, though in a relatively simple context.
• When data is unstructured, you face a much more daunting challenge and particularly where you are dealing with large enough pools of data so that all filtering and cataloging, and parsing and meta-tagging has to be carried out by strictly automated means.
• And even relatively modest data accumulations can require essentially completely automated processes for their basic management, let alone effective repurposing from them to work.
• And as a final point here, the real value in this data will depend in many if not most cases in matching data from one field to another in systematic records, using relational databases or specific alternatives to them – that others have to be able to connect to and effectively use data sets from too.

Once you know what you have, and certainly for your structured and largely structured data and once you know how to organize it as a source of commoditizable product you have to cleanse it as a core due diligence requirement.

• On one level, data cleansing simply means de-duplication and cleaning out data errors. But this also refers to the management and control of legally restricted sensitive information, and the development of cleaned alternatives that can be more publically shared,
• That means identifying data types that have to be anonymized – converted from personally identifiable information as to individual source for example, into general demographics level data and carrying out that conversion for any data that would be marketed, baring specific permissions to share personally identifiable information with specific allowed-for and approved recipients.
• That means reaching a level of assurance that any partly structured data that you so share does not contain personally identifiable or confidential information within it, or at least that it does not contain such data that would violate your due diligence and risk remediation guidelines and requirements.
• And for unstructured data, that might contain anything in it, you need to understand the inherent risk that sharing this might carry and certainly for sharing data that was offered to your business under assumption of confidentiality.

So far I have touched on a few basic areas of activity and concern, and I freely acknowledge here that I have only touched the surface of this topic with a simplistic cartoon representation of the issues and opportunities involved.

The data repurposing industry is one that will be built upon fleshing out this type of cartoon beginning with commercially viable, due diligence and risk remediation-robust details. And that only begins with incorporation of current search engine technologies and our emerging suite of tools and approaches for implementing web 3.0 and the intelligent web with their automated management of data flows.

My basic points here in this posting are that:

• Raw and unprocessed data can be collected and accumulated and combined into large collections and these can be combined into still larger ones.
• That is happening in a seemingly endless number of repositories of all sorts and in an incredible number of places.
• Much of this is potentially commoditizable and with due diligence and risk remediation processes in place.
• These data accumulations can be mined in numerous ways and repurposed to address new and emerging questions and needs, and by new users and types of users not anticipated when this data was originally collected and developed.
• And an industry is arising, both from capacity for developing accumulated access to pools of this data, and from improving capacity to intelligently mine it so the right data can be marketed to the right customers and in the right forms and formats.
• In that, data might derive from a single organizational source with that business mining and commoditizing its own raw data pools. Or a data repurposing business might serve in effect as a broker, helping other businesses to capitalize on some of the value inherent in their data pools and managing sale of access to it to others.

I am going to pick on from that last bullet point in my next series installment, where I will at least start a discussion of data sources, data ownership, data commoditization and sale, and data repurposing business models.

You can find this posting and series at Ubiquitous Computing and Communications – everywhere all the time and at Macroeconomics and Business.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: