Platt Perspective on Business and Technology

Big data and the assembly of global insight out of small scale, local and micro-local data 10: Bayesian statistics and the power of predictive modeling 1

Posted in business and convergent technologies, reexamining the fundamentals by Timothy Platt on February 25, 2015

This is my tenth installment to a series on big data and how wide-ranging and even globally significant insight can be developed out of small-scale local and even micro-local data (see Ubiquitous Computing and Communications – everywhere all the time 2, postings 265 and loosely following for Parts 1-9.)

I wrote in Part 9 of this series, of how “Big” in big data can be both a defining strength and a defining weakness. And in that regard, I made note of a particular working example based on current, as of this writing, US national security big data initiatives for illustrating both of those possibilities. And a key point that I at least touched upon there, that makes at least elements of both those positive and negative sides to big data inevitable, is that:

• While data might be developed, organized and accumulated from a deep and even profound understanding of what is, and of what has led up to any current now,
• It is also going to have to be applied to the unpredictably emergent and novel and to the disruptively unexpected too, as well as to the more linearly predictable and expected.

I find myself thinking back to a data collection distinction that I first encountered as a child when first learning about the scientific method. And the distinction hinged on the meaning and function of the scientific hypothesis and on what constitutes scientifically meaningful data. Two people accumulate and carefully record data and over an extensive period of time. One does so very systematically and with a goal of analytically addressing and resolving a specific hypothesis and the other does so with at least equal ardor and rigor but essentially randomly, collecting indoor and outdoor temperature readings at various systematically set times of the day and night, the first word they hear on their radio every morning and last they hear in the evening before turning their radio off for the night, observational data from their garden and from their nearby woods and a wide range of other at least seemingly disconnected data types.

The argument made was that as the former of these data sets is collected with an organizing purpose and according to a systematic design, it can legitimately be considered scientific data. The latter, collected without such organizing purpose or design and more at random would not qualify as constituting scientific data according to this conceptual perspective. But what if it is discovered after the systematically organized and collected, hypothesis-driven data is collected, that at least one crucial variable was not accounted for in this collection that would be essential for actually testing this hypothesis? Is this still scientific data, even if critically incomplete and inconclusive scientific data? And let’s consider that other data set too. An ecologist finds this trove of at-first seemingly randomly collected information and begins looking through it to see if any of it might be of value. And they see contained within it, consistently recorded flows of data for sets of variables that are crucial to testing a hypothesis that they have been seeking to address as to the possible effect of late autumn day-to-day temperature fluctuations on over-winter survival of certain types of usually-annual and perennial plants – where all of the data they require are precisely and consistently recorded in this flood of information. Suddenly at least specific subsets of this overall set of data can be collected out of the whole as valid and informative scientific data, in meaningfully addressing and even resolving a specific scientific hypothesis. And this brings me to “Big” in big data and why it is both big and why it tends to be more open-ended in its collection. You cannot necessarily know in advance all of what you will need in order to address specific hypotheses, or what can be brought together into meaningful patterns in addressing completely unexpected but nevertheless crucially important ones.

And with this in mind, I come to the issues that I noted at the end of Part 9, as topics for this posting, where I stated that I would:

• Continue its discussion in a next series installment with a look into how Bayesian analysis and approaches that stem from it can be used to more effectively manage and work with large-scale data.
• And in anticipation of that, and I add in keeping with my ongoing focus on innovation and the disruptively novel in this blog,
• I said that I would discuss how the unexpected and emergent, and the possibility of them impact on any use of big data and on capacity to make effective use of any data resources of any scale.

I begin this with the fundamentals and by explicitly dividing data analysis and use into two fundamental domains: descriptive and predictive analyses.

Descriptive analyses, as exemplified by descriptive statistical methodologies and approaches offer an essentially static understanding as to the nature of groups and their correlational relationships. This categorical type of analysis can explore pre-determined a priori known criteria of likely data organization, in testing description-predicting hypotheses. Or it can be expanded out and certainly in a big data context, to include search for unpredicted but nevertheless statistically significant new correlations and functional relationships too.

But the real power of big data comes when descriptive modeling as touched upon above, serves as a starting point for predictive analyses. And this brings me to predictive statistical analysis, and more specifically to Thomas Bayes: an 18th century Presbyterian minister who is considered the father of propositional logic-based probability theory, in which successive predictive refinements are made on the basis of the truth or falsity determinations of an iterative series of progressively more refined logically stated and tested hypotheses.

Put more simply, Bayes developed a new interpretation of probability theory, creating tools that can be used to step by step filter and predict consequential and highly correlated relationships and outcomes that can be described through testable hypotheses. And the suite of tools that he initially developed have grown from his initial work in this area to become in the early 21st century, among other things the predictive analytical toolset killer app for big data.

Bayes himself, initially developed his analytical approach with a goal of mathematically proving the existence of god. While he might not have succeeded in his quest, his successors, using the probabilistic tools and approaches that he initiated have gone on to analyze and predictively model data and presumptive evidence from as wide a range of areas of enquiry and experience as they have been able to imagine – and often with profound success.

I am going to continue my discussion of predictive data analysis in a next series installment, where I will continue discussion of the three to-do points that I listed above, focusing on the second and third of them. Meanwhile, you can find this series and other related material at Ubiquitous Computing and Communications – everywhere all the time 2 and also in my first Ubiquitous Computing and Communications directory page. And I also include this series in my directory: Reexamining the Fundamentals.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: