Platt Perspective on Business and Technology

Big data and the assembly of global insight out of small scale, local and micro-local data 11: Bayesian statistics and the power of predictive modeling 2

Posted in business and convergent technologies, reexamining the fundamentals by Timothy Platt on March 29, 2015

This is my eleventh installment to a series on big data and how wide-ranging and even globally significant insight can be developed out of small-scale local and even micro-local data (see Ubiquitous Computing and Communications – everywhere all the time 2, postings 265 and loosely following for Parts 1-10.)

I offered in Part 10 of this series, an explicit if selectively brief discussion of descriptive and predictive data analyses as approaches that can be used in testing essentially static descriptive hypotheses, and more dynamic predictive hypotheses respectively. And as a core component of my discussion of predictive analysis and hypothesis testing in that, I cited and began discussing Bayesian analysis as an essential analytical approach, and certainly in a big data context.

At the end of that installment I added that I would continue its narrative here with a more detailed discussion of identification of and analysis of unexpected and unpredictably emergent events and their impact on big data requirements, if these data resources are to offer value in making those predictive analyses.

• Data collection and analysis can and do face challenges, and even when collected data is only used to address hypotheses that are readily anticipated and that would resolve correlational and causal relationships that would arise under conditions of ongoing predictable linear evolutionary change.
• Effective data collection and data analyses become a great deal more complex and uncertain for their capacity to resolve hypotheses when disruptively novel change has to be accounted for as well.

I offered two test case scenarios in Part 10 of this series that I return to here, as a foundation for this discussion (suggesting that you review that posting for its relevant details):

• A pre-considered data collection enterprise that was planned out and designed, a priori to any actual data gathering, to empirically resolve a specific hypothesis and with a specific goal of enabling that analysis, and
• A more disorganized and open-ended data collection effort that was simply carried out without any a priori hypothesis-demanding need.

As noted in Part 10, I originally heard of this data-gathering distinction when a child and the point in question was one of which if either could be considered a gathering of scientific data. The answer that I was offered at the time was that the former did in fact qualify as the gathering of scientific data and the later did not. So I raised variations of the follow-up contingencies that I noted in Part 10, where the pre-planned data collection was insufficient to the task of actually addressing its hypothesis, but where the data more randomly collected could be used to resolve a specific real-world hypothesis. The resolution to these confounding scenarios that I offered then and that I still find to be more valid is that:

• The scientific nature of scientific data does not reside strictly speaking in the data itself, except insofar as it is carefully and consistently gathered and recorded, and that it be accurately recorded, at least to usably reasonable standards.
• It resides in how that data is utilized, where that use can legitimately be organized and planned out either a priori to or a posteriori to the initial data collection itself in resolving specific hypotheses, provided only that sufficient data of the right types and of sufficient accuracy of measurement have been gathered.

In my first scenario, as briefly outlined in Part 10, if the data collector finds when seeking to use their wealth of empirical evidence, that they had not gathered some specific type of data that they would need for their initial hypothesis to be tested, that data might not be of significant scientific value for that purpose. But if they fall back to consider a different, perhaps more predictively limited alternative hypothesis where they do have all necessary data for that, then their data trove does become empirically sufficient for scientific study and hypothesis-driven analysis.

And with this noted, I turn explicitly to consider big data, and as a working example I turn to consider the vast reaches of individually tagged surveillance data that the US National Security Agency (NSA) has been gathering about essentially everyone through its open-ended surveillance programs, as I have been discussing in my series: Learnable Lessons from Manning, Snowden and Inevitable Others (see Ubiquitous Computing and Communications – everywhere all the time 2, postings 227 and loosely following for Parts 1-29.)

And I approach that complex of technical and political issues with a goal of addressing one basic question: why?

• Why would agencies of the United States government actively seek to gather this type and level of data and so indiscriminately widely and in accordance with systematic government policy and doctrine, when it at the root of the American experience that that nation and its government should actively pursue and support individual liberties and democratic rights and principles?

I suggested one key element to an answer to that question in my just-cited national security-oriented series and specifically cite as of particular relevance here: Part 26, Part 27 and Part 28 of that series where I explicitly discuss the emerging Obama cyber-security doctrine per se. The Obama administration and its national security agencies and their leaders seek to achieve absolute security – which means being able to identify absolutely every possible question – every possible threat assessment hypothesis that would have to be considered in anticipating and resolving any possible source of emerging national security danger, and being able to answer and resolve them all, and proactively and with absolute, complete certainty so as to neutralize those sources of risk.

When put that way, the two scenarios that I recited here from my childhood education take on a new relevance. And as a paradigmatic example, the open-endedness of the George W Bush and now Barack Obama administrations’ national security surveillance programs become more important, and even arguably essential. If the data that is carefully collected for analysis is insufficient to address specific identified risk and threats assessment hypotheses, than its gaps represent a failure in securing and maintaining national security as a whole. This mandates wider ranging gathering of data in and of itself. And when data that is gathered has to be usable for addressing risk and threat assessment and the analysis of hypothesizes that could not even be anticipated in principle let alone in detail when this data was initially being gathered, that and the reliable recurrence of the novel and unexpected would demand even wider ranging data gathering. Absolute security and its requirement mean absolutely all possible surveillance data needs to be gathered.

I am not in any way trying to argue the case that the underlying reasoning behind this search for absolute security is or even can be sound. I am presenting a case that this governmentally presumed-sound rationale is why these big data gathering exercises are so open-ended and all-inclusive in what they vacuum in and store and about all of us.

Stepping back from the particulars of this example to consider big data requirements per se:

• The more disruptively novel and unexpected the hypotheses that will have to be effectively addressed and resolved when using accumulated data as a source of testable evidence, the wider the range of data types that have to be gathered in and made available for that analysis.
• And the finer the granularity and the greater the precision and detail that is required in resolving hypotheses in general, and the greater the certainty that is needed in the utility and accuracy of conclusions reached from this hypothesis testing, the more data is required and of all data types collected.

I have framed this posting’s discussion up to here in terms of a non-business, government policy and program example because it is so big-data demanding, and because it specifically highlights the two general-principle points that I cite above as to what types of data and how much of it would be required in a big data repository, in order to meet specific categorical types of hypothesis testing needs as those needs are goals and policy driven. When I write of granularity in general in this type of context, I refer both to:

• The fineness of causal relationship or correlational detail that a hypothesis under consideration would seek to resolve as empirically valid or not, if validly tested,
• And the level of statistical confidence that a conclusion has to satisfy in order to determine the validity of that hypothesis.

Finer detail in what has to be tested for validity and higher minimum standards of proof to validate a hypothesis both mean finer granularity overall.

For a perhaps correspondingly fine-grained business model example, I would cite a fully individualized and customized marketing model approach that big data is coming to make possible, as discussed for example in: Big Data 1: the emergence of the demographic of one.

I am going to continue this posting’s discussion with at least the start to an analysis of the cost and benefits dynamics of big data collection and use, and particularly where the overriding goal is to be able to address hypotheses from it with as fine a granularity as possible. And I note in that context, and in anticipation of discussion to come, that this of necessity raises both operational and strategic costs and risk management issues and challenges.

Meanwhile, you can find this series and other related material at Ubiquitous Computing and Communications – everywhere all the time 2 and also in my first Ubiquitous Computing and Communications directory page. And I also include this series in my directory: Reexamining the Fundamentals.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: