Platt Perspective on Business and Technology

Big data and the assembly of global insight out of small scale, local and micro-local data 1: assembling big pictures out of little pieces in three examples

Posted in business and convergent technologies, reexamining the fundamentals by Timothy Platt on December 16, 2013

How can you best determine both when a flu season has started, and the incidence level of this disease as it maps out across a geographic region? The traditional answer to that would involve conducting a detailed and I add time consuming and expensive epidemiological study. And with the lag times involved there, any findings arrived at would be out of date and primarily of historical value by the time requisite data was collected and analyzed, and developed into a report. Evidence strongly suggests that you can develop at least as accurate a disease rate and with details as to numbers of infections as geographically distributed, simply by looking to see where people conduct online searches from, for key search terms and phrases such as “flu”, “flu vaccine” and “flu treatment” and with appropriate alternative wordings added in for these (e.g. where “influenza” is used instead of “flu”.) And if you add in search queries for self-treatment options that people would pursue, who would not go to a physician or clinic, you might very well capture data from demographic groups that a more standard epidemiological study might underrepresent or even miss – and this would be much closer to real-time data than would be possible with a compiled epidemiological report.

For references in support of this, I would cite:

• An article from the February 19, 2009 issue of Nature: Google-Driven Epidemiology
• And for a specifically flu-related reference I cite Google Flu Trends. (This is a link to United States flu incidence findings but you can switch to see corresponding data from other countries with a pull-down menu search tool to the left. Or you can switch to look at incidence rate finding within a country with a second pull down selector for finer grained geographic distribution data.)

How can you most effectively identify new fads, and as close to real-time as possible? A traditional answer to that might mean gathering marketing data from age and otherwise defined demographically defined groups that would be expected to pick up on and promote new fads among their ranks. Once again, aggregate analysis of search engine results can offer value, but as fads become fads from the way they are spread through social networks, a best approach would probably have to include ongoing efforts to identify trending message topics as shared through sites such as Twitter. And if you look for possible fad leads from active Twitter posters who have large numbers of followers, and who have wide reach in what they say and suggest, and if you mesh this data with at least first-round retweeting numbers for tweets shared from these posters, that should offer real insight of marketable value.

And an analysis of patterns in search engine queries over time, coupled with data on trending topics in Twitter and other short message services and other online channels can and does offer insight as to how a candidate for political office is fairing in their campaigns leading up to an election.

One finding coming out of the 2008 US presidential election was that Republicans were more likely to do searches for “Barack Hussein Obama” on Google, where Democrats were more likely to search for information on “Barack Obama” – no middle name. And people were more likely to search for information on this candidate through sites such as Google if they felt involved enough to be more likely to actually vote. Non-voters were much less likely to make that effort. So search engine queries as aggregated across populations with time can provide real insight as to levels of candidate support and certainly when meshed with more traditional polling data. And search results data can provide insight into likely voter preference for members of demographic groups who are less likely to answer polling questions too. And at least demographically, this can indicate which way people who are undecided are leaning in their candidate selection, where if asked directly they would more likely simply say they are still undecided. Search engine query word choice indicates who a prospective voter is listening to for news coverage and opinions, insofar as many if not most people pick up on the phrasing of those they more actively and positively pay attention to.

For data from sites such as Twitter, and following through on the middle name or no middle name example from above, consider the value of tracking #hashtags that identify a candidate like Obama by name, but with or without the middle name. Or look for levels and trends in hashtag activity for any topic identifiers that can be used to identify political preferences or opinions, where people of differing political persuasion would use different but identifiable key wording.

In this, and to make sure you were collecting as representative a sample for polling analysis as possible, you would want to capture numbers and trends from a wide range of online and social media sources, and you would want to be able to correlate this data with data on what demographics groups are most likely to be posting where. But bottom line, this type of approach can offer insight not otherwise as readily available through traditional polling tools and channels. To add one more example of the type of data that can be used in this mix, and a type that is readily trackable, how many people friend a candidate’s Facebook page or post to their wall? How many of these people post with favorable, positive messages or express interest in helping out with their campaign? This is still an early stage resource, but it is already a powerful, election results shaping capability and particularly when ongoing findings from this type of data collection and analysis is used at least close to real-time leading up to an election to find tune campaign efforts. As one of many possible reference citations here, I note Voting 2.0 with their tagline “the future of voting, participation and community in an online world.”

I have just noted three seemingly separate and distinctive data collection and usage scenarios here but they all very clearly have at least one common core element in common.

• They are all examples of how big-picture understandings that are actionable can be developed through assembly of massive amounts of small-picture data, and from data shared online by vast numbers of people who do not for the most part know each other or even specifically of each other.

This is all about the assembly of big data insight from small-scale and even micro-localized data. I write this as a first installment to a new series in which I will look at how big data is assembled and used. For specific background information related to big data per se, I cite my already posted series: Big Data (see Ubiquitous Computing and Communications – everywhere all the time, postings 177 and following for its Part 1-7 and Page 2 of that directory, postings 207 and loosely following for Parts 8-11.) I am going to turn in my next series installment to more specifically consider local and micro-local data. Meanwhile, you can find this series and other related material at that second directory page. And I also include this series in my directory: Reexamining the Fundamentals.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: