Platt Perspective on Business and Technology

Usage challenges that drive the evolution and growth of information technology – 2

Posted in business and convergent technologies by Timothy Platt on January 9, 2012

• When you sufficiently change any system quantitatively you reach a point where you of necessity change it qualitatively too.
• Put somewhat differently but to the same point, linear scalability is always limited for any real world system – including those designed to be linearly scalable.

This is my second installment in a series that I have in a fundamental sense been thinking about since I first saw and worked on a computer, writing programming code for an early generation mainframe in the late 1960’s (see Part 1 of this series.) And the topic here is one of what drives innovation:

• The technology itself as the innovation driver, or
• New and novel use of that technology with emerging application needs driving innovation and forcing development of new technology to meet it,
• Or some synergistic combination with both playing a significant, intertwined role.

At least to this point in this series I have focused on the second of those options. I touched in Part 1 upon the by now well known example of computer-based and online gaming as a user-side driver for advancing information technology, and I then turned to a new and emerging same-direction originating innovation driver that joins the still advancing pressures of gaming, and I add a range of other usage and applications areas in this. I focused in Part 1 on bioinformatics as a new working example that is still just coming into its own.

And I continue here from that with more on bioinformatics as a technology driver, focusing in this posting on its role in the advancement of the technology infrastructure for the internet as a whole.

Consider the implications for need for internet scalability when genomics and the sequencing and analysis of complete genomes becomes a basic tool in biology where essentially every species and subspecies is sequenced and analyzed as part of its basic identification, and when this technology becomes a part of mainstream medicine with every individual person fully DNA sequenced too – and perhaps many times where they have diseases such as cancer that cause somatic cell mutations.

Consider the numbers in setting a scale for this information gathering, processing and transmitting activity, and for sake of argument lets consider an accumulated pool of data of 10 billion (thousand million) genomes with each an average 3 billion base pairs in extent (the approximate size of the human genome.) That would be some 30 quintillion base pairs of data, and even without accompanying identifying code or analytical information that would mean some 30 quintillion bytes of data. Adding in detailed analytical and other metadata would probably bring this up to over 100 quintillion bytes. That is 100 million, million, million bytes of information. And that is when only DNA sequences are considered out of the still larger pool of potential bioinformatics information. In computer storage terms that represents approximately 100 exabytes, and in fact bioinformatics data as a whole can be expected to rapidly expand into the multiple zettabyte range and beyond. But the important point here is not in the scale of this but in the way all of this information is going to be actively in motion through our information networks and all of the time – not just sitting there is fixed locations in local storage somewhere. And it will move in packages measured in the multiple gigabytes.

Data compression will ease the bandwidth requirements for this but only to a limited extent. And this data will simply join the flow of ever-increasing information storage, processing and transfer requirements that are arising from all sources and going online and through the internet – and in increasingly larger package sizes.

It should be noted that the data transferring that I write of here is not going to be smoothly, homogeneously distributed throughout the internet as to either sources sent from or sites receiving. Just considering data sources, even when sequencing DNA costs drop to $100 or less per complete human genome or equivalent for scale, with ease of online access and the tremendous cost of large scale sequencing infrastructure it is likely that much if not most of this work will still be carried out by businesses dedicated to that work.

• If a sequencing center processes a hundred thousand or even a million sequences a day or their equivalent in information volume with raw sequence data and analysis data included – metadata about those sequences that will quickly come to greatly exceed raw sequence data in volume, it is going to have to process and transfer on the order of petabytes of data daily through the internet and through its local access and connection points.

With that I turn to our present broad bandwidth connectivity systems and options. As of this writing, the United States is behind much of the world in bandwidth access for connecting into the internet, with 50 million bits per second (bps) common has a high speed connection. At 8 bits per byte, that translates to 6.25 million bytes per second. And a single raw data-only human genome sequence would take 8 minutes at this rate. (Note: that 50 million bps rate is for download and connectivity rates for uploads tend to be many times slower, meaning that a home connection might take an hour for sending that single genome sized data load and even with an optimized system.)

Singapore currently claims the title for offering every citizen and every business the broadest bandwidth connectivity of any country with a standard of offering one billion bits per second internet access. True, two-way connectivity at that rate would drop transmission and reception of that complete, here standard-unit genome sequence to 24 seconds which does not seem very long. But such a system working full time and with no need to ever repeat-send any single genome or part thereof would be limited to 3600 per day, and this would require a special dedicated line. With associated metadata accompanying raw sequence data this would probably drop to 1200 or less and especially where large numbers of data transmissions would primarily be of this processed metadata with its information about the meaning and significance of raw data findings. And as of this writing there are a few facilities that already generate more raw genomic data than that 3600 benchmark level on their peak production days.

This posting is all about new and emerging potential information flow bottlenecks and about internet scalability as challenged by bioinformatics as a test case – and perhaps a useful-in-itself stress test. And it is applications like this, for internet communications that are pushing current internet development past the structural and conceptual simplicity of linear expandability and quantitative-only change.

I am going to continue this discussion in my next series installment, there looking at bioinformatics data processing per se, and the emerging goal of exascale computing as a computational benchmark.) You can find this and related postings at Ubiquitous Computing and Communications – everywhere all the time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: