| THE ICARUS EFFECT
A funny thing happened on the way from the router.
We are all such good students of Moore's Law, the notion that
processor speeds will double every year and a half or so, that in
any digital arena, we have come to treat it as our 'c', our measure
of maximum speed. Moore's Law, the most famously accurate prediction
in the history of computer science, is treated as a kind of
inviolable upper limit: the implicit idea is that since nothing can
grow faster than chip speed, and chip speed is doubling evey 18
months, that necessarily sets the pace for everything else we do.
Parallelling Moore's law is the almost equally rapid increase in
storage density, with the amount of data accessible on any square
inch of media growing by a similar amount. These twin effects are
contantly referenced in 'gee-whiz' articles in the computer press:
"Why, just 7 minutes ago, a 14Mhz chip with a 22K disk cost
eleventy-seven thousand dollars, and now look! 333 Mhz and a 9 gig
drive for $39.95!"
All this breath-taking period doubling makes these measurements
into a kind of physics of our world, where clock speeds and disk
densities become our speed of light and our gravity - the boundaries
that determine the behavior for everything else. All this is well
and good for stand-alone computers, but once you network them, a
funny thing happens on the way from the router: this version of the
speed of light is exceeded, and from a most improbable quarter.
It isn't another engineering benchmark that is outstripping the
work at Intel and IBM, its the thing that often gets shortest shrift
in the world of computer science - the users of the network.
Chip speeds and disk densities may be doubling every 18 months,
but network population is doubling roughly annually, half again as
fast as either of those physical measurements. Network
traffic, measured in packets, is doubling semi-annually (last
year MAE-East, a major East Coast internet interconnect point) was
seeing twice the load every 4 months, or an 8-fold annualized
increase).
There have always been internal pressures for better, faster
computers - weather modelling programs and 3-D rendering, to name
just two, can always consume more speed, more RAM, more disk - but
the Internet, and particularly the Web and its multi-media cousins
of java applications and streaming media, present the first
external pressure on computers, where Moore's law simply
can't keep up and will never catch up. The network can put more
external pressure on individual computers than they handle, now and
for the forseeable future.
IF YOU SUCCEED, YOU FAIL.
This leads to a curious situation on the Internet, where any new
service risks the usual failure if there is not enough traffic, but
also risks failure if there is too much traffic. In a literal update
of Yogi Berra's complaint about a former favorite hang-out, "Nobody
goes there anymore. Its too crowded", many of the Web sites covering
the 1996 US Presidential election crashed on election night, the
time when they would have been most valuable, because so many people
thought they were a good idea. We might dub this the 'Icarus Effect'
- fly too high and you crash.
What makes this 'Icarus Effect' more than just an engineering
oversight is the relentless upward pressure on both population and
traffic - given the same scenario in the 2000 election, computers
will be roughly 8 times better equipped to handle the same traffic,
but they will be asked to handle roughly 16 times the traffic. (More
traffic than that even, much more, if the rise in number of users is
accompanied by the same rise in time spent on the net by each user
that we're seeing today.)
This is obviously an untenable situation - computing limits can't
be allowed to force entrepreneurs and engineers to hope for only
middling success, and yet everywhere I go, I see companies
excercising caution whenever they are comtemplating making any moves
which will increase traffic, even if that would be make for a better
site or service.
FIRST, FIX THE PROBLEM. NEXT, EMBRACE FAILURE.
We know what happens when the need for computing power outstrips
current technology - its a two-step process, which first beefs up
the current offering by improving performance and fighting off
failure, and then, when that line of development hits a wall (as it
inevitably does), embracing the imperfection of individual parts and
adopting parallel development to fill the gap.
Ten years ago, Wall St. had a similar problem to the Web today,
except it wasn't web sites and traffic, it was data and disk
failure. When you're moving trillions of dollars around the world in
real time, a disk drive dying can be a catastrophic loss, and a
backup that can get online 'in a few hours' does little to soften
the blow. The first solution is to buy bigger and better disk
drives, moving the Mean Time Between Failure from say, 10,000 hours
to 30,000 hours. This is certainly better, but in the end, the
result is simply spreading the pain of catastrophic failure over a
longer average period of time. When the failure does come, it is the
same catastrophe as before.
Even more disheartening, the price/performance curve is
exponential, putting the necessary order-of-magnitude improvements
out of reach. It would cost far more to go from 30K/hrs MTBF to
90K/hrs than it did to go from 10 to 30, and going from 90 to 270
would be unthinkably expensive.
Enter the RAID, the redundant array of inexpensive disks. Instead
of hoping for the Platonic 'ideal disk', the RAID accepts that each
disk is prone to failure, howsobeit rare, and simply groups them
together in such a way that the failure of any one disk isn't
catastrophic, because the other disks contain all of the failed
disk's data in a matrix shared among the remaining disks. As long as
a new working disk is put in place of the failed drive, the
theoretical MTBF of a RAID made of ordinary disks, where two disks
failed at precisely the same time, would be something like 900
million hours.
A similar path of development happened with the overtaking of the
supercomputer by the parallel processor, where the increasingly
baroque designs of single CPU supercomputers was facing the same
uphill climb that building single reliable disks did, and where the
notion of networking cheaper, slower CPUs proved a way out that
bottleneck.
THE WEB HITS THE WALL.
I believe that with the Web we are now seeing the beginning of
one of those uphill curves - there is no way that chip speed and
storage density can keep up with exploding user base, and this
problem will not abate in the forseeable future. Computers,
individual computers, are now too small, slow and weak to handle the
demand of a popular web site, and the current solution to the
demands of user traffic - buy a bigger computer - are simply
postponing the day when those solutions also fail.
What I can see in the outlines of in current web site development
is what might be called a 'RAIS' strategy - redundant arrays of
inexpensive servers. Just as RAIDs accept the inadequacy of any
individual disk, a RAIS would accept that servers crash when
overloaded, and that when you are facing 10% more traffic than you
can handle, having to buy a much bigger and more expensive server is
a lousy solution. RAIS architecture comes much closer to the
necessary level of granularity for dealing with network traffic
increases.
If you were to host a Web site on 10 Linux boxes instead of one
big commercial Unix server, you could react to a 10% increase in
traffic with 10% more server for 10% more money. Furthermore, one
server dying would only inconvenience the users who were mid-request
on that particular box, and they could restart their work on one of
the remaining servers immediately. Contrast this with the current
norm, a 100% failure for the full duration of a restart in cases
where a site is served by a single server.
The initial RAISs are here in sites like C|NET and ESPN, where
round-robin DNS configurations spread the load across multiple
boxes. However, these solutions are just the beginning - their
version of redundancy is often simply to mirror copies of the Web
server. A true RAIS architecture will spread not only versions of
the site, but will also spread functionality: images, a huge part of
network traffic, are 'read only' - a server or group of servers
optimized to handle only images could be served from WORM drives and
serve the most popular images from RAM. Incoming CGI data, on the
other hand, can potentially be 'write only' simply recording
information on removable medai which can be imported into a database
at a later date, on another computer, and so on.
This kind of development will ultimately dissolve the notion of
discrete net servers, and will lead to server networks, where an
individual network address does not map to a physical computer but
rather to a notional source of data. Requests to and from this IP
address will actually be handled not by individual computers,
whether singly or grouped into clusters of mirroring machines, but
by a single-address network, a kind of ecosystem of networked
processors, disks and other devices, each optimized for handling
certain aspects of the request - database lookups, image serving,
redirects, etc. Think of the part of the site that handles database
requests as an organ, specialized to its particular task, rather
than as a seperate organism pressed into that particular service.
THE CHILD IS FATHER TO THE MAN
It has long been observed that in the early days of ARPANet,
packet switching started out by piggy-backing on the
circuit-switched network, only to overtake it in total traffic,
which will happen this year, and almost certainly to subsume it
completely within a decade. I beleive a similar process is happening
to computers themselves: the Internet is the first place where we
can see that cumulative user need outstrips the power of individual
computers, even taking Moore's law into account, but it will not be
the last. In the early days, computers were turned into networks,
with the cumulative power of the net rising with the number of
computers added to it.
In a situation similar to the packet/circuit dichotomy, I believe
that we are witnessing another such tipping point, where networks
are brought into individual computers, where all computing
resources, whether cycles, RAM, storage, whatever, are mediated by a
network instead of being bundled into discrete boxes. This may have
been the decade where the network was the computer, but in the next
decade the computer will be the network, and so will everything
else. |