Davey Winder
Sunday, 27 July 2008 05:06
Your IT -
Home IT
Page 2 of 2
The current index would be even bigger than that
astonishing 1 trillion number if Google did not actively filter out the
multiple URLs with exactly the same page content. "Even after removing
those exact duplicates, we saw a trillion unique URLs" Alpert and Hajaj
say, adding "the number of individual web pages out there is growing by
several billion pages per day."
The truth is that nobody knows exactly how big
the web is or how many absolutely unique pages it contains. It can only
ever be a best guess metric because even Google has to admit it simply
does not have the resources or time to look at them all.
"Strictly speaking" Google says "the number of pages out there is
infinite." By way of example it offers the case of web calendars which
often incorporate a link to 'the next day' activities. If Google
followed these, it argues, it would be stuck in a forever search loop.
"We're not doing that, obviously, since there would be little benefit
to you."
In fact, Google did not index every one of that trillion pages claim
either because many of them are reported to be very similar to each
other, or contain auto-generated content that is not if much interest
to the general web searching public.
Google does claim to have the most comprehensive index of any search
engine however, and we have no inclination to argue with them there.
But imagine just how much better it could be if it were to index the so
called
Deep Web.
Back in the year 2000, when the Google index hit a billion pages
remember, a
University of Michigan study
was claiming that the Deep Web contained something in the region of 550
billion individual documents.
Do the math on that to take account of the new 1 trillion pages Google index figure, and that's what we call really big...