the top ten largest databases in the world
TRANSCRIPT
The Top Ten Largest Databases In The World
By Lewis Keller
2/27/2012
[Type the abstract of the document here. The abstract is typically a short summary of the contents of the document. Type the abstract of the document here. The abstract is typically a short summary of the contents of the document.]
The Top Ten Largest Databases In The World
Introduction
When I was presented with the opportunity to research the largest databases in the world,
I was willing to do a detailed discussion of the top five. However, I came across a list of the top
10 largest databases in the world. So, I decided to expand my discussion to cover the whole list.
One thing that I’m not surprised about is that the top two are owned by our government (the
Library of Congress and the Central Intelligence Agency, respectively). However, what I am
surprised about is that Google made it only to #7 on the list. Considering that it has a vast
amount of knowledge available to the public, I thought that it would be somewhere within the
top five. Overall, though, the sizes of these databases are pretty astounding, as several of them
are hundreds of terabytes in size.
#1: The Library of Congress
The Library of Congress has 130 million documents altogether. They have so much text
data that, if it were to be digitized, it would be 20 terabytes in total size! They have 5 million
digital documents, and over 10,000 items are being added to the database every day. However,
many of these items are restricted from the general public.
I decided to test their online system by doing a search for “Vietnam”. I immediately ran
across their 10,000-item limit, which shows me how immense their online system is. The newest
document I came across in my search was an article from 1991, and they had several documents
from the 1960’s and the 1970’s. The only thing that I don’t like about it is that the system gave
me only five minutes to do my search before it would kick me out.
#2: The Central Intelligence Agency
One interesting thing about the CIA’s database is that its size is unknown, due to the
number of classified files that it contains. However, there are portions of it available to the
public, such as The World Fact Book and the contents of the Freedom of Information Act
Electronic Reading Room. Another thing about the database is that it contains statistics on more
than 250 countries and entities.
The Electronic Reading Room makes some (potentially sensitive) government documents
available to the public, which can help someone find a copy of a previously passed act of law to
use for research. So, with my high level of curiosity, I decided to test it, too. I did a search on
Africa, and was able to come up with 98 items, which were available in both GIF and PDF
formats.
#3: Amazon.com
With the wealth of items that Amazon has for sale online, one would expect them to have
a large database. Well, their expectations are right, because Amazon’s database contains 42
terabytes of data. This database gathers and keeps massive amounts of intimate information
about its millions of shoppers, including their religion, sexual orientation, ethnicity and income.
This database combines information disclosed voluntarily by customers with facts gleaned from
public databases. This gives Amazon more detailed information about its customers than any
other retailer.
#4: YouTube
In 2006, back when YouTube was just starting to gain its foothold in our society, their
database was projected to have 45 terabytes of data. I seriously can’t imagine how many
terabytes of data are on there now, six years later. The database is open for people who want to
access it, which I find kind of astonishing, because of the possibility of users’ personal data
being exposed to the public. Despite this, in order to gain access the database, you must request
special developer and client keys. Due to the varying sizes and time-lengths of each video,
estimating the size of YouTube’s database is a difficult task to achieve. YouTube’s data API is
geared towards developers who have experience in dealing with programming server-side
languages.
#5: ChoicePoint
Consisting of 250 terabytes of personal data, ChoicePoint's database of 17 billion public
records is used for background checks, insurance applications and tenant screening. The database
contains information on approximately 250 million people. One thing that I don’t like about
ChoicePoint, is that they sell data to the highest bidders, which include the U.S. government.
However, much of their business is being administrated by the Fair Credit Reporting Act.
#6: Sprint
Sprint has 53 million subscribers worldwide, and their database is very expansive. Large
telecommunication companies like Sprint are notorious for having immense databases to keep
track of all of the calls taking place on their network. The database is spread across 2.85 trillion
data insertions (the largest number in the world). 365 million call detail records processed by the
database per day. However, phone information has previously been leaked out of the database,
though.
#7: Google
Google’s database contains virtual profiles of countless number of users, and it contains
all of the words that are used in search terms. Google searches account for more than 50% of all
internet searches. Like the CIA’s database, the size of Google’s database is unknown (due to it
being locked in a vault).
For a search through Google’s database to work, a crawler visits a page, copies the
content and follows the links from that page to the pages linked to it, repeating this process over
and over until it has crawled billions of pages on the web.
#8: AT&T
AT&T’s database contains 323 terabytes of data, and has 1.9 trillion phone call records.
AT&T is so careful with their records that they've maintained calling data from decades ago,
when the technology to store hundreds of terabytes of data was still non-existent. As a former
AT&T customer, I have to say that that’s a very impressive thing to do, because one never
knows when such a call might wind up putting somebody in jail over a crime they committed 20
years ago.
#9: NERSC
The NERSC is comprised of 2.8 petabytes, and is operated by more than 2,000 computer
scientists. Some of the information that’s included on it pertains to simulations of the early
universe, atomic energy research, and more. What distinguishes it from others is its successful
creation of an environment that makes the resources operative for research.
#10: The World Data Centre for Climate
This database is, by far, the largest database in the world! It contains 330 terabytes of
web/climate simulation data, and 6 petabytes of additional data on magnetic tape. The database is
so large, that it has to be hosted on a machine that cost 35 million euros ($46,942,000).
Conclusion
In conclusion, with the immense amount of data that they contain, each of these databases
help the general public find something that they want and/or need in some fashion. More
importantly, though, they set precedence for future databases. They do it through their size, their
accuracy, and the data that they contain. I honestly think that databases will continue to grow in
all three categories, thus providing more and more information to those who will be requesting
for it.
Bibliography
Credit.com. "Credit.com." 12 Questions for ChoicePoint. Web. 25 Feb. 2012.
<http://www.credit.com/credit_information/credit_law/Questions-for-Choicepoint.jsp>.
Dennyson, Robert. "Top 10 Largest Databases in the World." Beyondrelational.com. 01 July
2011. Web. 25 Feb. 2012.
<http://beyondrelational.com/modules/1/justlearned/388/tips/9212/top-10-largest-
databases-in-the-world.aspx>.
"Freedom of Information Act." CIA FOIA. CIA. Web. 25 Feb. 2012.
<http://www.foia.cia.gov/search.asp>.
Google. "Technology Overview - Company." � Technology Overview - Company� . Web. 26 Feb.
2012. <http://www.google.com/intl/en/about/company/tech.html>.
Harris, Craig. "Amazon Database Would Put Shoppers' Intimate Details on the Line."
Seattlepi.com. Seattlepi, 10 Aug. 2006. Web. 25 Feb. 2012.
<http://www.seattlepi.com/business/article/Amazon-database-would-put-shoppers-
intimate-1211419.php>.
Lee, Kevin. "What Is a Database on YouTube?" EHow. Demand Media, 04 Jan. 2012. Web. 25
Feb. 2012. <http://www.ehow.com/info_12217150_database-youtube.html>.
"LG Optimus Slider Aka Gelato Shows up in Sprint Database with September 11 Release Date."
Phone Arena. 13 June 2011. Web. 26 Feb. 2012. <http://www.phonearena.com/news/LG-
Optimus-Slider-aka-Gelato-shows-up-in-Sprint-database-with-September-11-release-
date_id19516>.
"Library of Congress Online Catalogs." Library of Congress Online Catalogs. Web. 25 Feb.
2012. <http://catalog.loc.gov/>.
"Model & Data: World Data Center for Climate (WDCC)." Model & Data: Welcome to the
Model & Data Homepage. 19 Feb. 2008. Web. 26 Feb. 2012.
<http://www.mad.zmaw.de/wdc-for-climate/>.
NERSC. "About NERSC." NERSC: National Energy Research Scientific Computing Center.
Web. 26 Feb. 2012. <http://www.nersc.gov/about/>.
"Top 10 Largest Databases in the World." Focus. Focus, Inc., 2012. Web. 25 Feb. 2012.
<http://www.focus.com/fyi/10-largest-databases-in-the-world/>.