what is big data? craig c. douglas university of wyoming

22
What Is Big Data? Craig C. Douglas University of Wyoming

Upload: sofia-lattner

Post on 29-Mar-2015

223 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: What Is Big Data? Craig C. Douglas University of Wyoming

What Is Big Data?

Craig C. DouglasUniversity of Wyoming

Page 2: What Is Big Data? Craig C. Douglas University of Wyoming

2

What Is Big Data?... It DependsUnit Approximately 10n Related to

Kilobyte (KB) 1,000 bytes 3 Circa 1952 computer memory

32 KB Apollo 11 computer memory (1969)

Megabyte (MB) 1,000 KB 6 Circa 1976 supercomputer memory

Gigabyte (GB) 1,000 MB 9 2013 typical 16 GB memory stick

Terabyte (TB) 1,000 GB 12 2012 largest SSD in a laptop

Petabyte (PB) 1,000 GB 15 250,000 DVD’s or the entire digital library of all known books written in all known languages

Exabyte (EB) 1,000 PB 18 175 EB copied to disk in 2010 (est.)

Zettabyte (ZB) 1,000 EB 21 2 ZB copied to disk in 2011 (est.)

32 GB Smart phone memory (2014)

Page 3: What Is Big Data? Craig C. Douglas University of Wyoming

3

What Is Big Data?... It Depends

• What if time counts?– Given a time period t,• How much data can be read and written?

– This changes over time as technology changes.

– What if the quantity of data counts?• How long does it take to read and write data?

– This changes over time as technology changes.

• Definition of Big Data is fluid, not static.

Page 4: What Is Big Data? Craig C. Douglas University of Wyoming

4

Some Sources of Big Data

• Interactions with dynamic databases• Internet data• City or regional transportation flow control• Environment and disaster management• Oil/gas fields or pipelines, seismic imaging• Credit cards and online businesses• Government or industry regulation/statistics• Dynamic data-driven apps

Page 5: What Is Big Data? Craig C. Douglas University of Wyoming

5

Why is Big Data a Hot Topic?

• Open positions in data analytics by 2020 (USA)– up to 200,000 open positions– might only be 140,000 open positions

• Bureau of Labor Statistics projects that 70% of all newly created jobs across all STEM fields during 2010’s,– across engineering, the physical sciences, the life

sciences, and the social sciences,– will be in computer science

Page 6: What Is Big Data? Craig C. Douglas University of Wyoming

6

Unprecedented Opportunities

• Significant contributions to the development of these transformative technologies have been made from diverse fields including:– mathematics,– natural sciences– engineering– social sciences– arts and entertainment industries– business world

Page 7: What Is Big Data? Craig C. Douglas University of Wyoming

7

Unprecedented Opportunities

• Algorithm and software development belong to computer science over the past 50 years:– Computer science researchers have designed and

implemented the algorithms and data structures, languages, models, tools, and abstractions that have enabled these transformational technology developments

Page 8: What Is Big Data? Craig C. Douglas University of Wyoming

8

Quick summary

• Simulation oriented computational science is transformational science, but is only a niche in the grand scheme of things.

• Big data computing capabilities must be broadly available in any institution that strives to compete in the coming decade.

• If not, an institution will simply cease to be competitive, similar to not joining the ARPAnet or CSnet in the 1970’s and 1980’s.

Page 9: What Is Big Data? Craig C. Douglas University of Wyoming

Similarities in Sentences in Big Files

Page 10: What Is Big Data? Craig C. Douglas University of Wyoming

10

Big File Format

• One line per sentence with no punctuation• Each word is separated by one blank• All lower case• Multiple languages and gibberish• Watch for an extra blank at end of some lines

Page 11: What Is Big Data? Craig C. Douglas University of Wyoming

11

Goals

• In the big file of sentences:– Eliminate similar sentences– Find similar sentences of some distance or less

• Either goal is hard work if the file has enough sentences

• Both goals of about the same hardness• Methods in Chapter 3 of Ullman et al’s Data

Mining book useful

Page 12: What Is Big Data? Craig C. Douglas University of Wyoming

12

Goal 1

• Eliminate all duplicate lines (distance 0)• Eliminate all sentences of distance 1– Two sentences S1 and S2 are distance n if S1 can be

transformed into S2 by adding, removing, or substituting at most n words.

– What happens if you eliminate sentence Si because of sentence Si-j, but you later find a sentence Sk that has distance 0 or 1 from Si?• Need to define how you handle this case.

Page 13: What Is Big Data? Craig C. Douglas University of Wyoming

13

Goal 2

• List all sentences that have duplicates.• List all sentences that have distance 1

sentences• List first one followed by all distance 0 or 1

sentences related to it– Can do as separate lists or just one– Should be sorted

• Redo for distance n

Page 14: What Is Big Data? Craig C. Douglas University of Wyoming

14

Preprocessing

• Read all of the file and build a dictionary with each word given a natural number as an index:– Given sentence one here as the first one• 1 2 3 4 5 6 3 7

– Next sentence after sentence one• 8 2 9 2 3

– And so on• 10 11 12

Page 15: What Is Big Data? Craig C. Douglas University of Wyoming

15

Implementation Suggestions

• Use hash tables of considerable size– Hash table size should be a prime number

• Build and debug your code with small files– Start with < 10 sentences– Next try 100, 1000, and 10,000 sentences– Then try 17,788,002 sentences

• Consider using Hadoop (requires knowledge of Java, however) or MR-MPI (C/C++)

Page 16: What Is Big Data? Craig C. Douglas University of Wyoming

16

Tricky Part

• Build a code to do Goal 1 or 2. Notes:– Shingling and minhash do not work well for edit

distance– Two approaches:• Try Jaccard similarity or distance methodology on

sentences considered as sets of words• Modify index-based and length-based methods

Page 17: What Is Big Data? Craig C. Douglas University of Wyoming

17

Generalizing

• Substitute n for 1– Not much extra work to do so– Instead of looking at sentences of word length

difference 1, look at ones of difference up to n– Makes a much more useful program

• Take arbitrary sentences– Convert to one per line, each word separated by one

blank– Take lower and upper case into account and convert

to all lower case as preprocessing

Page 18: What Is Big Data? Craig C. Douglas University of Wyoming

18

Some Interesting Problems

• An Open Source, secure Hadoop replacement suitable for hospitals and medical records.– Must be HPPA compliant.– Must scale well for very large databases.– Must have individual access capabilities.– Must not have complexity O(disk access) on a DFS.• Should use OpenMP and MPI.• Should use cache aware hashing methods.

– Will be useful well beyond medical records.

Page 19: What Is Big Data? Craig C. Douglas University of Wyoming

19

Some Interesting Problems

• Dynamic Data-Driven Application Systems and Big Data– A natural fit and there is no agreed upon software for

DDDAS or DDDAS-BD or DBDDAS. DDDAS has been applied to many, many fields.

– DDDAS researchers agree something should be produced: not considered an application and too applied to be considered networking research.

– Need to find a niche or a program officer in a funding agency willing to think outside of the box.

– Many Big Data issues long common to DDDAS.

Page 20: What Is Big Data? Craig C. Douglas University of Wyoming

20

Some Interesting Problems

• Sensors and telemetry– SensorML was supposed to provide a standard

way of describing sensor data and be able to get the data and deliver it to applications. It went commercial ($$$...$$$) after the original PI retired.

– A true Open Source, internationally recognized standard would benefit one area of Big Data and DDDAS.

Page 21: What Is Big Data? Craig C. Douglas University of Wyoming

21

Some Interesting Problems

• Reservoirs (oil, gas, water)– Dynamic reservoir meshing• Vertical wells with micro sensors provide updates to

fracked reservoirs.• Speed up the meshing to including in a reservoir

simulator time (e.g., go from a year to a day).• Dynamically improve predictions.

– Corporate oil/gas fields or pipelines (even small ones) produce excessive amounts of data• Open Source data mining tools for specific problem

Page 22: What Is Big Data? Craig C. Douglas University of Wyoming

22

Some Interesting Problems

• Audio and photographic data mining– World’s largest databases based on VoIP and phone

monitoring by many governments (e.g., P.R. China, France, Germany, Kingdom of Saudi Arabia, United Kingdom, USA, …).

– Keeps disk drive makers in business and lowers hard disk prices very significantly.• Another problem: Find all file duplicates in a file system

efficiently. Similar to sentence problem earlier.

– Has commercial (e.g., Bing, satellite transmission) and research ramifications that are not nefarious.