background materialdouglas/classes/bigdata/lectures/2019su/background-grad.pdf · hadoop, matlab,...

Post on 25-Mar-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Background Material

Craig C. Douglas University of Wyoming

craigcdouglas@yahoo.com

Schedule

• Undergraduateso June 30, 2:00-6:00, Background materialo July 2, 2:00-6:00, Data findingo July 4, 2:00-6:00, Data finding and machine learningo July 6, 2:00-6:00, Machine learning

• Graduateso July 1, 2:30-5:30, Introduction and data findingo July 3, 8:30-11:30, Data finding and machine learningo July 3, 2:30-5:30, Machine learning

2

Introduction

3

Useful References

• http://www.mgnet.org/~douglas/Classes/bigdata/2019su-index.html

• Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, Mining of Massive Datasets, 2nd ed. (version 2.1), Stanford University, 2014. The most up to date version is online at http://www.mmds.org. I will lecture from the 3rd edition draft as well.

• Andriy Burkov, The Hundred-Page Machine Learning Book, http://themlbook.com/wiki/doku.php, 2019.

4

Useful References

• Wooyoung Kim, Parallel Clustering Algorithms: Survey, Parallel Clustering Algorithms: Survey, http://grid.cs.gsu.edu/~wkim/index_files/SurveyParallelClustering.html, 2009.

• Deep Learning exercises using TensorFlow, https://www.coursera.org/learn/intro-to-deep-learning/home/welcome.o https://github.com/hse-aml/intro-to-dl

5

Useful Software

• TensorFlowo Version 1.13 is stable. Version 2.0.0-beta is not.o Anaconda or Miniconda environmentso Additional Python packages: jupyter, matplotlib,

pandas

• Tableau• MapReduce, Spark, and workflow systems• Many problems run 1000X faster on a GPU

6

Some Sources of Big Data

• Interactions with dynamic databases• Internet data• City or regional transportation flow control• Environment and disaster management• Oil/gas fields or pipelines, seismic imaging• Government or industry regulation/statistics• Closed circuit camera identification

7

Oil/Gas Pipelines

Picture courtesy of Miriam Webster Dictionary 8

Pipeline Network Properties

• Pipe diameters range from 2 inches to 5 feet.

• Rarely straight and level.• Contain– Possibly different grades of

oil or gas simultaneously.– Pigs as separators.– Sensors (inside and

outside)• Not restricted to oil/gas

pipelines (water, etc.).

9

1970’s Modeling

• Problem modeled mathematically based on time dependent, nonlinear coupled partial differential equations (two models).– Sensors on all pipeline components (recall the cartoon).– Distributed GRID computing with scattered phone booths:

• 2 minicomputers, 4 array processors, a heat pump on top, and a U.S. nickel soldered in place to allow “free” calls for telemetry.

• Sensors provided data (temperature, pressure, and velocity) dynamically based on need and anomalies and controlled by the environment and running model.

• No central computing, just central and distributed control sites.• 2,000 pieces of telemetry/minute in complete KSA network (1978).

10

Current Modeling

• 3D math models of pipelines with topography.• Central computing and fiber optic TCP/IP with

Gigabit Ethernet backup near pipelines.• Many more sensors plus ones to measure pipe

(shape) changes, internal pollutants and external gas leakages.

• When 1978 system replaced in KSA in 1998, 100,000 times the telemetry/minute. In 2014, a tsunami of uncountable data.

11

Monitoring Site Evolution

• In 1970’s, primitive center where “what if” scenarios were run to keep pipelines from breaking in parallel with regular monitoring.

• Now, large scale visualization is used to monitor pipelines in a multiscale framework. Individual high resolution monitors (1080p and 4K+) used for “what if” scenarios.

• Always trying to find anomalies in the data streams to avoid pipeline problems.

12

Computer Science Techniques

13

Hash Tables

• A hash table is a data structure with N buckets.– N is usually a prime number and may be quite

large.– Each bucket contains data.– Accessed using a hash function Key = h(x).• h(x) must be inexpensive to evaluate.• Key is an index 0, 1, …, N-1 into the hash table.• Data x can be found only in bucket h(x).

14

Storing a Hash Table

• If the data is very simple (numbers or short strings), then a spreadsheet may be optimal.

• If the data is arbitrary, then dynamically allocated memory techniques are common.– Common to use linked lists inside of each bucket.– Can be error prone.–Must remember to deallocate all of the hash table

when done, which can also be error prone.–Must decide if duplicates are allowed in a bucket.

15

Common Data Structure

16

012

N-2N-1

0

0

0

0

Buckets Data for each bucket

Variations:• doubly

linked lists• nested

tables• spreadsheet

Hash Table Functionalities

• Search• Add– Uses Search

• Delete– Uses Search

• Modify (optional)– Uses Search

• Change order of data in a bucket (optional)– Uses Search and possibly Delete and Add

17

Functionality

• Search(x)– Compute Key = h(x)– For each data stored in bucket Key, compare x to

the data.• If a match, then return something that allows the data

to be accessed.• If there is no match, return a Failure notice.

18

Functionality

• Add(x)– F = Search(x)– If F ≠ Failure, then• If no duplicates are allowed, return something that

allows the data to be accessed (and that it is already in the hash table).

– Otherwise,• Probably make a copy of x and add it to bucket h(x).

– Usually added as the first or last element in bucket h(x).– Usually have to modify the linked list for bucket h(x).

19

Functionality

• Delete(x)– F = Search(x)– If F ≠ Failure, then• Remove the data from bucket h(x). This usually means

deleting the copy of x and relinking inside the linked list. There may be other bookkeeping, too.• Return Success.

– Otherwise,• Return Failure.

20

Simple Examples

• Dataset D consists of combinations of a, b, c, …, x, y, z of exactly string length 3.

• We encode each letter by 00, 01, 02, ..., 23, 24, 25. So, abz is 000125 = 125.

• Consider two hash functions:– h1(x) = x mod 7– h2(x) = leading encoded letter in x

• We get two very different hash tables.

21

Example Dataset D

• D = { abc, def, acd,zaa, bbb, bzq,zxw, faq, cap,eld, ssa, bab }, or encoded

• D = { 102, 30405, 203,250000, 10101, 12516,252322, 50016, 20015,41103, 181800, 10001 }

22

h1(x) for D

• The number of buckets is 7 (a prime).• This is not necessarily a well balanced hash

table since too many members of D go into bucket 0.

• We can store the hash table using linked lists.23

x h1(x) x h1(x) x h1(x)

102 4 30405 4 203 0

250000 2 10101 0 12516 0

252322 2 50016 0 20015 1

41103 6 181800 3 10001 5

Hash Table for h1(x)

24

0123456

Buckets Data for each bucket

203 10101 12516 10101 0

20015 0

250000 252322 0

181800 0

0102 30405

10001 0

41103 0

h2(x) for D

• The number of buckets is 26 (not a prime).• This is a very different distribution of data

than for h1(x) and more balanced for our particular D.

• We can store it as a table or spreadsheet.25

x h2(x) x h2(x) x h2(x)

102 0 30405 3 203 0

250000 25 10101 1 12516 1

252322 25 50016 5 20015 2

41103 4 181800 18 10001 1

Hash Table for h2(x)

26

key value value value

0 102 203

1 10101 12516 10001

2 20015

3 30405

4 41103

5 50016

6

7

8

9

10

11

12

key value value value

13

14

15

16

17

18 181800

19

20

21

22

23

24

25 250000 252322

Fracking Data Example

• Open database maintained by the Pennsylvania State government based on the fractured oil and gas wells in the Marcellus Basin.

• There are about 8,000 wells that have been drilled and information is maintained about each in this database.

• Each state in the United States has at least one public database about fracking wells.

• 15.3 million Americans live within 1 mile (1.8 km) of a well drilled since 2000.

• Spreadsheets in the comma separated values format (.csv) or PDF common.

27

Fracking Data File Information

• Each file contains information for a period of time during 2000-2014o Locations of wellso Owner of propertyo Approximate latitude and longitude of each wello Drilling companyo Production information

§ Potential production§ Actual production (units: barrels for oil, 1000 cubic feet for

gas)§ Active/Inactive

o Much more information with some cells blank

28

Interesting Questions

• What are the production curves?o Are they uniform in regions or do they vary a lot?

• How long is there a good payout? (0, 12, 39-40, …, 120 months?)

• Are there some drillers whose wells are more likely to not be in production after some period of time?

• Where are clusters of wells?• How do you visualize the data?• How do you put the data into the right format in order

to ask the right questions and get answers quickly?

29

Data Files

• Approximately 574 MB of files.• First things to do:

o Determine how to use the data (Excel, MongoDB, Hadoop, Matlab, R, etc.).

o Use the data to answer some simple, but interesting questions.

o Visualize the results (Excel, Matlab, R, Tableau, etc.).• Thereafter,

o Determine how to answer general, complex questions.o Use a general database approach that uses all of your

computer’s cores and GPUs.

30

top related