sharanya thandra. datasets definition. iris flower datasets. weka, weka tool. handling...

41
LARGE DATA SETS FOR FREE Sharanya Thandra

Upload: julius-french

Post on 22-Dec-2015

223 views

Category:

Documents


9 download

TRANSCRIPT

LARGE DATA SETS FOR FREE

Sharanya Thandra

Datasets definition. Iris flower datasets. Weka, Weka tool. Handling Different Datasets with Weka Techniques for managing large data sets.• Compression• Indexing• Summerization Large data sets with Python Results and conclusions(Python) References.

TABLE OF CONTENTS

A Dataset or Data set is a collection of data. It corresponds to single database table, or a

single statistical matrix. Column is a variable and row is a given

number of the dataset in question. Singular form of datasets is datum.

What are datasets?

Characteristics that define data set’s structure

and properties: Number and types of the attributes.

Statistical measures such as standard deviation and Kurtosis.

Values : real numbers or integers(nominal data)

Datasets may further be generated by algorithms.

Datasets properties?

Iris flower data set Categorical Data Analysis Robust Statistics Time series Extreme Values Bayesian Data Analysis The Bupa Liver data Anscombe’s Quaret

Classic Datasets?

Multivariate data set , as an example of

Discriminate analysis. A typical test case for classification techniques

in Machine Learning. Good example: explains difference between

supervised and unsupervised techniques in data mining.

Iris Flower Dataset

Example Image

An example of the so called “metro map” for the Iris data set.

Weka is a Popular suite of Machine learning software

written in Java. It contains collection of tools and algorithms for data

analysis and predictive modeling. The original one was TCL/TK with non-java version,

preprocessing utilities in C. More recent one is fully Java-based Version. All of Weka’s Techniques are predicated assuming that

the data is available as a single Flat file or relation. Each Data point is described by a fixed number of

attributes.

Weka (Weh-kah)

Original one was used mostly for Agricultural

domains. Latest one is being used in many different

application areas – research and education. Data mining tasks such as• Preprocessing , • clustering, • classification,• regression, Visualization.

Uses of weka

Weka’s main Interface is the ExplorerThe Explorer contains Interface Features With severalFollowing Panels:• Preprocess panel: Imports data from a

database, as a CSV file. Preprocess the data using Filtering algorithm. Filters are used to transform data, helps in deleting instances and attribute according to specific data.

Panels of Weka

• Classify Panel: makes user apply

Classification and regression algorithms(classifiers) to resulting dataset. To estimate the accuracy of the predictive model, to visualize erroneous predictions, Roc curves.

• Associate panel: provides access to association rule learners that attempt to identify important interrelationships between attributed in the data.

Panels of weka

• Cluster panel: Gives access to clustering techniques

in Weka, for example: KMeans algorithm. It provides implementation of the expectation maximization algorithm for learning a mixture of normal distributions.

• Select attribute panel: Algorithms for Identifying most predictive attributes in a dataset.

• Visualize panel: provides scatter plot matrix, these plots can be further enlarged and analyzed using different selection operators.

Panels of Weka

Written in java and runs on any platform Available to users under GNU general public

license.• Advantages of Using Weka:• Freely Available for all the users.• Portability is high as its fully implemented in

Java Programming language as it runs on any platform.

• Provides comprehensive collection of data preprocessing and modeling techniques.

• Ease of usage with its graphical User Interface.

Weka Tool

http://arxiv.org/ftp/arxiv/papers/1310/1310.46

47.pdfFollowing link shows link to download the tool.http://www.cs.waikato.ac.nz/~ml/weka/

Some links for the information on weka

Data Characteristics: if data contains many

zeros we make use of sparse data which saves a lot of memory. Every algorithm in Weka can take advantage of this memory savings for speeding up computation.

An ARFF is a developer version , It’s a file which is in Ascii that describes a list of instances sharing a set of attributes.

Has two different sections Header information followed by Data Information.

Handling large datasets using weka

The ARFF header Section contains Relation

declaration and attribute declarations. The @Relation Declaration format : @relation

<relation-name> The @attribute Declarations format:

@attribute <attribute-name> <datatype> Contains Numeric and Nominal attributes Nominal attributes:  {<nominal-name1>,

<nominal-name2>, <nominal-name3>, ...}

Datasets with Weka

String attributes: @ATTRIBUTE LCC string Date attributes::@attribute <name> date

[<date-format>] Relational attributes: @attribute <name> relational <further attribute definitions> @end <name>

Handling large data sets with weka

As We deal with huge amount of data,

providing memory for every datum becomes a very important factor

Two features such as compression and Indexing help to provide ways in decreasing the amount of room needed for datasets.

Summarization is the third technique.

Techniques for managing large

datasets

Its highly important to know about the data we use before

using any of these techniques. Ask questions for yourself such as :How dense is the data? Upon which variables are users ,applications likely to process or clarify the data? How evenly distributed are values of variables? How is the data set being used? How often do we refresh it? Apart from knowing your data , before you perform any of the techniques always benchmark your results.

Techniques for handling large data

sets

Data Set Compression : It is a technique for

“squeezing out” excess blanks, then abbreviating repeated strings of values to decrease the data sets size and therefore to lower the storage space it requires.

Uncompressed data will look like this: ADAMS MIKE BARNET MARYBETH,Whereas the same example after compression is @ADAMS#@MIKE#@BARNET#@MARYBETH#

Techniques for handling large data

sets

Way to compress data is to use the

COMPRESS=YES data set option ,whenever a new data set is created we use this :

example: data sasuer.new(compress=yes); set sasuser.master; ….other statements…. run; The data sets descriptor portion records that

the data set is compressed.

Compression

Data sets descriptor portion records the

compression of data set. Observations: 466560 Variables: 12 Indexes: 0 Observation Length: 149 Deleted Observations: 0 Compressed: Yes Reuse space: No Sorted: No

Compression

A dataset Index is a data Structure that specifies a

location of observations based on the values of one or more key variables.

Lets consider the following example December 1(1,2,3,…) 2(…) February 1(22,23,…) 4(…) In the above example numbers 1,2 and 1,4 indicate

the data set pages Numbers in the parenthesis are the relative

observation numbers within that page matching each Distinct key value.

Data set Indexing :

The data sets Descriptor portion records that the data set

indexes associated with it. Observations: 466560 Variables: 12 Indexes: 1 Observation length: 149 Deleted Observations: 0 Compressed: NO Reuse space: NO Sorted: NO

Data set Indexing

Indexing Guidelines: • Only when there are fairly large number of

distinct values for the variables to be indexed.• The data is not randomly scattered throughout

the dataset• Data set is frequently queried and subset on

the indexed values.

Datasets Indexing

Questions before summarization:• To what level of detail must the data be

accessible? • How often is the data refreshed? • Is it possible to anticipate how best to summarize

date?The most important element to efficiency, is the thoughtful and careful consideration how the above techniques can best usedThorough testing of the techniques before launching them into production.

Summarization

Million Song Dataset: • The million Song Dataset is a freely available

collection of audio features and metadata for a million contemporary music tracks

Its purpose is to : Encourage research on algorithms that equals to

commercial sizes. Reference dataset for evaluating research As a shortcut to creating a large datasets with APIs Help new researchers get started in the MIR field.

Example of large data sets for free

Large data sets using Python

Considered an example : Task is to screen a huge set

of large data files in text format with billions of entries each.

To have a unified database structure available that combines all the coloumns, which represent different features, that are listed in the separate text files.

Goal is to get a database which is extendable, and the workflow will require that one can pull entries with combining features for further computation efficiently.

This can be achieved by sqlite3 Python module that work with SQLite database structures.

With Python

SQLite is a C library which provides a

lightweight disk-based database that don’t require a separate server process.

It allows accessing the database using a nonstandard variant of the SQL.

The SQlite3 module provides a SQL interface in compliance with DB-API 2.0 specification.

SQLite 3 Module

To use the module – create a connection

object that represents the database, data will be stored in the example.db file.

• >> Import sqlite3 conn=sqlite3.connect(‘example.db’)Once we have it connected , we can create a cursor object and call its execute() method to perform SQL commands.

Getting started…

c = conn.cursor() # Create table c.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''') # Insert a row of data c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)") # Save (commit) the changes conn.commit() # We can also close the connection if we are done with it. Just be sure any changes have been committed or they will be lost. conn.close()

With python..

The data we saved is Persistent and can use it

further: import sqlite3 conn = sqlite3.connect('example.db') c = conn.cursor()Note: Persistent. Also Our SQL operations will need values from Python variables. We shouldn’t assemble our query using Python’s String operations doing so causes insecurity as it makes the program easily attacked with SQL injection. http://xkcd.com/327/

With python.

# Never do this -- insecure! symbol = 'RHAT' c.execute("SELECT * FROM stocks WHERE symbol = '%s'" % symbol) # Do this instead t = ('RHAT',) c.execute('SELECT * FROM stocks WHERE symbol=?', t) print c.fetchone() # Larger example that inserts many records at a time

purchases = [('2006-03-28', 'BUY', 'IBM', 1000, 45.00), ('2006-04-05', 'BUY', 'MSFT', 1000, 72.00), ('2006-04-06', 'SELL', 'IBM', 500, 53.00), ]

c.executemany('INSERT INTO stocks VALUES (?,?,?,?,?)', purchases)

With Python

import sqlite3 # create new db and make connection conn = sqlite3.connect('my_db.db') c = conn.cursor() # create table c.execute('''CREATE TABLE my_db (id TEXT, my_var1 TEXT, my_var2 INT)''') # insert one row of data c.execute("INSERT INTO my_db VALUES ('ID_2352532','YES', 4)") # insert multiple lines of data multi_lines =[ ('ID_2352533','YES', 1), ('ID_2352534','NO', 0), ('ID_2352535','YES', 3), ('ID_2352536','YES', 9), ('ID_2352537','YES', 10) ] c.executemany('INSERT INTO my_db VALUES (?,?,?)', multi_lines) # save (commit) the changes conn.commit() # close connection conn.close()

With python

import sqlite3 # make connection to existing db conn = sqlite3.connect('my_db.db') c = conn.cursor() • # update field t = ('NO', 'ID_2352533', ) c.execute("UPDATE my_db SET my_var1=? WHERE id=?", t) print "Total number of rows changed:", conn.total_changes • # delete rows t = ('NO', ) c.execute("DELETE FROM my_db WHERE my_var1=?", t) print "Total number of rows deleted: ", conn.total_changes # add column c.execute("ALTER TABLE my_db ADD COLUMN 'my_var3' TEXT")• # save changes conn.commit()

With Python

To test how fast is SQLite ? To compare some of the simple speed

comparisions, here is an example ,to measure CPU time

a) Read in the text file line by line with simple Python.

b) Read in the Text file to create an SQLite database.

c) Query the whole database.

Benchmarking

import time• start_time = time.clock() lines = 0 with open("feature1.txt", "rb") as fileobj: for line in fileobj: lines += 1 elapsed_time = time.clock() - start_time print "Time elapsed: {} seconds".format(elapsed_time) print "Read {} lines".format(lines)

read_lines.py

Results and conclusions:

http://

wiki.pentaho.com/display/DATAMINING/Handling+Large+Data+Sets+with+Weka

http://en.wikipedia.org/wiki/Comma-separated_values http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf http://labrosa.ee.columbia.edu/millionsong/ http://

faculty.washington.edu/kenrice/sisg-adv/sisg-09.pdf http://

sebastianraschka.com/Articles/sqlite3_database.html http://www.mssqltips.com/sqlservertip/1200/handling-l

arge-sql-server-tables-with-data-partitioning/

Links for the references

Questions??