hpcc systems in educationcdn.hpccsystems.com/presentations/hpc_meetup-hpcc... · compute intensive...
TRANSCRIPT
Page 1 HPCC Systems - http://hpccsystems.com Risk Solutions Page 1
HPCC Systems in Education
See, Save, Skip: Sentiment Analysis using HPCC One Click Thor on AWS
Edin Muharemagic, Ph.D.
Architect and Data Scientist HPCC Systems
Page 2 HPCC Systems - http://hpccsystems.com Risk Solutions
Overview
HPCC Systems Update The Best Open Source Data Intensive Super Computing Platform
Machine Learning Library has been released!
HPCC in Education
One-Click Thor on AWS
Sentiment Analysis using HPCC
Page 3 HPCC Systems - http://hpccsystems.com Risk Solutions
LexisNexis Risk Solutions and HPCC Systems
LexisNexis:
30-year history with rich tradition in legal and academic markets
LexisNexis Risk Solutions:
New division with 10-year history
Products and services assess risk, verify identity, detect fraud, and help customers answer questions like “who are you?”, “how much risk is associated with you?”, “what type of network do you have?”
Customers: banks, insurance carriers, health care organizations, law enforcement, Federal Government
Built “Big Data” solutions for 10 years: Data Refinery (Thor)
Data Delivery (Roxy)
ECL – High Level Parallel Programming Language
HPCC Systems: Open Source Data Intensive Super Computing Platform
Page 4 HPCC Systems - http://hpccsystems.com Risk Solutions
One Platform End-to-End: Simple
Consistent and elegant HW&SW architecture across the complete platform
Page 5 HPCC Systems - http://hpccsystems.com Risk Solutions
Data-Driven World
Science Data bases from astronomy, genomics, natural languages, seismic
modeling, …
Humanities Scanned books, historic documents, …
Commerce Corporate sales, stock market transactions, census, airline traffic,
…
Entertainment Internet images, Hollywood movies, MP3 files, …
Medicine MRI & CT scans, patient records, …
Page 6 HPCC Systems - http://hpccsystems.com Risk Solutions
Science Paradigms eScience: Jim Gray http://research.microsoft.com/~Gray
Thousand years ago: science was empirical describing natural phenomena
Last few hundred years: theoretical branch using models, generalizations
Last few decades: a computational branch simulating complex phenomena
Today: data exploration (eScience) unify theory, experiment, and simulation Data captured by instruments
Or generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes database / files
using data management and statistics
2
22.
3
4
a
cG
a
aΚ−=
ρπ
Page 7 HPCC Systems - http://hpccsystems.com Risk Solutions
Data-Intensive Applications
Rely on large, ever-changing data sets Collecting and maintaining data represents major
effort Have Complex Computational Requirements From simple queries to large-scale analyses Requires Parallel Processing Program at abstract level
HPCC, a DISC, perfect platform for DI App domain
Page 8 HPCC Systems - http://hpccsystems.com Risk Solutions
Parallel Processing Classification
Compute Intensive (HPC)
Compute-bound applications
Performance measured in xFLOPS (x=tera, peta…)
Involves parallelizing algorithms (i.e. decompose application into separate tasks)
Functional (Control) Parallelism
Data Intensive (HPCC)
I/O bound applications
Performance measured in xORPS (x=B as in billion)
Involves subdividing data into segments, using the same application to process segments in parallel, and reassembling results at the end of processing
Data Parallelism
Page 9 HPCC Systems - http://hpccsystems.com Risk Solutions
Programming Models
Compute Intensive (HPC)
Programs described at very low level
Specify detailed control of processing & communications
Rely on small number of software packages
Written by specialists
Limits classes of problems & solution methods
Data Intensive (HPCC)
Application programs written in terms of high-level operations on data
Runtime system controls scheduling,
load balancing, …
Hardware
Machine-Dependent Programming Model
Software Packages
Application Programs
Hardware
Machine-Independent Programming Model
Runtime System
Application Programs
Page 10 HPCC Systems - http://hpccsystems.com Risk Solutions
Machine Learning Library
Page 11 HPCC Systems - http://hpccsystems.com Risk Solutions
ML Documentation
Andrew Ng
Andrew Ng
Andrew Ng
0
100
200
300
400
0 500 1000 1500 2000 2500
Housing price prediction.
Price ($) in 1000’s
Size in feet2
Regression: Predict continuous valued output (price)
Supervised Learning
“right answers” given
Andrew Ng
x1
x2
Supervised Learning
Andrew Ng
Unsupervised Learning
x1
x2
Andrew Ng
Organize computing clusters Social network analysis
Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)
Astronomical data analysis Market segmentation
Page 18 HPCC Systems - http://hpccsystems.com Risk Solutions
Open Data Intensive Computing course at FAU
Expose students and faculty to newest technology
Help faculty & PhD researchers concentrate on addressing real problems (e.g. ML experiments do not need to take 6 months to produce results)
Get smart people working together
University is an open forum for free exchange of ideas
Build HPCC following and community
Harness that Open Source community power to keep improving HPCC and stay relevant
Page 19 HPCC Systems - http://hpccsystems.com Risk Solutions
Open Data Intensive Computing course at FAU
How to make it interesting, interactive, entertaining?
A number of Universities already have similar courses, based on Hadoop
Browsing those offerings ran into an interesting approach:
http://www.youtube.com/watch?v=kO8x8eoU3L4
Page 20 HPCC Systems - http://hpccsystems.com Risk Solutions
Open Data Intensive Computing course at FAU
Q: What is the best DISC? A: HPCC!
Decided against it
Instead, created a hands on, interactive course, covering: Thor Architecture (Cluster components and their purpose)
Thor Configuration (Let’s build the Cluster)
ECL Programming (Let’s get the cluster busy)
Roxie Architecture (Let’s deliver)
ML with HPCC
Had 15 students (4 undergraduate and 11 graduate)
Page 21 HPCC Systems - http://hpccsystems.com Risk Solutions
Building an HPCC
FAU Cloud: VMware vSphare Hypervisor
College of Engineering IT: Serge and Mahesh allocated 32 nodes
Students used those nodes to build HPCC clusters
Configuration process is well documented:
http://hpccsystems.com/community/docs/installing-running-hpcc-platform
Initial Setup – Single Node
Configuring Multi node System
Starting and Stopping
Page 22 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
https://aws.hpccsystems.com
Page 23 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 24 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 25 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 26 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 27 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 28 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 29 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 30 HPCC Systems - http://hpccsystems.com Risk Solutions
One-Click Thor on AWS
Page 31 HPCC Systems - http://hpccsystems.com Risk Solutions
Mining the Web for Feelings
Computers are good at crunching numbers! Can they do feelings?
Emerging field: Sentiment Analysis! Translate human emotions into hard data
Cultural factors and language nuances make it difficult to deduce pro or con sentiment (e.g. sinful & chocolate cake)
Becoming standard feature of search engine – fine-tune results based on sentiment (e.g. best hotel in Boca)
Business: “online opinion represents virtual currency that makes or breaks a product in the marketplace”
Casual Web surfer: Tweetfeel, Twendz and Twitrratr
TV watcher: “See, Save, Skip – Aspect-Based Sentiment Analysis using HPCC”
SEE SAVE SKIP: ASPECT-BASED SENTIMENT ANALYSIS Charlene Gilbert Florida Atlantic University
Intro. to Sentiment Analysis
a.k.a. Sentiment Classification or Opinion Mining Given text, determine polarity
Positive Negative Neutral
See Save Skip: Television
“So many channels and nothing to watch!”
Not only TV but DVR, Netflix, Hulu, etc.
Decide shows to See (Live) Save (for Later) Skip
140 characters Hashtags, @mentions,
Search 200 million
tweets/day Embraced by television
shows Twitter Tickers GetGlue
Application Programming Interface (API) Search API
Keyword Location Date Language
Streaming API Real Time 400 Keywords
Keyword Based Sentiment
Lists of Affective Words Count words in tweet Classify sentiment with most words Other ways…
Naïve Bayes
Table 2: Sample Emoticon/Abbreviation List
Positive Negative
>:] :’(
:-) :(
:) T_T
:o) :c
8) :<
:D <.<
XD WTF
FTW FML
LOL FTL
Stop Words
Common words to filter out
Pre-Existing List
Sentiment Classification
A Tweet is Split into Tokens Joined with 3 Affective Word Lists Score(Positive Words) := 1 Score(Negative Words) := -1 Score(Neither) := 0
Sum of Token Scores Sum > 0 := 1 (Positive) Sum < 0 := -1 (Negative) Sum = 0 := 0 (Neutral)
ECL DEMO
See Save Skip Classification
Get all non-neutral tweets Get percentage of positive tweets
See – 100%-80% Positive Save – 80%-60% Positive Skip – Below 60%
Somewhat Arbitrary Might Classify on Bell curve
Some Results
Positive 89%
Negative 11%
Sentiment: The Sing Off
See Save Skip Classification: See! See Save Skip Classification: Save
Positive 70%
Negative 30%
Sentiment: Two Broke Girls