sports data sources and data extraction gavin zhang mis580 university of arizona 02-06-2008

23
Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

Upload: angela-chipman

Post on 30-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

Sports Data Sources and Data Extraction

Gavin Zhang

MIS580

University of Arizona

02-06-2008

Page 2: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

2

OutlineOutline

• Sports Data Sources – Baseball– Basketball– Football– Olympics– Greyhound

• Data Extraction– Case Study: AZGreyhound System

Page 3: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

3• http://www.baseball1.com/

Baseball Data SourceBaseball Data Source

Download the database

Page 4: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

4

Data DownloadData Download

• This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2007.– The data are provided in Microsoft

Access, CVS and other formats.– The newest version is Version 5.5.

• The database can be downloaded at:http://baseball1.com/content/view/

57/82/

Page 5: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

5

DatabaseDatabase

• Detailed description of the database is available at:

http://baseball1.com/content/view/57/82/

• The database has 21 tables; main tables include:

– MASTER Table- Player names, DOB, and biographical info;

– Batting Table- batting statistics; – Pitching Table- pitching statistics; – Fielding Table- fielding statistics.

• Detailed description about each data field in each table is available.

AwardPlayers.csv

playerID awardID yearID lgIDbondto01 Triple Crown 1877 NLhinespa01 Triple Crown 1878 NLheckegu01 Triple Crown 1884 AAradboch01 Triple Crown 1884 NLkeefeti01 Triple Crown 1888 NLclarkjo01 Triple Crown 1889 NLduffyhu01 Triple Crown 1894 NLrusieam01 Triple Crown 1894 NLlajoina01 Triple Crown 1901 ALyoungcy01 Triple Crown 1901 ALwadderu01 Triple Crown 1905 ALmathech01 Triple Crown 1905 NLmathech01 Triple Crown 1908 NLcobbty01 Triple Crown 1909 ALcobbty01 MVP 1911 ALschulfr01 MVP 1911 NLspeaktr01 MVP 1912 ALdoylela01 MVP 1912 NLjohnswa01 MVP 1913 AL

… … … …

Page 6: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

6

Basketball Data SourceBasketball Data Source

• http://databaseBasketball.com/

Download all of the player and team statistics

Page 7: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

7

Data DownloadData Download

• The website contains the NBA data from 1947 to 2007 and ABA data from 1968 to 1976 on players, teams, leagues, all-star games, awards, and coaches.

• Download at:

http://databasebasketball.com/stats_download.htm

Page 8: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

8

DatabaseDatabase

• This download contains nine column delimited files (.txt format), each of which represents a table in the database.

• If you open the files up in excel, you may need to select Data -> Text to Columns, then use the bar ("|") character as the delimiter.

Teams.txt

team|location|name|leag

ANA|Anaheim|Amigos|A

AND|Anderson|Duffey Packers|N

ATL|Atlanta|Hawks|N

BA1|Baltimore|Bullets|N

BAL|Baltimore|Bullets|N

BOS|Boston|Celtics|N

BUF|Buffalo|Braves|N

CAP|Capital|Bullets|N

CAR|Carolina|Cougars|A

CH1|Chicago|Stags|N

CH2|Chicago|Zephyrs|N

CHA|Charlotte|Hornets|N

CHI|Chicago|Bulls|N

… … … …

Page 9: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

9

Football Data SourceFootball Data Source

• http://www.pro-football-reference.com/

Page 10: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

10

Data DownloadData Download

• A copy of data set (in CVS format) can be downloaded from: http://ai.arizona.edu/hchen/chencourse/SportsData/Pro-football-refernce_CSV.zip

• This version contains the game data from 1995 to 2006. The dataset contains 64,327 players and the games they played in.

• Tables include:– Master—information about players– Seasons—the statistics of the players’ records by season – Games—the statistics of the players’ records by game

• Detailed description about each data field in each table is available.

Page 11: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

11

DatabaseDatabaseID last name first name position birth year debut year

AbduKa00 Abdul-Jabbar Karim rb 1974 1996AbduRa00 Abdullah Rabih rb 1975 1999AberWa00 Abercrombie Walter rb 1959 1982AbraDa00 Abramowicz Danny wr 1945 1967AdamBo00 Adams Bob te 1946 1969AdamCh00 Adams Charlie wr 1979 2003AdamCu00 Adams Curtis rb 1962 1985AdamGe00 Adams George rb 1962 1985AdamGr00 Adams Grant wr 2000 2005AdamJo00 Adams John rb 1937 1959AdamMi00 Adams Michael wr 1974 1997AdamMi01 Adamle Mike rb 1949 1971AdamTo00 Adams Tony qb 1950 1975AdamTo01 Adams Tom wr 1940 1962AdamTo02 Adamle Tony rb 1924 1950AdamWi00 Adams Willie wr 1956 1979AddaJo00 Addai Joseph rb 1983 2006AdkiJa00 Adkisson James te 1980 2005AdkiMa00 Adkins Margene wr 1947 1970AdkiSa00 Adkins Sam qb 1955 1977

… … … …

Master.csv

Page 12: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

12

Some Other Football Data SourcesSome Other Football Data Sources

• http://www.databasefootball.com/– The website contains the National Football League (NFL) data

from 1922 to 2005 and Australian Football League (AFL) data from 1960 to 1969 on players, teams, leagues, awards, and coaches.

– Data set can not be downloaded directly. The data need to be extracted from the HTML Web pages by using parsing programs.

• http://www.jt-sw.com/football/– The website contains the player/coach statistics of NFL from 1920

to present and statistics of AFL from 1960 to 1969.

– Data set can not be downloaded directly. The data need to be extracted from the HTML Web pages by using parsing programs.

Page 13: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

13

Olympics Data SourceOlympics Data Source

• http://www.databaseolympics.com/

Page 14: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

14

Data FormatData Format

• DatabaseOlympics.com is your source for every Summer and Winter Olympics medal winner.

– Summer Olympics from 1896-2004;

– Winter Olympics 1924 -2002

• You'll find every medal winner for every country with easy links to each Olympics, sports, and athletes.

Page 15: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

15

Data FormatData Format

Page 16: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

16

GreyhoundGreyhound

• http://66.236.122.233:8080/tracklink/

Page 17: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

17

Data FormatData Format

• Data includes daily race programs (videos) and odds charts (.txt file format) for all US Greyhound tracks.

• Some tracks had both Afternoon and Evening programs.

Page 18: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

18

Data FormatData Format Chart.txt

1st Grade: B Distance: 550 Condition: Fast

DOG WT P O 1/8 Str Fin Time Odds Comment

PTL Jane 63.5 6 3 1 1 1 ns 32.00 11.60 Held At Wire Inside

Silver Speck 68.5 1 1 2 2 2 ns 32.01 2.80 Cutff 1st, Stayd Cls

Jain't It Doug 75 7 7 6 6 3 1.5 32.10 7.50 Closed For Show Outs

Flyer Whitesocks 75.5 8 8 7 3 4 1.5 32.11 2.30 In The Hunt

Flying Detroit 69 5 5 4 4 5 2 32.15 9.00 Not Far Behind Mdtrk

VP Twix Twizala 59.5 3 4 3 5 6 4.5 32.31 4.20 Losing Position Ins

Sergio 73 4 6 5 7 7 5 32.34 13.30 Blocked 1st Turn

Heartattack Jack 71.5 2 2 8 8 8 5.5 32.39 7.10 Bumped 1st Turn

… … … …

Page 19: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

Case Study: AZGreyhound System

By Rob Schumaker

Page 20: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

20

AZGreyhound System DesignAZGreyhound System Design

• AZGreyhound System

DB

Race DataRace Data

Odds DataOdds Data

Greyhound DataGreyhound Data AZGreyhoundAZGreyhound

Model BuildingModel Building

Training / TestingTraining / Testing

PredictionPrediction

WinWin

MetricsMetrics

AccuracyAccuracy

PayoutPayout

EfficiencyEfficiency

PlacePlace

ShowShow

TraditionalTraditional Betting E

ngine

Betting E

ngine

ExactaExacta

TrifectaTrifecta

SuperfectaSuperfecta

Straight BetsStraight Bets

QuinielaQuiniela

TrifectaTrifecta

SuperfectaSuperfecta

Box BetsBox Bets

Page 21: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

21

Greyhound Data ExtractionGreyhound Data Extraction

• Grayhound data was gathered from www.trackinfo.com. The Web site links to:– GreyMatter http://66.236.122.233:8080/tracklink/– TrackInfo http://www.trackinfo.com/index2.html

• The race and odds data was parsed into a SQL Server database; then the data was sent to the AZGreyhound system for prediction.

Page 22: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

22

Example code public void RacePrograms() throws Exception {

... ...

String URL1 = "http://www.trackinfo.com/trakdocs/hound/";

String URL2 = "/Rpages";

... ...

OpenConnection2();

try { ... ...

TrackAbbrev = rSet.getString("TrackAbbrev");

String URL = URL1 + TrackAbbrev + URL2;

Feed = web.Scraper(URL, 1);

... ...

NumItems = web.NumItems(Feed, "~icons/html.gif");

for(int y = 1; y <= NumItems; y++) {

Feed = Feed.substring(Feed.indexOf("~icons/html.gif"));

FileName = web.ExtractText(Feed, "<A HREF=\"", "\">");

Feed = Feed.substring(Feed.indexOf("<A HREF="));

FileDate = web.ExtractText(Feed, "NOWRAP>", "</TD>");

FileContents = web.Scraper(URL + "/" + FileName, 1);

FileContents = FileContents.replaceAll("'", "-");

db.Insert2DBProgram(FileName, FileDate, FileContents); }

}

CloseConnection2();

}

catch(SQLException e) {

System.out.println(e); }

}

This method picks up the

overall race information

and puts it in the databaseData parsing URL

Parsing out each data field

Insert into DB

Page 23: Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008

23

You can use the sports data sources introduced in this set of slides for your data mining project.

You are strongly encouraged to identify other interesting public sports data sets for your project.

Thanks!