fsharp and data 101
TRANSCRIPT
-
F# and Data 101Utilizing FsLab to Quickly Understand your Data
Jonathan Dexter, Technology Manager of .NET, The Nerdery
-
AgendaWhat is this talk aboutGet the data!Transform!Science!???Prot!
-
If you came here to hear about"Monads""Functors"Tail-call optimizationImmutable designPattern matching
-
What we WILL talk about : FsLab
-
With Paket
1: 2:
paketinitpaketaddnugetfslab
-
With NuGet
1: nugetinstallfslabOutputDirectorypackages
-
The process(Not limited to FsLab)
1: 2: 3: 4:
acquiredata|>transform|>science|>visualize
-
FsLab: Scratching the Surface5 Libraries
-
Sorting the libraries above into categories:
Library Acquire Transform Science Display
F# Data Yep! - - -
Deedle - Yep - -
.NET Numerics - Supports Supports -
R Type Provider Partial Yep Yep Partial
XPlot - - - Yep
-
Step one: Acquire
-
Classic scenario: CSV
Using CSV type provider
1: 2:
typecsv=FSharp.Data.CsvProviderletcomplaints=csv.Load(complaintsCsv)
Using a data frame
1: letdata=Deedle.Frame.ReadCsv(__SOURCE_DIRECTORY__+"/data.csv")
-
Type Provider Scenario: World bank provider
World bank provider is bundled with F# Data
1: 2: 3: 4: 5: 6:
letdataContext=FSharp.Data.WorldBankData.GetDataContext()lethighTechExports=dataContext.Countries.``UnitedStates``.Indicators.``Hightechnologyexports(currentUS$)``
-
Type Provider Scenario: JSON provider
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
typeJsonContext=FSharp.Data.JsonProvider
-
Type Provider Scenario: JSON Provider (cont.)
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
letmissedVotes=JsonContext.Load(sprintf"%s?apikey=%s"missedVotesUrlapikey)
letcongressmen=missedVotes.Results|>Seq.collect(funr>r.Members)
lettopMissingCongressman=congressmen|>Seq.sortBy(funm>trym.MissedVotesPctwith|ex>0.0m)|>Seq.rev|>Seq.head
-
Type Provider Scenario: JSON Provider (cont.)
{"id":"M000309","name":"CarolynMcCarthy","party":"D","state":"NY","district":"4","total_votes":"1192","missed_votes":"687","missed_votes_pct":"57.63","rank":"1","notes":"Willretireattheendof113thCongress."}
-
Batteries not included: SQL Provider
*SQL Data Connection
*SQL Entity Connectoin
SQL Client
SQL Provider
https://msdn.microsoft.com/en-us/library/hh362320.aspxhttps://msdn.microsoft.com/en-us/library/hh362320.aspxhttps://github.com/fsprojects/FSharp.Data.SqlClienthttps://github.com/fsprojects/SQLProvider
-
SQL Data Connection
Type provder for an entire database, MS SQL focused.
1: 2: 3: 4:
typedbSchema=SqlDataConnectionletdb=dbSchema.GetDataContext()
-
SQL Entity Connection
Type provder for an entire database, through ADO.NET Entity model.
1: 2: 3: 4:
typedbSchema=SqlEntityConnectionletdb=dbSchema.GetDataContext()
-
SQL Client
Type provider for commands, sprocs, and queries
1: 2: 3: 4:
usecmd=newSqlCommandProvider()
1: letresults=cmd.Execute(region="USA")
-
SQL Provider
Type provider for DB as a whole
MS SQL, Postgres, SQLite, MySQL, Oracle, MS Access
1: 2: 3: 4:
typesql=SqlDataProvider
letctx=sql.GetDataContext()
-
Step two: Transform
-
Deedle: Convert to data frame
1: 2: 3:
openDeedleopenFSharp.DataopenFSharp.Data.Runtime.BaseTypes
Expander code omitted, but can be found here
1: 2: 3: 4:
letdataFrame=[forlincongressmen>series["It"=>l]]|>Frame.ofRowsOrdinal|>Frame.expandAllCols10
https://github.com/fslaborg/FsLab/issues/14
-
Deedle: Normal syntax
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
lethighTechExportData=WorldBankData.GetDataContext().Countries.``UnitedStates``.Indicators.``Hightechnologyexports(currentUS$)``
lethighTechFrame=highTechExportData|>Frame.ofRecords|>Frame.indexRowsInt"Item1"|>Frame.mapColKeys(fun_>"HighTechExports")
-
Deedle: Quick manipulations
Simple statistics
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
letstats=["Min"=>Stats.minhighTechFrame"Max"=>Stats.maxhighTechFrame"Average"=>Stats.meanhighTechFrame"StandardDeviation"=>Stats.stdDevhighTechFrame]letobservations=highTechFrame?``HighTechExports``|>Series.observations|>Seq.map(fun(k,v)>floatk,floatv)
letregression=observations|>MathNet.Numerics.LinearRegression.SimpleRegression.Fit
-
Results
Stats
[("Min",series[HighTechExports=>76767867475])("Max",series[HighTechExports=>220884471208])("Average",series[HighTechExports=>152642394565.462])("StandardDeviation",series[HighTechExports=>39746534154.5001])]
Regression fit (intercept, slope)
(5.799301011e+12,2973741397.0)
-
R Squared
R squared value
1: 2:
letrsquared=GoodnessOfFit.RSquared(regressedValues|>Seq.mapsnd,observations|>Seq.mapsnd)
0.3274641292
-
Deedle: Combining data and additional feature creation
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
letexportFrame=WorldBankData.GetDataContext().Countries.``UnitedStates``.Indicators.``Exportsofgoodsandservices(currentUS$)``|>Frame.ofRecords|>Frame.indexRowsInt"Item1"|>Frame.mapColKeys(fun_>"TotalExports")
exportFrame?``HighTechExports``
-
Deedle: Straight to R
Arrrr
-
Deedle: Straight to R
1: 2: 3: 4:
openRProvider.``base``openRProviderletrFrame=R.as_data_frame(exportFrame)letrFrameSummary=(R.summaryrFrame)
["TotalExportsMin.:2.700e+10""TotalExports1stQu.:1.110e+11""TotalExportsMedian:3.639e+11""TotalExportsMean:6.684e+11""TotalExports3rdQu.:1.015e+12""TotalExportsMax.:2.342e+12""HighTechExportsMin.:7.677e+10""HighTechExports1stQu.:1.282e+11""HighTechExportsMedian:1.521e+11""HighTechExportsMean:1.526e+11""HighTechExports3rdQu.:1.763e+11""HighTechExportsMax.:2.209e+11""HighTechExportsNA's:29""PercentageofHighTechExportsMin.:0.06563""PercentageofHighTechExports1stQu.:0.12270""PercentageofHighTechExportsMedian:0.15502""PercentageofHighTechExportsMean:0.13875""PercentageofHighTechExports3rdQu.:0.16385""PercentageofHighTechExportsMax.:0.18290""PercentageofHighTechExportsNA's:29"]
-
Step three: Science
-
"Normal" data analysis languages
-
F# is slowly catching up
Machine Learning Algorithms (suite)
-
F# is slowly catching up
Natural Language Processing
-
F# is slowly catching up
Cloud computing
-
Step four: Visualize
-
Visualizing our previous information
With Google Charts
1: 2: 3: 4: 5: 6:
letpieChart=congressmen|>Seq.filter(func>tryc.MissedVotes>=0with_>false)|>Seq.groupBy(func>c.Party)|>Seq.map(fung>fstg,(sndg)|>Seq.sumBy(func>c.MissedVotes|>XPlot.GoogleCharts.Chart.Pie
-
Visualizing our previous information
-
Visualizing our previous information
1: 2: 3: 4:
letscatter=exportFrame?``TotalExports``|>Series.observations|>XPlot.GoogleCharts.Chart.Scatter
-
Visualizing our previous information
-
Summary
ResourcesPresentation code:
Presentation share:https://github.com/mandest/FSharpAndDataTalk
More F# Resources: *F# Guides on fsharp.org Functional Programming SlackF# Weekly
https://github.com/mandest/FSharpAndDataTalkhttp://fsharp.org/http://fpchat.com/https://sergeytihon.wordpress.com/category/f-weekly/