fsharp and data 101

42
F# and Data 101 Utilizing FsLab to Quickly Understand your Data Jonathan Dexter, Technology Manager of .NET, The Nerdery

Upload: jonathan-dexter

Post on 13-Feb-2017

211 views

Category:

Technology


0 download

TRANSCRIPT

  • F# and Data 101Utilizing FsLab to Quickly Understand your Data

    Jonathan Dexter, Technology Manager of .NET, The Nerdery

  • AgendaWhat is this talk aboutGet the data!Transform!Science!???Prot!

  • If you came here to hear about"Monads""Functors"Tail-call optimizationImmutable designPattern matching

  • What we WILL talk about : FsLab

  • With Paket

    1: 2:

    paketinitpaketaddnugetfslab

  • With NuGet

    1: nugetinstallfslabOutputDirectorypackages

  • The process(Not limited to FsLab)

    1: 2: 3: 4:

    acquiredata|>transform|>science|>visualize

  • FsLab: Scratching the Surface5 Libraries

  • Sorting the libraries above into categories:

    Library Acquire Transform Science Display

    F# Data Yep! - - -

    Deedle - Yep - -

    .NET Numerics - Supports Supports -

    R Type Provider Partial Yep Yep Partial

    XPlot - - - Yep

  • Step one: Acquire

  • Classic scenario: CSV

    Using CSV type provider

    1: 2:

    typecsv=FSharp.Data.CsvProviderletcomplaints=csv.Load(complaintsCsv)

    Using a data frame

    1: letdata=Deedle.Frame.ReadCsv(__SOURCE_DIRECTORY__+"/data.csv")

  • Type Provider Scenario: World bank provider

    World bank provider is bundled with F# Data

    1: 2: 3: 4: 5: 6:

    letdataContext=FSharp.Data.WorldBankData.GetDataContext()lethighTechExports=dataContext.Countries.``UnitedStates``.Indicators.``Hightechnologyexports(currentUS$)``

  • Type Provider Scenario: JSON provider

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

    typeJsonContext=FSharp.Data.JsonProvider

  • Type Provider Scenario: JSON Provider (cont.)

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

    letmissedVotes=JsonContext.Load(sprintf"%s?apikey=%s"missedVotesUrlapikey)

    letcongressmen=missedVotes.Results|>Seq.collect(funr>r.Members)

    lettopMissingCongressman=congressmen|>Seq.sortBy(funm>trym.MissedVotesPctwith|ex>0.0m)|>Seq.rev|>Seq.head

  • Type Provider Scenario: JSON Provider (cont.)

    {"id":"M000309","name":"CarolynMcCarthy","party":"D","state":"NY","district":"4","total_votes":"1192","missed_votes":"687","missed_votes_pct":"57.63","rank":"1","notes":"Willretireattheendof113thCongress."}

  • Batteries not included: SQL Provider

    *SQL Data Connection

    *SQL Entity Connectoin

    SQL Client

    SQL Provider

    https://msdn.microsoft.com/en-us/library/hh362320.aspxhttps://msdn.microsoft.com/en-us/library/hh362320.aspxhttps://github.com/fsprojects/FSharp.Data.SqlClienthttps://github.com/fsprojects/SQLProvider

  • SQL Data Connection

    Type provder for an entire database, MS SQL focused.

    1: 2: 3: 4:

    typedbSchema=SqlDataConnectionletdb=dbSchema.GetDataContext()

  • SQL Entity Connection

    Type provder for an entire database, through ADO.NET Entity model.

    1: 2: 3: 4:

    typedbSchema=SqlEntityConnectionletdb=dbSchema.GetDataContext()

  • SQL Client

    Type provider for commands, sprocs, and queries

    1: 2: 3: 4:

    usecmd=newSqlCommandProvider()

    1: letresults=cmd.Execute(region="USA")

  • SQL Provider

    Type provider for DB as a whole

    MS SQL, Postgres, SQLite, MySQL, Oracle, MS Access

    1: 2: 3: 4:

    typesql=SqlDataProvider

    letctx=sql.GetDataContext()

  • Step two: Transform

  • Deedle: Convert to data frame

    1: 2: 3:

    openDeedleopenFSharp.DataopenFSharp.Data.Runtime.BaseTypes

    Expander code omitted, but can be found here

    1: 2: 3: 4:

    letdataFrame=[forlincongressmen>series["It"=>l]]|>Frame.ofRowsOrdinal|>Frame.expandAllCols10

    https://github.com/fslaborg/FsLab/issues/14

  • Deedle: Normal syntax

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

    lethighTechExportData=WorldBankData.GetDataContext().Countries.``UnitedStates``.Indicators.``Hightechnologyexports(currentUS$)``

    lethighTechFrame=highTechExportData|>Frame.ofRecords|>Frame.indexRowsInt"Item1"|>Frame.mapColKeys(fun_>"HighTechExports")

  • Deedle: Quick manipulations

    Simple statistics

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

    letstats=["Min"=>Stats.minhighTechFrame"Max"=>Stats.maxhighTechFrame"Average"=>Stats.meanhighTechFrame"StandardDeviation"=>Stats.stdDevhighTechFrame]letobservations=highTechFrame?``HighTechExports``|>Series.observations|>Seq.map(fun(k,v)>floatk,floatv)

    letregression=observations|>MathNet.Numerics.LinearRegression.SimpleRegression.Fit

  • Results

    Stats

    [("Min",series[HighTechExports=>76767867475])("Max",series[HighTechExports=>220884471208])("Average",series[HighTechExports=>152642394565.462])("StandardDeviation",series[HighTechExports=>39746534154.5001])]

    Regression fit (intercept, slope)

    (5.799301011e+12,2973741397.0)

  • R Squared

    R squared value

    1: 2:

    letrsquared=GoodnessOfFit.RSquared(regressedValues|>Seq.mapsnd,observations|>Seq.mapsnd)

    0.3274641292

  • Deedle: Combining data and additional feature creation

    1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

    letexportFrame=WorldBankData.GetDataContext().Countries.``UnitedStates``.Indicators.``Exportsofgoodsandservices(currentUS$)``|>Frame.ofRecords|>Frame.indexRowsInt"Item1"|>Frame.mapColKeys(fun_>"TotalExports")

    exportFrame?``HighTechExports``

  • Deedle: Straight to R

    Arrrr

  • Deedle: Straight to R

    1: 2: 3: 4:

    openRProvider.``base``openRProviderletrFrame=R.as_data_frame(exportFrame)letrFrameSummary=(R.summaryrFrame)

    ["TotalExportsMin.:2.700e+10""TotalExports1stQu.:1.110e+11""TotalExportsMedian:3.639e+11""TotalExportsMean:6.684e+11""TotalExports3rdQu.:1.015e+12""TotalExportsMax.:2.342e+12""HighTechExportsMin.:7.677e+10""HighTechExports1stQu.:1.282e+11""HighTechExportsMedian:1.521e+11""HighTechExportsMean:1.526e+11""HighTechExports3rdQu.:1.763e+11""HighTechExportsMax.:2.209e+11""HighTechExportsNA's:29""PercentageofHighTechExportsMin.:0.06563""PercentageofHighTechExports1stQu.:0.12270""PercentageofHighTechExportsMedian:0.15502""PercentageofHighTechExportsMean:0.13875""PercentageofHighTechExports3rdQu.:0.16385""PercentageofHighTechExportsMax.:0.18290""PercentageofHighTechExportsNA's:29"]

  • Step three: Science

  • "Normal" data analysis languages

  • F# is slowly catching up

    Machine Learning Algorithms (suite)

  • F# is slowly catching up

    Natural Language Processing

  • F# is slowly catching up

    Cloud computing

  • Step four: Visualize

  • Visualizing our previous information

    With Google Charts

    1: 2: 3: 4: 5: 6:

    letpieChart=congressmen|>Seq.filter(func>tryc.MissedVotes>=0with_>false)|>Seq.groupBy(func>c.Party)|>Seq.map(fung>fstg,(sndg)|>Seq.sumBy(func>c.MissedVotes|>XPlot.GoogleCharts.Chart.Pie

  • Visualizing our previous information

  • Visualizing our previous information

    1: 2: 3: 4:

    letscatter=exportFrame?``TotalExports``|>Series.observations|>XPlot.GoogleCharts.Chart.Scatter

  • Visualizing our previous information

  • Summary

    ResourcesPresentation code:

    Presentation share:https://github.com/mandest/FSharpAndDataTalk

    More F# Resources: *F# Guides on fsharp.org Functional Programming SlackF# Weekly

    https://github.com/mandest/FSharpAndDataTalkhttp://fsharp.org/http://fpchat.com/https://sergeytihon.wordpress.com/category/f-weekly/