command line data tools

9
Command-line Data Tools Peter Wang @pwang

Upload: peter-wang

Post on 28-Jan-2018

138 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Command-line Data ToolsPeter Wang

@pwang

Why?

• Some times you just want to sling data

• Text is still king; Lowest common denominator

• Machines are pretty honking big now

This Presentation

• List of some good collections of cmd-line tools

• Call out and describe a few in particular

• The PyDataTool of my desire

Sources• From author of “Data Science at the Command

Line”: http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html (larger list at http://datascienceatthecommandline.com/)

• HN discussion: https://news.ycombinator.com/item?id=6412190

• https://github.com/bitly/data_hacks

Tools• JSON:

• jq: https://stedolan.github.io/jq/

• RecordStream: https://github.com/benbernard/RecordStream

• csvkit: https://csvkit.readthedocs.io/en/1.0.2/

• dt: https://github.com/clarkgrubb/data-tools

• XMLStarlet: http://xmlstar.sourceforge.net/overview.php

Honorable Mentions

• Pythonic awk: https://github.com/alecthomas/pawk

• Google Crush Tools: https://github.com/google/crush-tools

• Xonsh: http://xon.sh/tutorial.html

The PyDataTool of My Desire• Support for csv, json, sql, xls, hdf5; image formats; network

formats (pcap etc.) • Capability of:

• csvkit, jq, dt, “cols” tool • unix tools: sed, sort, shuf, split, tr, tee, uniq, wc, head,

tail, bc • netpbm, imagemagick for images

• Work in streaming mode (netcat, wget, curl) • First-class support for dask, spark • Basic plotting via gnuplot, mpl, bokeh • Built-in SQLite to do in-memory support for queries

Continuum Is Hiring!• Creators of Anaconda, conda, bokeh, blaze, dask,

holoviews, numba, phosphorJS

• Maintainers/contributors to Jupyter, JupyterLab, Spyder, pandas, conda-forge, …

• 150+ ppl, 80 in Austin

• Venture backed

• Enterprise product, OSS community innovation, consulting, training

Continuum Is Hiring• Enterprise Product Team:

• Dev Manager (reports to CTO, runs product engineering)

• QA Lead Engineer - creates test plans, coordinates with product mgmt, dev, and testing team

• Senior Python Developer - enterprise product development; backend, web tech; full stack preferred

• DevOps and Operations - enterprise product, anaconda.org, Anaconda build system

• Email [email protected]