semantic data science for the us census bureau
DESCRIPTION
Semantic Data Science for the US Census Bureau. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://datacommunitydc.org/blog/2013/08/cloud-soa-semantics-and-data-science-conference/ - PowerPoint PPT PresentationTRANSCRIPT
1
Semantic Data Science for theUS Census Bureau
Dr. Brand NiemannDirector and Senior Data Scientist
Semantic Communityhttp://semanticommunity.info/
http://datacommunitydc.org/blog/2013/08/cloud-soa-semantics-and-data-science-conference/ https://silverspotfire.tibco.com/us/library#/users/bniemann/Public http://semanticommunity.info/Census_Semantic_Knowledge_Base
November 14, 2013
2
Google Search Display: Census Bureau
3
Google Search Result: Census Bureau
• Home Page– First source for current population data and the latest Economic Indicators
• State and County QuickFacts– USA QuickFacts
• American FactFinder– Your source for population, housing ...
• 2010 Census– Redistricting Data - What is the Census?
• Population Estimates– The Census Bureau's Population Estimates Program
• Easy Stats– Easy Stats gives you quick and easy access
• Data Access Tools– The Census Bureau data tools provide on-line access
4
Data Access Tools• Interactive Internet Data Tools:
– Data Visualization Gallery - A weekly exploration of Census data used to promote visualization and make data accessible to a broader audience.
– DataFerrett is a tool and data librarian that searches and retrieves data across federal, state, and local surveys, executes customized variable recoding, creates complex tabulations and business graphics. Current Population Survey, Survey of Income and Program Participation, American Community Survey, American Housing Survey, Small Area Income Poverty Estimates, Population Estimates, Economic Census Areawide Statistics, National Center for Health Statistics data, Centers for Disease Control data, and more.
– DataFerrett’s newest tool, the Community Economic Development HotReport provides community and business leaders speedy access to information on counties and the Employment & Training Administration’s Workforce Innovation in Regional Economic Development (WIRED) areas across the U.S.
6
Census Data Visualization Gallery As Data For the Digital Government Strategy
http://semanticommunity.info/Census_Data_Visualization
My Note: Structured and unstructured information is all turned into a knowledge base of data for relational and graph database processing.
My Note: The entire platform can be searched.The entire knowledge base page can be searched.
7
Census Data Visualization Gallery: Spotfire
Spotfire Web Player
My Note: This is federation of diverse data sourcesto find, facet filter, visualize, and discover new facts.
9
Data Ferrett Description
• DataFerrett is a data analysis and extraction tool to customize federal, state, and local data to suit your requirements. Using DataFerrett, you can develop an unlimited array of customized spreadsheets that are as versatile and complex as your usage demands then turn those spreadsheets into graphs and maps without any additional software.
• My Comment: This is what I use Spotfire for on Open Government Data for the Digital Government Strategy.
10
Community Economic Development HotReport Description
• This site, the Community Economic Development HotReport, provides access for users seeking economic indicators for individual counties.
• For areas that experience economic disruptions due to natural disasters, plant closings, base closings, and other economic changes, such as abrupt increases in employment, this HotReport shows pertinent economic indicators in unified on-line reports from many data sources.
11
Community Economic Development HotReport Web Site
Click on graph to view table.
Community Economic Development HotReport
12
White House Big Data Event:Data to Knowledge to Action
Making the Most of Big Data
“Just wanted to say how helpful it is that you take notes and share so broadly at these types of events. Thanks for your ongoing contributions to all the communities of which you are a part.”
13
Semantic Data Science Team Attends White House Big Data Event
• Our work is an example of the bold new collaboration theme: “Harnessing the Potential of Data Scientists and Big Data for Scientific Discovery” that shows “Data Innovation Across Sectors” and includes the following Breakout session topics:– Education and Workforce Development (George Mason
University and John Hopkins University - see below)• My Note: Census is one of 9 agencies involved in this NITRD effort.
– Research and Development (NIH and YarcData)– Innovation (DC Data Science Community and Semantic
Community)
14
NITRD Supplement to the FY14 President’s Budget
• We have worked to support the NITRD Current and Planned Coordination Activities as follows:– Working with two of the six agencies: NSF, NIH, and trying to work with the other four:
DoD, DARPA, DOE, and USGS;– Following the work in the NSF-NIH Solicitation, Core Techniques and Technologies for
Advancing Big Data Science & Engineering for datasets and results that can be reused;– Helping ensure a trained workforce to capitalize on big data resources by working with
GMU Data Science as part of our team and preparing a graduate course on data science using the applications and data sets mentioned above and below;
– Providing examples of applications that use multiagency big datasets and core technology that is needed to turn heterogeneous data into more homogeneous, interoperable data;
– Providing big data infrastructure development for domain science with Spotfire and the YarcData Graph Appliance; and
– Attending the second National Big Data R&D Initiative event.• My Note: We would like to work with Census on any or all of these!
Current and Planned Coordination Activities
15
Demos
• Spotfire 6:– Web Link
• Semantic Medline with YarcData Graph Appliance Pilot:– Wiki– YarcData Videos– Schizo-7 minutes– Cancer-21 minutes
16
Contact Information
• Brand Niemann, Semantic Community– [email protected]– 703-268-9314– http://semanticommunity.info
• N. Fredrik Salvesen, SBK LLC Alliance Partner YarcData– [email protected]– 443 994-5193– http://yarcdata.com/
17
Some Next Steps• So after about 10 years of development and the recent work of our Semantic Data Science
Team, we think we have the best US Federal Government semantic knowledge base (NIH Semantic Medline) running on one of the best graph computers (YarcData) for the OSTP/NITRD Federal Big Data Senior Steering WG.
• Our goal is to produce the “Killer Semantic Web Application for the US Federal Government” and we still have a ways to go.
• Now we need to help other agencies do the same by applying semantic data science to their data and metadata to develop their semantic knowledge base for piloting on the best graph computers.
• The following is a pilot example to begin to develop a semantic knowledge base for US Census showing the steps for preparing legacy US Census data sources and for collecting new US Census data sources so they are stored directly in a semantic knowledge base.– A historical note: This is like when I led the E-forms For E-government Pilot for OMB and the Federal
CIO Council – I selected the US Census Economic Census E-forms solution by Rick Fenestra to be the best practice for getting about 15 E-forms solutions being used by the US Federal Government to adopt a common e-Grant XML Schema so all 15 could become semantically interoperable and agencies would not have to “rip and replace” solutions. This approach could make agency semantic knowledge bases interoperable so they can be federated and we would have a “killer semantic web application” on top of “individual killer semantic web applications”!
18
Data Access Tools
http://www.census.gov/main/www/access.html
• Quick Facts• American FactFinder• Easy Stats• My Congressional District• Population Finder• American Community Survey• 2010 Census• Economic Census• Interactive Maps• Data Visualizations• Training & Workshops• Data Tools• Catalogs• Publications
19
Census Semantic Knowledge Base• US Census data is available in the following ways:– Data Access Tools: Making It Easier to Use the Data Than Just
Direct File Access Below (Start Here)– Research Data Centers: Access to Confidential Data (Defer This
Until Later Stage)– Software to Download: More Tools to Use (This is More About
Data Than Software)– Direct File Access: Public (Include This) and Private (Not
Applicable Here)– Access Tools at Other Sites: Is There a Better Place to Build This
Semantic Knowledge Base? (That University of Minnesota Web Site Looks Pretty Good!)
My Note: This defines how to start and the scope of the semantic knowledge base.
20
Semantic Knowledge Base• Initially we need at least a taxonomy and a vocabulary.• Eventually, we would like an ontology and thesaurus.• We need to build a data and metadata ecosystem with
relational and graph data sets.• The pilot will build a knowledge base in MindTouch,
spreadsheets in Excel, a dashboard in Spotfire, and a business process for data collection in Be Informed.
• The pilot will be scaled up to create a RDF triple store for the YARCData Graph Appliance.
• In essence, I am going to build a “SemanticData.gov” type application for the US Census Data.
21
Data Access Tools• Data Visualization Gallery: Recall Slide 6 Knowledge Base and Slide 7 Spotfire• 2010 Census Interactive Population Map• The American FactFinder• QuickFacts• Easy Stats• County Business & Demographics Map• Economic Database Search and Trend Charts• Glossary: See Slide 26 Excel and Slide 29 Spotfire Knowledge Bases• Censtats• Online Mapping Tools• US Gazetteer• Business Dynamics Statistics• DataFerrett: Recall Slides 8-9• Community Economic Development HotReport: Recall Slides 10-11• QWI Online• OnTheMap• Industry Focus• Census 2000 EEO Data Tool
My Note: This is another taxonomy!
22
Data Access Tools:Knowledge Base Spreadsheet
http://semanticommunity.info/@api/deki/files/27077/USCensusSemanticKnowledgeBase.xlsx
My Note: This is a taxonomy in Semantic Web Linked Open Data Format.
23
Direct File Access: Public
http://www2.census.gov/census_2000/datasets/
My Note: This is a taxonomy of howCensus organizes it data files that needsto be a searchable index in a spreadsheet.
24
Direct File Access Public: Knowledge Base Spreadsheet
http://semanticommunity.info/@api/deki/files/27077/USCensusSemanticKnowledgeBase.xlsx
My Note: This is both relational and graph(subject, object, & predicate database formats.
25
Census Taxonomy and Vocabulary: MindTouch Matrix
http://semanticommunity.info/Census_Semantic_Knowledge_Base#Story
My Note: The entire page & platform can be searched.
26
Census Semantic Knowledge Base: Excel Glossary
http://semanticommunity.info/@api/deki/files/27084/CensusSemanticKnowledgeBase.xlsx
My Note: All of these spreadsheets can be searched.
My Note: The Semantic Community approach is consistent with the EU ISA Recommended URI Design and Management Principles.
27
Census Semantic Knowledge Base: Spotfire Glossary
https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?CensusSemanticKnowledgeBase-Spotfire.dxp
28
Census Semantic Knowledge Base: Spotfire Taxonomy
https://silverspotfire.tibco.com/us/library#/users/bniemann/Public?CensusSemanticKnowledgeBase-Spotfire.dxp
29
Conclusions and Recommendations
• A taxonomy (Interactive Internet Data Tools) and vocabulary (Glossary) from Census were used to pilot a semantic knowledge base.
• Agile development of the semantic knowledge base was possible when the data dictionary and data are readily available in a spreadsheet or at the download site so one can focus on doing the data science and analytics.
• The Census "Building Deep Links into American FactFinder" can be Semantic Web Linked Open Data.– See 2012 Statistical Abstract as a Semantic Knowledge Base in the Next Slide.
• The Semantic Community Platform can produce a Census data science ecosystem and products in an interoperability interface with semantic interoperability.
• Next is piloting Be Informed for Census survey data collection and then YARCData on the triple stores that are created.
30
Statistical Abstract 2012: Spotfire Knowledge Base
http://semanticommunity.info/FedStats.net#Spotfire_Dashboard