innovaccer service capabilities with case studies
DESCRIPTION
Innovaccer is a California based research acceleration firm assisting hundreds of researchers from Harvard, Stanford, Wharton, MIT etc. This slide describes our capabilities of assisting research endeavors. To get in touch with us, please write to [email protected] To know more, please visit our website : www.innovaccer.comTRANSCRIPT
DATA EXPERTS
We accelerate research and transform data to help you create actionable insights
WE
MINE
WE
ANALYZE
WE
VISUALIZE
Data Scientists Software Engineers Domain Analysts
Data Mining Mining longitudinal and linked datasets from web and other archives
Web Data Mining Indian Patent Data
Task in hand was to write a code to harvest Indian Patent Data from
Indian Patent Office Website on an ongoing basis
Code was prepared to extract information on all patents filed between
date A and date B defined by user.
Further, a data quality tool was written to clean data from the
harvested pool of data.
A scheduling code was prepared to run harvesting code followed by
data quality code every week (as Indian Patent office publishes patents
every week).
This master scheduler code runs every week and updates all fields of
new patents in the master pool of database.
Task in hand was to write a code to harvest latest earning conference
calls of all NASDAQ listed firms and parse each call at individual speaker
level Earning Conference calls were harvested from NASDAQ, Thomson
StreetEvents and other sources on a daily basis
Automation code was written to parse those conference calls at
individual speaker level with meta data like speaker name, designation,
company and speech text
A scheduling code was prepared to run harvesting code and parsing
code followed by harvesting every day (from NASDAQ)
This master scheduler code runs every day and updates all fields of
new conference calls in the master pool of database.
Web Data Mining Earning Conference Calls
Task in hand was to create a longitudinal data-set containing details of
CEO Compensation
DEF 14-A Proxy Filings at SEC Edgar for each company contains details
of how the CEO was compensated and how much
Automation code was written to parse those proxy filings and pull out
total compensation as well as break-up including basic salary, bonus,
EPS, etc.
A team of financial domain analysts manually studied proxy filings and
created dummy variables for various kinds of compensation strategies
used in practice to compensate CEO/Key Executives
Finally, the data-set was cleaned up, standardized and sent to quality
team for vetting.
Archive Data Mining CEO Compensation
Statistical Modeling Curating statistical models or running statistical analysis on real or simulated data
Task in hand was to prepare, implement, and realize statistical models to
optimize school transportation and logistics costs for real cities based on
real data
A Professor was appointed by NPO and a government to prepare statistical
models aiming at city planning for optimizing school costs
Our statisticians used clustering techniques, optimization techniques, and
multi-processing to minimize total transportation + operational cost on schools
for a city.
Various constraints and difficulties were taken care of like bus capacity, total
riding time, mixed loading, land and sea routes, different types of
buses and classroom sizes, and many more.
A graphical user interface with flexibility was delivered to client to view
output on maps, spreadsheets, manually change routes, upload input data etc.
Statistical Modeling - Optimization Logistics Optimization
Using Hidden Markov Models to predict various hidden layers of
consumer behavior
Using a hypothesis that consumers happen to have sequential hidden phases of
behavior towards a product ecosystem over time, our task was to validate that
there are hidden layers and understanding their transitions with time.
Using data on adoption and rejection of various kinds of product in a product
ecosystem by a consumer from a panel data having more than 10,000 products
and 100,000 consumers, we prepared a hidden markov model with maximum 8
states and 20 time-periods.
Computation time for this analysis was of exponential order with each iteration
of optimization had to store Gigabytes of data. Thus an Hadoop layer on R was
used to reduce time and memory complexity on a single machine using
HDFS on five machines.
Statistical Modeling Hidden Markov Model
Data Analysis – Predictive Analysis Predicting Hospital Readmission Rates
Predicting and Benchmarking Re-admission rates using Re-Admission data
and Performance and cause factors.
With the implementation of Obama care bill hospitals now have a pressing need to
significantly reduce their readmission rates or end up paying huge penalties. A comprehensive
solution requires collating all readmission data available from various sources and then
identifying the causal factors affecting Re-Admission.
AIC + BIC Analysis was carried out for each disease to remove variables that share insignificant
dependence with Re-Admission Rates, leaving us with key variables for each disease. After
running Multiple Linear Regression with Re-Admission Rates as Response Variable and other
variables as Explanatory Variables, the results met the expectation of common logic.
We further developed a BI tool using a comprehensive model to successfully predict
readmission rates and their dependence on various KPIs.
Big Data Analysis Implementing distributed computing methods and cutting edge statistical models
to analyze huge chunks of data
Data Analysis – Contextual Analysis Sentiment Analysis
Navigating through millions of tweets to understand sentimental
behavior of people
Millions of tweets for various people were extracted using data harvesting
techniques.
Using Naïve Bayes Analysis and a training set of 100,000 tweets, a supervised
classifier was established to decide sentiment of any tweet. This analysis involved
prior cleaning of stop words and substituting prolonged chat versions of words
with their English word counterparts.
Sentiments of people were analyzed in various control groups.
It was not possible to complete this project in a record two month timeframe
without use of big data technologies.
Data Analysis – Big Data Analysis Analyzing high-frequency Trading data
Querying high frequency data from limit order book for all NYSE-traded
securities
The project involved analyzing high frequency datasets obtained from NYSE
Open Book and Open Book Ultra.
One of the biggest challenge in analyzing this data was the scale of data, the
total size of data was 30 TB and increasing. Performing simple analysis involved
querying the database which took several hours at time.
We distributed the computation using algorithms based on MapReduce
Technology and implemented the same using python. This enabled near
real-time querying.
This setup was later used for analyzing price variation and impact on market
liquidity. The distributed computing approach saved lot of analysis time.
Data Analysis – Prescriptive Analysis Personality Analysis
Personality of Key Executives of various companies was examined using
data on text-recorded public speeches of these key executives
Having access to parsed database of speeches of key executives of
NASDAQ traded firms, we performed personality analysis of over 30,000 people
(including chief executives and analysts) and 10 million words using pysho-
linguistic databases of words and SVM technique.
Each person was given a score from 1 to 8 in big five personality traits.
These traits were correlated with following quarters conference calls and
patterns were observed in personality of chief executives and their effect on
financial (especially stock) performance in the following quarter.
Data Standardization Converting raw data into cleaned and standardized formats
Merging company names in financial and patent data-sets
Task was to merge US financial data-set having 15,000 unique standardized
company names and US patent data-set having 800,000+ unstandardized
unique company names.
After a thorough fine-tuning of our hybrid model for company name matching,
we reduced those 800,000+ unstandardized unique company names to
200,000+ standardized company names and mapped all 15,000 company
names in finance data-set to patent data-set.
Results were spectacular worth mentioning that we found 105 types of spelling
mistakes of IBM which we grouped in one standardized name.
Data Standardization Merging Patent and Finance Data
Standardizing names of plaintiffs and defendants across a litigation data-
set containing approximately 4.7 million unique names
Used benchmark algorithms on Levenshtein & N-gram to perform
standardization of names, and implemented them in Python
Prepared and optimized hybrid parameter models for the above algorithms
Implemented the above on a five-node cluster using Hadoop Distributed File
System technology
Data Standardization Standardizing Names of individuals
Technology Implementation Implementing customized mobile and software tools to make research faster,
scalable and economical
Developing Mobile and cloud based survey application for researchers
for conducting, analyzing and reporting surveys
Technology Implementation Developing Mobile Applications
The problem of capturing data in surveys, digitizing results, performing
analyses was a long drawn error prone manual process.
We worked with researchers at the university to develop an android
application for collecting data on the smartphones on the on ground
surveyors.
The result – no skip logics, no digitization, audio and video responses, an
analysis dashboard, GPS location reliability, and all within a click of a button.
Building a platform to assist in collecting, tracking, and analysis of large
data-sets
Technology Implementation Developing Desktop Applications
Developed a login based offline .NET framework technology which collects
survey data, syncs local data with cloud, and allows users to analyze data,
all on a continual basis using the most reliable and scalable cloud-based
database technology, Google SQL
Considering physical constraints like electricity and internet shortages, our
team developed a technology which can be installed at any remote
terminal where they would operate their data collection activities
It was made sure that all local copies of the data is kept in their local
machines until an internet connection is activated which then syncs data to
a cloud database which can be accessed by client’s core team members
from anywhere any time
Programmed the mail-engine to send and monitor distributed email
campaigns
Technology Implementation Developing Email Delivery and Analytics Engine
One of the major challenges faced by the researcher was the lack of a
comprehensive solution to disseminate information, to the defined target
segment. Conventional programs were cluttered, repeatedly hit by delivery
failures and could not provide the key insights.
The application was hosted on appengine with a scalable infrastructure to
enable real-time monitoring and delivery, whereas the database was
centrally managed on Google Cloud SQL
The application was programmed using distributed loading techniques to
send more than 1 million emails per day.
Data Visualization Creating customized visualization for your data-sets to tell a compelling story
Data Visualization Industry Turbulence
Data Visualization Countries in Motion
Data Visualization Web Interactive Customizable World Maps
Clients Researchers from
Contact
We like challenges Keep on throwing questions and hear from our engagement managers
on how we can help.
Write to us at
Or have a look at us
www.innovaccer.com
Or Call us
+91 120 431 1139