innovaccer service capabilities with case studies

DATA EXPERTS

We accelerate research and transform data to help you create actionable insights

WE

MINE

WE

ANALYZE

WE

VISUALIZE

Data Scientists Software Engineers Domain Analysts

Data Mining Mining longitudinal and linked datasets from web and other archives

Web Data Mining Indian Patent Data

Task in hand was to write a code to harvest Indian Patent Data from

Indian Patent Office Website on an ongoing basis

Code was prepared to extract information on all patents filed between

date A and date B defined by user.

Further, a data quality tool was written to clean data from the

harvested pool of data.

A scheduling code was prepared to run harvesting code followed by

data quality code every week (as Indian Patent office publishes patents

every week).

This master scheduler code runs every week and updates all fields of

new patents in the master pool of database.

Task in hand was to write a code to harvest latest earning conference

calls of all NASDAQ listed firms and parse each call at individual speaker

level Earning Conference calls were harvested from NASDAQ, Thomson

StreetEvents and other sources on a daily basis

Automation code was written to parse those conference calls at

individual speaker level with meta data like speaker name, designation,

company and speech text

A scheduling code was prepared to run harvesting code and parsing

code followed by harvesting every day (from NASDAQ)

This master scheduler code runs every day and updates all fields of

new conference calls in the master pool of database.

Web Data Mining Earning Conference Calls

Task in hand was to create a longitudinal data-set containing details of

CEO Compensation

DEF 14-A Proxy Filings at SEC Edgar for each company contains details

of how the CEO was compensated and how much

Automation code was written to parse those proxy filings and pull out

total compensation as well as break-up including basic salary, bonus,

EPS, etc.

A team of financial domain analysts manually studied proxy filings and

created dummy variables for various kinds of compensation strategies

used in practice to compensate CEO/Key Executives

Finally, the data-set was cleaned up, standardized and sent to quality

team for vetting.

Archive Data Mining CEO Compensation

Statistical Modeling Curating statistical models or running statistical analysis on real or simulated data

Task in hand was to prepare, implement, and realize statistical models to

optimize school transportation and logistics costs for real cities based on

real data

A Professor was appointed by NPO and a government to prepare statistical

models aiming at city planning for optimizing school costs

Our statisticians used clustering techniques, optimization techniques, and

multi-processing to minimize total transportation + operational cost on schools

for a city.

Various constraints and difficulties were taken care of like bus capacity, total

riding time, mixed loading, land and sea routes, different types of

buses and classroom sizes, and many more.

A graphical user interface with flexibility was delivered to client to view

output on maps, spreadsheets, manually change routes, upload input data etc.

Statistical Modeling - Optimization Logistics Optimization

Using Hidden Markov Models to predict various hidden layers of

consumer behavior

Using a hypothesis that consumers happen to have sequential hidden phases of

behavior towards a product ecosystem over time, our task was to validate that

there are hidden layers and understanding their transitions with time.

Using data on adoption and rejection of various kinds of product in a product

ecosystem by a consumer from a panel data having more than 10,000 products

and 100,000 consumers, we prepared a hidden markov model with maximum 8

states and 20 time-periods.

Computation time for this analysis was of exponential order with each iteration

of optimization had to store Gigabytes of data. Thus an Hadoop layer on R was

used to reduce time and memory complexity on a single machine using

HDFS on five machines.

Statistical Modeling Hidden Markov Model

Data Analysis – Predictive Analysis Predicting Hospital Readmission Rates

Predicting and Benchmarking Re-admission rates using Re-Admission data

and Performance and cause factors.

With the implementation of Obama care bill hospitals now have a pressing need to

significantly reduce their readmission rates or end up paying huge penalties. A comprehensive

solution requires collating all readmission data available from various sources and then

identifying the causal factors affecting Re-Admission.

AIC + BIC Analysis was carried out for each disease to remove variables that share insignificant

dependence with Re-Admission Rates, leaving us with key variables for each disease. After

running Multiple Linear Regression with Re-Admission Rates as Response Variable and other

variables as Explanatory Variables, the results met the expectation of common logic.

We further developed a BI tool using a comprehensive model to successfully predict

readmission rates and their dependence on various KPIs.

Big Data Analysis Implementing distributed computing methods and cutting edge statistical models

to analyze huge chunks of data

Data Analysis – Contextual Analysis Sentiment Analysis

Navigating through millions of tweets to understand sentimental

behavior of people

Millions of tweets for various people were extracted using data harvesting

techniques.

Using Naïve Bayes Analysis and a training set of 100,000 tweets, a supervised

classifier was established to decide sentiment of any tweet. This analysis involved

prior cleaning of stop words and substituting prolonged chat versions of words

with their English word counterparts.

Sentiments of people were analyzed in various control groups.

It was not possible to complete this project in a record two month timeframe

without use of big data technologies.

Data Analysis – Big Data Analysis Analyzing high-frequency Trading data

Querying high frequency data from limit order book for all NYSE-traded

securities

The project involved analyzing high frequency datasets obtained from NYSE

Open Book and Open Book Ultra.

One of the biggest challenge in analyzing this data was the scale of data, the

total size of data was 30 TB and increasing. Performing simple analysis involved

querying the database which took several hours at time.

We distributed the computation using algorithms based on MapReduce

Technology and implemented the same using python. This enabled near

real-time querying.

This setup was later used for analyzing price variation and impact on market

liquidity. The distributed computing approach saved lot of analysis time.

Data Analysis – Prescriptive Analysis Personality Analysis

Personality of Key Executives of various companies was examined using

data on text-recorded public speeches of these key executives

Having access to parsed database of speeches of key executives of

NASDAQ traded firms, we performed personality analysis of over 30,000 people

(including chief executives and analysts) and 10 million words using pysho-

linguistic databases of words and SVM technique.

Each person was given a score from 1 to 8 in big five personality traits.

These traits were correlated with following quarters conference calls and

patterns were observed in personality of chief executives and their effect on

financial (especially stock) performance in the following quarter.

Data Standardization Converting raw data into cleaned and standardized formats

Merging company names in financial and patent data-sets

Task was to merge US financial data-set having 15,000 unique standardized

company names and US patent data-set having 800,000+ unstandardized

unique company names.

After a thorough fine-tuning of our hybrid model for company name matching,

we reduced those 800,000+ unstandardized unique company names to

200,000+ standardized company names and mapped all 15,000 company

names in finance data-set to patent data-set.

Results were spectacular worth mentioning that we found 105 types of spelling

mistakes of IBM which we grouped in one standardized name.

Data Standardization Merging Patent and Finance Data

Standardizing names of plaintiffs and defendants across a litigation data-

set containing approximately 4.7 million unique names

Used benchmark algorithms on Levenshtein & N-gram to perform

standardization of names, and implemented them in Python

Prepared and optimized hybrid parameter models for the above algorithms

Implemented the above on a five-node cluster using Hadoop Distributed File

System technology

Data Standardization Standardizing Names of individuals

Technology Implementation Implementing customized mobile and software tools to make research faster,

scalable and economical

Developing Mobile and cloud based survey application for researchers

for conducting, analyzing and reporting surveys

Technology Implementation Developing Mobile Applications

The problem of capturing data in surveys, digitizing results, performing

analyses was a long drawn error prone manual process.

We worked with researchers at the university to develop an android

application for collecting data on the smartphones on the on ground

surveyors.

The result – no skip logics, no digitization, audio and video responses, an

analysis dashboard, GPS location reliability, and all within a click of a button.

Building a platform to assist in collecting, tracking, and analysis of large

data-sets

Technology Implementation Developing Desktop Applications

Developed a login based offline .NET framework technology which collects

survey data, syncs local data with cloud, and allows users to analyze data,

all on a continual basis using the most reliable and scalable cloud-based

database technology, Google SQL

Considering physical constraints like electricity and internet shortages, our

team developed a technology which can be installed at any remote

terminal where they would operate their data collection activities

It was made sure that all local copies of the data is kept in their local

machines until an internet connection is activated which then syncs data to

a cloud database which can be accessed by client’s core team members

from anywhere any time

Programmed the mail-engine to send and monitor distributed email

campaigns

Technology Implementation Developing Email Delivery and Analytics Engine

One of the major challenges faced by the researcher was the lack of a

comprehensive solution to disseminate information, to the defined target

segment. Conventional programs were cluttered, repeatedly hit by delivery

failures and could not provide the key insights.

The application was hosted on appengine with a scalable infrastructure to

enable real-time monitoring and delivery, whereas the database was

centrally managed on Google Cloud SQL

The application was programmed using distributed loading techniques to

send more than 1 million emails per day.

Data Visualization Creating customized visualization for your data-sets to tell a compelling story

Data Visualization Industry Turbulence

Data Visualization Countries in Motion

Data Visualization Web Interactive Customizable World Maps

Clients Researchers from

Contact

We like challenges Keep on throwing questions and hear from our engagement managers

on how we can help.

Write to us at

[email protected]

Or have a look at us

www.innovaccer.com

Or Call us

+91 120 431 1139

innovaccer service capabilities with case studies

Technology

longitudinal data

data quality code

meta data

clean data

panel data

simulated data

gigabytes of data

data expertswe