business analytics

119
Business analytics From Wikipedia, the free encyclopedia Not to be confused with Business analysis . This article needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed. (October 2010) Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. [1] Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods . In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods. Business analytics makes extensive use of data, statistical and quantitative analysis, explanatory and predictive modeling , [2] and fact-based management to drive decision making . It is therefore closely related to management science . Analytics may be used as input for human decisions or may drive fully automated decisions. Business intelligence is querying , reporting , online analytical processing (OLAP), and "alerts." In other words, querying, reporting, OLAP, and alert tools can answer questions such as what happened, how many, how often, where the problem is, and what actions are needed. Business analytics can answer questions like why is this happening, what if these trends continue, what will happen next (that is, predict), what is the best that can happen (that is, optimize). [3] Contents 1 Examples of application 2 Types of analytics

Upload: jaspalsinhg

Post on 19-Jul-2016

40 views

Category:

Documents


0 download

DESCRIPTION

ba

TRANSCRIPT

Page 1: Business Analytics

Business analyticsFrom Wikipedia, the free encyclopediaNot to be confused with Business analysis.

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (October 2010)

Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning.[1] Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods.

Business analytics makes extensive use of data, statistical and quantitative analysis, explanatory and predictive modeling,[2] and fact-based management to drive decision making. It is therefore closely related to management science. Analytics may be used as input for human decisions or may drive fully automated decisions. Business intelligence is querying, reporting, online analytical processing (OLAP), and "alerts."

In other words, querying, reporting, OLAP, and alert tools can answer questions such as what happened, how many, how often, where the problem is, and what actions are needed. Business analytics can answer questions like why is this happening, what if these trends continue, what will happen next (that is, predict), what is the best that can happen (that is, optimize).[3]

Contents 1 Examples of application 2 Types of analytics 3 Basic domains within analytics 4 History 5 Challenges 6 Competing on analytics 7 See also 8 References 9 Further reading

Examples of applicationBanks, such as Capital One, use data analysis (or analytics, as it is also called in the business setting), to differentiate among customers based on credit risk, usage and other characteristics and then to match customer characteristics with appropriate product offerings. Harrah’s, the gaming firm, uses analytics in its customer loyalty programs. E & J Gallo Winery quantitatively

Page 2: Business Analytics

analyzes and predicts the appeal of its wines. Between 2002 and 2005, Deere & Company saved more than $1 billion by employing a new analytical tool to better optimize inventory.[3]

Types of analytics Decisive analytics: supports human decisions with visual analytics the user models to

reflect reasoning. Descriptive Analytics: Gain insight from historical data with reporting, scorecards,

clustering etc. Predictive analytics (predictive modeling using statistical and machine learning

techniques) Prescriptive analytics recommend decisions using optimization, simulation etc.

Basic domains within analytics Behavioral analytics Cohort Analysis Collections analytics Contextual data modeling - supports the human reasoning that occurs after viewing

"executive dashboards" or any other visual analytics Financial services analytics Fraud analytics Marketing analytics Pricing analytics Retail sales analytics Risk & Credit analytics Supply Chain analytics Talent analytics Telecommunications Transportation analytics

HistoryAnalytics have been used in business since the management exercises were put into place by Frederick Winslow Taylor in the late 19th century. Henry Ford measured the time of each component in his newly established assembly line. But analytics began to command more attention in the late 1960s when computers were used in decision support systems. Since then, analytics have changed and formed with the development of enterprise resource planning (ERP) systems, data warehouses, and a large number of other software tools and processes.[3]

In latter years the business analytics have exploded with the introduction to computers. This change has brought analytics to a whole new level and has made the possibilities endless. As far as analytics has come in history, and what the current field of analytics is today many people would never think that analytics started in the early 1900's with Mr. Ford himself.

Page 3: Business Analytics

ChallengesBusiness analytics depends on sufficient volumes of high quality data. The difficulty in ensuring data quality is integrating and reconciling data across different systems, and then deciding what subsets of data to make available.[3]

Previously, analytics was considered a type of after-the-fact method of forecasting consumer behavior by examining the number of units sold in the last quarter or the last year. This type of data warehousing required a lot more storage space than it did speed. Now business analytics is becoming a tool that can influence the outcome of customer interactions.[4] When a specific customer type is considering a purchase, an analytics-enabled enterprise can modify the sales pitch to appeal to that consumer. This means the storage space for all that data must react extremely fast to provide the necessary data in real-time.

Competing on analyticsThomas Davenport, professor of information technology and management at Babson College argues that businesses can optimize a distinct business capability via analytics and thus better compete. He identifies these characteristics of an organization that are apt to compete on analytics:[3]

One or more senior executives who strongly advocate fact-based decision making and, specifically, analytics

Widespread use of not only descriptive statistics, but also predictive modeling and complex optimization techniques

Substantial use of analytics across multiple business functions or processes Movement toward an enterprise level approach to managing analytical tools, data, and

organizational skills and capabilities

Balanced scorecardFrom Wikipedia, the free encyclopedia

Part of a series on Strategy

Strategy

Page 4: Business Analytics

Major dimensions[hide]

Strategy • Strategic management Military strategy • Strategic thinking Strategic planning • Game theory

Thought leaders[hide]

Michael Porter • Henry Mintzberg Bruce Henderson • Gary Hamel

Jim Collins • Liddell Hart Carl Von Clausewitz • Sun Tzu

Concepts[hide]

Competitive advantage • Experience curve Value chain • Portfolio theory

Core competency • Generic strategies

Frameworks & Tools[hide]

SWOT • Five Forces Balanced scorecard • Strategy map

PEST analysis • Growth–share matrix

v

t

Page 5: Business Analytics

e

The balanced scorecard (BSC) is a strategy performance management tool - a semi-standard structured report, supported by design methods and automation tools, that can be used by managers to keep track of the execution of activities by the staff within their control and to monitor the consequences arising from these actions.[1]

The critical characteristics that define a Balanced Scorecard are[2]

its focus on the strategic agenda of the organization concerned the selection of a small number of data items to monitor a mix of financial and non-financial data items.

Contents 1 Use 2 History 3 Characteristics 4 Design

o 4.1 First Generation Balanced Scorecard o 4.2 Second Generation Balanced Scorecard o 4.3 Third Generation Balanced Scorecard

5 Popularity 6 Variants 7 Criticism 8 Software tools 9 See also 10 References

UseBalanced Scorecard is an example of a closed-loop controller or cybernetic control applied to the management of the implementation of a strategy.[3] Closed-loop or cybernetic control is where actual performance is measured, the measured value is compared to an expected value and based on the difference between the two corrective interventions are made as required. Such control requires three things to be effective - a choice of data to measure, the setting of an expected value for the data, and the ability to make a corrective intervention.[3]

Within the strategy management context, all three of these characteristic closed-loop control elements need to be derived from the organisation's strategy and also need to reflect the ability of the observer to both monitor performance and subsequently intervene - both of which may be constrained.[4]

Page 6: Business Analytics

Two of the ideas that underpin modern Balanced Scorecard designs concern facilitating the creation of such a control - through making it easier to select which data to observe, and ensuring that the choice of data is consistent with the ability of the observer to intervene.[5]

HistoryOrganizations have used systems consisting of a mix of financial and non-financial measures to track progress for quite some time.[6] One such system was created by Art Schneiderman in 1987 at Analog Devices, a mid-sized semi-conductor company; the Analog Devices Balanced Scorecard.[7] Schneiderman's design was similar to what is now recognised as a "First Generation" Balanced Scorecard design.[5]

In 1990 Art Schneiderman participated in an unrelated research study in 1990 led by Dr. Robert S. Kaplan in conjunction with US management consultancy Nolan-Norton, and during this study described his work on performance measurement.[7] Subsequently, Kaplan and David P. Norton included anonymous details of this balanced scorecard design in a 1992 article.[8] Kaplan and Norton's article wasn't the only paper on the topic published in early 1992[9] but the 1992 Kaplan and Norton paper was a popular success, and was quickly followed by a second in 1993.[10] In 1996, the two authors published a book The Balanced Scorecard.[11] These articles and the first book spread knowledge of the concept of balanced scorecard widely, and has led to Kaplan and Norton being seen as the creators of the concept.

While the "balanced scorecard" terminology was coined by Art Schneiderman, the roots of performance management as an activity run deep in management literature and practice. Management historians such as Alfred Chandler suggest the origins of performance management can be seen in the emergence of the complex organisation - most notably during the 19th Century in the USA.[12] More recent influences may include the pioneering work of General Electric on performance measurement reporting in the 1950s and the work of French process engineers (who created the tableau de bord – literally, a "dashboard" of performance measures) in the early part of the 20th century.[6] The tool also draws strongly on the ideas of the 'resource based view of the firm'[13] proposed by Edith Penrose. However it should be noted that none of these influences is explicitly linked to original descriptions of balanced scorecard by Schneiderman, Maisel, or Kaplan & Norton.

Kaplan and Norton's first book[11] remains their most popular. The book reflects the earliest incarnations of balanced scorecards - effectively restating the concept as described in the second Harvard Business Review article.[10] Their second book, The Strategy Focused Organization,[14] echoed work by others (particularly a book published the year before by Olve et al. in Scandinavia[15]) on the value of visually documenting the links between measures by proposing the "Strategic Linkage Model" or strategy map.

As the title of Kaplan and Norton's second book[14] highlights, even by 2000 the focus of attention among thought-leaders was moving from the design of Balanced Scorecards themselves, towards the use of Balanced Scorecard as a focal point within a more comprehensive strategic management system. Subsequent writing on Balanced Scorecard by Kaplan & Norton has focused on uses of Balanced Scorecard rather than its design (e.g. "The Execution Premium"

Page 7: Business Analytics

in 2008[16]), however many others have continued to refine the device itself (e.g. Abernethy et al.[17]).

CharacteristicsThe characteristics of the balanced scorecard and its derivatives is the presentation of a mixture of financial and non-financial measures each compared to a 'target' value within a single concise report. The report is not meant to be a replacement for traditional financial or operational reports but a succinct summary that captures the information most relevant to those reading it. It is the method by which this 'most relevant' information is determined (i.e., the design processes used to select the content) that most differentiates the various versions of the tool in circulation. The balanced scorecard indirectly also provides a useful insight into an organisation's strategy - by requiring general strategic statements (e.g. mission, vision) to be precipitated into more specific / tangible forms.[18]

The first versions of balanced scorecard asserted that relevance should derive from the corporate strategy, and proposed design methods that focused on choosing measures and targets associated with the main activities required to implement the strategy. As the initial audience for this were the readers of the Harvard Business Review, the proposal was translated into a form that made sense to a typical reader of that journal - managers of US commercial businesses. Accordingly, initial designs were encouraged to measure three categories of non-financial measure in addition to financial outputs - those of "customer," "internal business processes" and "learning and growth." These categories were not so relevant to non-profits or units within complex organizations (which might have high degrees of internal specialization), and much of the early literature on balanced scorecard focused on suggestions of alternative 'perspectives' that might have more relevance to these groups.

Modern balanced scorecards have evolved since the initial ideas proposed in the late 1980s and early 1990s, and the modern performance management tools including Balanced Scorecard are significantly improved - being more flexible (to suit a wider range of organisational types) and more effective (as design methods have evolved to make them easier to design, and use).[19]

DesignDesign of a balanced scorecard is about the identification of a small number of financial and non-financial measures and attaching targets to them, so that when they are reviewed it is possible to determine whether current performance 'meets expectations'. By alerting managers to areas where performance deviates from expectations, they can be encouraged to focus their attention on these areas, and hopefully as a result trigger improved performance within the part of the organization they lead.[3]

The original thinking behind a balanced scorecard was for it to be focused on information relating to the implementation of a strategy, and over time there has been a blurring of the boundaries between conventional strategic planning and control activities and those required to

Page 8: Business Analytics

design a Balanced Scorecard. This is illustrated well by the four steps required to design a balanced scorecard included in Kaplan & Norton's writing on the subject in the late 1990s:

1. Translating the vision into operational goals;2. Communicating the vision and link it to individual performance;3. Business planning; index setting4. Feedback and learning, and adjusting the strategy accordingly.

These steps go far beyond the simple task of identifying a small number of financial and non-financial measures, but illustrate the requirement for whatever design process is used to fit within broader thinking about how the resulting Balanced Scorecard will integrate with the wider business management process.

Although it helps focus managers' attention on strategic issues and the management of the implementation of strategy, it is important to remember that the Balanced Scorecard itself has no role in the formation of strategy.[5] In fact, balanced scorecards can co-exist with strategic planning systems and other tools.[6]

First Generation Balanced Scorecard

The first generation of Balanced Scorecard designs used a "4 perspective" approach to identify what measures to use to track the implementation of strategy. `The original four "perspectives" proposed[8] were:

Financial: encourages the identification of a few relevant high-level financial measures. In particular, designers were encouraged to choose measures that helped inform the answer to the question "How do we look to shareholders?" Examples: cash flow, sales growth, operating income, return on equity.[20]

Customer: encourages the identification of measures that answer the question "How do customers see us?" Examples: percent of sales from new products, on time delivery, share of important customers’ purchases, ranking by important customers.

Internal business processes: encourages the identification of measures that answer the question "What must we excel at?" Examples: cycle time, unit cost, yield, new product introductions.

Learning and growth: encourages the identification of measures that answer the question "How can we continue to improve, create value and innovate?". Examples: time to develop new generation of products, life cycle to product maturity, time to market versus competition.

The idea was that managers used these perspective headings to prompt the selection of a small number of measures that informed on that aspect of the organisation's strategic performance.[8] The perspective headings show that Kaplan and Norton were thinking about the needs of non-divisional commercial organisations in their initial design. These headings are not very helpful to other kinds of organisations (e.g. multi-divisional or multi-national commercial organisations, governmental organisations, non-profits, non-governmental organisations, government agencies etc.), and much of what has been written on balanced scorecard since has, in one way or another, focused on the identification of alternative headings more suited to a broader range of organisations, and also suggested using either additional or fewer perspectives (e.g. Butler et al.

Page 9: Business Analytics

(1997),[21] Ahn (2001),[22] Elefalke (2001),[23] Brignall (2002),[24] Irwin (2002),[25] Radnor et al. (2003)[26]).

These suggestions were notably triggered by a recognition that different but equivalent headings would yield alternative sets of measures, and this represents the major design challenge faced with this type of balanced scorecard design: justifying the choice of measures made. "Of all the measures you could have chosen, why did you choose these?" These issues contribute to dis-satisfaction with early Balanced Scorecard designs, since if users are not confident that the measures within the Balanced Scorecard are well chosen, they will have less confidence in the information it provides.[27]

Although less common, these early-style balanced scorecards are still designed and used today.[1]

In short, first generation balanced scorecards are hard to design in a way that builds confidence that they are well designed. Because of this, many are abandoned soon after completion.[6]

Second Generation Balanced Scorecard

In the mid-1990s, an improved design method emerged.[15] In the new method, measures are selected based on a set of "strategic objectives" plotted on a "strategic linkage model" or "strategy map". With this modified approach, the strategic objectives are distributed across the four measurement perspectives, so as to "connect the dots" to form a visual presentation of strategy and measures.[28]

In this modified version of balanced scorecard design, managers select a few strategic objectives within each of the perspectives, and then define the cause-effect chain among these objectives by drawing links between them to create a "strategic linkage model". A balanced scorecard of strategic performance measures is then derived directly by selecting one or two measures for each strategic objectives.[5] This type of approach provides greater contextual justification for the measures chosen, and is generally easier for managers to work through. This style of balanced scorecard has been commonly used since 1996 or so: it is significantly different in approach to the methods originally proposed, and so can be thought of as representing the "2nd generation" of design approach adopted for balanced scorecard since its introduction.

Third Generation Balanced ScorecardMain article: Third-generation balanced scorecard

In the late 1990s, the design approach had evolved yet again. One problem with the "second generation" design approach described above was that the plotting of causal links amongst twenty or so medium-term strategic goals was still a relatively abstract activity. In practice it ignored the fact that opportunities to intervene, to influence strategic goals are, and need to be, anchored in current and real management activity. Secondly, the need to "roll forward" and test the impact of these goals necessitated the creation of an additional design instrument: the Vision or Destination Statement. This device was a statement of what "strategic success", or the "strategic end-state", looked like. It was quickly realized that if a Destination Statement was created at the beginning of the design process, then it was easier to select strategic activity and

Page 10: Business Analytics

outcome objectives to respond to it. Measures and targets could then be selected to track the achievement of these objectives. Design methods that incorporate a Destination Statement or equivalent (e.g. the results-based management method proposed by the UN in 2002) represent a tangibly different design approach to those that went before, and have been proposed as representing a "third generation" design method for balanced scorecards.[5]

Design methods for balanced scorecards continue to evolve and adapt to reflect the deficiencies in the currently used methods, and the particular needs of communities of interest (e.g. NGO's and government departments have found the third generation methods embedded in results-based management more useful than first or second generation design methods).[29]

This generation refined the second generation of balanced scorecards to give more relevance and functionality to strategic objectives. The major difference is the incorporation of Destination Statements. Other key components are strategic objectives, strategic linkage model and perspectives, measures and initiatives.[5]

PopularityIn 1997, Kurtzman[30] found that 64 percent of the companies questioned were measuring performance from a number of perspectives in a similar way to the balanced scorecard. Balanced scorecards have been implemented by government agencies, military units, business units and corporations as a whole, non-profit organizations, and schools.

Balanced Scorecard has been widely adopted, and has been found to be the most popular performance management framework in a recent survey[31]

Many examples of balanced scorecards can be found via web searches. However, adapting one organization's balanced scorecard to another is generally not advised by theorists, who believe that much of the benefit of the balanced scorecard comes from the design process itself.[6] Indeed, it could be argued that many failures in the early days of balanced scorecard could be attributed to this problem, in that early balanced scorecards were often designed remotely by consultants.[32]

[33] Managers did not trust, and so failed to engage with and use, these measure suites created by people lacking knowledge of the organization and management responsibility.[19]

VariantsSince the balanced scorecard was popularized in the early 1990s, a large number of alternatives to the original 'four box' balanced scorecard promoted by Kaplan and Norton in their various articles and books have emerged. Most have very limited application, and are typically proposed either by academics as vehicles for promoting other agendas (such as green issues) - e.g. Brignall (2002)[24] or consultants as an attempt at differentiation to promote sales of books and / or consultancy (e.g. Bourne (2002);[34] Niven (2002)[35]).

Many of the structural variations proposed are broadly similar, and a research paper published in 2004[5] attempted to identify a pattern in these variations - noting three distinct types of variation.

Page 11: Business Analytics

The variations appeared to be part of an evolution of the Balanced Scorecard concept, and so the paper refers to these distinct types as "generations". Broadly, the original 'measures in boxes' type design (as proposed by Kaplan & Norton) constitutes the 1st generation balanced scorecard design; balanced scorecard designs that include a 'strategy map' or 'strategic linkage model' (e.g. the Performance Prism,[36] later Kaplan & Norton designs[16] the Performance Driver model of Olve, Roy & Wetter (English translation 1999,[15] 1st published in Swedish 1997)) constitute the 2nd Generation of Balanced Scorecard design; and designs that augment the strategy map / strategic linkage model with a separate document describing the long-term outcomes sought from the strategy (the "destination statement" idea) comprise the 3rd generation balanced scorecard design.

Variants that feature adaptations of the structure of Balanced Scorecard to suit better a particular viewpoint or agenda are numerous. Examples of the focus of such adaptations include green issues,[24] decision support,[37] public sector management,[38] and health care management.[39] The performance management elements of the UN's Results Based Management system have strong design and structural similarities to those used in the 3rd Generation Balanced Scorecard design approach.[29]

Balanced Scorecard is also often linked to quality management tools and activities.[40] Although there are clear areas of cross-over and association, the two sets of tools are complementary rather than duplicative.[41]

A common use of balanced scorecard is to support the payments of incentives to individuals, even though it was not designed for this purpose and is not particularly suited to it.[2][42]

CriticismThe balanced scorecard has attracted criticism from a variety of sources. Most has come from the academic community, who dislike the empirical nature of the framework: Kaplan and Norton notoriously failed to include any citation of prior articles in their initial papers on the topic. Some of this criticism focuses on technical flaws in the methods and design of the original Balanced Scorecard proposed by Kaplan and Norton,.[19][32][43] Other academics have simply focused on the lack of citation support.[44]

A second kind of criticism is that the balanced scorecard does not provide a bottom line score or a unified view with clear recommendations: it is simply a list of metrics (e.g. Jensen 2001[45]). These critics usually include in their criticism suggestions about how the 'unanswered' question postulated could be answered, but typically the unanswered question relate to things outside the scope of balanced scorecard itself (such as developing strategies) (e.g. Brignall[24])

A third kind of criticism is that the model fails to fully reflect the needs of stakeholders - putting bias on financial stakeholders over others. Early forms of Balanced Scorecard proposed by Kaplan & Norton focused on the needs of commercial organisations in the USA - where this focus on investment return was appropriate.[10] This focus was maintained through subsequent revisions.[46] Even now over 20 years after they were first proposed, the four most common perspectives in Balanced Scorecard designs mirror the four proposed in the original Kaplan &

Page 12: Business Analytics

Norton paper.[1] However, as noted earlier in this wiki page, there have been many studies that suggest other perspectives might better reflect the priorities of organisations - particularly but not exclusively relating to the needs of organisations in the public and Non Governmental sectors.[47] More modern design approaches such as 3rd Generation Balanced Scorecard and the UN's Results Based Management methods explicitly consider the interests of wider stakeholder groups, and perhaps address this issue in its entirety.[29]

There are few empirical studies linking the use of balanced scorecards to better decision making or improved financial performance of companies, but some work has been done in these areas. However, broadcast surveys of usage have difficulties in this respect, due to the wide variations in definition of 'what a balanced scorecard is' noted above (making it hard to work out in a survey if you are comparing like with like). Single organization case studies suffer from the 'lack of a control' issue common to any study of organizational change - you don't know what the organization would have achieved if the change had not been made, so it is difficult to attribute changes observed over time to a single intervention (such as introducing a balanced scorecard). However, such studies as have been done have typically found balanced scorecard to be useful.[6]

[19]

Software toolsIt is important to recognize that the balanced scorecard by definition is not a complex thing - typically no more than about 20 measures spread across a mix of financial and non-financial topics, and easily reported manually (on paper, or using simple office software).[46]

The processes of collecting, reporting, and distributing balanced scorecard information can be labor-intensive and prone to procedural problems (for example, getting all relevant people to return the information required by the required date). The simplest mechanism to use is to delegate these activities to an individual, and many Balanced Scorecards are reported via ad-hoc methods based around email, phone calls and office software.

In more complex organizations, where there are multiple balanced scorecards to report and/or a need for co-ordination of results between balanced scorecards (for example, if one level of reports relies on information collected and reported at a lower level) the use of individual reporters is problematic. Where these conditions apply, organizations use balanced scorecard reporting software to automate the production and distribution of these reports.

Recent surveys[1][48] have consistently found that roughly one third of organizations used office software to report their balanced scorecard, one third used software developed specifically for their own use, and one third used one of the many commercial packages available.

DashboardFrom Wikipedia, the free encyclopediaThis article is about a control panel placed in the front of the car. For other uses, see Dashboard (disambiguation).

Page 13: Business Analytics

The dashboard of a Bentley Continental GTC car

A Suzuki Hayabusa motorcycle dash

Dashboard instruments displaying various car and engine conditions

Carriage dashboard

A dashboard (also called dash, instrument panel, or fascia) is a control panel placed in front of the driver of an automobile, housing instrumentation and controls for operation of the vehicle.

The word originally applied to a barrier of wood or leather fixed at the front of a horse-drawn carriage or sleigh to protect the driver from mud or other debris "dashed up" (thrown up) by the horses' hooves.[1]

Contents 1 Dashboard items 2 Padding and safety

Page 14: Business Analytics

3 Fashion in instrumentation 4 Gallery 5 See also 6 References

Dashboard itemsItems located on the dashboard at first included the steering wheel and the instrument cluster. The instrument cluster pictured to the right contains gauges such as a speedometer, tachometer, odometer and fuel gauge, and indicators such as gearshift position, seat belt warning light, parking-brake-engagement warning light[2] and an engine-malfunction light. There may also be indicators for low fuel, low oil pressure, low tire pressure and faults in the airbag (SRS) system. Heating and ventilation controls and vents, lighting controls, audio equipment and automotive navigation systems are also mounted on the dashboard.

The top of a dashboard may contain vents for the heating and air conditioning system and speakers for an audio system. A glove compartment is commonly located on the passenger's side. There may also be an ashtray and a cigarette lighter which can provide a power outlet for other low-voltage appliances.[3]

Padding and safetyIn 1937, Chrysler, Dodge, DeSoto, and Plymouth cars came with a safety dashboard that was flat, raised above knee height, and had all the controls mounted flush.[4]

Padded dashboards were advocated in the 1930s by car safety pioneer Claire L. Straith.[5] In 1947, the Tucker became the first car with a padded dashboard.[6]

One of the safety enhancements of the 1970s was the widespread adoption of padded dashboards. The padding is commonly polyurethane foam, while the surface is commonly either polyvinyl chloride (PVC) or leather in the case of luxury models.

In the early and mid-1990s, airbags became a standard feature of steering wheels and dashboards.

Fashion in instrumentation

Page 15: Business Analytics

Stylised dashboard from a 1980s Lancia Beta

In the 1940s through the 1960s, American car manufacturers and their imitators designed unusually-shaped instruments on a dashboard laden with chrome and transparent plastic, which could be less readable, but was often thought to be more stylish. Sunlight could cause a bright glare on the chrome, particularly for a convertible.

With the coming of the LED in consumer electronics, some manufacturers used instruments with digital readouts to make their cars appear more up to date, but this has faded from practice. Some cars use a head-up display to project the speed of the car onto the windscreen in imitation of fighter aircraft, but in a far less complex display.

In recent years, spurred on by the growing aftermarket use of dash kits, many automakers have taken the initiative to add more stylistic elements to their dashboards. One prominenet example of this is the Chevrolet Sonic which offers both exterior (e.g., a custom graphics package) and interior cosmetic upgrades.[7] In addition to OEM dashboard trim and upgrades a number of companies offer domed polyurethane or vinyl applique dash trim accent kits or "dash kits." [8] Some of the major manufacturers of these kits are Sherwood Innovations, B&I Trims and Rvinyl.com.

Manufacturers such as BMW, Honda, Toyota and Mercedes-Benz have included fuel-economy gauges in some instrument clusters, showing fuel mileage in real time. The ammeter was the gauge of choice for monitoring the state of the charging system until the 1970s. Later it was replaced by the voltmeter. Today most family vehicles have warning lights instead of voltmeters or oil pressure gauges in their dashboard instrument clusters, though sports cars often have proper gauges for performance purposes and driver appeasement.

Data qualityFrom Wikipedia, the free encyclopedia

Data are of high quality if, "they are fit for their intended uses in operations, decision making and planning." (J. M. Juran). Alternatively, data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes paramount, regardless of fitness for use for any particular external purpose; e.g., a person's age and birth date

Page 16: Business Analytics

may conflict within different parts of the same database. The first views can often be in disagreement, even about the same set of data used for the same purpose. This article discusses the concept of data quality as it relates to business data processing, although of course other fields have their own data quality issues as well.

Contents 1 Definitions 2 History 3 Overview 4 Data Quality Assurance 5 Data quality control 6 Optimum use of data quality 7 Criticism of existing tools and processes 8 Professional associations 9 See also 10 References 11 Further reading

DefinitionsThis list is taken from the online book "Data Quality: High-impact Strategies".[1] See also the glossary of data quality terms.[2]

Degree of excellence exhibited by the data in relation to the portrayal of the actual scenario.

The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.[3]

The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data.[4]

The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria.[5]

Complete, standards based, consistent, accurate and time stamped.[6]

HistoryBefore the rise of the inexpensive server, massive mainframe computers were used to maintain name and address data so that mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry (NCOA). This technology saved large companies millions of dollars in comparison to manually correction of customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately.

Page 17: Business Analytics

Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.

Companies with an emphasis on marketing often focus their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found in the enterprise. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking and shipping parts across a large organization.

While name and address data has a clear standard as defined by local postal authorities, other types of data have few recognized standards. There is a movement in the industry today to standardize certain non-address data. The non-profit group GS1 is among the groups spearheading this movement.

For companies with significant research efforts, data quality can include developing protocols for research methods, reducing measurement error, bounds checking of the data, cross tabulation, modeling and outlier detection, verifying data integrity, etc.

OverviewThere are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework, dubbed "Zero Defect Data" (Hansen, 1991) adapts the principles of statistical process control to data quality. Another framework seeks to integrate the product perspective (conformance to specifications) and the service perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in semiotics to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously (Wand and Wang, 1996).

A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. These lists commonly include accuracy, correctness, currency, completeness and relevance. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognise this as a similar problem to "ilities".

MIT has a Total Data Quality Management program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ). This program grew out of the work done by Hansen on the "Zero Defect Data" framework (Hansen, 1991).

Page 18: Business Analytics

In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence to customer relationship management and supply chain management. One industry study estimated the total cost to the US economy of data quality problems at over US$600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data migration and conversion projects.[7]

In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed.[8]

One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year.[9]

In fact, the problem is such a concern that companies are beginning to set up a data governance team whose sole role in the corporation is to be responsible for data quality. In some[who?] organizations, this data governance function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations.

Problems with data quality don't only arise from incorrect data; inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.

Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.[10]

The market is going some way to providing data quality assurance. A number of vendors make tools for analysing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:

1. Data profiling - initially assessing the data to understand its quality challenges2. Data standardization - a business rules engine that ensures that data conforms to quality

rules3. Geocoding - for name and address data. Corrects data to US and Worldwide postal

standards4. Matching or Linking - a way to compare data so that similar, but slightly different records

can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. It might be able to manage 'householding', or finding links between spouses at the same address, for example. Finally, it often can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.

5. Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.

Page 19: Business Analytics

6. Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean.

There are several well-known authors and self-styled experts, with Larry English perhaps the most popular guru. In addition, the International Association for Information and Data Quality (IAIDQ) was established in 2004 to provide a focal point for professionals and researchers in this field.

ISO 8000 is the international standard for data quality.

Data Quality AssuranceData quality assurance is the process of profiling the data to discover inconsistencies and other anomalies in the data, as well as performing data cleansing activities (e.g. removing outliers, missing data interpolation) to improve the data quality .

These activities can be undertaken as part of data warehousing or as part of the database administration of an existing piece of applications software.

Data quality control

This section may require copy-editing. (February 2014)

Data quality control is the process of controlling the usage of data with known quality measurement—for an application or a process. This process is usually done after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction.

Data QA process provides following information to Data Quality Control (QC):

Severity of inconsistency Incompleteness Accuracy Precision Missing / Unknown

The Data QC process uses the information from the QA process, then it decides to use the data for analysis or in an application or business process. For example, if a Data QC process finds that the data contains too much error or inconsistency, then it prevents that data from being used for its intended process. The usage of incorrect data might crucially impact output. For example, providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing data QC process provides the protection of usage of data control and establishes safe information usage.

Optimum use of data quality

Page 20: Business Analytics

Data Quality (DQ) is a niche area required for the integrity of the data management by covering gaps of data issues. This is one of the key functions that aid data governance by monitoring data to find exceptions undiscovered by current data management operations. Data Quality checks may be defined at attribute level to have full control on its remediation steps.

DQ checks and business rules may easily overlap if an organization is not attentive of its DQ scope. Business teams should understand the DQ scope thoroughly in order to avoid overlap. Data quality checks are redundant if business logic covers the same functionality and fulfills the same purpose as DQ. The DQ scope of an organization should be defined in DQ strategy and well implemented. Some data quality checks may be translated into business rules after repeated instances of exceptions in the past.

Below are a few areas of data flows that may need perennial DQ checks:

Completeness and precision DQ checks on all data may be performed at the point of entry for each mandatory attribute from each source system. Few attribute values are created way after the initial creation of the transaction; in such cases, administering these checks becomes tricky and should be done immediately after the defined event of that attribute's source and the transaction's other core attribute conditions are met.

All data having attributes referring to Reference Data in the organization may be validated against the set of well-defined valid values of Reference Data to discover new or discrepant values through the validity DQ check. Results may be used to update Reference Data administered under Master Data Management (MDM).

All data sourced from a third party to organization's internal teams may undergo accuracy (DQ) check against the third party data. These DQ check results are valuable when administered on data that made multiple hops after the point of entry of that data but before that data becomes authorized or stored for enterprise intelligence.

All data columns that refer to Master Data may be validated for its consistency check. A DQ check administered on the data at the point of entry discovers new data for the MDM process, but a DQ check administered after the point of entry discovers the failure (not exceptions) of consistency.

As data transforms, multiple timestamps and the positions of that timestamps are captured and may be compared against each other and its leeway to validate its value, decay, operational significance against a defined SLA (service level agreement). This timeliness DQ check can be utilized to decrease data value decay rate and optimize the policies of data movement timeline.

In an organization complex logic is usually segregated into simpler logic across multiple processes. Reasonableness DQ checks on such complex logic yielding to a logical result within a specific range of values or static interrelationships (aggregated business rules) may be validated to discover complicated but crucial business processes and outliers of the data, its drift from BAU (business as usual) expectations, and may provide possible exceptions eventually resulting into data issues. This check may be a simple generic aggregation rule engulfed by large

Page 21: Business Analytics

chunk of data or it can be a complicated logic on a group of attributes of a transaction pertaining to the core business of the organization. This DQ check requires high degree of business knowledge and acumen. Discovery of reasonableness issues may aid for policy and strategy changes by either business or data governance or both.

Conformity checks and integrity checks need not covered in all business needs, it’s strictly under the database architecture's discretion.

There are many places in the data movement where DQ checks may not be required. For instance, DQ check for completeness and precision on not–null columns is redundant for the data sourced from database. Similarly, data should be validated for its accuracy with respect to time when the data is stitched across disparate sources. However, that is a business rule and should not be in the DQ scope.

Criticism of existing tools and processesThe main reasons cited are:

Project costs: costs are typically in the hundreds of thousands of dollars Time: lack of enough time to deal with large-scale data-cleansing software Security: concerns over sharing information, giving an application access across systems,

and effects on legacy systems

Master data managementFrom Wikipedia, the free encyclopedia

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (April 2012)

In business, master data management (MDM) comprises the processes, governance, policies, standards and tools that consistently define and manage the critical data of an organization to provide a single point of reference.[1]

The data that is mastered may include:

reference data - the business objects for transactions, and the dimensions for analysis analytical data - supports decision making [2][3]

In computing, an MDM tool can be used to support master data management by removing duplicates, standardizing data (mass maintaining), and incorporating rules to eliminate incorrect data from entering the system in order to create an authoritative source of master data. Master data are the products, accounts and parties for which the business transactions are completed. The root cause problem stems from business unit and product line segmentation, in which the same customer will be serviced by different product lines, with redundant data being entered

Page 22: Business Analytics

about the customer (aka party in the role of customer) and account in order to process the transaction. The redundancy of party and account data is compounded in the front to back office life cycle, where the authoritative single source for the party, account and product data is needed but is often once again redundantly entered or augmented.

MDM has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing such data throughout an organization to ensure consistency and control in the ongoing maintenance and application use of this information.

The term recalls the concept of a master file from an earlier computing era.

Definition: Master data management (MDM) is a comprehensive method of enabling an enterprise to link all of its critical data to one file, called a master file, that provides a common point of reference. When properly done, MDM streamlines data sharing among personnel and departments. In addition, MDM can facilitate computing in multiple system architectures, platforms and applications.

Contents 1 Issues 2 Solutions 3 Transmission of Master Data 4 See also 5 References 6 External links

IssuesAt a basic level, MDM seeks to ensure that an organization does not use multiple (potentially inconsistent) versions of the same master data in different parts of its operations, which can occur in large organizations. A common example of poor MDM is the scenario of a bank at which a customer has taken out a mortgage and the bank begins to send mortgage solicitations to that customer, ignoring the fact that the person already has a mortgage account relationship with the bank. This happens because the customer information used by the marketing section within the bank lacks integration with the customer information used by the customer services section of the bank. Thus the two groups remain unaware that an existing customer is also considered a sales lead. The process of record linkage is used to associate different records that correspond to the same entity, in this case the same person.

Other problems include (for example) issues with the quality of data, consistent classification and identification of data, and data-reconciliation issues. Master data management of disparate data systems requires data transformations as the data extracted from the disparate source data system is transformed and loaded into the master data management hub. To synchronize the disparate source master data, the managed master data extracted from the master data

Page 23: Business Analytics

management hub is again transformed and loaded into the disparate source data system as the master data is updated. As with other Extract, Transform, Load-based data movement, these processes are expensive and inefficient to develop and to maintain which greatly reduces the return on investment for the master data management product.

One of the most common reasons some large corporations experience massive issues with MDM is growth through mergers or acquisitions. Two organizations which merge will typically create an entity with duplicate master data (since each likely had at least one master database of its own prior to the merger). Ideally, database administrators resolve this problem through deduplication of the master data as part of the merger. In practice, however, reconciling several master data systems can present difficulties because of the dependencies that existing applications have on the master databases. As a result, more often than not the two systems do not fully merge, but remain separate, with a special reconciliation process defined that ensures consistency between the data stored in the two systems. Over time, however, as further mergers and acquisitions occur, the problem multiplies, more and more master databases appear, and data-reconciliation processes become extremely complex, and consequently unmanageable and unreliable. Because of this trend, one can find organizations with 10, 15, or even as many as 100 separate, poorly integrated master databases, which can cause serious operational problems in the areas of customer satisfaction, operational efficiency, decision-support, and regulatory compliance.

SolutionsProcesses commonly seen in MDM include source identification, data collection, data transformation, normalization, rule administration, error detection and correction, data consolidation, data storage, data distribution, data classification, taxonomy services, item master creation, schema mapping, product codification, data enrichment and data governance

The selection of entities considered for MDM depends somewhat on the nature of an organization. In the common case of commercial enterprises, MDM may apply to such entities as customer (customer data integration), product (product information management), employee, and vendor. MDM processes identify the sources from which to collect descriptions of these entities. In the course of transformation and normalization, administrators adapt descriptions to conform to standard formats and data domains, making it possible to remove duplicate instances of any entity. Such processes generally result in an organizational MDM repository, from which all requests for a certain entity instance produce the same description, irrespective of the originating sources and the requesting destination.

The tools include data networks, file systems, a data warehouse, data marts, an operational data store, data mining, data analysis, data visualization, Data federation and data virtualization. One of the newest tools, virtual master data management utilizes data virtualization and a persistent metadata server to implement a multi-level automated MDM hierarchy.

Transmission of Master Data

Page 24: Business Analytics

There are several ways in which Master Data may be collated and distributed to other systems.[4] This includes:

Data consolidation : The process of capturing master data from multiple sources and integrating into a single hub (operational data store) for replication to other destination systems.

Data federation : The process of providing a single virtual view of master data from one or more sources to one or more destination systems.

Data propagation : The process of copying master data from one system to another, typically through point-to-point interfaces in legacy systems.

Data profilingFrom Wikipedia, the free encyclopedia

[hide]This article has multiple issues. Please help improve it or discuss these issues on the talk page.

The topic of this article may not meet Wikipedia's general notability guideline. (August 2010)

This article needs additional citations for verification. (August 2010)

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:

1. Find out whether existing data can easily be used for other purposes2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to

a category3. Give metrics on data quality including whether the data conforms to particular standards or

patterns4. Assess the risk involved in integrating data for new applications, including the challenges of joins5. Assess whether metadata accurately describes the actual values in the source database6. Understanding data challenges early in any data intensive project, so that late project surprises

are avoided. Finding data problems late in the project can lead to delays and cost overruns.7. Have an enterprise view of all data, for uses such as master data management where key data is

needed, or data governance for improving data quality.

Contents 1 Data Profiling in Relation to Data Warehouse/Business Intelligence Development

o 1.1 Introduction o 1.2 How to do Data Profiling

Page 25: Business Analytics

o 1.3 When to Conduct Data Profiling o 1.4 Benefits of Data Profiling

2 See also 3 References

Data Profiling in Relation to Data Warehouse/Business Intelligence DevelopmentIntroduction

Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data.[1] Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata.[2] Thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not.[3] The result of the analysis is used both strategically, to determine suitability of the candidate source systems and give the basis for an early go/no-go decision, and tactically, to identify problems for later solution design, and to level sponsors’ expectations.[1]

How to do Data Profiling

Data profiling utilizes different kinds of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, and variation as well as other aggregates such as count and sum. Additional metadata information obtained during data profiling could be data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.[2][4][5] The metadata can then be used to discover problems such as illegal values, misspelling, missing values, varying value representation, and duplicates. Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.[2] Normally purpose-built tools are used for data profiling to ease the process.[1][2][4][5][6][7] The computation complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.[3]

When to Conduct Data Profiling

According to Kimball,[1] data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken as soon as candidate source systems have been identified right after the acquisition of the business requirements for the DW/BI. The purpose is to clarify at an early stage if the right data is available at the right detail level and that anomalies can be handled subsequently. If this is not the case the project might have to be canceled.[1] More detailed profiling is done prior to the dimensional modeling process in order to see what it will require to convert data into the

Page 26: Business Analytics

dimensional model, and extends into the ETL system design process to establish what data to extract and which filters to apply.[1] An additional time to conduct data in the data warehouse development process after data has been loaded into staging, the data marts, etc. Doing so at these points in time helps assure that data cleaning and transformations have been done correctly according to requirements.

Benefits of Data Profiling

The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and improve understanding of data for the users.[7] Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling.[3] Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.[7] Although data profiling is effective, then do remember to find a suitable balance and do not slip into “analysis paralysis”

Sectin bData modelingFrom Wikipedia, the free encyclopedia

(Redirected from Data modelling)

Page 27: Business Analytics

The data modeling process. The figure illustrates the way data models are developed and used today. A conceptual data model is developed based on the data requirements for the application that is being developed, perhaps in the context of an activity model. The data model will normally consist of entity types, attributes, relationships, integrity rules, and the definitions of those objects. This is then used as the start point for interface or database design.[1]

Data modeling in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.

Contents 1 Overview 2 Data modeling topics

o 2.1 Data models o 2.2 Conceptual, logical and physical schemas o 2.3 Data modeling process o 2.4 Modeling methodologies o 2.5 Entity relationship diagrams o 2.6 Generic data modeling o 2.7 Semantic data modeling

3 See also 4 References 5 Further reading 6 External links

Overview

Page 28: Business Analytics

Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations. Therefore, the process of data modeling involves professional data modelers working closely with business stakeholders, as well as potential users of the information system.

According to Hoberman, data modeling is the process of learning about the data, and the data model is the end result of the data modeling process.[2]

There are three different types of data models produced while progressing from requirements to the actual database to be used for the information system.[3] The data requirements are initially recorded as a conceptual data model which is essentially a set of technology independent specifications about the data and is used to discuss initial requirements with the business stakeholders. The conceptual model is then translated into a logical data model, which documents structures of the data that can be implemented in databases. Implementation of one conceptual data model may require multiple logical data models. The last step in data modeling is transforming the logical data model to a physical data model that organizes the data into tables, and accounts for access, performance and storage details. Data modeling defines not just data elements, but also their structures and the relationships between them.[4]

Data modeling techniques and methodologies are used to model data in a standard, consistent, predictable manner in order to manage it as a resource. The use of data modeling standards is strongly recommended for all projects requiring a standard means of defining and analyzing data within an organization, e.g., using data modeling:

to assist business analysts, programmers, testers, manual writers, IT package selectors, engineers, managers, related organizations and clients to understand and use an agreed semi-formal model the concepts of the organization and how they relate to one another

to manage data as a resource for the integration of information systems for designing databases/data warehouses (aka data repositories)

Data modeling may be performed during various types of projects and in multiple phases of projects. Data models are progressive; there is no such thing as the final data model for a business or application. Instead a data model should be considered a living document that will change in response to a changing business. The data models should ideally be stored in a repository so that they can be retrieved, expanded, and edited over time. Whitten et al. (2004) determined two types of data modeling:[5]

Strategic data modeling: This is part of the creation of an information systems strategy, which defines an overall vision and architecture for information systems is defined. Information engineering is a methodology that embraces this approach.

Data modeling during systems analysis: In systems analysis logical data models are created as part of the development of new databases.

Data modeling is also used as a technique for detailing business requirements for specific databases. It is sometimes called database modeling because a data model is eventually implemented in a database.[5]

Page 29: Business Analytics

Data modeling topicsData modelsMain article: Data model

How data models deliver benefit.[1]

Data models provide a structure for data used within information systems by providing specific definition and format. If a data model is used consistently across systems then compatibility of data can be achieved. If the same data structures are used to store and access data then different applications can share data seamlessly. The results of this are indicated in the diagram. However, systems and interfaces often cost more than they should, to build, operate, and maintain. They may also constrain the business rather than support it. This may occur when the quality of the data models implemented in systems and interfaces is poor.[1]

Business rules, specific to how things are done in a particular place, are often fixed in the structure of a data model. This means that small changes in the way business is conducted lead to large changes in computer systems and interfaces. So, business rules need to be implemented in a flexible way that does not result in complicated dependencies, rather the data model should be flexible enough so that changes in the business can be implemented within the data model in a relatively quick and efficient way.

Entity types are often not identified, or are identified incorrectly. This can lead to replication of data, data structure and functionality, together with the attendant costs of that duplication in development and maintenance.Therefore, data definitions should be made as explicit and easy to understand as possible to minimize misinterpretation and duplication.

Data models for different systems are arbitrarily different. The result of this is that complex interfaces are required between systems that share data. These interfaces can account for between 25-70% of the cost of current systems. Required interfaces should be considered inherently while designing a data model, as a data model on its own would not be usable without interfaces within different systems.

Data cannot be shared electronically with customers and suppliers, because the structure and meaning of data has not been standardised. To obtain optimal value from an implemented data model, it is very important to define standards that will ensure that data models will both meet business needs and be consistent.[1]

Page 30: Business Analytics

Conceptual, logical and physical schemas

The ANSI/SPARC three level architecture. This shows that a data model can be an external model (or view), a conceptual model, or a physical model. This is not the only way to look at data models, but it is a useful way, particularly when comparing models.[1]

In 1975 ANSI described three kinds of data-model instance:[6]

Conceptual schema : describes the semantics of a domain (the scope of the model). For example, it may be a model of the interest area of an organization or of an industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial "language" with a scope that is limited by the scope of the model. Simply described, a conceptual schema is the first step in organizing the data requirements.

Logical schema : describes the structure of some domain of information. This consists of descriptions of (for example) tables, columns, object-oriented classes, and XML tags. The logical schema and conceptual schema are sometimes implemented as one and the same. [3]

Physical schema : describes the physical means used to store data. This is concerned with partitions, CPUs, tablespaces, and the like.

According to ANSI, this approach allows the three perspectives to be relatively independent of each other. Storage technology can change without affecting either the logical or the conceptual schema. The table/column structure can change without (necessarily) affecting the conceptual schema. In each case, of course, the structures must remain consistent across all schemas of the same data model.

Page 31: Business Analytics

Data modeling process

Data modeling in the context of Business Process Integration.[7]

In the context of business process integration (see figure), data modeling complements business process modeling, and ultimately results in database generation.[7]

The process of designing a database involves producing the previously described three types of schemas - conceptual, logical, and physical. The database design documented in these schemas are converted through a Data Definition Language, which can then be used to generate a database. A fully attributed data model contains detailed attributes (descriptions) for every entity within it. The term "database design" can describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views. In an object database the entities and relationships map directly to object classes and named relationships. However, the term "database design" could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the Database Management System or DBMS.

In the process, system interfaces account for 25% to 70% of the development and support costs of current systems. The primary reason for this cost is that these systems do not share a common data model. If data models are developed on a system by system basis, then not only is the same analysis repeated in overlapping areas, but further analysis must be performed to create the interfaces between them. Most systems within an organization contain the same basic data, redeveloped for a specific purpose. Therefore, an efficiently designed basic data model can minimize rework with minimal modifications for the purposes of different systems within the organization[1]

Page 32: Business Analytics

Modeling methodologies

Data models represent information areas of interest. While there are many ways to create data models, according to Len Silverston (1997)[8] only two modeling methodologies stand out, top-down and bottom-up:

Bottom-up models or View Integration models are often the result of a reengineering effort. They usually start with existing data structures forms, fields on application screens, or reports. These models are usually physical, application-specific, and incomplete from an enterprise perspective. They may not promote data sharing, especially if they are built without reference to other parts of the organization.[8]

Top-down logical data models, on the other hand, are created in an abstract way by getting information from people who know the subject area. A system may not implement all the entities in a logical model, but the model serves as a reference point or template. [8]

Sometimes models are created in a mixture of the two methods: by considering the data needs and structure of an application and by consistently referencing a subject-area model. Unfortunately, in many environments the distinction between a logical data model and a physical data model is blurred. In addition, some CASE tools don’t make a distinction between logical and physical data models.[8]

Entity relationship diagramsMain article: Entity-relationship model

Page 33: Business Analytics

Example of an IDEF1X Entity relationship diagrams used to model IDEF1X itself. The name of the view is mm. The domain hierarchy and constraints are also given. The constraints are expressed as sentences in the formal theory of the meta model.[9]

There are several notations for data modeling. The actual model is frequently called "Entity relationship model", because it depicts data in terms of the entities and relationships described in the data.[5] An entity-relationship model (ERM) is an abstract conceptual representation of structured data. Entity-relationship modeling is a relational schema database modeling method, used in software engineering to produce a type of conceptual data model (or semantic data model) of a system, often a relational database, and its requirements in a top-down fashion.

These models are being used in the first stage of information system design during the requirements analysis to describe information needs or the type of information that is to be stored in a database. The data modeling technique can be used to describe any ontology (i.e. an overview and classifications of used terms and their relationships) for a certain universe of discourse i.e. area of interest.

Page 34: Business Analytics

Several techniques have been developed for the design of data models. While these methodologies guide data modelers in their work, two different people using the same methodology will often come up with very different results. Most notable are:

Bachman diagrams Barker's notation Chen's Notation Data Vault Modeling Extended Backus–Naur form IDEF1X Object-relational mapping Object-Role Modeling Relational Model Relational Model/Tasmania

Generic data modelingMain article: Generic data model

Example of a Generic data model.[10]

Generic data models are generalizations of conventional data models. They define standardized general relation types, together with the kinds of things that may be related by such a relation type. The definition of generic data model is similar to the definition of a natural language. For example, a generic data model may define relation types such as a 'classification relation', being a binary relation between an individual thing and a kind of thing (a class) and a 'part-whole relation', being a binary relation between two things, one with the role of part, the other with the role of whole, regardless the kind of things that are related.

Given an extensible list of classes, this allows the classification of any individual thing and to specify part-whole relations for any individual object. By standardization of an extensible list of relation types, a generic data model enables the expression of an unlimited number of kinds of facts and will approach the capabilities of natural languages. Conventional data models, on the

Page 35: Business Analytics

other hand, have a fixed and limited domain scope, because the instantiation (usage) of such a model only allows expressions of kinds of facts that are predefined in the model.

Semantic data modelingMain article: Semantic data model

The logical data structure of a DBMS, whether hierarchical, network, or relational, cannot totally satisfy the requirements for a conceptual definition of data because it is limited in scope and biased toward the implementation strategy employed by the DBMS.

Semantic data models.[9]

Therefore, the need to define data from a conceptual view has led to the development of semantic data modeling techniques. That is, techniques to define the meaning of data within the context of its interrelationships with other data. As illustrated in the figure the real world, in terms of resources, ideas, events, etc., are symbolically defined within physical data stores. A semantic data model is an abstraction which defines how the stored symbols relate to the real world. Thus, the model must be a true representation of the real world.[9]

A semantic data model can be used to serve many purposes, such as:[9]

planning of data resources building of shareable databases evaluation of vendor software integration of existing databases

The overall goal of semantic data models is to capture more meaning of data by integrating relational concepts with more powerful abstraction concepts known from the Artificial Intelligence field. The idea is to provide high level modeling primitives as integral part of a data model in order to facilitate the representation of real world situations.[11]

Types of data modelsDatabase modelMain article: Database model

Page 36: Business Analytics

A database model is a specification describing how a database is structured and used. Several such models have been suggested. Common models include:

Flat model

Hierarchical model

Network model

Relational model

Flat model : This may not strictly qualify as a data model. The flat (or table) model consists of a single, two-dimensional array of data elements, where all members of a given column are assumed to be similar values, and all members of a row are assumed to be related to one another.

Hierarchical model : In this model data is organized into a tree-like structure, implying a single upward link in each record to describe the nesting, and a sort field to keep the records in a particular order in each same-level list.

Page 37: Business Analytics

Network model : This model organizes data using two fundamental constructs, called records and sets. Records contain fields, and sets define one-to-many relationships between records: one owner, many members.

Relational model : is a database model based on first-order predicate logic. Its core idea is to describe a database as a collection of predicates over a finite set of predicate variables, describing constraints on the possible values and combinations of values.

Concept-oriented model

Star schema

Object-relational model : Similar to a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language.

Star schema is the simplest style of data warehouse schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The star schema is considered an important special case of the snowflake schema.

Tecniqes

Bachman diagram

Illustration of set type using a Bachman diagram

Page 38: Business Analytics

A Bachman diagram is a certain type of data structure diagram,[2] and is used to design the data with a network or relational "logical" model, separating the data model from the way the data is stored in the system. The model is named after database pioneer Charles Bachman, and mostly used in computer software design.

In a relational model, a relation is the cohesion of attributes that are fully and not transitive functional dependent[clarify] of every key in that relation. The coupling between the relations is based on accordant attributes. For every relation, a rectangle has to be drawn and every coupling is illustrated by a line that connects the relations. On the edge of each line, arrows indicate the cardinality. We have 1-to-n, 1-to-1 and n-to-n. The latter has to be avoided and must be replaced by two 1-to-n couplings.

See also

Barker's notationFrom Wikipedia, the free encyclopedia

Barker's notation refers to the ERD notation developed by Richard Barker, Ian Palmer, Harry Ellis et al. whilst working at the British consulting firm CACI around 1981. The notation was adopted by Barker when he joined Oracle and is effectively defined in his book Entity Relationship Modelling as part of the CASE Method series of books. This notation was and still is used by the Oracle CASE modelling tools. It is a variation of the "crows foot" style of data modelling that was favoured by many over the original Chen style of ERD modelling because of its readability and efficient use of drawing space.

The notation has features that represent the properties of relationships including cardinality and optionality (the crows foot and dashing of lines), exclusion (the exclusion arc), recursion (looping structures) and use of abstraction (nested boxes).

Object-relational mappingFrom Wikipedia, the free encyclopedia

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (May 2009)

Not to be confused with Object-Role Modeling.

Object-relational mapping (ORM, O/RM, and O/R mapping) in computer science is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to create their own ORM tools.

Page 39: Business Analytics

In object-oriented programming, data management tasks act on object-oriented (OO) objects that are almost always non-scalar values. For example, consider an address book entry that represents a single person along with zero or more phone numbers and zero or more addresses. This could be modeled in an object-oriented implementation by a "Person object" with attributes/fields to hold each data item that the entry comprises: the person's name, a list of phone numbers, and a list of addresses. The list of phone numbers would itself contain "PhoneNumber objects" and so on. The address book entry is treated as a single object by the programming language (it can be referenced by a single variable containing a pointer to the object, for instance). Various methods can be associated with the object, such as a method to return the preferred phone number, the home address, and so on.

However, many popular database products such as structured query language database management systems (SQL DBMS) can only store and manipulate scalar values such as integers and strings organized within tables. The programmer must either convert the object values into groups of simpler values for storage in the database (and convert them back upon retrieval), or only use simple scalar values within the program. Object-relational mapping is used to implement the first approach.[1]

The heart of the problem is translating the logical representation of the objects into an atomized form that is capable of being stored in the database, while preserving the properties of the objects and their relationships so that they can be reloaded as objects when needed. If this storage and retrieval functionality is implemented, the objects are said to be persistent.[1]

Object-role modelingFrom Wikipedia, the free encyclopedia  (Redirected from Object-Role Modeling)

Not to be confused with Object-relational mapping.

example of an ORM2 diagram

Object-role modeling (ORM) is used to model the semantics of a universe of discourse. ORM is often used for data modeling and software engineering.

An object-role model uses graphical symbols that are based on first order predicate logic and set theory to enable the modeler to create an unambiguous definition of an arbitrary universe of discourse.

Page 40: Business Analytics

The term "object-role model" was coined in the 1970s and ORM based tools have been used for more than 30 years – principally for data modeling. More recently ORM has been used to model business rules, XML-Schemas, data warehouses, requirements engineering and web forms.[1]

Concepts

Overview of object-role model notation, Stephen M. Richard (1999).[2]

Facts

Object-role models are based on elementary facts, and expressed in diagrams that can be verbalised into natural language. A fact is a proposition such as "John Smith was hired on 5 January 1995" or "Mary Jones was hired on 3 March 2010".

With ORM, propositions such as these, are abstracted into "fact types" for example "Person was hired on Date" and the individual propositions are regarded as sample data. The difference between a "fact" and an "elementary fact" is that an elementary fact cannot be simplified without loss of meaning. This "fact-based" approach facilitates modeling, transforming, and querying information from any domain.[3]

Attribute-free

ORM is attribute-free : unlike models in the entity relationship (ER) and Unified Modeling Language (UML) methods, ORM treats all elementary facts as relationships and so treats decisions for grouping facts into structures (e.g. attribute-based entity types, classes, relation schemes, XML schemas) as implementation concerns irrelevant to semantics. By avoiding attributes in the base model, ORM improves semantic stability and enables verbalization into natural language.

Page 41: Business Analytics

Fact-based modeling

Fact-based modeling includes procedures for mapping facts to attribute-based structures, such as those of ER or UML.[3]

Fact-based textual representations are based on formal subsets of native languages. ORM proponents argue that ORM models are easier to understand by people without a technical education. For example, proponents argue that object-role models are easier to understand than declarative languages such as Object Constraint Language (OCL) and other graphical languages such as UML class models.[3] Fact-based graphical notations are more expressive than those of ER and UML. An object-role model can be automatically mapped to relational and deductive databases (such as datalog).[4]

ORM 2 graphical notation

ORM2 is the latest generation of object-role modeling . The main objectives for the ORM 2 graphical notation are:[5]

More compact display of ORM models without compromising clarity Improved internationalization (e.g. avoid English language symbols) Simplified drawing rules to facilitate creation of a graphical editor Extended use of views for selectively displaying/suppressing detail Support for new features (e.g. role path delineation, closure aspects, modalities)

Design procedure

Example of the application of Object Role Modeling in a "Schema for Geologic Surface", Stephen M. Richard (1999).[2]

System development typically involves several stages such as: feasibility study; requirements analysis; conceptual design of data and operations; logical design; external design; prototyping;

Page 42: Business Analytics

internal design and implementation; testing and validation; and maintenance. The seven steps of the conceptual schema design procedure are:[6]

1. Transform familiar information examples into elementary facts, and apply quality checks2. Draw the fact types, and apply a population check3. Check for entity types that should be combined, and note any arithmetic derivations4. Add uniqueness constraints, and check arity of fact types5. Add mandatory role constraints, and check for logical derivations6. Add value, set comparison and subtyping constraints7. Add other constraints and perform final checks

ORM's conceptual schema design procedure (CSDP) focuses on the analysis and design of data.

Data warehouseFrom Wikipedia, the free encyclopedia

[hide]This article has multiple issues. Please help improve it or discuss these issues on the talk page.

This article needs additional citations for verification. (February 2008)

This article has an unclear citation style. (September 2013)

Data Warehouse Overview

In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a system used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and

Page 43: Business Analytics

historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

The data stored in the warehouse is uploaded from the operational systems (such as marketing, sales, etc., shown in the figure to the right). The data may pass through an operational data store for additional operations before it is used in the DW for reporting.

Contents 1 Types of systems 2 Software tools 3 Benefits 4 Generic data warehouse environment 5 History 6 Information storage

o 6.1 Facts o 6.2 Dimensional vs. normalized approach for storage of data

7 Top-down versus bottom-up design methodologies o 7.1 Bottom-up design o 7.2 Top-down design o 7.3 Hybrid design

8 Data warehouses versus operational systems 9 Evolution in organization use 10 See also 11 References 12 Further reading 13 External links

Types of systemsData mart

A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as sales, finance or marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data.[1]

Online analytical processing (OLAP)

Is characterized by a relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems, response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have

Page 44: Business Analytics

data latency of a few hours, as opposed to data marts, where latency is expected to be closer to one day.

Online Transaction Processing (OLTP)

Is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data. The schema used to store transactional databases is the entity model (usually 3NF).[2]

Predictive analysis

Predictive analysis is about finding and quantifying hidden patterns in the data using complex mathematical models that can be used to predict future outcomes. Predictive analysis is different from OLAP in that OLAP focuses on historical data analysis and is reactive in nature, while predictive analysis focuses on the future. These systems are also used for CRM (Customer Relationship Management).[3]

Software toolsThe typical extract-transform-load (ETL)-based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.[4]

This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, cataloged and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support.[5] However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.

BenefitsA data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to :

Page 45: Business Analytics

Congregate data from multiple sources into a single database so a single query engine can be used to present data.

Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long running, analysis queries in transaction processing databases.

Maintain data history, even if the source transaction systems do not. Integrate data from multiple source systems, enabling a central view across the enterprise. This

benefit is always valuable, but particularly so when the organization has grown by merger. Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad

data. Present the organization's information consistently. Provide a single common data model for all data of interest regardless of the data's source. Restructure the data so that it makes sense to the business users. Restructure the data so that it delivers excellent query performance, even for complex analytic

queries, without impacting the operational systems. Add value to operational business applications, notably customer relationship management

(CRM) systems. Making decision–support queries easier to write.

Generic data warehouse environmentThe environment for data warehouses and marts includes the following:

Source systems that provide data to the warehouse or mart; Data integration technology and processes that are needed to prepare the data for use; Different architectures for storing data in an organization's data warehouse or data marts; Different tools and applications for the variety of users; Metadata, data quality, and governance processes must be in place to ensure that the

warehouse or mart meets its purposes.

In regards to source systems listed above, Rainer states, “A common source for the data in data warehouses is the company’s operational databases, which can be relational databases”.[6]

Regarding data integration, Rainer states, “It is necessary to extract data from source systems, transform them, and load them into a data mart or warehouse”.[6]

Rainer discusses storing data in an organization’s data warehouse or data marts.”.[6]

Metadata are data about data. “IT personnel need information about data sources; database, table, and column names; refresh schedules; and data usage measures“.[6]

Today, the most successful companies are those that can respond quickly and flexibly to market changes and opportunities. A key to this response is the effective and efficient use of data and information by analysts and managers.[6] A “data warehouse” is a repository of historical data that are organized by subject to support decision makers in the organization.[6] Once data are stored in a data mart or warehouse, they can be accessed.

Page 46: Business Analytics

HistoryThe concept of data warehousing dates back to the late 1980s[7] when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users.

Key developments in early years of data warehousing were:

1960s — General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[8]

1970s — ACNielsen and IRI provide dimensional data marts for retail sales.[8]

1970s — Bill Inmon begins to define and discuss the term: Data Warehouse.[citation needed]

1975 — Sperry Univac introduces MAPPER (MAintain, Prepare, and Produce Executive Reports) is a database management and reporting system that includes the world's first 4GL. First platform designed for building Information Centers (a forerunner of contemporary Enterprise Data Warehousing platforms)

1983 — Teradata introduces a database management system specifically designed for decision support.

1983 — Sperry Corporation Martyn Richard Jones [9] defines the Sperry Information Center approach, which while not being a true DW in the Inmon sense, did contain many of the characteristics of DW structures and process as defined previously by Inmon, and later by Devlin. First used at the TSB England & Wales A subset of this work found its way into the much later papers of Devlin and Murphy.

1984 — Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases Data Interpretation System (DIS). DIS was a hardware/software package and GUI for business users to create a database management and analytic system.

1988 — Barry Devlin and Paul Murphy publish the article An architecture for a business and information system where they introduce the term "business data warehouse".[10]

1990 — Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system specifically for data warehousing.

1991 — Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a data warehouse.

1992 — Bill Inmon publishes the book Building the Data Warehouse.[11]

1995 — The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.

1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit.[12]

Page 47: Business Analytics

2000 — Daniel Linstedt releases the Data Vault, enabling real time auditable Data Warehouses warehouse.

In 2012 Bill developed and made public technology known as “textual disambiguation”. Textual disambiguation applies context to raw text and reformats the raw text and context into a standard data base format. Once raw text is passed through textual disambiguation, it can easily and efficiently be accessed and analyzed by standard business intelligence technology. Textual disambiguation is accomplished through the execution of textual ETL. Textual disambiguation is useful wherever raw text is found, such as in documents, Hadoop, email, and so forth.

Information storageFacts

A fact is a value or measurement, which represents a fact about the managed entity or system.

Facts as reported by the reporting entity are said to be at raw level.

E.g. if a BTS received 1,000 requests for traffic channel allocation, it allocates for 820 and rejects the remaining then it would report 3 facts or measurements to a management system:

tch_req_total = 1000 tch_req_success = 820 tch_req_fail = 180

Facts at raw level are further aggregated to higher levels in various dimensions to extract more service or business-relevant information out of it. These are called aggregates or summaries or aggregated facts.

E.g. if there are 3 BTSs in a city, then facts above can be aggregated from BTS to city level in network dimension. E.g.

Dimensional vs. normalized approach for storage of data

There are three or more leading approaches to storing data in a data warehouse — the most important approaches are the dimensional approach and the normalized approach.

The dimensional approach refers to Ralph Kimball’s approach in which it is stated that the data warehouse should be modeled using a Dimensional Model/star schema. The normalized approach, also called the 3NF model (Third Normal Form) refers to Bill Inmon's approach in

Page 48: Business Analytics

which it is stated that the data warehouse should be modeled using an E-R model/normalized model.

In a dimensional approach, transaction data are partitioned into "facts", which are generally numeric transaction data, and "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.

A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly.[citation needed] Dimensional structures are easy to understand for business users, because the structure is divided into measurements/facts and context/dimensions. Facts are related to the organization’s business processes and operational system whereas the dimensions surrounding them contain context about the measurement (Kimball, Ralph 2008).

The main disadvantages of the dimensional approach are the following:

1. In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated.

2. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented (Kimball, Ralph 2008)[citation needed]. The main advantage of this approach is that it is straightforward to add information into the database. Some disadvantages of this approach are that, because of the number of tables involved, it can be difficult for users to join data from different sources into meaningful information and to access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

Both normalized and dimensional models can be represented in entity-relationship diagrams as both contain joined relational tables. The difference between the two models is the degree of normalization (also known as Normal Forms). These approaches are not mutually exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 2008).

In Information-Driven Business,[13] Robert Hillard proposes an approach to comparing the two approaches based on the information needs of the business problem. The technique shows that normalized models hold far more information than their dimensional equivalents (even when the same fields are used in both models) but this extra information comes at the cost of usability. The

Page 49: Business Analytics

technique measures information quantity in terms of information entropy and usability in terms of the Small Worlds data transformation measure.[14]

Top-down versus bottom-up design methodologiesThis section appears to be written like an advertisement. Please help improve it by rewriting promotional content from a neutral point of view and removing any inappropriate external links. (November 2012)

Bottom-up design

Ralph Kimball [15] created an approach to data warehouse design known as bottom-up.[16] In the bottom-up approach, data marts are first created to provide reporting and analytical capabilities for specific business processes.

These data marts can eventually be integrated to create a comprehensive data warehouse. The data warehouse bus architecture is primarily an implementation of "the bus", a collection of conformed dimensions and conformed facts, which are dimensions that are shared (in a specific way) between facts in two or more data marts.

Top-down design

Bill Inmon has defined a data warehouse as a centralized repository for the entire enterprise.[17] The top-down approach is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision, the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities. Gartner released a research note confirming Inmon's definition in 2005[18] with additional clarity. They also added one attribute.

Hybrid design

Data warehouse (DW) solutions often resemble the hub and spokes architecture. Legacy systems feeding the DW/BI solution often include customer relationship management (CRM) and enterprise resource planning solutions (ERP), generating large amounts of data. To consolidate these various data models, and facilitate the extract transform load (ETL) process, DW solutions often make use of an operational data store (ODS). The information from the ODS is then parsed into the actual DW. To reduce data redundancy, larger systems will often store the data in a normalized way. Data marts for specific reports can then be built on top of the DW solution.

The DW database in a hybrid solution is kept on third normal form to eliminate data redundancy. A normal relational database, however, is not efficient for business intelligence reports where dimensional modelling is prevalent. Small data marts can shop for data from the consolidated warehouse and use the filtered, specific data for the fact tables and dimensions required. The DW

Page 50: Business Analytics

effectively provides a single source of information from which the data marts can read, creating a highly flexible solution from a BI point of view. The hybrid architecture allows a DW to be replaced with a master data management solution where operational, not static information could reside.

The Data Vault Modeling components follow hub and spokes architecture. This modeling style is a hybrid design, consisting of the best practices from both 3rd normal form and star schema. The Data Vault model is not a true 3rd normal form, and breaks some of the rules that 3NF dictates be followed. It is however, a top-down architecture with a bottom up design. The Data Vault model is geared to be strictly a data warehouse. It is not geared to be end-user accessible, which when built, still requires the use of a data mart or star schema based release area for business purposes.

Data warehouses versus operational systemsOperational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity-relationship model. Operational system designers generally follow the Codd rules of database normalization in order to ensure data integrity. Codd defined five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems.

Data warehouses are optimized for analytic access patterns. Analytic access patterns generally involve selecting specific fields and rarely if ever 'select *' as is more common in operational databases. Because of these differences in access patterns, operational databases (loosely, OLTP) benefit from the use of a row-oriented DBMS whereas analytics databases (loosely, OLAP) benefit from the use of a column-oriented DBMS. Unlike operational systems which maintain a snapshot of the business, data warehouses generally maintain an infinite history which is implemented through ETL processes that periodically migrate data from the operational systems over to the data warehouse.

Evolution in organization useThese terms refer to the level of sophistication of a data warehouse:

Offline operational data warehouse

Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data

Page 51: Business Analytics

Offline data warehouse

Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data are stored in a data structure designed to facilitate reporting.

On time data warehouse

Online Integrated Data Warehousing represent the real time Data warehouses stage data in the warehouse is updated for every transaction performed on the source data

Integrated data warehouse

These data warehouses assemble data from different areas of business, so users can look up the information they need across other systems.

Data martFrom Wikipedia, the free encyclopedia

A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.[1] This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.

The reasons why organizations are building data warehouses and data marts are because the information in the database is not organized in a way that makes it easy for organizations to find what they need. Also complicated queries might take a long time to answer what people want to know since the database systems are designed to process millions of transactions per day. Transactional database are designed to be updated, however, data warehouses or marts are read only. Data warehouses are designed to access large groups of related records.

Data marts improve end-user response time by allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.

A data mart is basically a condensed and more focused version of a data warehouse that reflects the regulations and process specifications of each business unit within an organization. Each data mart is dedicated to a specific business function or region. This subset of data may span across many or all of an enterprise’s functional subject areas. It is common for multiple data marts to be

Page 52: Business Analytics

used in order to serve the needs of each individual business unit (different data marts can be used to obtain specific information for various enterprise departments, such as accounting, marketing, sales, etc.).

The related term spreadmart is a derogatory label describing the situation that occurs when one or more business analysts develop a system of linked spreadsheets to perform a business analysis, then grow it to a size and degree of complexity that makes it nearly impossible to maintain.

Contents 1 Data Mart vs Data Warehouse 2 Design schemas 3 Reasons for creating a data mart 4 Dependent data mart 5 See also 6 References 7 Bibliography 8 External links

Data Mart vs Data WarehouseData Warehouse:

Holds multiple subject areas Holds very detailed information Works to integrate all data sources Does not necessarily use a dimensional model but feeds dimensional models.

Data Mart:

Often holds only one subject area- for example, Finance, or Sales May hold more summarized data (although many hold full detail) Concentrates on integrating information from a given subject area or set of source

systems Is built focused on a dimensional model using a star schema.

Design schemas Star schema - fairly popular design choice; enables a relational database to emulate the

analytical functionality of a multidimensional database Snowflake schema

Page 53: Business Analytics

Reasons for creating a data mart Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full data warehouse Potential users are more clearly defined than in a full data warehouse Contains only business essential data and is less cluttered.

Dependent data martAccording to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following reasons:

A need refreshment for a special data model or schema: e.g., to restructure for OLAP Performance: to offload the data mart to a separate computer for greater efficiency or to

obviate the need to manage that workload on the centralized data warehouse. Security: to separate an authorized data subset selectively Expediency: to bypass the data governance and authorizations required to incorporate a

new application on the Enterprise Data Warehouse Proving Ground: to demonstrate the viability and ROI (return on investment) potential of

an application prior to migrating it to the Enterprise Data Warehouse Politics: a coping strategy for IT (Information Technology) in situations where a user

group has more influence than funding or is not a good citizen on the centralized data warehouse.

Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse.

According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited scalability, duplication of data, data inconsistency with other silos of information, and inability to leverage enterprise sources of data.

The alternative school of data warehousing is that of Ralph Kimball. In his view, a data warehouse is nothing more than the union of all the data marts. This view helps to reduce costs and provides fast development, but can create an inconsistent data warehouse, especially in large organizations. Therefore, Kimball's approach is more suitable for small-to-medium corporations.[2]

Data martFrom Wikipedia, the free encyclopedia

Page 54: Business Analytics

A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.[1] This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.

The reasons why organizations are building data warehouses and data marts are because the information in the database is not organized in a way that makes it easy for organizations to find what they need. Also complicated queries might take a long time to answer what people want to know since the database systems are designed to process millions of transactions per day. Transactional database are designed to be updated, however, data warehouses or marts are read only. Data warehouses are designed to access large groups of related records.

Data marts improve end-user response time by allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.

A data mart is basically a condensed and more focused version of a data warehouse that reflects the regulations and process specifications of each business unit within an organization. Each data mart is dedicated to a specific business function or region. This subset of data may span across many or all of an enterprise’s functional subject areas. It is common for multiple data marts to be used in order to serve the needs of each individual business unit (different data marts can be used to obtain specific information for various enterprise departments, such as accounting, marketing, sales, etc.).

The related term spreadmart is a derogatory label describing the situation that occurs when one or more business analysts develop a system of linked spreadsheets to perform a business analysis, then grow it to a size and degree of complexity that makes it nearly impossible to maintain.

Contents 1 Data Mart vs Data Warehouse 2 Design schemas 3 Reasons for creating a data mart 4 Dependent data mart 5 See also 6 References 7 Bibliography 8 External links

Page 55: Business Analytics

Data Mart vs Data WarehouseData Warehouse:

Holds multiple subject areas Holds very detailed information Works to integrate all data sources Does not necessarily use a dimensional model but feeds dimensional models.

Data Mart:

Often holds only one subject area- for example, Finance, or Sales May hold more summarized data (although many hold full detail) Concentrates on integrating information from a given subject area or set of source

systems Is built focused on a dimensional model using a star schema.

Design schemas Star schema - fairly popular design choice; enables a relational database to emulate the

analytical functionality of a multidimensional database Snowflake schema

Reasons for creating a data mart Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full data warehouse Potential users are more clearly defined than in a full data warehouse Contains only business essential data and is less cluttered.

Dependent data martAccording to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following reasons:

A need refreshment for a special data model or schema: e.g., to restructure for OLAP Performance: to offload the data mart to a separate computer for greater efficiency or to

obviate the need to manage that workload on the centralized data warehouse.

Page 56: Business Analytics

Security: to separate an authorized data subset selectively Expediency: to bypass the data governance and authorizations required to incorporate a

new application on the Enterprise Data Warehouse Proving Ground: to demonstrate the viability and ROI (return on investment) potential of

an application prior to migrating it to the Enterprise Data Warehouse Politics: a coping strategy for IT (Information Technology) in situations where a user

group has more influence than funding or is not a good citizen on the centralized data warehouse.

Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse.

According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited scalability, duplication of data, data inconsistency with other silos of information, and inability to leverage enterprise sources of data.

The alternative school of data warehousing is that of Ralph Kimball. In his view, a data warehouse is nothing more than the union of all the data marts. This view helps to reduce costs and provides fast development, but can create an inconsistent data warehouse, especially in large organizations. Therefore, Kimball's approach is more suitable for small-to-medium corporations.[2]

Data integrationFrom Wikipedia, the free encyclopedia

Data integration involves combining data residing in different sources and providing users with a unified view of these data.[1] This process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume and the need to share existing data explodes.[2] It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. In management circles, people frequently refer to data integration as "Enterprise Information Integration" (EII).

Contents 1 History 2 Example 3 Theory of data integration

o 3.1 Definitions o 3.2 Query processing

4 Data Integration in the Life Sciences 5 See also 6 References 7 Further reading

Page 57: Business Analytics

History

Figure 1: Simple schematic for a data warehouse. The ETL process extracts information from the source databases, transforms it and then loads it into the data warehouse.

Figure 2: Simple schematic for a data-integration solution. A system designer constructs a mediated schema against which users can run queries. The virtual database interfaces with the source databases via wrapper code if required.

Issues with combining heterogeneous data sources under a single query interface have existed for some time. The rapid adoption of databases after the 1960s naturally led to the need to share or to merge existing repositories. This merging can take place at several levels in the database architecture.

One popular solution is implemented based on data warehousing (see figure 1). The warehouse system extracts, transforms, and loads data from heterogeneous sources into a single view schema so data becomes compatible with each other. This approach offers a tightly coupled architecture because the data are already physically reconciled in a single queryable repository, so it usually takes little time to resolve queries. However, problems lie in the data freshness, that is, information in warehouse is not always up-to-date. Thus updating an original data source may outdate the warehouse, accordingly, the ETL process needs re-execution for synchronization. Difficulties also arise in constructing data warehouses when one has only a query interface to summary data sources and no access to the full data. This problem frequently emerges when integrating several commercial query services like travel or classified advertisement web applications.

Page 58: Business Analytics

As of 2009 the trend in data integration has favored loosening the coupling between data[citation

needed] and providing a unified query-interface to access real time data over a mediated schema (see figure 2), which allows information to be retrieved directly from original databases. This approach relies on mappings between the mediated schema and the schema of original sources, and transform a query into specialized queries to match the schema of the original databases. Such mappings can be specified in 2 ways : as a mapping from entities in the mediated schema to entities in the original sources (the "Global As View" (GAV) approach), or as a mapping from entities in the original sources to the mediated schema (the "Local As View" (LAV) approach). The latter approach requires more sophisticated inferences to resolve a query on the mediated schema, but makes it easier to add new data sources to a (stable) mediated schema.

As of 2010 some of the work in data integration research concerns the semantic integration problem. This problem addresses not the structuring of the architecture of the integration, but how to resolve semantic conflicts between heterogeneous data sources. For example if two companies merge their databases, certain concepts and definitions in their respective schemas like "earnings" inevitably have different meanings. In one database it may mean profits in dollars (a floating-point number), while in the other it might represent the number of sales (an integer). A common strategy for the resolution of such problems involves the use of ontologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach represents ontology-based data integration. On the other hand, the problem of combining research results from different bioinformatics repositories requires bench-marking of the similarities, computed from different data sources, on a single criterion such as positive predictive value. This enables the data sources to be directly comparable and can be integrated even when the natures of experiments are distinct.[3]

As of 2011 it was determined that current data modeling methods were imparting data isolation into every data architecture in the form of islands of disparate data and information silos each of which represents a disparate system. This data isolation is an unintended artifact of the data modeling methodology that results in the development of disparate data models.[4] Disparate data models, when instantiated as databases, form disparate databases. Enhanced data model methodologies have been developed to eliminate the data isolation artifact and to promote the development of integrated data models.[5] [6] One enhanced data modeling method recasts data models by augmenting them with structural metadata in the form of standardized data entities. As a result of recasting multiple data models, the set of recast data models will now share one or more commonality relationships that relate the structural metadata now common to these data models. Commonality relationships are a peer-to-peer type of entity relationships that relate the standardized data entities of multiple data models. Multiple data models that contain the same standard data entity may participate in the same commonality relationship. When integrated data models are instantiated as databases and are properly populated from a common set of master data, then these databases are integrated.

ExampleConsider a web application where a user can query a variety of information about cities (such as crime statistics, weather, hotels, demographics, etc.). Traditionally, the information must be stored in a single database with a single schema. But any single enterprise would find

Page 59: Business Analytics

information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data.

A data-integration solution may address this problem by considering these external resources as materialized views over a virtual mediated schema, resulting in "virtual data integration". This means application-developers construct a virtual schema — the mediated schema — to best model the kinds of answers their users want. Next, they design "wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution (see figure 2). When an application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user's query.

This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for them. It contrasts with ETL systems or with a single database solution, which require manual integration of entire new dataset into the system. The virtual ETL solutions leverage virtual mediated schema to implement data harmonization; whereby the data are copied from the designated "master" source to the defined targets, field by field. Advanced Data virtualization is also built on the concept of object-oriented modeling in order to construct virtual mediated schema or virtual metadata repository, using hub and spoke architecture.

Each data source is disparate and as such is not designed to support reliable joins between data sources. Therefore, data virtualization as well as data federation depends upon accidental data commonality to support combining data and information from disparate data sets. Because of this lack of data value commonality across data sources, the return set may be inaccurate, incomplete, and impossible to validate.

One solution is to recast disparate databases to integrate these databases without the need for ETL. The recast databases support commonality constraints where referential integrity may be enforced between databases. The recast databases provide designed data access paths with data value commonality across databases.

Theory of data integrationThe theory of data integration[1] forms a subset of database theory and formalizes the underlying concepts of the problem in first-order logic. Applying the theories gives indications as to the feasibility and difficulty of data integration. While its definitions may appear abstract, they have sufficient generality to accommodate all manner of integration systems.[citation needed]

Definitions

Data integration systems are formally defined as a triple where is the global (or

mediated) schema, is the heterogeneous set of source schemas, and is the

Page 60: Business Analytics

mapping that maps queries between the source and the global schemas. Both and

are expressed in languages over alphabets composed of symbols for each of their

respective relations. The mapping consists of assertions between queries over

and queries over . When users pose queries over the data integration system, they pose

queries over and the mapping then asserts connections between the elements in the global schema and the source schemas.

A database over a schema is defined as a set of sets, one for each relation (in a relational

database). The database corresponding to the source schema would comprise the set of sets of tuples for each of the heterogeneous data sources and is called the source database. Note that this single source database may actually represent a collection of disconnected databases.

The database corresponding to the virtual mediated schema is called the global

database. The global database must satisfy the mapping with respect to the source database. The legality of this mapping depends on the nature of the correspondence between

and . Two popular ways to model this correspondence exist: Global as View or GAV and Local as View or LAV.

Figure 3: Illustration of tuple space of the GAV and LAV mappings.[7] In GAV, the system is constrained to the set of tuples mapped by the mediators while the set of tuples expressible over the sources may be much larger and richer. In LAV, the system is constrained to the set of tuples in the sources while the set of tuples expressible over the global schema can be much larger. Therefore LAV systems must often deal with incomplete answers.

GAV systems model the global database as a set of views over . In this case

associates to each element of as a query over . Query processing becomes a

straightforward operation due to the well-defined associations between and . The burden of complexity falls on implementing mediator code instructing the data integration system exactly how to retrieve elements from the source databases. If any new sources join the

Page 61: Business Analytics

system, considerable effort may be necessary to update the mediator, thus the GAV approach appears preferable when the sources seem unlikely to change.

In a GAV approach to the example data integration system above, the system designer would first develop mediators for each of the city information sources and then design the global schema around these mediators. For example, consider if one of the sources served a weather website. The designer would likely then add a corresponding element for weather to the global schema. Then the bulk of effort concentrates on writing the proper mediator code that will transform predicates on weather into a query over the weather website. This effort can become complex if some other source also relates to weather, because the designer may need to write code to properly combine the results from the two sources.

On the other hand, in LAV, the source database is modeled as a set of views over . In

this case associates to each element of a query over . Here the exact

associations between and are no longer well-defined. As is illustrated in the next section, the burden of determining how to retrieve elements from the sources is placed on the query processor. The benefit of an LAV modeling is that new sources can be added with far less work than in a GAV system, thus the LAV approach should be favored in cases where the mediated schema is more stable and unlikely to change.[1]

In an LAV approach to the example data integration system above, the system designer designs the global schema first and then simply inputs the schemas of the respective city information sources. Consider again if one of the sources serves a weather website. The designer would add corresponding elements for weather to the global schema only if none existed already. Then programmers write an adapter or wrapper for the website and add a schema description of the website's results to the source schemas. The complexity of adding the new source moves from the designer to the query processor.

Query processingThe theory of query processing in data integration systems is commonly expressed using conjunctive queries and Datalog, a purely declarative logic programming language.[8] One can loosely think of a conjunctive query as a logical function applied to the relations of a database

such as " where ". If a tuple or set of tuples is substituted into the rule and satisfies it (makes it true), then we consider that tuple as part of the set of answers in the query. While formal languages like Datalog express these queries concisely and without ambiguity, common SQL queries count as conjunctive queries as well.

In terms of data integration, "query containment" represents an important property of conjunctive

queries. A query contains another query (denoted ) if the results of

applying are a subset of the results of applying for any database. The two queries are said to be equivalent if the resulting sets are equal for any database. This is important

Page 62: Business Analytics

because in both GAV and LAV systems, a user poses conjunctive queries over a virtual schema represented by a set of views, or "materialized" conjunctive queries. Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query. This corresponds to the problem of answering queries using views (AQUV).[9]

In GAV systems, a system designer writes mediator code to define the query-rewriting. Each element in the user's query corresponds to a substitution rule just as each element in the global schema corresponds to a query over the source. Query processing simply expands the subgoals of the user's query according to the rule specified in the mediator and thus the resulting query is likely to be equivalent. While the designer does the majority of the work beforehand, some GAV systems such as Tsimmis involve simplifying the mediator description process.

In LAV systems, queries undergo a more radical process of rewriting because no mediator exists to align the user's query with a simple expansion strategy. The integration system must execute a search over the space of possible queries in order to find the best rewrite. The resulting rewrite may not be an equivalent query but maximally contained, and the resulting tuples may be incomplete. As of 2009 the MiniCon algorithm[9] is the leading query rewriting algorithm for LAV data integration systems.

In general, the complexity of query rewriting is NP-complete.[9] If the space of rewrites is relatively small this does not pose a problem — even for integration systems with hundreds of sources.

Data Integration in the Life SciencesLarge-scale questions in science, such as global warming, invasive species spread, and resource depletion, are increasingly requiring the collection of disparate data sets for meta-analysis. This type of data integration is especially challenging for ecological and environmental data because metadata standards are not agreed upon and there are many different data types produced in these fields. National Science Foundation initiatives such as Datanet are intended to make data integration easier for scientists by providing cyberinfrastructure and setting standards. The five funded Datanet initiatives are DataONE,[10] led by William Michener at the University of New Mexico; The Data Conservancy,[11] led by Sayeed Choudhury of Johns Hopkins University; SEAD: Sustainable Environment through Actionable Data,[12] led by Margaret Hedstrom of the University of Michigan; the DataNet Federation Consortium,[13] led by Reagan Moore of the University of North Carolina; and Terra Populus,[14] led by Steven Ruggles of the University of Minnesota. The Research Data Alliance,[15] has more recently explored creating global data integration frameworks.

Online transaction processingFrom Wikipedia, the free encyclopedia  (Redirected from OLTP)

This article relies largely or entirely upon a single source. Relevant discussion may be found on the talk page. Please help improve this article by introducing citations to

Page 63: Business Analytics

additional sources. (March 2013)

Online transaction processing, or OLTP, is a class of information systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions.[1] OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automated teller machine (ATM) for a bank is an example of a commercial transaction processing application. Online transaction processing applications are high throughput and insert or update-intensive in database management. These applications are used concurrently by hundreds of users. The key goals of OLTP applications are availability, speed, concurrency and recoverability.[2] Reduced paper trails and the faster, more accurate forecast for revenues and expenses are both examples of how OLTP makes things simpler for businesses. However, like many modern online information technology solutions, some systems require offline maintenance, which further affects the cost-benefit analysis of online transaction processing system.

Contents 1 What Is an OLTP System? 2 Online Transaction Processing Systems Design 3 Contrasted to 4 See also 5 References 6 External links

What Is an OLTP System?OLTP system is a popular data processing system in today's enterprises. Some examples of OLTP systems include order entry, retail sales, and financial transaction systems.[3] Online transaction processing system increasingly requires support for transactions that span a network and may include more than one company. For this reason, modern online transaction processing software use client or server processing and brokering software that allows transactions to run on different computer platforms in a network.

In large applications, efficient OLTP may depend on sophisticated transaction management software (such as CICS) and/or database optimization tactics to facilitate the processing of large numbers of concurrent updates to an OLTP-oriented database.

For even more demanding decentralized database systems, OLTP brokering programs can distribute transaction processing among multiple computers on a network. OLTP is often integrated into service-oriented architecture (SOA) and Web services.

Page 64: Business Analytics

Online Transaction Processing (OLTP) involves gathering input information, processing the information and updating existing information to reflect the gathered and processed information. As of today, most organizations use a database management system to support OLTP. OLTP is carried in a client server system.

Online Transaction Process concerns about concurrency and atomicity. Concurrency controls guarantee that two users accessing the same data in the database system will not be able to change that data or the user has to wait until the other user has finished processing, before changing that piece of data. Atomicity controls guarantee that all the steps in transaction are completed successfully as a group. That is, if any steps between the transaction fail, all other steps must fail also.[4]

Online Transaction Processing Systems DesignTo build an OLTP system, designer must know that the large number of concurrent users does not interfere with the system's performance. To increase the performance of OLTP system, designer must avoid the excessive use of indexes and clusters.

The following elements are crucial for the performance of OLTP systems:[5]

Rollback segments

Rollback segments are the portions of database that record the actions of transactions in the event that a transaction is rolled back. Rollback segments provide read consistency, roll back transactions, and recover the database.[6]

Clusters

A cluster is a schema that contains one or more tables that have one or more columns in common. Clustering tables in database improves the performance of join operation.[7]

Discrete transactions

All changes to the data are deferred until the transaction commits during a discrete transaction. It can improve the performance of short, non-distributed transaction.[8]

Block (data storage) size

The data block size should be a multiple of the operating system's block size within the maximum limit to avoid unnecessary I/O.[9]

Buffer cache size

To avoid unnecessary resource consumption, tune SQL statements to use the database buffer cache.[10]

Page 65: Business Analytics

Dynamic allocation of space to tables and rollback segments Transaction processing monitors and the multi-threaded server

A transaction processing monitor is used for coordination of services. It is like an operating system and does the coordination at a high level of granularity and can span multiple computing devices.[11]

Partition (database)

Partition increases performance for sites that have regular transactions while still maintain availability and security.[12]

database tuning

With database tuning, OLTP system can maximize its performance as efficiently and rapidly as possible.

Batch processingFrom Wikipedia, the free encyclopedia

This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. Please help to improve this article by introducing more precise citations. (March 2013)

Batch processing is the execution of a series of programs ("jobs") on a computer without manual intervention.

Jobs are set up so they can be run to completion without human interaction. All input parameters are predefined through scripts, command-line arguments, control files, or job control language. This is in contrast to "online" or interactive programs which prompt the user for such input. A program takes a set of data files as input, processes the data, and produces a set of output data files. This operating environment is termed as "batch processing" because the input data are collected into batches or sets of records and each batch is processed as a unit. The output is another batch that can be reused for computation.

Data miningFrom Wikipedia, the free encyclopediaNot to be confused with analytics, information extraction, or data analysis.

Machine learning anddata mining

Page 66: Business Analytics

Problems

Classification Clustering Regression

Anomaly detection Association rules

Reinforcement learning Structured prediction

Feature learning Online learning

Semi-supervised learning Grammar induction

Supervised learning(classification • regression)

Decision trees Ensembles (Bagging, Boosting, Random forest)

k -NN Linear regression

Naive Bayes Neural networks

Logistic regression Perceptron

Support vector machine (SVM) Relevance vector machine (RVM)

Clustering

Page 67: Business Analytics

BIRCH Hierarchical

k -means Expectation-maximization (EM)

DBSCAN OPTICS

Mean-shift

Dimensionality reduction

Factor analysis CCA ICA LDA NMF PCA t-SNE

Structured prediction

Graphical models (Bayes net, CRF, HMM)

Anomaly detection

k -NN Local outlier factor

Neural nets

Autoencoder Deep learning

Multilayer perceptron RNN

Restricted Boltzmann machine SOM

Convolutional neural network

Theory

Bias-variance dilemma

Page 68: Business Analytics

Computational learning theory Empirical risk minimization

PAC learning Statistical learning

VC theory

Computer science portal

Statistics portal

v t e

Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD),[1] an interdisciplinary subfield of computer science,[2][3][4] is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.[2] The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.[2] Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.[2]

The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amount of data, not the extraction of data itself.[5] It also is a buzzword,[6] and is frequently also applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence, machine learning, and business intelligence. The popular book "Data mining: Practical machine learning tools and techniques with Java"[7] (which covers mostly machine learning material) was originally to be named just "Practical machine learning", and the term "data mining" was only added for marketing reasons.[8] Often the more general terms "(large scale) data analysis", or "analytics" – or when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by

Page 69: Business Analytics

a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting are part of the data mining step, but do belong to the overall KDD process as additional steps.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

Contents 1 Etymology 2 Background

o 2.1 Research and evolution 3 Process

o 3.1 Pre-processing o 3.2 Data mining o 3.3 Results validation

4 Standards 5 Notable uses

o 5.1 Games o 5.2 Business o 5.3 Science and engineering o 5.4 Human rights o 5.5 Medical data mining o 5.6 Spatial data mining o 5.7 Temporal data mining o 5.8 Sensor data mining o 5.9 Visual data mining o 5.10 Music data mining o 5.11 Surveillance o 5.12 Pattern mining o 5.13 Subject-based data mining o 5.14 Knowledge grid

6 Privacy concerns and ethics o 6.1 Situation in the United States o 6.2 Situation in Europe

7 Software o 7.1 Free open-source data mining software and applications o 7.2 Commercial data-mining software and applications o 7.3 Marketplace surveys

8 See also 9 References 10 Further reading 11 External links

Page 70: Business Analytics

EtymologyIn the 1960s, statisticians used terms like "Data Fishing" or "Data Dredging" to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term "Data Mining" appeared around 1990 in the database community. For a short time in 1980s, a phrase "database mining"™, was used, but since it was trademarked by HNC, a San Diego-based company (now merged into FICO), to pitch their Database Mining Workstation;[9] researchers consequently turned to "data mining". Other terms used include Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, etc. Gregory Piatetsky-Shapiro coined the term "Knowledge Discovery in Databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and Machine Learning Community. However, the term data mining became more popular in the business and press communities.[10] Currently, Data Mining and Knowledge Discovery are used interchangeably. Since about 2007, "Predictive Analytics" and since 2011, "Data Science" terms were also used to describe this field.

BackgroundThe manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns[11] in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever larger data sets.

Research and evolution

The premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD).[12][13] Since 1989 this ACM SIG has hosted an annual international conference and published its proceedings,[14] and since 1999 it has published a biannual academic journal titled "SIGKDD Explorations".[15]

Computer science conferences on data mining include:

CIKM Conference – ACM Conference on Information and Knowledge Management DMIN Conference – International Conference on Data Mining DMKD Conference – Research Issues on Data Mining and Knowledge Discovery ECDM Conference – European Conference on Data Mining

Page 71: Business Analytics

ECML-PKDD Conference – European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

EDM Conference – International Conference on Educational Data Mining ICDM Conference – IEEE International Conference on Data Mining KDD Conference – ACM SIGKDD Conference on Knowledge Discovery and Data

Mining MLDM Conference – Machine Learning and Data Mining in Pattern Recognition PAKDD Conference – The annual Pacific-Asia Conference on Knowledge Discovery and

Data Mining PAW Conference – Predictive Analytics World SDM Conference – SIAM International Conference on Data Mining (SIAM) SSTD Symposium – Symposium on Spatial and Temporal Databases WSDM Conference – ACM Conference on Web Search and Data Mining

Data mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases

ProcessThe Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:

(1) Selection(2) Pre-processing(3) Transformation(4) Data Mining(5) Interpretation/Evaluation.[1]

It exists, however, in many variations on this theme, such as the Cross Industry Standard Process for Data Mining (CRISP-DM) which defines six phases:

(1) Business Understanding(2) Data Understanding(3) Data Preparation(4) Modeling(5) Evaluation(6) Deployment

or a simplified process such as (1) pre-processing, (2) data mining, and (3) results validation.

Polls conducted in 2002, 2004, and 2007 show that the CRISP-DM methodology is the leading methodology used by data miners.[16][17][18] The only other data mining standard named in these polls was SEMMA. However, 3-4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,[19][20] and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.[21]

Page 72: Business Analytics

Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.

Data mining

Data mining involves six common classes of tasks:[1]

Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.

Association rule learning (Dependency modeling) – Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

Regression – attempts to find a function which models the data with the least error.

Summarization – providing a more compact representation of the data set, including visualization and report generation.

Results validation

Data mining can unintentionally be misused, and can then produce results which appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split - when applicable at all - may not be sufficient to prevent this from happening.[citation needed]

Page 73: Business Analytics

This section is missing information about non-classification tasks in data mining. It only covers machine learning. Please expand the section to include this information. Further details may exist on the talk page. (September 2011)

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish "spam" from "legitimate" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. A number of statistical methods may be used to evaluate the algorithm, such as ROC curves.

If the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.

StandardsThere have been some efforts to define standards for the data mining process, for example the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006, but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.

For exchanging the extracted models – in particular for use in predictive analytics – the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.[22]

Notable uses

It has been suggested that this section be split into a new article titled Examples of data mining. (Discuss) Proposed since January 2014.

See also Category:Applied data mining.

Games

Page 74: Business Analytics

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully acquire the high level of abstraction required to be applied successfully. Instead, extensive experimentation with the tablebases – combined with an intensive study of tablebase-answers to well designed problems, and with knowledge of prior art (i.e., pre-tablebase knowledge) – is used to yield insightful patterns. Berlekamp (in dots-and-boxes, etc.) and John Nunn (in chess endgames) are notable examples of researchers doing this work, though they were not – and are not – involved in tablebase generation.

Business

In business, data mining is the analysis of historical business activities, stored as static data in data warehouse databases. The goal is to reveal hidden patterns and trends. Data mining software uses advanced pattern recognition algorithms to sift through large amounts of data to assist in discovering previously unknown strategic business information. Examples of what businesses use data mining for include performing market analysis to identify new product bundles, finding the root cause of manufacturing problems, to prevent customer attrition and acquire new customers, cross-sell to existing customers, and profile customers with more accuracy.[23]

In today’s world raw data is being collected by companies at an exploding rate. For example, Walmart processes over 20 million point-of-sale transactions every day. This information is stored in a centralized database, but would be useless without some type of data mining software to analyze it. If Walmart analyzed their point-of-sale data with data mining techniques they would be able to determine sales trends, develop marketing campaigns, and more accurately predict customer loyalty.[24]

Every time a credit card or a store loyalty card is being used, or a warranty card is being filled, data is being collected about the users behavior. Many people find the amount of information stored about us from companies, such as Google, Facebook, and Amazon, disturbing and are concerned about privacy. Although there is the potential for our personal data to be used in harmful, or unwanted, ways it is also being used to make our lives better. For example, Ford and Audi hope to one day collect information about customer driving patterns so they can recommend safer routes and warn drivers about dangerous road conditions.[25]

Data mining in customer relationship management applications can contribute significantly to the bottom line.[citation needed] Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimize resources across campaigns so that one may predict to which channel and to which offer an individual is most likely to respond (across all potential offers). Additionally, sophisticated applications could be used to automate mailing. Once the results from data mining (potential prospect/customer and

Page 75: Business Analytics

channel/offer) are determined, this "sophisticated application" can either automatically send an e-mail or a regular mail. Finally, in cases where many people will take an action without an offer, "uplift modeling" can be used to determine which people have the greatest increase in response if given an offer. Uplift modeling thereby enables marketers to focus mailings and offers on persuadable people, and not to send offers to people who will buy the product without an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set.

Businesses employing data mining may see a return on investment, but also they recognize that the number of predictive models can quickly become very large. For example, rather than using one model to predict how many customers will churn, a business may choose to build a separate model for each region and customer type. In situations where a large number of models need to be maintained, some businesses turn to more automated data mining methodologies.

Data mining can be helpful to human resources (HR) departments in identifying the characteristics of their most successful employees. Information obtained – such as universities attended by highly successful employees – can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.[26]

Market basket analysis , relates to data-mining use in retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favor silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical, or inexact rules may also be present within a database.

Market basket analysis has been used to identify the purchase patterns of the Alpha Consumer. Analyzing the data collected on this type of user has allowed companies to predict future buying trends and forecast supply demands.[citation needed]

Data mining is a highly effective tool in the catalog marketing industry.[citation needed] Catalogers have a rich database of history of their customer transactions for millions of customers dating back a number of years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns.

Data mining for business applications can be integrated into a complex modeling and decision making process.[27] Reactive business intelligence (RBI) advocates a "holistic" approach that integrates data mining, modeling, and interactive visualization into an end-to-end discovery and continuous innovation process powered by human and automated learning.[28]

Page 76: Business Analytics

In the area of decision making, the RBI approach has been used to mine knowledge that is progressively acquired from the decision maker, and then self-tune the decision method accordingly.[29] The relation between the quality of a data mining system and the amount of investment that the decision maker is willing to make was formalized by providing an economic perspective on the value of “extracted knowledge” in terms of its payoff to the organization[27] This decision-theoretic classification framework[27] was applied to a real-world semiconductor wafer manufacturing line, where decision rules for effectively monitoring and controlling the semiconductor wafer fabrication line were developed.[30]

An example of data mining related to an integrated-circuit (IC) production line is described in the paper "Mining IC Test Data to Optimize VLSI Testing."[31] In this paper, the application of data mining and decision analysis to the problem of die-level functional testing is described. Experiments mentioned demonstrate the ability to apply a system of mining historical die-test data to create a probabilistic model of patterns of die failure. These patterns are then utilized to decide, in real time, which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products. Other examples[32][33] of the application of data mining methodologies in semiconductor manufacturing environments suggest that data mining methodologies may be particularly useful when data is scarce, and the various physical and chemical parameters that affect the process exhibit highly complex interactions. Another implication is that on-line monitoring of the semiconductor manufacturing process using data mining may be highly effective.

Science and engineering

In recent years, data mining has been used widely in the areas of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering.

In the study of human genetics, sequence mining helps address the important goal of understanding the mapping relationship between the inter-individual variations in human DNA sequence and the variability in disease susceptibility. In simple terms, it aims to find out how the changes in an individual's DNA sequence affects the risks of developing common diseases such as cancer, which is of great importance to improving methods of diagnosing, preventing, and treating these diseases. One data mining method that is used to perform this task is known as multifactor dimensionality reduction.[34]

In the area of electrical power engineering, data mining methods have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on, for example, the status of the insulation (or other important safety-related parameters). Data clustering techniques – such as the self-organizing map (SOM), have been applied to vibration monitoring and analysis of transformer on-load tap-changers (OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for exactly the same tap position. SOM has

Page 77: Business Analytics

been applied to detect abnormal conditions and to hypothesize about the nature of the abnormalities.[35]

Data mining methods have been applied to dissolved gas analysis (DGA) in power transformers. DGA, as a diagnostics for power transformers, has been available for many years. Methods such as SOM has been applied to analyze generated data and to determine trends which are not obvious to the standard DGA ratio methods (such as Duval Triangle).[35]

In educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning,[36] and to understand factors influencing university student retention.[37] A similar example of social application of data mining is its use in expertise finding systems, whereby descriptors of human expertise are extracted, normalized, and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate institutional memory.

Data mining methods of biomedical data facilitated by domain ontologies,[38] mining clinical trial data,[39] and traffic analysis using SOM.[40]

In adverse drug reaction surveillance, the Uppsala Monitoring Centre has, since 1998, used data mining methods to routinely screen for reporting patterns indicative of emerging drug safety issues in the WHO global database of 4.6 million suspected adverse drug reaction incidents.[41] Recently, similar methodology has been developed to mine large collections of electronic health records for temporal patterns associating drug prescriptions to medical diagnoses.[42]

Data mining has been applied to software artifacts within the realm of software engineering: Mining Software Repositories.

Human rights

Data mining of government records – particularly records of the justice system (i.e., courts, prisons) – enables the discovery of systemic human rights violations in connection to generation and publication of invalid or fraudulent legal records by various government agencies.[43][44]

Medical data mining

In 2011, the case of Sorrell v. IMS Health, Inc., decided by the Supreme Court of the United States, ruled that pharmacies may share information with outside companies. This practice was authorized under the 1st Amendment of the Constitution, protecting the "freedom of speech."[45] However, the passage of the Health Information Technology for Economic and Clinical Health Act (HITECH Act) helped to initiate the adoption of the electronic health record (EHR) and supporting technology in the United States.[46] The HITECH Act was signed into law on February 17, 2009 as part of the American Recovery and Reinvestment Act (ARRA) and helped to open the door to medical data mining.[47] Prior to the signing of this law, estimates of only 20% of

Page 78: Business Analytics

United States based physician were utilizing electronic patient records.[46] Søren Brunak notes that “the patient record becomes as information-rich as possible” and thereby “maximizes the data mining opportunities.”[46] Hence, electronic patient records further expands the possibilities regarding medical data mining thereby opening the door to a vast source of medical data analysis.

Spatial data mining

Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions, and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasizes the importance of developing data-driven inductive approaches to geographical analysis and modeling.

Data mining offers great potential benefits for GIS-based applied decision-making. Recently, the task of integrating these two technologies has become of critical importance, especially as various public and private sector organizations possessing huge databases with thematic and geographically referenced data begin to realize the huge potential of the information contained therein. Among those organizations are:

offices requiring analysis or dissemination of geo-referenced statistical data public health services searching for explanations of disease clustering environmental agencies assessing the impact of changing land-use patterns on climate

change geo-marketing companies doing customer segmentation based on spatial location.

Challenges in Spatial mining: Geospatial data repositories tend to be very large. Moreover, existing GIS datasets are often splintered into feature and attribute components that are conventionally archived in hybrid data management systems. Algorithmic requirements differ substantially for relational (attribute) data management and for topological (feature) data management.[48] Related to this is the range and diversity of geographic data formats, which present unique challenges. The digital geographic data revolution is creating new types of data formats beyond the traditional "vector" and "raster" formats. Geographic data repositories increasingly include ill-structured data, such as imagery and geo-referenced multi-media.[49]

There are several critical research challenges in geographic knowledge discovery and data mining. Miller and Han[50] offer the following list of emerging research topics in the field:

Developing and supporting geographic data warehouses (GDW's): Spatial properties are often reduced to simple aspatial attributes in mainstream data warehouses. Creating an integrated GDW requires solving issues of spatial and temporal data interoperability – including differences in semantics, referencing systems, geometry, accuracy, and position.

Page 79: Business Analytics

Better spatio-temporal representations in geographic knowledge discovery: Current geographic knowledge discovery (GKD) methods generally use very simple representations of geographic objects and spatial relationships. Geographic data mining methods should recognize more complex geographic objects (i.e., lines and polygons) and relationships (i.e., non-Euclidean distances, direction, connectivity, and interaction through attributed geographic space such as terrain). Furthermore, the time dimension needs to be more fully integrated into these geographic representations and relationships.

Geographic knowledge discovery using diverse data types: GKD methods should be developed that can handle diverse data types beyond the traditional raster and vector models, including imagery and geo-referenced multimedia, as well as dynamic data types (video streams, animation).

Temporal data mining

Data may contain attributes generated and recorded at different times. In this case finding meaningful relationships in the data may require considering the temporal order of the attributes. A temporal relationship may indicate a causal relationship, or simply an association.[citation needed]

Sensor data mining

Wireless sensor networks can be used for facilitating the collection of data for spatial data mining for a variety of applications such as air pollution monitoring.[51] A characteristic of such networks is that nearby sensor nodes monitoring an environmental feature typically register similar values. This kind of data redundancy due to the spatial correlation between sensor observations inspires the techniques for in-network data aggregation and mining. By measuring the spatial correlation between data sampled by different sensors, a wide class of specialized algorithms can be developed to develop more efficient spatial data mining algorithms.[52]

Visual data mining

In the process of turning from analogical into digital, large data sets have been generated, collected, and stored discovering statistical patterns, trends and information which is hidden in data, in order to build predictive patterns. Studies suggest visual data mining is faster and much more intuitive than is traditional data mining.[53][54][55] See also Computer vision.

Music data mining

Data mining techniques, and in particular co-occurrence analysis, has been used to discover relevant similarities among music corpora (radio lists, CD databases) for purposes including classifying music into genres in a more objective manner.[56]

Surveillance

Data mining has been used by the U.S. government. Programs include the Total Information Awareness (TIA) program, Secure Flight (formerly known as Computer-Assisted Passenger

Page 80: Business Analytics

Prescreening System (CAPPS II)), Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE),[57] and the Multi-state Anti-Terrorism Information Exchange (MATRIX).[58] These programs have been discontinued due to controversy over whether they violate the 4th Amendment to the United States Constitution, although many programs that were formed under them continue to be funded by different organizations or under different names.[59]

In the context of combating terrorism, two particularly plausible methods of data mining are "pattern mining" and "subject-based data mining".

Pattern mining

"Pattern mining" is a data mining method that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behavior in terms of the purchased products. For example, an association rule "beer ⇒ potato chips (80%)" states that four out of five customers that bought beer also bought potato chips.

In the context of pattern mining as a tool to identify terrorist activity, the National Research Council provides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity — these patterns might be regarded as small signals in a large ocean of noise."[60][61][62] Pattern Mining includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search methods.

Subject-based data mining

"Subject-based data mining" is a data mining method involving the search for associations between individuals in data. In the context of combating terrorism, the National Research Council provides the following definition: "Subject-based data mining uses an initiating individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum."[61]

Knowledge grid

Knowledge discovery "On the Grid" generally refers to conducting knowledge discovery in an open environment using grid computing concepts, allowing users to integrate data from various online data sources, as well make use of remote resources, for executing their data mining tasks. The earliest example was the Discovery Net,[63][64] developed at Imperial College London, which won the "Most Innovative Data-Intensive Application Award" at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed knowledge discovery application for a bioinformatics application. Other examples include work conducted by researchers at the University of Calabria, who developed a Knowledge Grid architecture for distributed knowledge discovery, based on grid computing.[65][66]

Page 81: Business Analytics

Privacy concerns and ethicsWhile the term "data mining" itself has no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).[67]

The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics.[68] In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.[69][70]

Data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent).[71] This is not data mining per se, but a result of the preparation of data before – and for the purposes of – the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.[72][73][74]

It is recommended that an individual is made aware of the following before data are collected:[71]

the purpose of the data collection and any (known) data mining projects; how the data will be used; who will be able to mine the data and use the data and their derivatives; the status of security surrounding access to the data; how collected data can be updated.

Data may also be modified so as to become anonymous, so that individuals may not readily be identified.[71] However, even "de-identified"/"anonymized" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.[75]

Situation in the United States

In the United States, privacy concerns have been addressed to some[weasel words] extent by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their "informed consent" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week', "'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is undermined by the complexity of consent forms that are required of patients and participants, which approach a level of incomprehensibility to average individuals."[76] This underscores the necessity for data anonymity in data aggregation and mining practices.

Page 82: Business Analytics

U.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. Use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.

Situation in Europe

Europe has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's Global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement have failed.[citation needed]

SoftwareSee also Category:Data mining and machine learning software.

Free open-source data mining software and applications

Carrot2 : Text and search results clustering framework. Chemicalize.org : A chemical structure miner and web search engine. ELKI : A university research project with advanced cluster analysis and outlier detection

methods written in the Java language. GATE : a natural language processing and language engineering tool. KNIME : The Konstanz Information Miner, a user friendly and comprehensive data

analytics framework. ML-Flex : A software package that enables users to integrate with third-party machine-

learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results.

MLPACK library : a collection of ready-to-use machine learning algorithms written in the C++ language.

Massive Online Analysis (MOA) : a real-time big data stream mining with concept drift tool in the Java programming language.

NLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language.

OpenNN : Open neural networks library. Orange : A component-based data mining and machine learning software suite written in

the Python language. R : A programming language and software environment for statistical computing, data

mining, and graphics. It is part of the GNU Project. RapidMiner : An environment for machine learning and data mining experiments. SCaViS : Java cross-platform data analysis framework developed at Argonne National

Laboratory. SenticNet API : A semantic and affective resource for opinion mining and sentiment

analysis.

Page 83: Business Analytics

Tanagra : A visualisation-oriented data mining software, also for teaching. Torch : An open source deep learning library for the Lua programming language and

scientific computing framework with wide support for machine learning algorithms. SPMF : A data mining framework and application written in Java with implementations of

a variety of algorithms. UIMA : The UIMA (Unstructured Information Management Architecture) is a component

framework for analyzing unstructured content such as text, audio and video – originally developed by IBM.

Weka : A suite of machine learning software applications written in the Java programming language.

Commercial data-mining software and applications

Angoss KnowledgeSTUDIO : data mining tool provided by Angoss. Clarabridge : enterprise class text analytics solution. HP Vertica Analytics Platform : data mining software provided by HP. IBM SPSS Modeler : data mining software provided by IBM. KXEN Modeler : data mining tool provided by KXEN. LIONsolver : an integrated software application for data mining, business intelligence,

and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach.

Microsoft Analysis Services : data mining software provided by Microsoft. NetOwl : suite of multilingual text and entity analytics products that enable data mining. Neural Designer : data mining software provided by Intelnics. Oracle Data Mining : data mining software by Oracle. QIWare : data mining software by Forte Wares. SAS Enterprise Miner : data mining software provided by the SAS Institute. STATISTICA Data Miner : data mining software provided by StatSoft.

Marketplace surveys

Several researchers and organizations have conducted reviews of data mining tools and surveys of data miners. These identify some of the strengths and weaknesses of the software packages. They also provide an overview of the behaviors, preferences and views of data miners. Some of these reports include:

2011 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery[77]

Rexer Analytics Data Miner Surveys (2007–2013)[78]

Forrester Research 2010 Predictive Analytics and Data Mining Solutions report[79]

Gartner 2008 "Magic Quadrant" report[80]

Robert A. Nisbet's 2006 Three Part Series of articles "Data Mining Tools: Which One is Best For CRM?"[81]

Haughton et al.'s 2003 Review of Data Mining Software Packages in The American Statistician [82]

Goebel & Gruenwald 1999 "A Survey of Data Mining a Knowledge Discovery Software Tools" in SIGKDD Explorations[83]

Page 84: Business Analytics

See also