modeler server performance, optimization, and sizing

Upload: ashu-bobhate

Post on 03-Apr-2018

240 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    1/16

    Technical report

    PASW

    Modeler Server Performance,Optimization, and Sizing

    SPSS is a registered trademar k and the other SPSS Inc. products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. 2009 SPSS Inc. All rights reserved. CSWP-0209

    Table of contents

    Introduction .......................................................................................................................... 2

    High performance out-of-the-box ....................................................................................... 3

    Scaling the data mining process with SPSS Predictive Enterprise Services .......................... 5

    Performance optimization ..................................................................................................... 7

    Advanced performance optimization................................................................................... 10

    Scoping and sizing PASW Modeler Server ........................................................................... 12

    Conclusion ......................................................................................................................... 16

    About SPSS Inc. .................................................................................................................. 16

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    2/16

    PASW ModelerServer Performance, Optimization, and Sizing

    Introduction

    Data mining offers organizations many benefits, including a more detailed view of their customers, along with a clearer view

    of current conditions and deeper insight into future events. By choosing a high-performance data mining tool, organizations

    can mine their data more efficiently and gain a significant return on investment (ROI). PASW Modeler*, the leading data mining

    workbench from SPSS Inc., enables organizations to easily and quickly mine many types of data, including large datasets.

    The result: more business value than other solutions can offer.

    PASW Modeler uses a scalable, three-tiered architecture to improve modeling productivity and deployment when working with

    large datasets. The PASW Modeler Client tier passes data mining processes to the PASW Modeler Server. Then PASW Modeler

    Server** analyzes these tasks to determine which ones should be executed within the database. After the database processes

    those tasks, it passes only the relevant aggregate or summary data to PASW Modeler Server. Since data pre-processing

    typically 80-90 percent of the data mining effortoccurs in the database tier, users will accelerate modeling, maximize

    resources, and minimize network traffic.

    Data mining is an exploratory and interactive process requiring immediate feedback, so high-performance tools like PASW

    Modeler Server are essential. PASW Modeler Server provides increased productivity and faster access to results. When

    analytical results are deployed into operational systems, the impact of performance is even more significant because of high

    data volumes and real-time constraints.

    Data mining is a core process involved in predictive analytics, which combines advanced analytic techniques and decision

    optimization to inform and direct decision making. The value of predictive analytics is that it gives your organization the ability

    to act on the results, and PASW Modeler Servers high performance is crucial to timely action. This technical brief serves as

    a guide for understanding and maximizing PASW Modeler Servers already high performance. It focuses on PASW Modeler

    Servers out-of-the-box performance, scalability, and performance optimization, as well as its scoping and sizing requirements.

    * PASW Modeler, formerly called Clementine, is part of SPSS Inc.s Predictive Analytics Software portfolio.

    ** PASW Modeler Server, formerly called Clementine Server, is part of SPSS Inc.s Predictive Analytics Software portfolio.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    3/16

    High performance out-of-the-box

    PASW Modeler Server has been designed and developed to provide high performance and scalability for all data mining tasks.

    SQL generation and parallel processing, for example, are performed automatically. As a result, PASW Modeler users dont need

    to make any changes to the way they work to get consistently high performance.

    In our benchmark tests of PASW Modeler Server performance1, we measured the ability of PASW Modeler to carry out the

    common tasks of model building, model scoring, and data preparation.

    PASW ModelerServer Performance, Optimization, and Sizing 3

    Figure 1: This stream was used in tests of model building performance.

    Figure 1Model building: 16 million records in under five minutes

    PASW Modeler Server was able to build a logistic

    regression model from approximately 16 million records2

    in less than five minutes (see Figure 1).

    This dataset is larger than those typically used for model

    building. Against a more modest-sized dataset of 500,000

    records, all of the model types were built in less than two

    minutes (see Figure 2).

    PASW Modeler Server transforms a time-consuming

    process into an iterative one and vastly reduces the time

    required to build models and to find the best model.

    Figure 2: The elapsed time taken to build a model usingdifferent algorithms3.

    Figure 2

    1 Test environment: 2 x Intel Xeon 3.6GHz (hyperthreaded), 8GB RAM, 36GB RAID 1 System disk, 440GB RAID 0 Data disk, Microsoft WindowsServer 2003 Enterprise x64 SP1, Microsoft SQL Server 2000 SP4, and Clementine 10.0.

    2 21 fields used, mixture of data types.3 Neural network build time is affected by randomization in the selection of records to prevent overtraining.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    4/16

    4 PASW ModelerServer Performance, Optimization, and Sizing

    Figure 3: This stream was used in tests of model scoring performance.

    Figure 3

    Figure 4: The elapsed time taken to score a C&RT decision tree model.

    Figure 4

    Figure 5: This stream was used in tests of data preparation performance.

    Figure 5

    4 21 fields used, mixture of data types.

    Model scoring: 32 million records in close to

    eight minutes

    In a test scoring records against a classification model

    (see Figures 3 and 4), PASW Modeler Server accessed

    data from a table of 32 million records4, scored the data

    against a decision tree model, and wrote the scores to a

    new database table in less than eight minutes.

    This scoring was achieved at a sustained rate of close

    to 65,000 records per second, equivalent to 225 million

    records per hour.

    Data preparation: 16 million customer

    records processed against 42 million products

    in eight minutes

    Data mining is about more than model

    building and scoring. A large part of the data

    mining process involves preparing the data. As

    seen in Figure 5, our tests of data preparation

    involved the performance of multiple, common

    data preparation steps, including joining

    customer data to a product dataset of nearly

    three times its size.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    5/16

    However, with SPSS Predictive Enterprise Services,

    organizations receive a complete, enterprise solution

    to the problems of analytical asset and process

    management. SPSS Predictive Enterprise Services uses

    an advanced, service-oriented architecture to improve

    the management of predictive models and related

    analytical processes within your organizations business

    operations. It extends PASW Modelers rapid model

    development and deployment capabilities to create

    more manageable predictive analytics solutions.

    By providing an integrated way to centralize and organize predictive modelsand also automate predictive analytics

    processesSPSS Predictive Enterprise Services helps organizations improve analytical asset and process management.

    Analytical asset managementThe resources that are involved in a predictive analytics process may involve:

    n PASW Modeler streams, models, and outputs

    n Documentation

    n External scripts for data preparation or report generation

    n Resources from other predictive analytics tools, such as PASW Statistics syntax and outputs, and SAS code

    PASW Modeler Server ran the stream against 16 million

    customer records in approximately eight minutes for an

    overall rate of over 33,000 customers per second (see

    Figure 6).

    Scaling the data mining process with SPSS

    Predictive Enterprise Services

    Raw data processing speed is not the only factor affecting

    performance. Frequently, the volume of modelsrather

    than the volume of datais the bottleneck hampering

    data mining productivity. In many organizations, the

    number of data miners, analysts, and others involved

    in the process can also have a very significant impact

    on performance.

    PASW ModelerServer Performance, Optimization, and Sizing 5

    Figure 6: The elapsed time taken to perform data preparation steps.

    Figure 6

    By using PASW Modeler Server with SPSS Predictive

    Enterprise Services, one financial services organization

    optimized its operational analytics, reducing the timetaken to execute a key analytical process by a factor of

    80 times. This resulted in major, quantifiable savings.

    Generating real performance from data mining activities often depends more on an organizations ability to manage its

    analytical assets and complex, multi-part analytical processes than on raw data processing performance alone. For

    example, powerful servers are often underutilized when organizations are unable to put the right models in the right place

    and effectively schedule their execution.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    6/16

    6 PASW ModelerServer Performance, Optimization, and Sizing

    Figure 7: Predictive Enterprise Manager allows users to create and schedule multi-part, multi-tool, analytical processesvia a visual workflow interface.

    Figure 7

    These are analytical assetsthe tangible results of the efforts of data mining teams. SPSS Predictive Enterprise Services

    provides a centralized repository that offers:

    n Security and access control

    n Version control and labeling

    n Audit and tracking capabilities

    n Advanced data mining-aware organization and search facilities

    n Direct integration with PASW Modeler and also with PASW Statistics tools

    Managing analytical assets provides a foundation for data mining processes, enabling these processes to scale to the

    enterprise level.

    Analytical process management

    Developing robust processes for data mining activities such as model building, scoring, and validation is integral to delivering

    high performance on an enterprise scale. These processes often involve the combination of multiple tools and technologies.

    SPSS Predictive Enterprise Services provides a visual workflow user interface, Predictive Enterprise Manager, which allows a

    full, end-to-end process to be defined using assets stored in the repository and a mix of technologies (see Figure 7).

    Analytical processes are fully integrated with the repository, automatically extracting the required objects and versions, and

    storing the results. A scheduling service allows these processes to be executed at regular intervals, and a notification service

    provides e-mail tracking.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    7/16

    Performance optimization

    Most of PASW Modeler Servers high performance is achieved through performance optimizations that are switched on by

    default. Many PASW Modeler operations can be further improved by fine-tuning performance parameters.

    Maximize performance with in-database mining

    One of the key benefits of PASW Modeler Server is that it allows organizations to fully utilize their investments in high-

    performance database systems. Many organizations have invested heavily in a database infrastructure and business

    intelligence systems, but these systems are often under-utilized by the analytical tools that use them.

    PASW Modeler Server improves performance when mining large datasets by maximizing in-database mining. For example, you

    can delegate as many operations as possible to your IBM DB2 Data Warehouse database or Oracle Database 10g, taking

    advantage of database optimization and reducing data movement.

    With PASW Modeler Server, processing is executed inthe database via SQL queries. Any operation that

    cannot be represented using SQL queries is performed

    by the server itself. Only relevant results are passed

    back to the client; perhaps more importantly, data

    transfer between the database and PASW Modeler

    Server is minimized.

    Another advantage of PASW Modeler Servers in-database mining is that it minimizesand can even eliminatedata transfer

    costs. In a test measuring the impact of in-database mining (see Figure 8), the same PASW Modeler stream was executed

    with full SQL generation, no SQL generation, and a scoring-only SQL generation (which executed the scoring in-database but

    performed transfer of data to and from the database).

    PASW ModelerServer Performance, Optimization, and Sizing 7

    While SQL generation of the scoring was approximately

    10 percent quicker than scoring in the application,

    the biggest factor in performance is data transfer, which

    accounts for more than 85 percent of the elapsed time

    for scoring.

    The only way to manage the data transfer bottleneck

    is to ensure that less data is transferred. PASW Modeler

    Servers SQL generation reduces data transfer to aminimum and leverages your investment in high-

    performance databases.

    Figure 8: Scoring stream executed with full SQL generation, SQLgeneration of scoring only, and no SQL generation

    Figure 8

    Data transfer costs are the most significant factor affecting

    performance. For example, over 85 percent of the time

    allotted to score a model can be attributed to data transfer

    between the database and the scoring application.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    8/16

    In Figure 9, the PASW Modeler stream is executed using SQL generation. Many nodes are purple, rather than the usual

    white, during execution. Purple nodes mean that the operations represented by those nodes have been translated into SQL

    and executed in-database. This feedback helps an analyst ensure that as much of the stream as possible is executed in the

    database. Additional options allow the user to examine the SQL that is generated.

    Stream optimization relies on intelligent SQL generation and stream execution

    SQL generation is a powerful capability, but it depends upon analysts to understand how PASW Modeler operations can be

    executed on a database. And analysts are focused on solving business problems, rather than optimizing their PASW Modeler

    streams for performance.

    For this reason, PASW Modeler Server features advanced optimization that intelligently re-orders operations in the PASW

    Modeler stream to maximize performance without altering results. Data miners can organize streams in a way that makes

    sense to them, and PASW Modeler Server will reorganize those same operations in a way that makes sense to the database.

    8 PASW ModelerServer Performance, Optimization, and Sizing

    Figure 9: SQL generation and highlighting in a PASW Modeler stream

    Figure 9

    SQL feedback, previewing, and viewing

    There will be times when analysts will want more control over the optimization of PASW Modeler streams. PASW Modeler

    Server supports this by providing immediate feedback: upon execution, every PASW Modeler node that can be fully translated

    to SQL is highlighted (see Figure 9).

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    9/16

    PASW ModelerServer Performance, Optimization, and Sizing 9

    Figure 11: Setting a cache on a node that is likely to be re-executedwill store the data in a temporary table on the database, whenpossible. Executing streams from that cached node will allow furtherin-database operations.

    Figure 11

    Figure 10: Stream optimization

    Figure 10In Figure 10, the derive node contains an operation that

    cannot be carried out in the database. PASW Modeler

    optimizes the process so that the select operation is

    performed before the derive operation, thereby reducing

    data transfer and improving performance.

    In-database caching

    One common user optimization is to set up a cache on

    a node. The next time data is passed through that node,

    the cache is filled with that data. From then on, the data

    is read from the cache rather than from the data source.

    This can be a useful way to ensure that expensive data

    processing is only executed once.

    Normally, the cache is stored as a temporary file on the

    file system, but PASW Modeler Server also supports

    the caching of this data into a temporary table in the

    database. When combined with SQL optimization,

    this may result in significant gains in performance.

    As illustrated in Figure 11, the output from a stream

    that merges multiple tables to create a data mining

    view may be cached and reused as needed.

    Plus, by automatically generating SQL for all downstream nodes, performance can be improved further. In Figure 11,

    the select operation is highlighted, indicating that the operation is being executed in the database from the filled

    database cache.

    In-database model building

    PASW Modeler Server supports integration with data mining algorithms that are available from other database vendors.

    Organizations can use PASW Modeler to manage the entire data mining process while modeling with the database-native

    algorithms provided by these vendors. Using in-database modeling ensures that data transfer is minimized, even during

    the model building phase. It also helps organizations leverage their existing investments in IBM DB2 Intelligent Miner,

    Microsoft SQL Server 2005, and Oracle Data Mining.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    10/16

    Advanced performance optimization

    In addition to in-database mining, PASW Modeler Server provides a number of capabilities that allow the user to optimize the

    performance of his streams.

    Database bulk-loading

    Data movement is often a bottleneck in performance, especially when writing data to a database. PASW Modeler Server

    provides a number of features to optimize this process for large data volumes.

    10 PASW ModelerServer Performance, Optimization, and Sizing

    Figure 12: Database export advanced options allow bulk loading todatabase via ODBC or through an external loader.

    Figure 12

    Figure 13: Create indexes on database tables to improvedatabase performance.

    Figure 13

    By default, writing data to a database is performed on

    a row-by-row basis. While this prevents errors and

    provides data security, it slows performance. Allowing

    the PASW Modeler Server to commit multiple rows at

    a time is a good way to ensure more reasonable

    performance, and this option is available by default.

    In addition to the batch committal of records, PASW

    Modeler Server supports two types of bulk loading,

    as shown in Figure 12.

    The first is provided through ODBC bulk loading facilities.

    The second type uses an external bulk loading tool to

    allow a database-native solution. External bulk loading

    scripts are provided for Microsoft SQL Server, Oracle Data

    Mining, IBM DB2 Intelligent Miner, Netezza Performance

    Server, Teradata Warehouse, and IBM Redbrick

    Warehouse databases. These scripts can be customized,

    and custom scripts may be written for other databases.

    Database indexing

    Indexing database tables maintains the performance of

    in-database options. Correct indexing significantly impacts

    many subsequent database operations.

    As shown in Figure 13, PASW Modeler Server enables

    users to create indexes on tables exported from PASWModeler. Simple indexes can be created easily, and PASW

    Modeler also allows you to customize the SQL statement

    used to create the index (for instance, to create a BITMAP,

    UNIQUE, or FILLFACTOR index).

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    11/16

    Optimized joins and sorts

    By default, PASW Modeler has to make assumptions

    about the state of data in the system. For example,

    PASW Modeler cannot assume that any data has already

    been sorted, so many operations ensure that a sort

    is performed when required, even if such a sort is

    redundant. PASW Modeler allows the user to optimize

    a sort or join operation by specifying any existing sorts

    on the data. This eliminates redundancy and improves

    performance, as shown in Figure 14.

    Users can also optimize the performance of PASW

    Modeler Server through special case algorithms for joins.

    PASW Modelers default join algorithm is designed toperform optimally when joining datasets of similar size.

    In some very common operations, such as when using a

    join to connect an ID in one table to a label or description

    from another (e.g., joining a product code in a table of

    transactions to a product name in a look-up table), the

    default join is inefficient.

    PASW Modeler offers an alternate join algorithm for these

    situations that significantly boosts performance speed,

    as can be seen in Figure 15.

    High performance through parallel data processing

    Multithreading is a method by which an applications

    process can perform more than one task at the same

    time. Threads share the same memory space, and

    PASW ModelerServer Performance, Optimization, and Sizing 11

    Figure 15: Impact of specialized join when joining a large table to asmall table (250,000 records)

    Figure 15

    Figure 14: Impact of pre-sorting optimization on sort performance

    Figure 14

    must synchronize at certain points within their execution to access shared resources safely. Operating systems provide

    low-level mechanisms to support this synchronization. If an application uses more than one thread to execute, it is said

    to be multithreaded.

    Symmetric multiprocessing (SMP) machines are widely used and available for all platforms supported by PASW Modeler

    Server. They comprise multiple CPUs sharing access to the same memory, disk, network, and other I/O resources. When amultithreaded application runs on an SMP box, threads may be distributed across the CPUs and execute truly in parallel.

    Application processes and individual threads can usually migrate dynamically between CPUs to balance processor load.

    This is generally handled transparently by the operating system.

    PASW Modeler Server employs parallel processing to improve performance in both data processing and modeling operations.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    12/16

    Parallel data processing

    PASW Modeler Server uses a parallel data-sorting algorithm to improve the performance of a number of data processing

    operations. Sorting is used by many PASW Modeler operations, including binning, model evaluation, merge and, of course,

    the sort operation itself. All of these operations benefit from the parallelization of the sort operation.

    The parallelized sort algorithm uses a technique called

    record parallelism. This technique distributes records

    across a number of separate sorting processes. Each process

    sorts its own subset of records and then the results are joined.

    Figure 16 shows the effect of running a parallelized sort on

    multiprocessor hardware. At high data volumes, sort times

    can be reduced by more than 30 percent.

    12 PASW ModelerServer Performance, Optimization, and Sizing

    Figure 16: Impact of multiple CPUs on data sorting performance

    Figure 16

    Parallel predictive model building

    Parallel processing techniques are also used by PASW

    Modelers C5.0 decision tree algorithm and can improve

    performance in building decision trees and rule sets. The

    benefits depend largely on dataset sizeboth the number

    of records and the number of fieldsbut they can provide

    a useful boost to what can be a time-consuming process.

    Scoping and sizing PASW Modeler ServerMany factors must be considered when scoping hardware requirements for a PASW Modeler Server installation. The breadth

    of PASW Modeler operations and differences in data volumes make it difficult to estimate performance for any specific

    hardware configuration.

    Impact of CPUs on performance

    Obviously, the core speed of any individual CPU will impact data mining performance. Almost all data mining operations,

    especially modeling, are heavily processor dependent, so an increase in CPU speed will produce a proportional increase

    in performance for many PASW Modeler processes.

    The main benefits of multiple CPUs (or multicore CPUs) occur when running multiple streams. This means that the number of

    users will often be the deciding factor in determining the optimum number of CPUs. Multiple CPUs will also benefit parallelized

    operations, but the main benefits will be from supporting multiple users.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    13/16

    Table 1: Recommended number of CPUs per number of users

    For a production server running scheduled data mining via SPSS Predictive Enterprise Services, the number of CPUs

    should be determined by the number of separate processes to be performed simultaneously. Maximum performance

    can be achieved, for instance, by splitting a model scoring process across multiple CPUs or building multiple

    models simultaneously.

    Impact of physical memory on performance

    Most PASW Modeler operations can be performed on large volumes of data with minimal memory usage. Only certain

    operations, such as sorting, joining, and modeling, require data to be temporarily stored in memory. If not enough memory is

    available, these operations will store part of the data as virtual memory on disk. This can affect performance, since disk access

    is significantly slower than memory access.

    As with CPU usage, the number of users impacts the required memory for normal operation. Memory requirements depend on

    data volume. Typical minimum requirements can be found in Table 2.

    Table 2: Minimum RAM for number of users in normal use

    Large volume model building

    Model building is one of the more memory-intensive operations in the data mining process. This is because the model-

    building algorithms require access to the entire modeling dataset, often making multiple passes at the data.

    For this reason, model building is usually performed on subsets or samples of data. It is normally more productive to build

    different models on a small subset of the data and then choose the best model, rather than to build a single model on a larger

    dataset. This type of model building can usually be performed within minimal memory requirements.

    PASW ModelerServer Performance, Optimization, and Sizing 13

    Number of users Minimum RAM

    1-2 1GB

    3-4 2GB

    5-10 4GB

    11-20 8GB

    21+ 16GB

    Number of users Number of CPUs

    1-2 1

    3-4 2

    5-10 4

    11-20 8

    21+ 16

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    14/16

    Using more data rarely improves the predictive accuracy of a model. However, if model building on larger volumes is required,

    additional memory can help performance.

    14 PASW ModelerServer Performance, Optimization, and Sizing

    5 Estimates based on neural network, Kohonen, and K-means algorithm memory requirements. Maximum physical memory may also be limited by theoperating system.

    Table 3: Estimated RAM required (GB) to avoid disk-caching during model building5

    Table 3 provides guidance on the memory required to avoid disk-caching on model building operations, based on the memory

    usage of the neural network, K-means, and Kohonen modeling algorithms.

    Memory configuration

    PASW Modeler Server will, by default, limit the amount of physical memory used by any single process to ensure that other

    simultaneous processes arent affected. A maximum of 25 percent of available memory will be allocated for model building,

    and approximately 10 percent will be available for sorting operations. This figure is lower, as there may be multiple sorts in

    a single stream. The PASW Modeler Server administrator can modify these settings.

    Impact of disk space on performance

    Before addressing disk space requirements, it is important to understand the volume of data that is likely to be used for

    the actual data mining. Most organizations store many terabytes of data, especially transactional data, but this amount

    will rarely be used. Normally the data is aggregated, selected, or sampled before it is used for analysis. While large data

    volumes are typically used in model scoring, the model scoring processes usually rely on operations that dont use a lot

    of system resources.

    When trying to maximize performance, disk usage for data processing steps can be relatively high. The user often caches data

    to minimize execution times, and some operations will spill to disk when physical memory is unavailable. In addition, some

    operations may produce a dataset larger than the raw input data, further increasing disk requirements.

    Columns

    Rows (millions) 10 20 50 100 500 1000

    0.1 0.5 0.5 0.5 0.5 2 4

    0.5 0.5 0.5 0.5 1 4 8

    1 0.5 0.5 1 2 8 16

    2 0.5 0.5 2 4 16 32

    4 0.5 1 4 8 32 -

    8 1 2 8 16 - -

    16 2 4 16 32 - -32 4 8 32 - - -

    64 8 16 - - - -

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    15/16

    To understand disk usage, a series of tests was performed based upon the PASW Modeler Application Template for customer

    relationship management (CRM). This template consists of streams that demonstrate data mining techniques used for CRM.

    The source dataset was 72MB in size, representing a sample of 140,000 customers and 360,000 transactions, plus other

    associated data.

    PASW ModelerServer Performance, Optimization, and Sizing 15

    6 SQL generation typically reduces the disk space requirements for PASW Modeler Server since many of the data preparation steps can be carried out onthe database.

    7 Estimates based on 1 million rows/10 columns requiring 100MB disk (high estimate) and a working multiplier of 5 times (high estimate for single user).

    Figure 17: Percentage of original disk space required for data miningstream operations.

    Figure 17The data was stored in text files and all operations

    were carried out by PASW Modeler Serverno SQL

    generation was required6.

    As shown in Figure 17, the tests measured the maximum

    amount of disk space needed to execute over 100

    separate execution streams. The vast majority of streams

    required little disk usage, but others used over four times

    the disk space of the source data.

    Given that these data preparation steps are typically

    executed infrequently (its a best practice to store the

    results of such processing as intermediate files or tables),

    a conservative rule of thumb is to reserve between

    three to five times the disk space required to store the

    original data.

    Table 4: Estimated disk space required (GB) for data mining (15 users)7

    Columns

    Rows (million) 10 20 50 100 500 1000

    1 0.5 1 2.5 5 25 50

    2 1 2 5 10 50 100

    4 2 4 10 20 100 200

    8 4 8 20 40 200 400

    16 8 16 40 80 400 800

    32 16 32 80 160 800 1600

    64 32 64 160 320 1600 3200

    This rule holds for small numbers of users because users will rarely perform high disk-usage operations simultaneously. In

    addition, organizations can minimize overall disk usage by scheduling expensive data preparation steps during times of low

    system usage.

  • 7/28/2019 Modeler Server Performance, Optimization, And Sizing

    16/16

    Conclusion

    The ever-growing amount of data created by organizations presents opportunities and challenges for data mining.

    The PASW Modeler data mining solution makes it easy to use business knowledge to quickly develop, update, and deploy

    predictive models.

    Furthermore, PASW Modeler Servers combination of high performance, scalability, performance optimization options, and

    flexible hardware requirements enables it to handle large and complex data mining projects. With PASW Modeler Server,

    your organization can:

    n Utilize your investment in high-performance databases for all data mining tasks, ensuring high performance and

    minimizing data transfer costs

    n Maximize your use of multiple CPUs (or multicore CPUs) in your operating environment by using parallel processing

    during a number of data preparation and model-building operations

    n Use in-database caching, database write-back with indexing, and optimized merging to join tables outside ofthe database

    Scaling the entire data mining process with PASW Modeler Server makes it possible for your organization to analyze large

    volumes of data efficiently, shortening the time needed to turn data into better business decisions that boost your ROI.

    About SPSS Inc.

    SPSS Inc. (NASDAQ: SPSS) is a leading global provider of predictive analytics software and solutions. The companys

    predictive analytics technology improves business processes by giving organizations consistent control over decisions made

    every day. By incorporating predictive analytics into their daily operations, organizations become Predictive Enterprisesable

    to direct and automate decisions to meet business goals and achieve measurable competitive advantage.

    More than 250,000 public sector, academic, and commercial customers rely on SPSS Inc. technology to help increase

    revenue, reduce costs, and detect and prevent fraud. Founded in 1968, SPSS Inc. is headquartered in Chicago, Illinois. For

    additional information, please visit www.spss.com.

    To learn more, please visit www.spss.com. For SPSS Inc. office locations and telephone numbers, go to www.spss.com/worldwide.

    SPSS is a registered trademar k and the other SPSS Inc. products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. 2009 SPSS Inc. All rights reserved. CSWP-0209