infosphere cdc flat file for datastage configuration and ... · pdf fileinfosphere cdc flat...
TRANSCRIPT
© 2010 IBM Corporation
InfoSphere CDC Flat file for DataStage
Configuration and Best Practices
Information Management Software
2
Understanding the Flat File Workflow
Landing Location
Information Management Software
3
Landing Location
1. Source Database
• Configure CDC on the source database where the CDC service for the database reads the transaction log to captu re changes
Information Management Software
4
Landing Location
2. Defining the Replication Definition
• CDC for DataStage transfers the change data accordi ng to the replication definition
• To configure:
• Define the table structure that will be sent to DataStage
• Define the DataStage connection method for Flat Files
• Define single or multiple format to determine how DataStage will be processing the incoming records
Information Management Software
5
Map Table for Flat File Output (1)
• Map table as usual, select WebSphere DataStage as t he target
• Select Flat File for method
• Specify the directory to which the flat files will be written and picked up by the DataStage job (directory resides on the D S server)
• Initial status of table will be Active (picking up changes from the moment it was mapped)
Information Management Software
6
Map Table for Flat File Output (2)
Information Management Software
7
Defining the DataStage Record Format (1)
• Standard columns containing information about the c hange:
• DM_TIMESTAMP - The timestamp obtained from the log of when the operation occurred (contains the value from the &TIMSTAMP journal control field)
• DM_TXID - Transaction identifier (contains the value from the &CCID journal control field)
• DM_OPERATION_TYPE contains a single character indicating the type of operation:• "I" for an insert. • "D" for a delete. • For Single Record Format there is one type that represents the update image
• "U" represents an update. • For Multiple Record Format there are two separate types that represent before and
after image• "B" for the row containing the before image of an update. • "A" for the row containing the after image of an update.
• DM_USER - The user that performed the operation (contains the value from the &USER journal control field)
Information Management Software
8
Defining the DataStage Record Format (2)
• Single record• In this format an update operation is sent as a single row• The before and after image is contained in the same record • E.g. Updating 3 records
"2010-11-23 21:43:24","0","U","EPANG","1","elaine ","1","update “"2010-11-23 21:43:24","0","U","EPANG","2","elaine ","2","update “"2010-11-23 21:43:24","0","U","EPANG","3","abc ","3","update "
• Multiple record format• An update operation is sent as two rows, the first row being the before image
and the second row containing the after image.• E.g. Updating 3 records
"2010-11-23 21:46:15","0","B","EPANG","1","update “
"2010-11-23 21:46:15","0","A","EPANG","1","hello “
"2010-11-23 21:46:15","0","B","EPANG","2","update “
"2010-11-23 21:46:15","0","A","EPANG","2","hello “
"2010-11-23 21:46:15","0","B","EPANG","3","update “
"2010-11-23 21:46:15","0","A","EPANG","3","hello "
Information Management Software
9
Naming Convention of Flat Files
• CDC uses the following convention to name the flat files that are produced during replication.
• [Table].x[Date].[Time][# Records]
• x = D for completed flat files, @ for currently open flat file• [Date] = Julian date (year, day number within year)• [Time] = hh24mmss when flat file was created (in GMT)• [# Records] = Optionally the number of records can be added
• [Table].STOPPED
• When subscription is stopped, this file is generated
The timestamp format can be configured using the system parameter ds_output_timestamp_format . E.g. ds_output_timestamp_format=“yyyy-
MM-dd HH:mm:ss.SSS” (to include milliseconds)
Information Management Software
10
Landing Location
3. Flat Files Become Available for DataStage
• CDC for DataStage server hardens the files and depo sits them in the flat file location.
• While actively mirroring to a file it is not access ible to DataStage. The process of hardening involves renam ing the file, replacing the ‘@’ with a ‘D’ thus making it a vailable to Datastage.
• To configure:
• Define the Batch Size Threshold settings to determine how often CDC hardens the flat files that are made available to DataStage
Information Management Software
11
Set Subscription DataStage Properties
• Right-click on subscription to set properties
• The file will be hardened always at the end of a tr ansaction boundary and when either of the following thresholds are passed:• Timing in seconds of flat file closure• Maximum number of rows per flat file
• Flat file is closed and next one is created/opened when either value is reached• Closed flat files can be picked up by DataStage for processing as they will contain only
completed transactions
Information Management Software
12
Landing Location
4. Flat Files Read by DataStage Job
• InfoSphere DataStage sequential file reader retriev es the flat files as part of an InfoSphere DataStage job and transforms them
• The job has three parameters defined in the Managem ent Console where the *.dsx file is created:
• SPFolderPath – the full path name for the folder that DataStage searches for the source flat files created by CDC
• SPFileNamePattern – the file name pattern used to identify the source flat files
• SPEndFileNamePattern – the file name pattern DataStage creates when subscriptions stop mirroring.
Information Management Software
13
Landing Location
5. Flat Files are Deposited to New Location
• InfoSphere DataStage sequential file reader deposit s the transformed flat files in the new flat file locatio n
• To configure:
• DataStage definition file (*.dsx ) from Management Console or in DataStage Designer
• Import definition file into DataStage and customize any additional steps/stages where necessary
Information Management Software
14
Connecting CDC for DataStage with DataStage
• Datastage uses job definitions to describe the sequence of steps, or stages required to transform data
• DataStage jobs are normally designed and edited in InfoSphere DataStage Designer
• When using CDC for DataStage you have the option of generating a job definition within CDC without crea ting it in DataStage Designer
Information Management Software
15
Generating an InfoSphere DataStage Definition File
• DataStage definition import file (.dsx) can be gene rated automatically
• Right-click on subscription and select Generate Inf oSphere DataStage Job Definition
• Place .dsx file at a location where it can be selec ted from DataStage (or copy it to the DS server)
Information Management Software
16
Import .dsx file into DataStage (1)
• DataStage flat file processing job will be generate d automatically
• DS job is already tailored to picking up the flat files from the specified directory
Information Management Software
17
Import .dsx file into DataStage (2)
Information Management Software
18
Best Practices for Flat Files
Information Management Software
19
Flat Files are Best Suited for…
• Best suited for under a few hundred tables
• Extra memory will need to be allocated with larger numbers of tables
• Very high data volume which requires parallel loadi ng
• Replacement for existing ETL delta extracts
• Data warehouses which benefit from bulk load of cha nged data
• Installation on 64 bit systems
Information Management Software
20
Considerations and Limitations
• The Flat File integration option is not suitable wh en character columns contain binary data. The UTF-8 files may c ontain code points that resolve to special characters, such as quotes, line feed or carriage returns, that cannot be processed
• Tables are individually replicated, which can break transactional table dependencies
• Additional processing is required in DataStage to maintain referential integrity between dependent tables
• Disk staging space
• Managing many files
Information Management Software
21
Initial Synchronization
• DataStage extracts data from source database using standard ETL functions
• An alternative is to use CDC to perform initial Ref resh and then transition to mirroring mode. This method inv olves first creating flat files for the refresh then loading us ing DataStage.
Information Management Software
22
Recommended Flat file Storage Option
• Direct attached disk storage is a typical option us ed for the storage of CDC flat files.
• Shared Storage Area Network (SAN) is another recomm ended option to stage files.
• This allows running CDC DS on a server separated from the DataStage grid, ensuring CDC has dedicated CPU/Disk capacity.
• The DataStage grid nodes can then read the files on the shared SAN, allowing for high performance and recoverability.
• Network File System (NFS) is not recommended for hi gh volume environments.
• CDC is not resilient to file system errors that may occur, and may suffer from network latency for writing many small changes to the flat files.
Information Management Software
23
Clean-up of Flat files Generated by CDC
• By default, the .dsx file generated by CDC will def ine that flat files are removed once CDC has deposited the files into the DataStage job.
• If additional sequencing of the files is required ( i.e. multiple tables containing foreign key relationships) this l ogic requires customization.
• A DataStage expert can modify the .dsx file generat ed by CDC to remove the cleanup logic and make adjustments as appropriate.
Information Management Software
24
Distinguishing Transaction/Record Ordering
• The timestamp field provides second to microsecond accuracy. It cannot be used alone to uniquely order records if m ultiple records are changed at the same time
• You can use the system parameter ds_output_timestamp_formatto format timestamp in milliseconds in the flat fil es. Note: some databases like Oracle can not produce millisecond a ccuracy. Changing this parameter can not improve upon the ac curacy that the database supports
• For sequencing within a single table:
• Use a combination of the timestamp, flat file number and line number to uniquely identify changes in commit order
• If you need to sequence across all tables in a subs cription, you will additionally be required to use a derived column on the source to generate a sequence number
Information Management Software
25
Recovery
• CDC maintains the source database log position in a ‘bookmark’ which is used for restarting replication and/or recovery from failure
• Flat files – CDC writes the bookmark to internal CDC metadata when hardening a flat file which has finished writing
• If the network is lost or a system failure occurs the flat file option provides recoverability and resiliency; CDC will start from the last flat file that was not yet hardened
• Both options operate independently from DataStage w hich periodically picks up the changes and processes the data
• CDC only manages recovery up to the CDC staging mechanism