ETL - Informatica: 2011

Sunday, March 6, 2011

Performance Tuning

The goal of performance tuning is to optimize session performance by eliminating performance bottlenecks. To tune the performance of a session, first you identify a performance bottleneck, eliminate it, and then identify the next performance bottleneck until you are satisfied with the session performance. You can use the test load option to run sessions when you tune session performance.

Once you determine the location of a performance bottleneck, you can eliminate the bottleneck by following these guidelines:
♦Eliminate source and target database bottlenecks. Have the database administrator optimize database performance by optimizing the query, increasing the database network packet size, or configuring index and key constraints.
♦Eliminate mapping bottlenecks. Fine tune the pipeline logic and transformation settings and options in mappings to eliminate mapping bottlenecks.
♦Eliminate session bottlenecks. You can optimize the session strategy and use performance details to help tune session configuration.
♦Eliminate system bottlenecks. Have the system administrator analyze information from system monitoring tools and improve CPU and network performance.
If you tune all the bottlenecks above, you can further optimize session performance by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.
Because determining the best way to improve performance can be complex, change only one variable at a time, and time the session both before and after the change. If session performance does not improve, you might want to return to your original configurations.

----------------------------------------------------------------------

Identifying the Performance Bottleneck
The first step in performance tuning is to identify the performance bottleneck. Performance bottlenecks can occur in the source and target databases, the mapping, the session, and the system. Generally, you should look for performance bottlenecks in the following order:
1.Target 2.Source 3.Mapping 4.Session 5.System

You can identify performance bottlenecks by running test sessions, viewing performance details, and using system monitoring tools.

1) Identifying Target Bottlenecks:
The most common performance bottleneck occurs when the PowerCenter Server writes to a target database. You can identify target bottlenecks by configuring the session to write to a flat file target. If the session performance increases significantly when you write to a flat file, you have a target bottleneck.
If your session already writes to a flat file target, you probably do not have a target bottleneck. You can optimize session performance by writing to a flat file target local to the PowerCenter Server.
Causes for a target bottleneck may include small check point intervals, small database network packet size, or problems during heavy loading operations.

2) Identifying Source Bottlenecks
Performance bottlenecks can occur when the PowerCenter Server reads from a source database. If your session reads from a flat file source, you probably do not have a source bottleneck. You can improve session performance by setting the number of bytes the PowerCenter Server reads per line if you read from a flat file source.
If the session reads from relational source, you can use a filter transformation, a read test mapping, or a database query to identify source bottlenecks.
Using a Filter Transformation
You can use a filter transformation in the mapping to measure the time it takes to read source data. Add a filter transformation in the mapping after each source qualifier. Set the filter condition to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then you have a source bottleneck.
Using a Read Test Session
You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the read query by removing the transformation in the mapping. Use the following steps to create a read test mapping:
1.Make a copy of the original mapping.
2.In the copied mapping, keep only the sources, source qualifiers, and any custom joins or queries.
3.Remove all transformations.
4.Connect the source qualifiers to a file target.
Use the read test mapping in a test session. If the test session performance is similar to the original session, you have a source bottleneck.
Using a Database Query
You can identify source bottlenecks by executing the read query directly against the source database.
Copy the read query directly from the session log. Execute the query against the source database with a query tool such as isql. On Windows, you can load the result of the query in a file. On UNIX systems, you can load the result of the query in /dev/null.
Measure the query execution time and the time it takes for the query to return the first row. If there is a long delay between the two time measurements, you can use an optimizer hint to eliminate the source bottleneck.
Causes for a source bottleneck may include an inefficient query or small database network packet sizes.

3) Identifying Mapping Bottlenecks
If you determine that you do not have a source or target bottleneck, you might have a mapping bottleneck. You can identify mapping bottlenecks by using a Filter transformation in the mapping.
If you determine that you do not have a source bottleneck, you can add a Filter transformation in the mapping before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, you have a mapping bottleneck.

You can also identify mapping bottlenecks by using performance details. High errorrows and rowsinlookupcache counters indicate a mapping bottleneck. High Rowsinlookupcache Counters
Multiple lookups can slow down the session. You might improve session performance by locating the largest lookup tables and tuning those lookup expressions.

4)Identifying a Session Bottleneck
If you do not have a source, target, or mapping bottleneck, you may have a session bottleneck. You can identify a session bottleneck by using the performance details. The PowerCenter Server creates performance details when you enable Collect Performance Data in the Performance settings on the Properties tab of the session properties.
Performance details display information about each Source Qualifier, target definition, and individual transformation. All transformations have some basic counters that indicate the number of input rows, output rows, and error rows.
Any value other than zero in the readfromdisk and writetodisk counters for Aggregator, Joiner, or Rank transformations indicate a session bottleneck.
Small cache size, low buffer memory, and small commit intervals can cause session bottlenecks.
If a session contains Aggregator, Rank, or Joiner transformations, examine each Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes. The PowerCenter Server uses the index cache to store group information and the data cache to store transformed data, which is typically larger. Therefore, although both the index cache and data cache sizes affect performance, you will most likely need to increase the data cache size more than the index cache size.

Informatica PowerCenter 8.0 New Features and Enhancements

Informatica Distributed Services Architecture:

One of the major differences between the two versions is that PowerCenter 7 had a client-server architecture whereas PowerCenter 8 has a service-oriented architecture.

�� Domains and nodes: PowerCenter 8.0 allows you to scale services and share resources across multiple machines. PowerCenter introduces a PowerCenter domain, the primary unit of administration for the PowerCenter environment, and a PowerCenter node, the logical representation of a machine in a domain.
�� Services: PowerCenter provides application and core services. Application services represent PowerCenter server-based functionality such as the Repository Service,Integration Service, and SAP BW Service. Core services support the domain and application services. Core services include a Log Service, Licensing Service, and Domain Service.
�� High availability: You can use the high availability option to eliminate single points of failure in the PowerCenter environment and reduce service interruptions in the event of failure. High availability provides resilience, failover, and recovery for services.
�� Administration: PowerCenter provides centralized administration through the Administration Console. You perform all administration tasks for the domain and services in the Administration Console. This includes administration of the Repository Service.

Command Line Programs:
�� infacmd. infacmd is a new command line program that allows you to administer PowerCenter domains and services. You can perform tasks such as enabling and disabling services and purging log events.
�� infasetup. infasetup is a new command line program that allows you to configure domains and nodes. Use infasetup to define domains and nodes, join domains, and configure node passwords, domain licenses, and domain service ports.
�� pmcmd. pmcmd includes new commands to support Integration Service functionality and new syntax to connect to a domain.
�� pmrep. pmrep now includes former pmrepagent commands and new syntax to connect to a domain.

Datatypes:
�� Flat file definitions: You can configure flat file definitions to use the Integer or Double datatype.

Performance Improvements:
�� Pushdown optimization: The Integration Service can execute SQL against the source or target database instead of processing the transformation logic within the Integration Service.
�� Partitioned flat file targets: The Integration Service can create an individual target file for all partitions, a merge file for all partitioned flat file target instances, and a file list that references the individual target files for all target partitions.
�� Append to flat file targets: The Integration Service can append data to existing flat file target and reject files.
�� Flat file source and target commands: You can specify a command for any flat file source or target instance in a session. You can use a command to generate source data and you can use a command to process target data.
�� Data compression" Repository data is compressed during communication among the Repository Service, Integration Service, and PowerCenter Client.

Sunday, February 27, 2011

Change Data Capture

In databases, change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.

CDC solutions occur most often in data-warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.

Methodology

System developers can set up CDC mechanisms in a number of ways and in any one or a combination of system layers from application logic down to physical storage.

In a simplified CDC context, one computer system has data believed to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that does not change the design patterns logically. Not uncommonly, multiple CDC solutions can exist in a single system.

Timestamps on rows

Tables whose changes must be captured may have a column that represents the time of last change. Names such as LAST_UPDATE, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed.

Version Numbers on rows

Database designers give tables whose changes must be captured a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. When data in a row changes, its version number is updated to the current version. A supporting construct such as a reference table with the current version in it is needed. When a change capture occurs, all data with the latest version number is considered to have changed. When the change capture is complete, the reference table is updated with a new version number.

Status indicators on rows

This technique can either supplant or complement timestamps and versioning. It can configure an alternative if, for example, a status column is set up on a table row indicating that the row has changed (e.g. a boolean column that, when set to true, indicates that the row has changed). Otherwise, it can act as a complement to the previous methods, indicating that a row, despite having a new version number or an earlier date, still shouldn't be updated on the target (for example, the data may require human validation).

Time/Version/Status on rows

This approach combines the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status provides a particularly powerful mechanism and programmers should utilize them as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for production."

Triggers on tables

May include a publish/subscribe pattern to communicate the changed data to multiple targets. In this approach, triggers log events that happen to the transactional table into another queue table that can later be "played back". For example, imagine an Accounts table, when transactions are taken against this table, triggers would fire that would then store a history of the event or even the deltas into a separate queue table. The queue table might have schema with the following fields: Id, TableName, RowId, TimeStamp, Operation. The data inserted for our Account sample might be: 1, Accounts, 76, 11/02/2008 12:15am, Update. More complicated designs might log the actual data that changed. This queue table could then be "played back" to replicate the data from the source system to a target.

[More discussion needed]

Log scanners on databases

Most database management systems manage a transaction log that records changes made to the database contents and to metadata. By scanning and interpreting the contents of the database transaction log one can capture the changes made to the database in a non-intrusive manner.

Using transaction logs for change data capture offers a challenge in that the structure, contents and use of a transaction log is specific to a database management system. Unlike data access, no standard exists for transaction logs. Most database management systems do not document the internal format of their transaction logs, although some provide programmatic interfaces to their transaction logs (for example: Oracle, DB2, SQL/MP and SQL Server 2008).

Friday, February 25, 2011

Interview QA

Informatica/SQL/UNIX/Data warehouse questions:

1. what is the difference between source qualifier and joiner?

Source Qualifier	Joiner
Is used to join homogeneous database tables	Is used to join heterogeneous database tables and can also join homogeneous tables. Can also join flat files. Ex. Oracle, sql server etc.
Is used to join n number of tables	Can join only two tables using one joiner
Can do SQL override	Cannot do SQL Override
	Master and detail tables, master goes into cache which has less number of rows.
	Different types of joins are Master outer, detail outer, full outer and Normal join.

2. what is the difference between lookup and joiner?

Lookup	Joiner
It is passive transformation	It is active transformation
It can use multiple relational operators	It can use only ‘=’ relational operator
It can get either first or last row	It gets all rows if match found.
SQL override can be done	Cannot do SQL override.

3. what is the difference between filter and router?

Filter	Router
Is used to give only one value for one column	Can give multiple values for one column in different groups.
	It has input and default groups. You can create as many groups as you want.

4. what does normalizer do?

Ans : a) COBOL file sources, we get normalizer source qualifier

b) In the pipeline to break one record into multiple records.

5. what does lookup do?

Ans : a) Gets the related value.

b) Updating slowly changing dimensions.

6. what is the difference between connected and unconnected lookup?

Connected lookup	Unconnected lookup
Gets the value directly from the pipeline	Gets the value indirectly using expression or update strategy
Can use static and dynamic cache	Can use only static cache
Can return multiple ports	Can return only one port ( Return port )
Can be called only once where specified	Can be called multiple times in a mapping.
User defined default values can be used if no match found	User defined default values not supported if no match found.

7. what is the difference between static and dynamic cache?

Static cache	Dynamic cache
No change in the cache	Inserts, updates or no change in the cache before passing it onto the target table.
	Has Newlookuprow port which has values : 0 – No change 1 – Insert 2 – Update Associated ports is used for comparison or to check for any changes.

8. what is the difference between mapping parameters and variables?

Mapping parameters	Mapping variables
Remains constant through out the mapping	Changes through out the mapping
Ex : Currency conversion or schema name $$curr_rate, $$schema_name	Ex : for delta pull or full pull $$Lastpulldate & $$Enddate
No special functions used	Functions used : Setvariable Setmaxvariable setminvariable

Session parameters :

Session parameters are used at session level and remains same through out the session. Its represented using <$sessionparametername>.

9. what is performance tuning? How do you identify the bottlenecks?

Ans : Performance tuning is used for improving the performance.One solution is to do pipeline partitioning

Steps to identify bottlenecks:

a) Look at the session log and try to find out the bug in it.

b) Target bottleneck - Change relational target to flat file target and check the time

c) Source bottleneck – run the SQL query onto the database and find out the time, if taking long time then do the SQL tuning.

d) Mapping bottlenecks – Expression transformation – Use decodes instead of multiple IIF statements, use || instead of multiple concat functions.

Aggregator transformation – Use sorted input for it.

Joiner transformation – Make sure the master table is smaller

Lookup transformation – try to add where clause in Lookup SQL override to restrict no. of rows getting in lookup cache.

10. what is pipeline partitioning?

Ans :Creating partitions means creating multiple reader, writer and transformation threads. Threads execute at the same time.

Different types of partitions are:

a) Pass through – simple Partitioning, usually on Source qualifier

b) Round robin – Balances the load on each of these partitions, usually on filter transformations

c) Key range – We define key ranges on each of the partitions, usually on source and target definitions

d) Hashkey – Places data from each group together, usually on aggregators – tell that you haven’t worked a lot on it.

We can create upto 63 partitions for a particular session .. tell them you have created around 11.

11. how do you read xml files?

Ans :

XSD – Schema definition file

XML – Actual data file It is recommended to import XSD files.

12. what is difference between star and snowflake schema?

Star schema	Snowflake schema
It is denormalized	It is normalized
Dimension tables are not broken into other lookup tables	Dimension tables are broken into other lookup tables.
Easier to access data	Difficult to access the tables as lot of joins will be involved.

13. what is SCD type 1 and type 2? How do you do the mappings?

SCD – Type 1	SCD Type -2	SCD Type -3
No History	Full history	Partial history
If any changes, it will be updated	If any changes then new record will be inserted. Will be maintained by one of following : Version number Flag Effective dates	If any changes, new column will be added. First and last information will be stored.

14. What is the complex mapping that you have worked on?

Ans : SCD Type – 2

15. How do you remove duplicates from a flat file?

Ans : By using aggregator and grouping by all the columns from the flat file. In order to remove duplicates from UNIX box, use sort – u option.

16. How do you find out duplicates from the table?

Ans : Example : Select empno,ename, count(*) from emp Group by empno, ename having count(*) > 1

17. What do you do if you get an error in the session?

Ans : Check the error in session log and try to find out where exactly is the error in the mapping.

18. what is the difference between rank and aggregator?

Rank	Aggregator
Rank is active	Aggregator is also active
It groups by but can give all the ranks within that group	It groups by and gives out only last row from that group
Rankindex column is automatically created	Aggregate functions can be performed like: MAX MIN SUM COUNT AVG

19. How did you do the full pull and delta load?

Ans :Full pull – read entire table as source

Delta pull – read only part of table as source, maybe only 5 days worth data.

21. How did you use to schedule the etl jobs?

Ans : Used schedulers like

Tivoli workflow scheduler – TWS

Autosys

Crontab.

22. How did you use to do SQL tuning? Do you know anything about explain plan?

Ans : Some of the steps in SQL Tuning :

a) Try to join tables including all their natural keys.

b) Try to remove subqueries

c) Try to reduce aggregations.

23. What is IDQ and IDE?

Ans : IDQ – Informatica Data Quality

IDE – Informatica Data explorer. If asked tell them that you haven’t worked a lot on it.

24. How did you use to get the files from Mainframe?

Ans : Mainframe team used to FTP the files onto our UNIX box and sometimes we used to pull those from Mainframe using Informatica Power Exchange.

25. which tables will you load first? Fact or Dimension table?

Ans : Dimension tables will be loaded first and then fact tables.

26. what were the sizes of the fact and dimension table that you have worked on?

Ans : Dimension tables – around few thousand records

Fact tables – around 15-20 million records.

27. what are different types of joins and how do you write queries for that?

Ans : Inner join – gets only when condition matches

Left outer – gets everything from right and those only which are matching from left.

Right outer - gets everything from left and those only which are matching from right.

Full outer – gets everything from both the tables

Self join – joins with the same table again.

28. How do you denormalize data?

Ans : look at the interview diagram.

29. What is the difference between active and passive transformations?

Ans : If the number of rows going in the transformation is not equal to the number of rows going out of it then it is active. If they are same then passive transformation.

30. What does update strategy do?

Ans : Update strategy is used to flag the rows for the following operations :

DD_INSERT
DD_UPDATE
DD_REJECT
DD_DELETE

31. what is the difference between union and union all, delete and truncate?

Ans : Union all – merges two tables with all the values

Union – merges two tables with only distinct values

Delete – can delete records which we want and we can rollback

Truncate – deletes all the records and cannot rollback.

32. what does the following commands in UNIX do?
a) ls –ltr – displays all the files with their times and permissions.
b) vi – editor to view the files and make changes in it.
c) cp, mv, rm, mkdir, cat – cp is for copy, mv is for renaming the file, rm is for removing the file, mkdir is for creating directory, cat is for viewing the file and not making changes.
d) grep – used for searching patterns in a file or multiple files.
e) sort – u - Is used for removing duplicates.
f) pwd – shows present working directory

33. what is the scheduling tool that you have been working on?

Ans : TWS ( Tivoli workflow scheduler )
34. what do you mean by CDC?

Ans : CDC – Change data capture. Picks up only changed records from the source.

35. what did you use PowerExchange for and which version?

Ans : Extracting files from Mainframe.

36. What is the difference between MINUS and INTERSECT?

Ans :

MINUS – If A MINUS B then picks all the records from A which do not exist in B.

INTERSECT – If A INTERSECT B then picks up common records between A and B.

37. what are the new features of version 8? what is Pushdown optimization?

Ans : New transformations are added :

Java transformation

SQL transformation

HTTP transformation

Push down optimization – Pushes part of the transformation logic in informatica onto the database to improve the performance.

38. what are the transformations that use cache and which are active and which are passive?

Ans :Transformations that uses cache are : Sorter, aggregator, lookup, joiner, rank.

Passive transformations : Expression, lookup, stored procedure

Active transformations : others are all active.

39. How do you handle performance issues with aggregator and sorter?

Ans : Use sorter transformation before aggregator that means passing sorted input to aggregator may improve the performance.

40. What is versioning?

Ans :Its used to keep history of changes done on the mappings and workflows

Check in : You check in when you are done with your changes so that everyone can see those changes.

Check out : You Check out from the main stream when you want to make any change to the mapping/workflow.

Version history : It will show you all the changes made and who made it.

41. what is target load order plan?

Ans :If a mapping has two source qualifiers going to different targets, then we can define the order in which they can load

Ex : A -à B

B -à C

Mappings (tab) -à Target Load Order Plan

42. what are mapplets and worklets?

Ans : Mapplets are reusable codes, you define some complex logic in It and can use across different mappings in the repository. You cannot have target definition in it but can have source definition.

43. what are different types of tasks that you have worked on?

Ans:Email Task – sends out email to group after successful execution of the task.

b) Command Task – used to write some UNIX commands.

c) Session Task – associated with each mapping.

d) Decision task – Used as a router.

44. what are pmcmd commands and have you written perl scripts or shell scripts on your own? what did you do in that?

Ans :These are the UNIX commands that are used to start or stop informatica server, run workflows etc.. tell them there were predefined scripts for it which you use to run.

45. What is database partitioning? what are different types?

Ans : Database partitioning is dividing table into different parts.

Different types :

a) Range partitioning

b) List partitioning

c) Hash partitioning

d) Composite partitioning

46. what are cursors, triggers, stored procedures and functions, views? and what is the difference between them?

Ans :Cursors - A cursor is a mechanism by which you can assign a name to a "select statement" and manipulate the information within that SQL statement.

We've categorized cursors into the following topics:

Cursor Attributes (%FOUND, %NOTFOUND, etc)

Triggers – They get executed at a certain event.

Stored procedures – reusable objects, may or may not have return value.

Functions – reusable objects, always have return value.

Views – Virtual table, its not physical table and gets data according to select statement.

47. what is the difference between exists and IN clause?

Ans :IN - The IN function helps reduce the need to use multiple OR conditions.

EXISTS - The EXISTS condition is considered "to be met" if the subquery returns at least one row.

48. what does the debugger do?

Ans : Debugger is used to stop at some breakpoints and look at the data flow. Helps in identifying where exactly the issue is.

49. what is difference between version 7 and version 8?

Ans : In Version 8, new transformations like Java, SQL, HTTP are added and push down optimization feature is introduced.

50. what are different types of stored procedure transformations?

Ans : Normal – connected / unconnected

Source pre load ( Unconnected ) – check existence of tables

Source post load ( Unconnected ) – drop temporary tables

Target pre load ( Unconnected ) – drop indexes on target table

Target post load ( Unconnected ) – recreate indexes on target table.

51. what are presession and post session commands?

Ans :presession commands : This can be defined in the session, you can write some sql query that will execute before the session kicks off.

postsession commands : This can be defined in the session, you can write some sql query that will execute after the session kicks off.

52. what do you do when you have problem loading targets?

Ans : drop indexes and recreate them.

53. what is sequence generator transformation?

Ans :It generates sequence ids ( system generated values ). We get these values from Nextval column and Currval column contains the repository value.

54. How did you schedule your informatica jobs?

Ans : Either by writing all workflows in perl/shell scripts and schedule those scripts or schedule the workflow jobs directly.

55. How many batch jobs you use to run in a day?

Ans : around 200-300 batch jobs.

56. How much time each job used to take to run?

Ans : Depends, some use to take 1 hour and some use to take 2-3 minutes.. depends on the job and complexity.

Pages