Fetature Ad

Thursday, 23 April 2015

Informatica ETL process implementation

Informatica ETL process implementation

Parameters and variables in Informatica PowerCenter

Purpose

A mapping can utilize parameters and variables to store information during the execution. Each parameter and variable is defined with a specific data type and their main purpose is to provide increased development flexibility. 
Parameters are different from variables in the fact that:
  • The value of a parameter is fixed during the run of the mapping
  • Variables can change in value during run-time 

    Both parameters and variables can be accessed from any component in the mapping which supports it. 

    To create a parameter or variable, go to Mapping -> Parameters and Variables from within the Mapping Designer in the Designer client.
    The format is $$VariableName or $$ParameterName

    Changing values of Variables

    To change the value of a variable, one of the following functions can be used within an expression: SETMAXVARIABLE($$Variable, value) , SETMINVARIABLE($$Variable, value), SETVARIABLE($$Variable, value) , SETCOUNTVARIABLE($$Variable), where:
  • SETVARIABLE sets the variable to a value that you specify (executes only if a row is marked as insert or update). At the end of a successful session, the Integration Service saves either the MAX or MIN of (start value.final value) to the repository, depending on the aggregate type of the variable. Unless overridden, it uses the saved value as the start value of the variable for the next session run.
  • SETCOUNTVARIABLE - increments a counter variable. If the Row Type is Insert increment +1, if Row Type is Delete increment -1. A value = 0 is used for Update and Reject.
  • SETMAXVARIABLE - compare current value to value passed into the function. Returns the higher value and sets the current value to the higher value.
  • SETMINVARIABLE - compare current value to the value passed into the function. Returns the lower value and sets the current value to the lower value.
    At the end of a successful session, the values of variables are saved to the repository. The SetVariable function writes the final value of a variable to the repository based on the Aggregation Type selected when the variable was defined.

    Parameter files

    Parameter file is an ASCII file which is used to set values of mapping paramteres and variables. Parameters can be set on workflow, worklet or session level. The physical location of a parameter file is set in Workflow manager in Workflows -> Edit. It can also be specified using the pmcmd command when starting a session task.

    Parameter file structure

    Parameters can be grouped into the following sections:
    • [Global]
    • [Service: service name]
    • [folder name.WF:workflow name]
    • [folder name.WF:workflow name.WT:worklet name]
    • [folder name.WF:workflow name.WT:worklet name.WT:worklet name...]
    • [folder name.WF:workflow name.ST:session name]
    • [folder name.session name]
    • [session name]

    Examples / useful tips

    • The value is initialized by the specification that defines it, however it can be set to a different value in a parameter file, specified for the session task
    • Initialization priority of Parameters: Parameter file, Declared initial value, Default value
    • Initialization priority of Variables: Parameter file, Repository value, Declared initial value, Default value
    • Parameters and variables can only be utilized inside of the object that they are created in.
    • Parameters and variables can be used in pre and post-SQL
    • Sample parameter file:
      [Service:IntegrationSvc_01]
      $$SuccessEmail=dwhadmin@etl-tools.info
      $$FailureEmail=helpdesk@etl-tools.info
      [DWH_PROJECT.WF:wkf_daily_loading]
      $$platform=hpux
      $$DBC_ORA=oracle_dwh
      [DWH_PROJECT.WF:wkf_daily_loading.ST:s_src_sa_sapbw]
      $$DBC_SAP=sapbw.etl-tools.info
      $$DBC_ORA=oracle_sap_staging

      Workflow tasks

      Purpose

      Tasks are designed in Workflow Manager and represent an executable set of actions, functions or commands - which define the runtime behavior of thw whole ETL process. 
      Some of the most often used tasks:
    • Session task - it needed to run a mapping
    • Assignment task - can establish the value of a Workflow Variable whose value can be used at a later point in the workflow, as testing criteria to determine if (or when) other workflow tasks/pipelines should be run.
    • Decision task - enables the workflow designer to set criteria by which the workflow will or will not proceed to the next set of tasks, depending on whether the set criteria is true or false. It tests for a condition during the workflow and sets a flag based on the condition
    • Email task - sends an email from within a workflow. Email addresses, a subject line and the email message text can be defined. When called from within a Session task, the message text can contain variable session-related metadata.
    • Event wait task - pauses processing of the pipeline until a specified event occurs. Events can be Pre-defined (file watch) or User-defined (created by an Event Raise task elsewhere in the workflow)
    • Command task - specifies one or more UNIX command or shell script, DOS command or batch file for Windows servers to run during a workflow. Command task status (success or failure) is held in the task-specific variable: $command_task_name.STATUS
      UNIX and DOS commands can also be run pre- or post- session in a Session task.

      Examples / useful tips

    • Session, Email and Command tasks can be reusable.
    • Reusable tasks are created within the Task Developer

      Workflow

      Purpose

      Workflow, designed in Workflow Manager, is a collection of tasks that descibe runtime ETL processes. Speaking the IBM Infosphere Datastage language, Worflows are Job Sequences, Flows in Ab Initio and Jobs in Pentaho Data Integration. 

      Examples / useful tips

    • Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or session. A parameter file can be created with any text editor such as Notepad or Vi.
    • When developing a sequential workflow, it is a good idea to use the Workflow Wizard to create Sessions in sequence. Dependencies between the sessions can be created.
    • Session parameters must be defined in a parameter file. Since session parameters do not have default values, when the Integration Service cannot locate the value of a session parameter in the parameter hie, the session initialization fails.
    • On under-utilized hardware systems, it may be possible to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running on the Integration Service node. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity
    • Incremental aggregation is useful for applying captured changes in the source to aggregate calculations in a session.
    • From the Workflow Manager Tools menu, select Options and select the option to 'Show full names of task'. This will show the entire name of all tasks in the workflow.

      Design and development best practices

      Mapping design tips

      • Standards - sticking to consistent standards is beneficial in a long-run. This includes naming conventions, descriptions, environment settings, parameter files, documentation, etc.
      • Reusability - in order to be able to react quickly to potential changes, use where possible such Informatica components as mapplets, worklets, reusable transformations.
      • Scalability - when designing and developing mappings, it is a good practice to keep volumes in mind. This is caching, queries, partitioning, initial vs incremental loads.
      • Simplicity - it is recommended to create multiple mappings instead of few complex ones. Use Staging Area and try to keep the processing logic as clear and simple as possible.
      • Modularity - use modular design technique (common error handling, reprocessing).

      Mapping development best practices

      • Source Qualifier - use shortcuts, extract only the necessary data, limit read of columns and rows on source. Try to use the default query options (User Defined Join, Filter) instead of using SQL Query override which may impact database resources and make unable to use partitioning and push-down.
      • Expressions - use local variables to limit the amount of redundant calculations, avoid datatype conversions, reduce invoking external scripts (coding outside of Informatica), provide comments, use operators (||, +, /) instead of functions. Keep in mind that numeric operations are generally faster than string operations.
      • Filter - use the Filter transformation as close to the source as possible. If multiple filters need to be applied, usually it's more efficient to replace them with Router.
      • Aggregator - use sorted input, also use as early (close to the source) as possible and filter the data before aggregating.
      • Joiner - try to join the data in Source Qualifier if possible, avoid outer joins. It is a good practice to use a source with fewer rows as a Master source.
      • Lookup - relational lookup should only return ports that meet the condition. Call Unconnected Lookup in expression (IIF). Replace large lookup tables with joins when possible. Review the database objects and add indexes to database columns when possible. Use Cache Calculator in session to eliminate paging in lookup cache.
  • Informatica Tutorials

    Informatica tutorial

    Informatica basics:
  • Informatica components and architecture - Informatica PowerCenter services, client applications and modules. 
  • Informatica ETL programs - information on basic Informatica components such as sources, targets, mappings, sessions, workflows 
  • Mapping development tips - useful advices, best practices and design guidelines.
  • Informatica Powercenter weaknesses - things that make an Informatica developer's life harder.

    Informatica Mappings development

    Below a description of a few typical and most widely used development components:
  • Source qualifier - conversion of source datatypes to Informatica datatypes
  • Expression - transforming data with functions
  • Lookups - joining data with lookups
  • Sorter & Aggregator - sorting and aggregating data
  • Update Strategy expressions
  • Transaction control - how to create transactions and control commits and rollbacks
  • Java transformation - invoking Java methods, variables, third-party API's and built-in packages with Java transformation

    Informatica ETL process implementation

  • Parameters and variables in Informatica
  • Tasks - workflow tasks
  • Workflows - representation of the ETL process
  • Design best practices - Informatica mapping design tips, tricks and best practices 

    Informatica extensions and optional features

  • Powercenter Enterprise GRID - cost-effective scalability to ensure enhanced data integration and reduction of time needed for responding to business changes
  • Unstructure data extension for Informatica - with Unstructured Data Option data of any format can be easily read integrated
  • Powerexchange for Tibco - a part of PowerExchange set of extensions is a virtual bridge between Informatica solutions and TIBCO itself
  • Powerexchange for SAP Netweaver - integration between Informatica and SAP
  • Data masking components - out of the box data masking option in informatica
  • Real time edition - PowerCenter platform supplied with numerous real time operational data integration capabilities. 

  • Informatica

    Informatica, founded in 1993 is a ldeader in providing enterprise data integration solutions. It provides applications for the following Data Integration areas:
  • Data Migration - ERP and legacy systems consolidation, new application implementation
  • Data Synchronization - business-to-business data transfer
  • Data Warehousing
  • Data Hubs, Data Marts
  • Master Data Management (MDM)
  • Business Intelligence Reporting
  • Business Activity Monitoring (BAM)

    Informatica PowerCenter components

    Services

    The following services form the Informatica PowerCenter processing engine:
  • Integration Service - conducts and implements the ETL logic
  • Repository Service - manages connectivity to the metadata repositories, where mapping and workflow definitions are saved
  • Repository Service Process - retrieves, inserts and updates repository metadata. It is a multi-threaded process.
  • Repository - a database which contains the ETL metadata

    Client Applications

    Informatica client applications are desktop tools used to create transformations, manage metadata, execute ETL processes and monitor them.
  • Designer - a developer tool used for the development of ETL mappings.
  • Workflow Manager - create and start workflows (sequences of mappings)
  • Workflow Monitor - monitors workflows
  • Repository Manager - tool used to manage source/targets connections, folders, objects, users, etc.
  • Administrator console - a web-based tool used to perform domain and repository service tasks (configure services, nodes, perform backups, etc.)

    Informatica ETL programs

    Informatica ETL programs designed in one of the client applications (Designer or Workflow Manager) and are responsible for the whole ETL process execution.
    In Informatica there are a few important objects definitions:
  • Mapping - developed in Designer, it logically defines the ETL process. Mappings read data from the sources, apply transformation logic to the data and write transformed data to targets.
  • Transformations are elements of a mapping which generate, modify or pass the data. Transformations are linked together with links through ports (input and output) within a mapping. 
    There are two groups of transformations: Passive - where the number of rows entering and exiting the transformation are the same (f.ex. Expression, Source Qualifier); Active - number of output rows may not be the same as input rows (f.ex. Aggregator). 
    Examples of transformations: Source qualifier, expression, filter, sorter, aggregator, joiner, lookup, update strategy, router, transaction control, union.
  • Task - designed in Workflow Manager, is an executable set of actions, functions or commands. A sequence of tasks defines the runtime behavior of an ETL process. 
    Example: Session task runs a mapping, Command tasks executes a shell script, Email task sends an email.
  • Workflow - designed in Workflow Manager, it is a collection of tasks that descibe runtime ETL processes. Tasks can be linked sequentially or concurrently and can also depend on the completion status of previous tasks. 
    Each linked icon represents a task.

    Informatica mapping development

    The article provides a checklist of topics to consider during the Informatica ETL development development project and covers a variety of tips, guidelines and things to consider before proceeding with the development.

    General ETL development tips

  • Before designing a mapping, it is important to have a clear picture of the end-to-end processes that the data will flow through.
  • It is a good practice to create a high-level view of the mapping first and document a picture of the process with the mapping, using a textual description to explain exactly what the mapping is supposed to accomplish and the methods or steps it will follow to accomplish its goal.
  • Next, document the details at the field level, listing each of the target fields and the source fields that are used to create the target field, along with the transformations used to create this field (for example: a sum calculation, a concatenation of two fields, a comparison of two fields, etc.). At this point the designer may have to do some investigation for some business rules with business guys.
  • Create an inventory of Mappings and Reusable objects (mapplets, worklets). This will be a 'work in progress' list and will have to be continually updated (this is particularly valuable for the lead developer).
  • The administrator or lead developer should gather all of the potential Sources, Targets and Reusable objects and place these in a shared folder accessible to all who may need access to them.
  • As for Reusable objects, they need to be properly documented to make it easier for other developers to determine if they can be re-used.
  • As a developer the specifications for a mapping should include required Sources, Targets and additional information regarding derived ports and finally how the ports relate from the source to the target.
  • Document any other information about the mapping that is likely to be helpful in developing the mapping. This may, for example, include source and target database connection information (database schema owners, passwords and connect strings), lookups and how to match data in the lookup tables, data cleansing needed at a field level, potential data issues at a field level, any known issues with particular fields, pre or post mapping processing requirements, and any information about specific error handling for the mapping.
  • The completed mapping design should then be reviewed with one or more team members for completeness and adherence to the business requirements. In addition, the design document should be updated if the business rules change or if more information is gathered during the build process.

    Informatica-specific mapping development guidelines

  • One of the first things to do is to bring in all required source and target objects into the mapping.
  • Only connect fields that are needed or will be used.
  • Only connect from the Source Qualifier those fields needed subsequently.
  • Filter early and often. Only manipulate data that needs to be moved and transformed. Reduce the number non-essential records that are passed through the mapping.
  • Decide if a Source Qualifier join will net the result needed versus creating a Lookup to retrieve desired results.
  • Reduce the number of transformations. Excessive number of transformations will increase overhead.
  • Consider increasing the shared memory when using a large number of transformations.
  • Make use of variables, local or global, to reduce the number of times functions will have to be used.
  • Watch the data types. The Informatica engine converts compatible data types automatically. Excessive number of conversions is inefficient.
  • Make use of variables, reusable transformations and mapplets for reusable code. These will leverage the work done by others.
  • Use active transformations early in the process to reduce the number of records as early in the mapping as possible.
  • When joining sources, select appropriate driving and master table.
  • Utilize single pass reads. Design mappings to utilize one Source Qualifier to populate multiple targets.
  • Remove or reduce field-level stored procedures. These will be executed for each record and slow performance.

    Informatica Powercenter weaknesses

    Things that make an Informatica developer's life harder and sometimes makes them laugh. 

    Looking in the internet, we can find a lot of articles on why Informatica is the best etl tool but there's very little information on its weaknesses. Still Informatica Powercenter is a leading ETL tool on the market, however it's far from being perfect. This article covers things that are never revealed on a sales powerpoint presentation level.
    This is a compilation of weaknesses of Informatica from the developer's perspective. Hopefully after reading someone will avoid getting into trouble. 

    Recently on a project I'm taking part in, we migrated and switched from Informatica 8.6.1 to 9.1.0. I was hoping most of the issues described below would be resolved but apparently it's not the case.

    Sorting values

    Sorting of dates and numbers based on string values. This generally looks like an amateur work. Few examples below: 
    Informatica Repository Manager - wrongly sorted numbers 
    Is really 2 a bigger number than 14? :) 

    And here's how session are sorted in the ascending order by date: 
    Informatica Repository Manager - dated sorted incorrectly 

    Lack of sorting in Workflow Monitor

    For instance you can't order sessions by Name, Start Time, Completion Time. In fact reading the execution logs is a pain, especially when a workflow has many sessions. By default the sessions are sorted by execution order (start time), however when workflow monitor is loaded when a workflow is still running, the sessions get sorted in a random order.

    Workflow monitor

    • In Workflow monitor, why there's no option to filter out folders that are not of one's interest? I mean in big Powercenter installations there are tens of available folders. Typically, a developer has access to maybe three, maximum five. But in the folder tree on the left-hand side there are all of them. It takes time for WF monitor to load and this consumes network traffic.
    • It's also a pity that informatica doesn't provide a web-based workflow execution monitoring tool

    Repository Manager

    Moving object from one folder to another.

    Let's say we want to copy workflows, sessions, mappings and sources + targets from one folder to another. 
    If we do it in repository manager using copy & paste method or dragging and dropping the main workflow, all the objects will be copied. But... the problem is that the source filter set on a session level gets reverted to the value from the mapping! So if there's a source filter which limits the data in some way (a where SQL filter statement) it gets wiped out. So be careful when copying the sessions. 
    It doesn't happen when doing 'export to XML' and 'import from XML' in the repository manager though. 
    Wouldn't it be handy to be able to choose if the values should be reverted?

    Importing XML export files

    Importing XML repository export files - the folder selection window when you map folders from the XML file to the available folders. You can't type in the folder name. So you need to click ... and it seems that Informatica reads all the available folders and user's permissions. This is ok, however for me it sometimes takes like 15 minutes to load this window! Way too long.

    Mappings and workflows development

    Lack of functionality compared to other tools or things that might be improved:
    • Lack of a possibility to do loops within informatica workflows.
    • Parameter files handling. The concept of using a parameter file is ok, however it would be nice to be able to dynamically set parameters during runtime. Editing the parameter file for each run is rather painful. This would be very helpful especially for development/testing phase.
    • It's not possible to run a single session within a worklet. Sometimes, again especially for testing/development, it's handy to just run sessions one by one. It's ok in workflows but why it's not possible within worklets ?
    • When a workflow is running it's not possible to run another session (the error message says: Could not start execution of this workflow because the current run on this Informatica Server has not completed yet.). Enabling concurrent execution option on a workflow level sometimes helps, not always though. Why it's not possible to put such a session in a queue?
    • Lack of refresh option in designer and wf manager. If for instance a source definition changes, a developer needs to disconnect and connect again. In repository manager the refresh option is available though.
    • In workflow manager (can happen also in designer), when a mapping is modifed, after coming back to the session, it can't be validated, saved or refreshed due to the following error: REP_12355 The object ... has been modified since the time it was read. 
      The only option then is disconnect the repository and connect again or restart the tool. 
      It would be so much easier to be able to do a refresh or if this causes a conflict, to be able to choose what to do. Unfortunately the only option available is a restart, not nice.

      Source Qualifier transformation

      Purpose

      Source Qualifier (SQ) Transformation selects records from various sources - those can be flat files, relational tables, Informatica PowerExchange services. It provides an efficient way to filter input columns (or fields) and to perform joins. 

      Source Qualifier converts source datatypes to native (most compatible) PowerCenter datatypes. It is a mandatory component for relational table sources and flat files. Source Qualifier is an Active transformation (which means that the number of input rows may differ from the number of output ports). It generates a SQL query for relational sources and can perform joins between relational tables on the same database.

      Converting datatypes

      Data can be converted from one type to another by:
    • Passing data between ports with different datatypes. Datatypes must be compatible or explicitly converted.
    • Passing data from an expression to a port
    • Using transformation functions or arithmetic operators
      More detailed information on converting data types in Informatica Client Help: 'Port-to-Port Data Conversion' topic

      Example

      Source Qualifier
      A sample use of Source Qualifier for a Flat File source (top) and ODBC database source (bottom). Note the datatype change for some columns. 

      Useful tips

    • For relational data sources, provide only columns (ports) which are used in the mapping (based on the output ports).
    • Use 'SQL Query' option in the Properties tab to override the default SQL