Just, DataStage

Tuesday, November 8, 2011

BUY v/s BUILD:

When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things:

Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool.

Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch.

Data volume: Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.

Important Data WareHousing Terms

Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster.

Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension.

Conformed Dimension: A dimension that has exactly the same meaning and content when being referred from different fact tables.

Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content.

Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process (as defined by Bill Inmon).

Data Warehousing: The process of designing, building, and maintaining a data warehouse system.

Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension.

Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures.

Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table.

Drill Across: Data analysis across dimensions.

Drill Down: Data analysis to a child attribute.

Drill Through: Data analysis that goes from an OLAP cube into the relational database.

Drill Up: Data analysis to a parent attribute.

ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another.

Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns: fact columns and foreign keys to the dimensions.

Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a hierarchy belong to the same dimension.

Metadata: Data about data. For example, the number of tables in the database is a type of metadata.

Metric: A measured value. For example, ‘total sales’ is a metric.

MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.

OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of slicing and dicing the data.

ROLAP: Relational OLAP. ROLAP systems store data in the relational database.

Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more than a single dimension table.

Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by a single dimension table.

Reading & Writing Files - Stage Properties

Reading & Writing Files - Stages

Stage Name

Input Link

Output Link

Reject Link

Executes in

Data Set

Stage

Can be configured to execute in parallel or sequential mode

Sequential File Stage

The stage executes in parallel mode if reading multiple files but executes sequentially if it is only reading one file.

The Stage executes in parallel if writing to multiple files, but executes sequentially if writing to a single file.

File Set Stage

It only executes in parallel mode.

Look up File Set Stage

(must be reference link)

The stage can be configured to execute in parallel or sequential mode when used with an input link.

External Source Stage

(Takes input from 1 or more source programs)

It can be configured to execute in parallel or sequential mode.

External Target Stage

(Allows you to write data to one or more source programs)

It can be configured to execute in parallel or sequential mode.

Complex Flat File Stage

(used as source)

(Reads from one or more complex flat files)

CFF source stages run in parallel mode when they are used to read multiple files, but you can configure the stage to run sequentially if it is reading only one file with a single reader.

Complex Flat File Stage

(used as target)

(You can write data to one or more

Can write to one or more complex flat files. But cannot write to MVS data sets or to files that contain multiple record types.

Tuesday, September 6, 2011

OLAP Basic Concept

OLAP stores data in multi-dimension; it helps the user to view the data in different point of view. OLAP database is not necessarily required to be as large as data warehouse since we use only a part of data in the data warehouse to work on the analysis. OLAP can be defined as a decision support system that allows user to quickly analyze information that has been summarized into multi-dimensional views and hierarchies.

There are three main features of OLAP system :

ü Multidimensional Viewing – OLAP supports multidimensional model which consists of facts and dimensions also called as Star Schema.

ü Calculation Intensive Capabilities – Due to data is stored in facts and dimensions tables, it enables users to analyze data without much calculations.

ü Time Series analysis – Enables users to analyze data across time.

*This is just a basic idea, read more details if required.

ETL Basic Concepts

ETL stands for Extract Transform & Load; this is the foundation of Data Integration Systems. The concept which we are talking here is very simple and self explanatory. Here you go…

E – Extract:

Extracting data from different sources, where each source contains different types of data and in different formats.

T – Transform:

Transform the data into a required unified format.

L – Load:

Load the data into the required system. Normally it will be a Data Warehouse or any other applications.

What is done during processing the data in the ETL?

Below is the processing we normally do in ETL.

1. Aggregating Data

2. Cleansing Data

3. Deriving Data

4. Filtering Data

5. Integrating Data

6. Validating Data

Data Warehouse Basic Definition

Data Warehouse is a just like a database which stores large amount of data usually the data stored in the data warehouse are old data not current data. Organizations use these data stored over years in the data warehouse to analyze the trend and to forecast the future. This analysis is very helpful to find some unknown patterns in the Business which might turn the profits higher.

Data warehouse mainly has 3 functional layers: Staging Layer, Integration Layer and Access Layer

In Staging Layer, the raw data is stored.

In Integration Layer, some data are integrated together as per the requirement and are stored in this layer.

In Access Layer, data are ready to be accessed in the format which will be helpful for generating the reports.

What is DataStage?

The very first question which arises in people’s mind is what is it? Here is the definition by wikipedia.com, “IBM InfoSphere DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition and the Enterprise Edition.”

IBM is the company which has developed this product; this product comes under a product line called InfoSphere. InfoSphere contains any many products (you can check out the products using the link http://www-01.ibm.com/software/data/infosphere/ )

ETL stands for Extract, Transform & Load, if you don’t know ETL concept I feel its better you get some brief knowledge about it and then continue reading.

ETL concept is used with the Data Warehouse, because we are going to handle billions or trillions or even more number of records in Data Warehouse and it will be dealt differently not like the way we deal with few thousands of records.

I suggest you to read about Data Warehouse, OLAP, OLTP and difference between Data Warehouse and Data Base before continuing reading on DataStage.