The Business Intelligence Guide
   BI Strategy | BI Program | BI Projects | BI Data | BI Infrastructure | BI User Tools | BI Vendors | Articles | BI Blog
HOME
 
Business Intelligence
BI Definition
BI Evolution
Updates In BI
 
BI Strategy
Drivers of BI
BI Lifecycle
Setting BI Strategy
BI Strategy Doc
BI Scorecard
BI Guiding Principles
 
BI Programs
BI Governance
BI Program
BI Roadmap
BI Roles
Barriers To BI
 
BI Tools
About BI Tools
OLAP
Scorecards
Dashboards
BI Tools and BPM
Text Mining
 
BI Solutions
BI Software
BI Solution Comparison
BI Vendor Updates
CRM & BI
 
Data
About Data
Data Definition
Data Management
Data Governance
MDM
Metadata
Data Cleansing
Data Integration
 
Databases
About Databases
Data Warehouses
Data Marts
Microsoft SQL
Oracle OODBMS
Contextual Databases
Development Platforms
 
DW Solutions

DW Appliances

Netezza PS
Datallegro
Teradata ADW
 
Industry Solutions
Airline
Health
Retail
Telecommunications
 
Case Studies
BI Case Study Index
Govt Planning Office
Manufacturing Co
Port Logistics
Postal Logistics
Telco Customer Churn
 
RESOURCES
ARTICLES
NEWS
Sitemap

 

Data Integration


Most organizations accumulate a staggering volume of data. This data is commonly spread across multiple databases and operational systems. This distributed data architecture makes it difficult to maintain data consistency and retrieve it across the business.

By integrating your data into a unified database or data view, Business Intelligence [BI] tools can help manage the data and make it readily available to BI applications, such as scorecards, dashboards and analytic tools.

Data Integration Solutions

A typcial data integration solution will enable data developers to combine data from multiple operational systems across a company, integrate similar databases from different sources, an pull syndicated information or RSS feeds into a database.

 

ETL

Extract, Transform and Load [ETL] is a toolset that automatically performs the forementioned three actions between data source and data destination [data warehouse]:

  • Extracting data from outside sources
  • Transforming it to fit business needs
  • Loading it into the data warehouse.

Data Extraction

Data warehousing projects consolidate data from different source systems. Often, each separate system uses a different way of organizing the data [data format].

Common data source formats include:

  • Relational databases and flat files
  • Non-relational database structures - IMS, VSAM or ISAM.

Extraction converts the data into a format for transformation processing.

Transformation

During transformation, a series of rules or functions are applied to the extracted data to derive the desired data to be loaded. Some data sources will require very little manipulation of data; others may require one or more of the following transformations :

  • Selecting only certain columns to load (or selecting null columns not to load)
  • Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this is called data cleansing
  • Encoding free-form values (e.g., mapping "Male" and "1" and "Mr" into M)
  • Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
  • Joining together data from multiple sources (e.g., lookup, merge, etc.)
  • Summarizing multiple rows of data (e.g., total sales for each store, and for each region)
  • Generating surrogate key values
  • Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
  • Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)

Load

Loading data into the data warehouse [DW] depends on the requirements of the organization. The loading process ranges widely to include paramters such as:

Data Versioning - whether new data overwrites existing information to provide a cumulative, updated data set.

Load Schedule - whether data is added incrementally real time, or batch loaded hourly or overnight.

The loading sequence is a strategic decision based on the time available and the business needs. More complex systems maintain a history and audit trail of all changes to the data. Significant operational problems can occur with improperly designed ETL systems.

 

ETL Issues

ETL challenges include:

Expectations - the range of data values or data quality in an operational system may not be visible to the designers at the time validation and transformation rules are specified. This is avoided by completing data profiling of a source during data analysis.

Scalability - The scalability of an ETL system also needs to be established during analysis. This includes:

  • The volumes of data to be processed within Service Level Agreements, (SLAs).
  • The time available to extract from source systems may change, resulting in less time.
  • Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues or real-time change data capture (CDC) for continuous transformation and update.

Consistency - ensuring the data being uploaded is relatively consistent. Different source databases have different update cycles [minutes, days, weeks]. This means the ETL system may be required to hold back certain data until all sources are available to be synchronized. Data reconciliation may also be required between data in a source system and that of a downstream system, such as the general ledger, establishing synchronization and reconciliation points as necessary.

 

Parallel Processing

Parellel Processing is a more recent development in ETL software, improving the performance of ETL processes when dealing with large volumes of data.

There are 3 main types of ETL parallelism:

Data - splitting a single sequential file into smaller data files to provide parallel access.

Pipeline - allowing simultaneous running of several components on the same data stream. An example would be looking up a value on record 1 at the same time as adding together two fields on record 2.

Component - the simultaneous running of multiple processes on different data streams in the same job. Sorting one input file while performing a deduplication on another file would be an example of component parallelism.

All three types of parallel processing are usually combined in a single job.

 

Data Integration Products

Current Enterprise Application Integration tools include ETL tools as well as data profiling, data quality and metadata capabilities.

Microsoft SQL Server 2005 Integration Services

 

Back To Top

For The World's Leading Guide To BI Strategy, Program & Technology


Data Index | Data Defintion | Meta Data | Data Management | MDM | Data Governance | Data Cleansing | Normalization | Data Integration | Data Growth | Data Solutions

 


NOW AVAILABLE!

The Logical Organization
A Strategic Guide To Corporate Performance Using Business Intelligence

THE ULTIMATE BI REFERENCE
FOR MANAGERS & CONSULTANTS

The Logical Organization Book Cover



Feature Articles

Using BI To Drive Corporate Performance

Pervasive BI - The Next Step in BI Excellence

The Executive Guide to BI Tools and Solutions

The Executive Guide To Understanding Corporate Data

Using Business Intelligence To Power Boost Corporate Performance