Data Integration
Most organizations accumulate a staggering volume of data. This
data is commonly spread across multiple databases and operational
systems. This distributed data architecture makes it difficult to
maintain data consistency and retrieve it across the business.
By integrating your data into a unified database or data view,
Business Intelligence [BI] tools
can help manage the data and make it readily available to BI applications,
such as scorecards, dashboards
and analytic tools.
Data Integration Solutions
A typcial data integration solution will enable data developers
to combine data from multiple operational systems across a company,
integrate similar databases from different sources, an pull syndicated
information or RSS feeds into a database.
ETL
Extract, Transform and Load [ETL] is a toolset that automatically
performs the forementioned three actions between data source and
data destination [data warehouse]:
- Extracting data from outside sources
- Transforming it to fit business needs
- Loading it into the data warehouse.
Data Extraction
Data warehousing projects consolidate data from different source
systems. Often, each separate system uses a different way of organizing
the data [data format].
Common data source formats include:
- Relational databases and flat files
- Non-relational database structures - IMS, VSAM or ISAM.
Extraction converts the data into a format for transformation processing.
Transformation
During transformation, a series of rules or functions are applied
to the extracted data to derive the desired data to be loaded. Some
data sources will require very little manipulation of data; others
may require one or more of the following transformations :
- Selecting only certain columns to load (or selecting null columns
not to load)
- Translating coded values (e.g., if the source system stores
1 for male and 2 for female, but the warehouse stores M for male
and F for female), this is called data cleansing
- Encoding free-form values (e.g., mapping "Male" and
"1" and "Mr" into M)
- Deriving a new calculated value (e.g., sale_amount = qty *
unit_price)
- Joining together data from multiple sources (e.g., lookup,
merge, etc.)
- Summarizing multiple rows of data (e.g., total sales for each
store, and for each region)
- Generating surrogate key values
- Transposing or pivoting (turning multiple columns into multiple
rows or vice versa)
- Splitting a column into multiple columns (e.g., putting a comma-separated
list specified as a string in one column as individual values
in different columns)
Load
Loading data into the data warehouse [DW] depends on the requirements
of the organization. The loading process ranges widely to include
paramters such as:
Data Versioning - whether new data overwrites
existing information to provide a cumulative, updated data set.
Load Schedule - whether data is added incrementally
real time, or batch loaded hourly or overnight.
The loading sequence is a strategic decision based on the time
available and the business needs. More complex systems maintain
a history and audit trail of all changes to the data. Significant
operational problems can occur with improperly designed ETL systems.
ETL Issues
ETL challenges include:
Expectations - the range of data values or data
quality in an operational system may not be visible to the designers
at the time validation and transformation rules are specified. This
is avoided by completing data profiling of a source during data
analysis.
Scalability - The scalability of an ETL system
also needs to be established during analysis. This includes:
- The volumes of data to be processed within Service Level Agreements,
(SLAs).
- The time available to extract from source systems may change,
resulting in less time.
- Increasing volumes of data may require designs that can scale
from daily batch to intra-day micro-batch to integration with
message queues or real-time change data capture (CDC) for continuous
transformation and update.
Consistency - ensuring the data being uploaded
is relatively consistent. Different source databases have different
update cycles [minutes, days, weeks]. This means the ETL system
may be required to hold back certain data until all sources are
available to be synchronized. Data reconciliation may also be required
between data in a source system and that of a downstream system,
such as the general ledger, establishing synchronization and reconciliation
points as necessary.
Parallel Processing
Parellel Processing is a more recent development in ETL software,
improving the performance of ETL processes when dealing with large
volumes of data.
There are 3 main types of ETL parallelism:
Data - splitting a single sequential file into
smaller data files to provide parallel access.
Pipeline - allowing simultaneous running of several
components on the same data stream. An example would be looking
up a value on record 1 at the same time as adding together two fields
on record 2.
Component - the simultaneous running of multiple
processes on different data streams in the same job. Sorting one
input file while performing a deduplication on another file would
be an example of component parallelism.
All three types of parallel processing are usually combined in
a single job.
Data Integration Products
Current Enterprise Application Integration tools include ETL tools
as well as data profiling, data quality and metadata capabilities.
Microsoft SQL Server 2005 Integration Services
Back To Top
For
The World's Leading Guide To BI Strategy, Program & Technology
Data Index | Data Defintion
| Meta Data | Data
Management | MDM | Data
Governance | Data Cleansing | Normalization
| Data Integration | Data
Growth | Data Solutions
|