Nico Budi Darmawan Tan - Simple Outside, Complicated Inside.: ETL

IBM : What is Data Quality?
Data quality is an essential characteristic that determines the reliability of data for making decisions. High-quality data is:

Complete: All relevant data —such as accounts, addresses and relationships for a given customer—is linked.
Accurate: Common data problems like misspellings, typos, and random abbreviations have been cleaned up.
Available: Required data is accessible on demand; users do not need to search manually for the information.
Timely: Up-to-date information is readily available to support decisions.

Business leaders recognize the value of big data and are eager to analyze it to obtain actionable insights and improve the business outcomes. Unfortunately, the proliferation of data sources and exponential growth in data volumes can make it difficult to maintain high-quality data. To fully realize the benefits of big data, organizations need to lay a strong foundation for managing data quality with best-of-breed data quality tools and practices that can scale and be leveraged across the enterprise.

#Business value of data quality
Data quality-related problems cost companies millions of dollars annually because of lost revenue opportunities, failure to meet regulatory compliance or failure to address customer issues in a timely manner. Poor data quality is often cited as a reason for failure of critical information-intensive projects. By implementing a data quality program, organizations can:

Deliver high-quality data for a range of enterprise initiatives including business intelligence, applications consolidation and retirement, and master data management
Reduce time and cost to implement CRM, data warehouse/BI, data governance, and other strategic IT initiatives and maximize the return on investments
Construct consolidated customer and household views, enabling more effective cross-selling, up-selling, and customer retention
Help improve customer service and identify a company's most profitable customers
Provide business intelligence on individuals and organizations for research, fraud detection, and planning
Reduce the time required for data cleansing—saving on average 5 million hours, for an average company with 6.2 million records (Aberdeen Group research)

Wikipedia :
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database. The first views can often be in disagreement, even about the same set of data used for the same purpose. This article discusses the concept as it related to business data processing, although of course other data have various quality issues as well.

#Definitions
This list is taken from the online book "Data Quality: High-impact Strategies".

Degree of excellence exhibited by the data in relation to the portrayal of the actual scenario.
The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.
The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data.
The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria.
Complete, standards based, consistent, accurate and time stamped.

#Overview
Problems with data quality don't only arise from incorrect data. Inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.

Enterprises, scientists, and researchers are starting to participate within data curation communities to improve the quality of their common data.[10]

The market is going some way to providing data quality assurance. A number of vendors make tools for analysing and repairing poor quality data in situ, service providers can clean the data on a contract basis and consultants can advise on fixing processes or systems to avoid data quality problems in the first place. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:

Data profiling - initially assessing the data to understand its quality challenges
Data standardization - a business rules engine that ensures that data conforms to quality rules
Geocoding - for name and address data. Corrects data to US and Worldwide postal standards
Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. It might be able to manage 'householding', or finding links between husband and wife at the same address, for example. Finally, it often can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.
Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.
Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean.

#Data Quality Control
Data quality control is the process of controlling the usage of data with known quality measurement—for an application or a process. This process is usually done after a Data Quality Assurance (QA) process, which consists of discovery of data inconsistency and correction.

Data QA process provides following information to Data Quality Control (QC):

Severity of inconsistency
Incompleteness
Accuracy
Precision
Missing / Unknown

The Data QC process uses the information from the QA process, then it decides to use the data for analysis or in an application or business process. For example, if a Data QC process finds the data contains too much error or inconsistency, it rejects the data to be processed. The usage of incorrect data could crucially impact output. For example, providing invalid measurements from several sensors to the automatic pilot feature on an aircraft could cause it to crash. Thus, establishing data QC process provides the protection of usage of data control and establishes safe information usage.

#Data Quality Assurance
Data quality assurance is the process of profiling the data to discover inconsistencies, and other anomalies in the data and performing data cleansing activities (e.g. removing outliers, missing data interpolation) to improve the data quality .

These activities can be undertaken as part of data warehousing or as part of the database administration of an existing piece of applications software.

#Criticism of existing tools and processes
The main reasons cited are:

Project costs: costs typically in the hundreds of thousands of dollars
Time: lack of enough time to deal with large-scale data-cleansing software
Security: concerns over sharing information, giving an application access across systems, and effects on legacy systems

Gartner :

#Data Quality Tools
The market for data quality tools has become highly visible in recent years as more organizations understand the impact of poor-quality data and seek solutions for improvement. Traditionally aligned with cleansing of customer data (names and addresses) in support of CRM-related activities, the tools have expanded well beyond such capabilities, and forward-thinking organizations are recognizing the relevance of these tools in other data domains. Product data — often driven by MDM initiatives — and financial data (driven by compliance pressures) are two such areas in which demand for the tools is quickly building.

Data quality tools are used to address various aspects of the data quality problem:

Parsing and standardization — Decomposition of text fields into component parts and formatting of values into consistent layouts based on industry standards, local standards (for example, postal authority standards for address data), user-defined business rules, and knowledge bases of values and patterns
Generalized “cleansing” — Modification of data values to meet domain restrictions, integrity constraints or other business rules that define sufficient data quality for the organization
Matching — Identification, linking or merging related entries within or across sets of data
Profiling — Analysis of data to capture statistics (metadata) that provide insight into the quality of the data and aid in the identification of data quality issues
Monitoring — Deployment of controls to ensure ongoing conformance of data to business rules that define data quality for the organization
Enrichment — Enhancing the value of internally held data by appending related attributes from external sources (for example, consumer demographic attributes or geographic descriptors)

The tools provided by vendors in this market are generally consumed by technology users for internal deployment in their IT infrastructure, although hosted data quality solutions are continuing to emerge and grow in popularity. The tools are increasingly implemented in support of general data quality improvement initiatives, as well as within critical applications, such as ERP, CRM and BI. As data quality becomes increasingly pervasive, many data integration tools now include data quality management functionality.

Nico Budi Darmawan Tan - Simple Outside, Complicated Inside.

Tuesday, May 6, 2014

ETL - Data Quality Overview & Definition

No comments:

Post a Comment