Data quality plays an important role in the effectiveness of today’s businesses as the cost of poor data quality can be catastrophic for a business regardless of its size. This thesis focuses on the methods for measuring, assessing and improving data quality in the QAS Technical Services Snap-shot Database. The QAS Snap-shot database is used by the Technical Services Department to find out information on the Service Requests (requests of service from the customers of QAS).
The topic of data quality was researched in depth to devise a number of different subjective and objective assessment techniques to help measure and assess the current data quality level. The research helps to define a set of 7 core data quality dimensions to test on the Snap-shot database. To measure each data quality dimension a number of interpretable data quality formulas were used. The causes behind the problems found in the measurement of the data quality were then identified. Finally, a comprehensive set of improvement recommendations were devised using an array of tools and techniques including SQL querying and Data Mining. The effectiveness of the data quality improvements was measured using the devised data quality formulas; it was found that the overall data quality of the database was improved by an average of 4.65% (average across all used data quality dimensions).
Table of Contents
List of Tables
Table 1.1: Risk Assessment……………………………………………………………………………….4
Table 1.2: Risk Control………..……………………………………………………………………………5
Table 5.1: Data Mining Comparison ………………………………………………………………………………..23
Table 6.1 Overall Data Quality…………………………………………………………………………..25
List of Figures
Figure 2.1: Academics’ View Data Quality. ………………………………………………………………7
Figure 2.2: Cited Dimensions …………………………………………………………………………..…8
Figure 4.1 CRISP-DM………………………………………………………………………………………17
QAS is an Address Management Software company who specialise in Quick Address Capture and Batch Address cleansing software for Medium to Large Enterprises. QAS has over 10,000 customers globally. Within QAS there are five main departments: Sales, Professional Services, Finance, Operations and Technical Services. The Technical Services department are responsible for supporting all Technical queries regarding the products, services and data QAS offers.
The Technical Service department deals with all of the technical queries their customers have; these queries are referred to as Service Requests. The most common type of Service Requests are: Updating Expired Data (QAS products use data with a 9 month life span, if data is not updated before the data expires the products stop working), Product Installation Assistance, and Product Functionality Assistance. Every Service Request is logged into the Siebel CRM system by the Technical Services department; the data submitted into the CRM system is stored in an SQL database.
The QAS Technical Services database is very complex and contains many tables and a vast amount of data. To make it easier for the managers of the Technical Services department to find useful information in this database the Snap-shot database has been created. The Snap-shop database is a de-normalised database that has been put together to contain all of the key fields and information that the Technical Services department might need. This presents several benefits to the Technical Services department: it helps to make SQL querying easier, quicker to write, quicker to run, and users don’t require knowledge of the database structure. These benefits are largely due to no table joins being needed in the SQL querying and because all the information is in one place.
The Technical Services team regularly uses the database for SQL querying in order to find information about the Service Requests, such as:
- Which problems take the longest to resolve
- What the current 10 longest open Service Requests are
- Which clients have had the largest number of Service Requests in the last year
This information can then be used to support important business decision such as:
- Which products and processes require extra training
- Which processes and products need to be improved
- Which processes, if automated, would save the most time
The problem this project is attempting to address is that the Technical Support Managers have found that many of the decisions based on the information gained from the Snap-shot database have been largely ineffective and that the overall confidence in the data centred in the Snap-shot database is very poor.
An example of this is the failed Electronic Updates Project. The Technical Support Managers queried the Snap-shot database to find the most common Service Requests and investigated if the process could be improved. They found that 11.5% of the total Service Requests that the 1st Line Support Team deal with were because the Customer’s data has expired. According to the activity time of these Service Requests 35.5 hours a week was spent dealing with these types of issues. A proposed solution was to heavily promote a product called Electronic Updates (EU). The EU product automatically updates data before it expires. The Technical Support Managers worked out that by automating all data updates the 1st Line team could be reduced from 5 employees to 4. After a year the Electronic Updates project had migrated 85% of customers to the EU platform. Despite the data originally indicating that 35.5 hours a week would be saved by automating this process, after a period of time it was proved to not to make a difference and that the small number in reduced data update queries were just replaced with EU setup and EU configuration problems. Whilst this failure was down to a number of different factors it has brought into question the quality of data in the Snap-shot database.
In addition, QAS has experienced a high level of growth over the last 5 years and the growth of the company has now started to plateau. This has caused them to look at internal processes and systems in order to improve their effectiveness. Due to the failing of various Technical Support decisions such as the EU project, the data quality of the Technical Services database has been identified as an area that needs improvement. The problem with the Snap-shot database is that there are currently no data quality measures or procedures in place and the quality of the data is simply not known; however, it is thought that it contains many rushed entries, missing information and inconsistencies.
The proposed solution to the problem is to investigate the issues and produce a data quality improvements report. The report will help identify and measure the data quality issues, and provide recommendations that if implemented with help to improve the data quality of the Snap-shot database. The data quality improvements report will provide recommendations to help improve and maintain the data quality. The QAS Technical Service Manager has asked specifically for a data quality report/audit; therefore, there is no need to investigate any additional solution to the problem at this time.
The main objectives of this project are:
- To create a data quality improvements report that will:
- Assess the data quality
- Recommend steps to improve the data quality
- Recommend steps to maintain the data quality
- Improve the users of the database with confidence in the data quality by raising awareness of the data quality issues.
- Develop an understanding of data quality including:
- How to assess data quality
- How to improve data quality
- How to maintain data quality
- The costs of data quality
- Gain and develop project management skills
- Improve SQL Skills
For the initial project plan see Appendix D.
Before starting this project Risk Management analysis was performed to identify, analyse and control the risks of this project (Blokdijk 2007). This project has many different components that need to be successful in order for the project to be overall success, therefore effective risk analysis is an essential planning tool. Risks cannot always be eradicated completely; therefore, a risk management plan can helps plan for them if they do occur as well as working to reduce the probability of risks. Table 1.1 shows the each of the identified risks, the probability of the risks occurring, the impact if the risks do occur, as well as ranking of the risks. Table 1.2 contains information on how the probability and impact of the risks can be reduced.
The Risk Management Process is split into two main parts: Risk Assessment and Risk Control. To assess the risks Boehm’s (1991) three stages of Risk Assessment was used:
- Risk Identification
- Risk Analysis
- Risk Prioritisation
Once the risks were identified the three stages of Risk Control were followed:
- Risk Management Planning
- Risk Resolution
- Risk Monitoring
In Risk Assessment step the risks have been identified and analysed in terms of probability of them occurring and their potential impact. The risks have be sorted by probability*impact with the highest risk on the list being the most important and highest priority.
|No.||Risk Identification||Probability (0 Low, 1 High)||Impact (0 Low, 1 High)||Probability*Impact (0 Low, 1 High)|
|1||Going over time scale||0.5||1||0.5|
|2||Project and Research is Corrupt or Lost||0.6||0.8||0.48|
|3||Data Protection Act/ Privacy/ Sensitive Data||0.5||0.9||0.45|
|4||Access to the required resources||0.6||0.7||0.42|
|5||Research Topics are insufficient||0.5||0.5||0.25|
|6||Poor quality/ Report not high quality||0.3||0.8||0.24|
|7||Not fit for purpose||0.2||0.9||0.18|
Table 1.1: Risk Assessment
This chapter covers how the probability of occurrence and impact of each risk will be controlled and/or reduced.
|1||Going over time scale||Careful Planning
Regular project reviews
|Contingency time built into the project plan|
|2||Project and Research is Corrupt or Lost||Accept it may occur||Back up on a daily basis (Multiple sources)|
|3||Data Protection Act/ Privacy/ Sensitive Data||Investigation and research into how to abide by these rules||The impact cannot be reduced|
|4||Access to the required resources||Plan resources needed
|Look at alternative software, hardware etc.|
|5||Research Topics are insufficient||Look at similar projects
|Contingency time built into the project plan
|6||Poor quality/ Report not high quality||Section and task reviews
Looking at similar projects
In depth research
|Contingency time built into the project plan|
|7||Not fit for purpose||Involve customers/users in the requirements gathering process.||Contingency time built into the project plan|
Table 1.2: Risk Control
Overall the majority of the risks probability can be reduced however in some cases such as the project and research being corrupted or lost it is accepted that it may occur and the focus is on reducing the impact if it does occur. If work is regularly saved and backed up on a daily basis this should not be a problem. The main solution to the identified risks is careful planning and built in contingency time, so it is essential these are observed at all stages of the project.
The purpose of this chapter is to explore relevant literature in order to expand upon knowledge and create an in depth understanding of the data quality field. Firstly, this chapter defines what data quality is, and draws focus on how it has a relative and contextual meaning. The chapter then moves on to discuss the data dimensions, and how they can be categorised, and continues to explore how to improve and measure data quality. Finally the chapter then draws to a close by visiting the impacts of poor data quality.
Before the topic of data quality can be visited an understanding of what data quality is required. First what data is and how it differs to information and knowledge can be looked at. A popular study by Mϋller and Freytag (2003, p.4) argues that “data are symbolic representations of information, i.e., facts or entities from the world, depicted by symbolic values.” On the other hand, Tuomi (1999, p.1) states that “the generally accepted view sees data as simple facts that become information as data is combined into meaningful structures which subsequently become knowledge as meaningful information is put into a context and when it can be used to make predictions.” Tuomi’s (1999) statement indicates that data is a prerequisite of information, and information is a prerequisite of knowledge. This proves that data quality is very important as without high quality data you cannot trust the information and knowledge which follows.
Data quality is not a simple term; ‘quality’ can have various connotations dependent upon the context. The Oxford Dictionary (2014) definition for quality is “the standard of something as measured against other things of a similar kind”. In relation to this definition in the context of data, ‘data quality’ would thus mean the standard of data when measured against other similar data sets; hence proving, data quality is a very relative term, and different data sets should be measured with different metrics depending on its purpose. Tayi and Ballou (1998, p.54) agree with this statement as they argue that data cannot be of high quality unless it suits the needs of the organisation it is being used in, and data quality can be defined as “fitness for use”. Wang and Strong (1996) expand on this concept by defining data quality as data that is fit for use by data consumers, suggesting that consumers and user perception plays an important role in whether or not data is of high quality.
In contrast, Orr (1998, p.3) provides a definition for data quality from a different perspective. He states that “data quality is the measure of the agreement between the data views presented by an information system and that same data in the real world”. This relates data quality directly back to the information system. Whilst this is a useful view on what data quality is, it is very narrow as it is heavily focused on the accuracy of the data.
In this chapter the dimensions of data quality are explored. As Wang and Strong (1996) suggest there are many different data quality dimensions beyond the traditional dimensions such as accuracy. In a study by Strong, Lee and Wang (1997) they suggest that the dimensions can be broken down into four DQ categories: ‘Intrinsic DQ’, ‘Accessibility DQ’, ‘Contextual DQ’ and ‘Representation DQ’. They continue by suggesting that each of these categories should be investigated when measuring DQ in equal proportion and that there is much more to data quality than just the intrinsic view. There is great significance in visiting each of these four dimensions in detail as each helps to address different data quality pitfalls (Strong, Lee and Wang 1997).
The four categories discussed in Strong, Lee and Wang’s (1997) study is the widely adopted approach to categorizing data quality dimensions. In contrast, the views on the dimensions and the categories they fall within are much more varying. Despite this, as figure 1 represents there are 5 core dimensions that the academic agree or partly agree on which are:
Figure 2.1 taken from the study AIMQ by Lee, Strong, Kahn and Wang (2002), illustrates how the view of academics differs for each of the four data quality categories.
Figure 2.1: Academics’ View Data Quality (Lee, Strong, Kahn and Wang 2002)
Figure 2.1 lists common data quality dimensions across all the academic studies. It is widely agreed that ‘Accuracy’, ‘Correctness’ or ‘Reliability’ as some academic refer to it falls within the ‘Intrinsic’ category, and that ‘Timeliness’, ‘Importance’ or ‘Relevancy’ fall within the ‘Contextual’ category. ‘Understandability’, ‘Interpretability’ or ‘Consistency’ fall within the ‘Representational’ category and that ‘Accessibility’ or ‘Convenience of access’ fall within the ‘Accessibility’ category ((Wang and Strong 2002); (Zmud 1978); (Jarke and Vassiliou 1997); (Delone and McLean 1992); (Goodhue 1995)).
There are many others to consider; however they can generally be categorised or merged into the five core dimensions listed. This combined approach of the five different core dimensions will be used in this project as it provides the best aspects of each of the studies reviewed. This claim is supported in a study by Wand and Wang (1996) where they explored a variety of studies and papers and look at the most cited data quality dimension.
Figure 2.2: Cited Dimensions (Wand and Wang 1996)
Figure 2.2 shows that ‘Accuracy’, ’Reliability’, ‘Timeliness’, ‘Completeness’ are all cited the most often within the studies reviewed by Wand and Wang (1996). Although ‘Accessibility’ isn’t listed in the table ‘Interpretability, ‘Usableness’ and ‘Quantativeness’ can all be categorised under the term ‘accessibility’. In addition, as Vale (2008) discusses in a conference on Accessibility and clarity, ‘Accessibility’ does tend to be one of the most neglected but yet important data quality dimensions.
Choosing the correct data quality dimensions to focus on is a very important part of data quality assessment. The needs and requirements of the consumer and the organisation differ from case to case with each dimension not always needing to be looked at with the same level of detail. Data quality dimensions should be looked at as requirements (Bobrowski, Marré and Yankelevich 2002) and be assessed based on the needs of the consumer. In contrast to these views Pipino, Yang and Wang (2002) found that all the core data quality dimensions should be investigated and measured using both subjective and objective measures in order to provide a good overview of overall data quality. These are two theories that can be used when assessing which dimensions to focus on in the data quality improvements report for QAS. Both of these techniques have core benefits and a combination of these theories will be used in the artefact in order to produce a well-rounded analysis of the data quality.
When looking at what data quality is, how it can be measured and improved, it is also important to look at why data quality should be improved, and discuss the impact of poor data quality on organisations. In today’s organisations almost every activity involves data. “Data provides the foundation for operational, tactical and strategic decisions” (Tee. Bowen and Doyle 2007, p.335) and the cost of getting information wrong is potentially huge.
The biggest cost of poor data quality to an organisation is money. The example below by Bobrowski, Marré and Yankelevich (1998, p.4) on the data quality failings of a hospital proved this: “Studying last year information the managers of a hospital discovered that most of the patients suffered from haemorrhoids. The resources of the next year were assigned on this basis. The number of beds, 5 nurses, and other resources needed were determined using this information. However, it came out that “haemorrhoids” was the default choice at the check-in application, and clerks selected it because it was difficult to look for the correct choice. The bad data quality due to a poor interface design had terrible consequences on the hospital finances.” Thus, this shows that even though something may seem to be a small error individually, collectively the information and knowledge gained from this database can have serious knock on effects.
A popular study by Redman (1998) on “the impacts of poor data quality” discusses that it can result in customer dissatisfaction, increased operational costs, and reduced ability to make and execute strategy. This expands on the finding from ‘Data Quality in Context’ (Strong, Lee and Wang 1997) as it explores further the impacts for each level of an organisation. This study claims that poor data quality can hurt employee morale, lead directly to customer dissatisfaction, and lowered employees satisfaction which in turn can reduce the overall productivity of the organisation. However, in practice it is very hard to measure the true costs of poor data quality. Whilst most academics do agree that data quality errors are very costly, they are also very hard to quantify (Wang, Storey and Firth 1995). In addition, poor data quality compromises decision making and “decisions are no better than the data on which they are based on” (Redman 1998, p.81). This statement is widely used by academics when discussing the costs of poor data quality. Bobrowski, Marré and Yankelevich (1998, p.1) also highlights that “when quality is not achieved, information is not used or leads to incorrect decision or even loss”.
Improving the overall data quality of a database is much more than just cleansing the data in its current form. Maintenance and future prevention methods are at least equally important. Without taking future prevention methods and looking into the deep underlying issues of why data quality concerns have occurred, over time the poor data quality is likely to return. There are several different ways and levels that data quality can be improved including the manual data entry level, the data/ database level and the Information System (IS) level.
A paper by Murphy (2009, p.1882) argues that in order to improve and maintain overall data quality the manual entry level is key, and states that if “data is collected in a consistent and timely manner, that is both complete and accurate reduces the amount of manipulation and cleansing required later in the process.” Expanding on this, Murphy (2009) believes that improving the attitudes to data quality, improving group support for data quality, and encouraging feedback on data quality are of key importance. Lee and Strong (2003) similarly state that there is a strong correlation between the ‘Why Knowledge’ and data quality performance; this means that if employees are educated on the collection and data input methods, this will simultaneously help to improve overall data quality.
This research has proven that there needs to be deeper investigation into why the data quality issues have occurred, and not only identify but produce recommendations and a structure that can be implemented to help maintain and further improve the data quality. In contrast to this, Bobrowski, Marré and Yankelevich (1998) suggest that focus should be on improving data quality at an information system level and that an IS should be built and updated with data quality in mind. In the data quality improvements report for QAS these approaches will be combined and each taken into consideration in order to improve and maintain the data quality.
Orr (1998, p.7) argues that “data that is not used cannot be correct for very long” and suggesting data that isn’t used cannot be maintained, and that data quality can be improved by increasing the use of the data and removing the data that is not useful. Whilst Bobrowski, Marré and Yankelevich (1998, p.5) agree with this they believe that in many cases this theory has been used to “illustrate that problems in the quality of information are not caused by poor design.” This proves that there are many different ways you can look at data quality problems and emphasizes the importance of always taking a multi-viewed approach.
The requirement gathering is essential to the overall project successes (Kendrick 2009). In this chapter the projects requirements and constraints are discussed.
Many projects will have set requirements, with the clients knowing exactly what objectives and aims they want from a project. In this case the Technical Services departments know that they want the data quality of the Snap-shot database to be improved in order to have more trust in their data; however, they did not have any idea of what is wrong with the data quality or even what the current level of data quality is. This made the requirements process very interesting as it meant that a large portion of this project will be finding out the data quality requirements (measuring and assessing the data quality) of the Snap-shot database. The data quality level for each of the core dimensions discussed in the literature review is not known therefore the requirement for each of the dimensions is for them to be as accurate as possible rather than a specific measurable value.
- The report should identify the main data quality requirements of the Snap-shot database
- The data quality improvement report must assess the data quality level
- The data quality assessment should be interpretable
- The data in the Snap-shot database should be as:
- Accessible as possible (available or easily and quickly retrievable as possible)
- Accurate as possible (correct and reliable as possible)
- Complete as possible (True and Creditable as possible)
- Consistent as possible (presented in the same format as much as possible)
- Timely as possible (up-to-date as possible)
- Unique as possible (one of a kind as possible)
- Useful as possible (needed in the dataset as possible)
- Valid as possible (factually sound as possible)
- The data quality report should provide future recommendations that can be implemented by the QAS support department
The above requirements and constraints were devised from the review of literature (Section 2), questionnaire (Appendix C) and interviews with the clients (Appendix E and F).
The Technical Services department is a very busy team, it is essential that the constraints of the project are documented and followed throughout.
- The project must have no downtime
- The recommendations should be easy to implement
- The project must have no monetary cost to QAS
- Only existing or free software and hardware must be used for this project
- The project must be completed by 16th May2014
This chapter contains the selection and justification of the processes, methods, and tools used within this project. Firstly, this chapter visits the database querying aspect of this project, covering the querying tools and processes used to identify trends and analyse the data. Emphasis is then drawn on how the data quality dimensions will be chosen and how to measure the data quality. Finally, the methodology will draw on how the improvements will be applied and evaluated.
It is very important that the structuring of the data quality improvements report is correct. To do this the findings from the literature review, the requirements of the client and other data quality reports were investigated in order to build a comprehensive and effective structure for the report.
The review of literature revealed that there are several important steps in data quality analysis and improvement. The literature review indicated that it is very important to find the data quality requirements of the database as this is not a set entity. Therefore, before the measurement and improvement of the data quality there will need to be a stage where the data quality dimensions are analysed. Once the data dimensions have been chosen and analysed the literature indicated that an in depth analysis and improvement can occur.
The review of other data quality reports showed that it is very important to understand the data one is using. A popular technique used for this is to create a data dictionary. It is essential that data is understood, as Uhrowczik (1973, p.333) states “a Data Dictionary can provide centralized control over data resources and data management”.
The Technical Service department require the data quality report to identify, measure, and provide future recommendations.
By combining all three research methods the following basic structure for the data quality improvements report will be followed:
- Understanding the Database
- Data Quality Dimension Analysis
- Data Quality Measurement
- Data Quality Improvement
- Summary of Findings
- Future Recommendation
Using SQL for querying the data is the most effective way of finding out information from the dataset. The Snap-shot database contains 104,222 records. Without the assistance of the correct tools, techniques and software the trends and problems within the database are difficult to see.
There are not many alternatives to SQL for database querying; although one alternative explored was Java Persistence Query Language (JPQL). Before starting this project the suitability of using JPQL was investigated as an alternative or adjacent technique; however, as the author already had an in-depth understanding and knowledge of SQL it was decided that there was no benefits in using another technique. In addition, the users of the Snap-shot database all use SQL as the sole querying tool.
In contrast, there is a wide range of different Database Management Software that supports SQL querying. The most popular are MYSQL, Access and Microsoft SQL server. Each of them would be suitable for the type of SQL querying and database manipulation required for this project however due to the author having previously used MS SQL Server 2008 the project will use this software. The author is familiar with the interface, processes and tools it offered. In addition, it has a reputation for being a very user friendly SQL querying tool, with a very usable interface and simple querying and importing functions. This will help reduce training time and potentially improve overall productivity.
The literature review was instrumental in gaining knowledge and a wider understanding of what data quality is and how it can be broken down. As found in the literature review, there are a number of core data quality dimensions. The research from Section 2.1 was used to determine which are the most important and common data quality dimensions, this provided a base for which dimensions should be investigated for the Snap-shot database.
As was established in Section 2.2, data quality revolves around the consumers and data cannot have quality unless it is fit for purpose (Wang and Strong 1996). There is no defined set of dimensions that should be used for every data quality improvement study as each data set and project is unique and has unique requirements, problems and solutions (Pipino 2002).
The literature review indicated that there are four data quality categories that need to be investigated (Strong, Lee and Wang 1997):
- Intrinsic DQ
- Accessibility DQ
- Contextual DQ
- Representation DQ
Within these categories there are five core data quality dimensions that academics believe to be the most important:
In addition, although they are not as commonly mentioned in academic literature it was found that uniqueness, validity and usefulness are also very important. The reasoning for this was because the general rules for any table or database (with a unique identifier) is that the records are should be unique. It was also decided that usefulness should be included as it is a broad dimension that actually covers a large proportion of the more commonly cited dimensions such as relevancy and interpretability. In addition, validity also covers aspects from the representation data quality category that consistency and uniqueness do not cover.
The 8 dimensions listed above cover all four data quality categories effectively:
- Intrinsic DQ: accuracy, completeness
- Accessibility DQ: accessibility, completeness
- Contextual DQ: timeliness, usefulness
- Representation DQ: consistency, uniqueness and validity
The above dimensions will form the basis of the initial investigation. By using a combination of subjective and objective measures, the data quality requirements will be determined. It is likely that not all of these data quality dimensions will need to be looked into in great detail, and it is possible that additional data quality dimensions will be added to the investigation.
The review of literature indicated the importance of using both subjective and objective measures for testing data quality. Therefore, a combined approach will be taken.
The Subjective Measures that will be used are:
- Interviewing staff
- Physically looking at the data
The Objective Measures that will be used are:
- Using formulas to provide a percentage score for each dimension
- Producing a Data Dictionary
By using the above methods, both subjective perceptions of the Support Team involved with the data, and the objective measurements based on the data set in question will be explored (Pipino, Lee and Wang 2002). This in turn ensures a well-rounded approach to capturing the overall data quality of the Snap-shot database.
The objective assessment includes both task-independent and task-depended assessment methods. “Task-independent metrics reflect states of the data without the contextual knowledge of the application” whilst “task dependent metrics, which include the organization’s business rules, company and government regulations, and constraints provided by the database administrator, are developed in specific application contexts.” (Pipino, Lee and Wang 2002, p,211)
The Task-dependent Measure that will be used is: using formulas to provide a percentage score for each dimension. The Task-independent Measure that will be used is the creation of a Data Dictionary containing organization rules and integrity constraints
How the data quality dimensions are measured for this project is a very important factor. If the data quality is measured ineffectively or incorrectly this could have a large impact on which areas are investigated in the most detail.
Where possible, each data quality dimension will be measured using formulas. The formulas for each of the data quality dimensions will be used to provide overall percentage (%) scores. The benefit of using formulas to provide an overall percentage score for the data quality is that it allows this to be compared, and allows the results of the data quality improvements to be assessed.
The following data quality formulas that will be used:
- Accuracy:Count of accurate objectsCount of accurate objects+Count of inaccurate objects100%
- Completeness:Count of Not Null recordsCount of Null records+ Count of Not Null records100%
- Consistency:Count of consistent objectsCount of Inconsistent objects+Count of consistent objects100%
- Timeliness:(date and time entry appears in database- date and time submitted)
- Uniqueness:Count of unique instancesCount of duplicate instances+Count of unqiue instances100%
- Usefulness:Count of useful fieldsCount of useful fields+Count of not useful fields100%
- Validity:Count of valid objectsCount of Invalid objects+Count of valid objects100%
The above formulas are based on the objective assessment method. However, the element of ‘useful fields’ in the usefulness formula is determined using subjective assessment.
The accuracy, consistency and validity elements are all based on the integrity constraints and rules defined in the data dictionary. For example, if a field is defined as only allowing numerical values every entry that has an entry that is not numerical will be defined as an inaccuracy. The accessibility attribute cannot be measured using a formula. To measure the data quality of the accessibility dimension only the subjective measures (interviews, questionnaire and physical investigation) will be used.
Once the data quality has been measured effectively the actual improvement of the data will need to take place. This is in the form of manually updating, predicting, and deleting of records. It is expected that for some records the missing and incorrect values cannot be predicted or removed from the database; therefore, an important part of this process is to provide future improvements and recommendations. As discussed in the literature review, data quality is much more than just improving the data quality of the static database and IS changes and future maintenance tasks are key.
For missing values and known errors there is a strong possibility that data mining with of benefit. Therefore, it is important to cover how this process will be undertaken. For the data mining of this project the popular CRISP-DM methodology will be used.
Figure 4.1 CRISP-DM (SPSS 2000)
220.127.116.11 Business Understanding
The first step of CRISP-DM is the business understanding. This focuses on the objectives and requirements and what the business problem is. The output of this step will be an explanation of the problem that has been identified and why data mining is a possible solution to this.
18.104.22.168 Data Understanding
The second phase requires the data collection and understanding the data. The data dictionary that would have been produced prior to the data mining will be used for this process. This phase will convert the business problem into a set of data mining problems. The target label that needs to be predicted will be identified, along with the input variables. This stage will require in-depth understanding of QAS and the Technical Services database.
22.214.171.124 Data Preparation
Once the data mining problem has been defined the data preparation phase can begin. This stage includes all of the activities that are needed to construct the dataset that will be used for modelling. The dataset is very large (104222 records) therefore it is likely this process will include taking a sample of the data, dealing with missing values and attribute selection. This step provides rationale for inclusion and exclusion of attributes, description of pre-processing action that were necessary, discoveries of pre-processing actions and summary of tasks completed in this stage.
In this step the actual modelling will take place, 10 fold cross validation will be used to ensure the model is evaluated correctly and protected against over fitting. The output of this step will be a description of the models trained and explanation of how the model will be assessed.
This is where the baseline that each of the models will be measured against will be defined. The baseline is normally the RMSE score for the mean class regression algorithm (this will be confirmed in the Artefact). The accuracy, build time and interpretability of each model will be taken into account to recommend the most suitable model. The models will be compared against the baseline to evaluate if it is beneficially for the most suitable predictive model to be implemented. Two formulas will be used to assess the accuracy, these are:
Mean Absolute Error:
Root Mean Square Error:
nis the number of non-missing data points,
yiis the actual activityTime and
ŷistands for the estimated/predicted activityTime (Spider Financial 2014).
The MAE will be used as the primary evaluation method as it provides a more natural measure of average error (Willmott and Matsuura 2005). The output of this step will be an evaluation and comparison of the different models against the baseline, with review of the data mining success.
The deployment of the chosen method will then occur, this could be the implementation of the chosen model, implementation of the baseline (defined during implementation stage) or no deployment (if it is more effective to do nothing) will take place. The output of this step will be the deployment of the chosen method or model.
126.96.36.199 Data Mining Software
The benefit of data mining software packages is that they are generally open source and many different alternatives; therefore, they fit the project mould of being free to use. The different data mining software packages available considered for use in this project were: Weka and Knime. The author has used both Weka and Knime before but found Weka to be the easiest to use. Weka is a very popular machine learning software suite that provides an array of different regression models, pre-processing capabilities and useful features. In addition, Weka is powerful to handle data with a large number of dimensions and instances; it has a very user friendly interface with built in attribute selectors which make removing unwanted attributes a simple process; therefore, the data mining modelling will be completed using Weka
In some cases the records may be incomplete, inconsistent, invalid, or/and inaccurate to include and will have to be removed; to do this SQL queries will be used to identify and remove these records.
The way data quality improvements will be evaluated is very much linked to how they have been measured and the percentage scores for each dimension.
Once the data quality improvements have been implemented each dimension will be measured and tested again using the same formulas as discussed in Section 4.4.
The scores for each dimension will be then be compared, both at a field level and an overall level.
This chapter discusses the creation of the artefact (for the full artefact see Appendix K).
Before commencing work on the artefact, two weeks had been spent on initial research, in this period the following topics were covered:
- What is data quality?
- Common data quality issues
- Impacts of data quality
- How to measure data quality
- Solutions to data quality problems
The initial two week period of research was a sufficient amount of time to gain a broad understanding of the data quality field. The review of relevant literature, similar projects and data quality reports provided a good understanding of what was required for the artefact in this project. In addition, it also brought up a number of aspects that were not considered in the project plan; such as, the importance of understanding the data and the purpose of producing a data dictionary to document database’s characteristics.
The first part of the artefact was to understand the database; this includes what it is used for, who uses it, and the database’s current characteristics. The techniques used to understand the database were: SQL querying, staff interviews, and physical investigation of the database. The result of this stage was a completed data dictionary containing important information on the characteristics of the Snap-shot database.
Before the project was started it was known that the data quality needed to be improved, however the true data quality requirements of the Snap-shot database were not known. The review of literature provided an overview on the key data quality dimensions that need to be tested and also stressed the importance of involving the data users in this process.
The purpose of this phase was to find out the data quality requirements of the project. To do this as discussed in the Section 4.3, both subjective and objective assessments were applied. This included interviewing staff (See Appendix B and Appendix C), a questionnaire (See Appendix D), physical investigation and tests on the database.
The data quality dimension analysis stage used the assessment methods mentioned in Section 4.3 to determine that the following dimensions required further in depth analysis
- Accuracy: “The extent to which data is correct and reliable.”
- Completeness: “The extent to which data is regarded as true and creditable.”
- Consistency: “The extent to which data is presented in the same format.”
- Uniqueness: The extent to which the data is one of a kind
- Usefulness: The extent to which the data has use and is needed in the dataset
- Validity: The extent to which the data is factually sound
It also found that the following dimensions didn’t require any further analysis and investigation:
- Accessibility: “The extent to which data is available, or easily and quickly retrievable.”
- Timeliness: “The extent to which the data is sufficiently up-to-date for the task at hand.”
(Pipino, Lee and Wang 2002, p.212):
This was because the questionnaire and physical investigation showed that accessibility and timeliness were not a problem in the Snap-shot database.
Once the data quality dimension requirements for this project were defined the next step was to measure each of them. This was done using the data quality dimension formulas (Section 4.4) devised from the review of literature. This was completed using SQL queries to query the data based on the integrity constrains and rules defined in the data dictionary (Appendix K, Section 2.1).
SQL was an essential tool for capturing instances with data quality problems for example finding Nulls, incorrect formats and anomalies. (For SQL Queries See Artefact Appendixes H-L)
The dataset was tested for each of different data quality dimensions at both a field (column) level and an overall level. The number of data quality issues in each field was recorded and using the data quality formulas the overall percentage for each data quality dimension was devised.
This provided a benchmark to measure the data quality once the improvement tasks had been completed.
Both data mining and SQL played an important role in improving the data quality. When measuring and investigating the data quality of the database one serious problem was raised. The activityTime field is arguable the most important field in the database. The activityTime field is used to record the time spend on the Service Request. The users of the Snap-shot database will typically use this field for finding out the average effort a particular issue takes to resolve. This information can then be used to make important business decisions, such as where extra training is needed or which processes need escalating. The activityTime had 24952 instances that are known to be of poor data quality (Null or equal 0) which meant that 23.94% of the instances had an incorrect activityTime. As this was a large percentage of records there is no benefit in just removing the ‘bad’ records. Therefore accurately predicting these values became the only option. To do this data mining was used (for full data mining analysis see Section 5.8). Besides Data mining, SQL was used to update known errors and remove records and fields that weren’t useful to the users.
The data quality dimensions were each evaluated individually; the evaluation of the data quality improvement was made very simple as the same process used to measure the data quality was followed.
As has been found in the review of literature, improving the static data quality is not enough and that data quality improvement has to be an on-going operation (Murphy 2009). This chapter covered the recommendations that QAS should implement to both their CRM system and frequency of data quality actions.
The data that needed to be predicted was the activityTime field in the Snap-shot database for all the instances that it was equal to zero or Null. Before the Weka modelling was started, a baseline was identified that could test against all of the models. This was achieved by using a model (ZeroR) that worked by working out the MAE for the target label (activityTime) for each of the instances (Not including records where activityTime was 0).
The baseline model found that the MAE was 35.56, therefore the models and data mining would only be suitable to implement if the model results were more accurate than that score.
A random subset of data was selected; as it had previously been identified, Weka would not handle all 800,000+ records of the Snap-shot database. This was achieved by writing an SQL query that randomly selected a stated percentage of records from the database.
Once the subset was identified some basic models were tested recording the regression algorithms which performed better than the baseline model. When training the models on the training dataset ensured a 10 fold cross validation was used as this helps to protect against over fitting when training the models on the data.
There is a wide variety of regression algorithms one can use in Weka these are ZeroR, IBK Nearest Neighbour, LWL (Locally weighted learning), Decision Table, Decision Tree and KStart.
Each of the algorithms were tested using the data subset, not all of them are exclusive regression algorithms however all of them had the capability to handle numerical class values (such as activityTime).
From the initial investigation it was found that all of these models performed better than the ZeroR baseline. A trial and error approach was then used, using different ensemble methods including Bagging, Boosting, and Random Feature Subspaces and voting (for full model comparison see Appendix H, for data mining models see CD).
In addition, enhancing methods were discussed such as attribute selection and pre-processing techniques. It was found that attribute selection was not suitable for the data set as the prediction class was numerical. This was continued by manually deleting the attributes that were not needed.
The two best performing models were the Decision Table and Vote.
|Regression Algorithm Type||Description||MAE||RMSE||Time To Build Model (Seconds)|
|Random Subspace (Decision table)||Decision Table regression model||29.81||55.38||2.56|
|Vote (IBK Nearest Neighbour K=21 and Decision table)||Average of probabilities from IBK and Decision table||29.45||57.81||0.01|
|ZeroR (Baseline)||Default variables||35.56||63.99||0.01|
Table 5.1: Data Mining Comparison
The Voting with IBK Nearest Neighbour K=21 and Decision table works by combining the two models and taking the average of the probabilities given by the Decision Table and the IBK models. This can help boost the accuracy of models as the error from irregularities is minimized. Whilst the Decision Table isn’t exclusively a regression model it can handle numerical target labels and perform as a regression model with ease.
The Random Sub Space with Decision Table works by training multiple models on different attribute subsets and averaging their predictions, it then makes a prediction by combining the results of the number of models. This is a good model to use as it randomly selects the attributes from all the training data.
These models both have some separate pros and cons. The Vote model had the best Mean Absolute Error and the second best Root Mean Squared Error, and was also very quick to build. This meant it would really suit QAS as in the future if this model was ever needed to be used again it would not take the users very long to make the predictions. However, the problem with this model is that IBK Nearest Neighbour is a lazy model which can be susceptible to outliers, and is known to have problems with high dimensional datasets. However, the model is very interpretable as it works by looking at the nearest 21 neighbours and then working out what the most probable activityTime value would be.
The Random Sub Space model with Decision Table is interpretable and easily explainable to users. One negative however is that it took much longer to build than the Voting model, despite this it would still be argued that it is an acceptable speed and this should not cause any problems. This model has the best Root Mean Square Error indicating that this model produces the least about of outliers. It also has a very good mean absolute error score which proves it will produce acceptable predictions.
Despite both models being suitable I recommend the Vote model as this model takes into account the predictions of two models and is there more stable that the Random Sub-space model. The main deciding benefit of the Voting model is that it is slightly more accurate and accuracy is key in this case.
For the chosen vote model the error distribution of the results shows that there are 65 instances with large errors (over 100), if these are removed then the MAE is reduced from 29.45 minutes to 21.9 minutes. The results with large errors were manually looked at to see if there are any reasons why they are particularly large to see if anything could be done about them, however this process didn’t reveal any more information and the instance with large errors had no noticeable trends. For the error distribution table see Appendix I (for the Error Spread Sheet see CD).
The following chapter covers the evaluation of the projects effectiveness of fulfilling its purpose, objectives and aims. It is split into three main parts. Firstly, it covers the evaluation of the objectives and if these have been met. Secondly, it covers the method reflection which covers the effectiveness of the methods and techniques used, and finally it covers the personal reflection which discuss if the personal aims and objectives of the author have been achieved.
The main objectives of this project were to create a data quality improvements report that would help to assess, improve, and maintain the data quality, with the ultimate goal of improving the user’s confidence in the data quality and making business decisions more informed and accurate.
The overall data quality has been improved as the table 6.1 shows.
|Original Database||New Database|
Table 6.1 Overall Data Quality
Each of the dimensions have been improved with the overall data quality being improved by 4.65%. The data quality has only been measured objectively; due to the time constraints of this project no subjective measure such as staff interview or a questionnaire could be applied. So whilst I can confidently say the data quality has been improved, it is difficult to prove that the QAS staff have more confidence in the data. Despite this QAS have decided to implement the changes as part of a company wide Data Quality Improvement Project.
In order to prove that the users have more confidence in the data a questionnaire, Interview and A -B testing could be used. In order to provide reliable results a 6 month period of use would be needed before these tests were applied. This would both give time for the users to identify weaknesses, the strengths and for the Technical Services department to implement the recommendations. With this in mind I intend to follow up this process in 6 months and test how the confidence has changed.
Alternatively, I could have asked the Technical Service department to repeat the initial data quality questionnaire straight away after the implementation to see how their perception changed. However, I felt this would not provide a fair representation of reality and I felt it would be better to only measure objectively.
The tools and software chosen for this project worked very well. Using SQL for this project was a a great success, the simple nature and extensive support materials combined with my previous experience and knowledge of this langue help greatly.
Using Microsoft Server 2008 for the database manipulation and SQL querying was also a very suitable choice. The ease of database importing, extensive troubleshooting and in application support helped make sure any errors were picked up and dealt with quickly, efficiently and effectively.
The investigation into the data quality was a very important task. The first part of this was researching the data quality dimensions to test. Overall, I believe that the eight chosen dimensions (Accuracy, Accessibility, Completeness, Consistency, Timeliness, Usefulness, Uniqueness and Validity) provided a very good overview of the four data quality categories (Accessibility DQ, Intrinsic, Representational and Contextual). I feel it was very important to pre-define the data quality dimensions to look at and investigate as the number of different data quality dimensions that academic propose is vast and it would be impossible to test all of them with the project constraints. Despite this, looking back on the dimensions chosen I would have measured a wider range of data quality constraints. If time had permitted I would have looked at a more in-depth investigation into the accessibility and timeliness, and additional data quality dimensions such as reliability, relevancy and currency. When choosing the data quality dimensions reliability, relevancy and currency were not selected as I thought each of them would be covered by the 8 dimensions the selected and they would not add much to the overall investigation. In hindsight, whilst I am happy with the data quality dimensions chosen it would have provided a better overview of data quality if these extra dimensions were investigated to test if they were needed rather than excluding them all up front.
A wide variety of methods were used to measure the data quality. As a result of the findings of the literature review this included both subjective and objective methods to ensure all angles of data quality were covered. The data quality formulas were the biggest success as it not only provided a means to test the data quality but also helped with the evaluation and future data quality assessment. The only criticism of the data quality measurement is that I could have tried to use the subjective methods in the measurement process for each of the data quality dimensions.
The improvement of the data quality was completed using SQL querying and data mining. The data mining process was not as successful as I anticipated. The modelling and data mining tasks were completed to a high standard although unfortunately it seems that the activityTime is difficult to predict using the data. Before investigating the viability of using data mining algorithms to predict the activityTime I expected to receive much better result in that the 32.45 RMSE score. However, this was completed to the best possible ability considering constraints. It was just unfortunate that the dataset fields were not as strong correlated as I would have hoped.
188.8.131.52 Data Mining
The final model chosen is accurate to the average of 29.45 which means the if the real life activity time is 50 then using this predictive model it could be display as 10.55 or as 89.45. The error distribution showed that the average accuracy can be reduced to almost 20 minutes if the largest errors were removed, however the analysis showed that there is nothing noticeable different about these outliers therefore removing them would reduce the overall effectiveness of the model which is advised against.
In addition, the model out performs the baseline by 5.77, this means that it will on average predict the activtyTime 5.77 minutes closer than if the average baseline was used. Overall, I am pleased with the results from the data mining however before I started testing the different models I did expect better accuracy results in the 15-20 range.
I feel the evaluation of the data quality improvements was done to a good standard, the formulas allowed each dimension to be rated as a percentage. The formulas also allowed the overall data quality improvement to be worked out. If I had more time for this project as previously mentioned I would have looked at A-B testing and using some additional subjective testing methods.
Before the project was started an initial project plan was created. This plan set out the main activities of the project including the initial research stage, artefact creation, and dissertation write up. It was expected that the plan would change to a certain extent as many of the requirements and features of the project would be relived as the research was carried out. Appendix E shows the Gantt chart of the original project plan.
In addition to this, an Artefact plan (Appendix F) was created. This plan slightly changed as the project went on (Appendix G shows how this plan changed in comparison to the initial baseline).
The difficulties with this project were that the data quality requirements and assessment measures were not realised until after the initial research and investigation into the database. This was expected as the users of the database did not know what the problems were; therefore, a large part of the project was actually to identify and quantify the data quality issues.
Initially the project artefact was due to be completed by the 18/04/2014, whilst the plan was followed as closely as possible from start to finish the project slightly overran and was completed on the 24/04/2014.
The task duration which differed the most were the: physical investigation, questionnaire, implementing the solutions and the conclusion sections. The physical investigation stage took 2 days longer than planned for, there were no problems in this stage and the reason it over ran was because the size of the task was under estimated.
The questionnaire stage took 2 days longer than estimated; this was because there were some problems getting the questionnaire results back from the Technical Services staff due to their busy schedule.
The implementation stage took 4 days longer than scheduled; this was because the size of the data mining aspect of this stage wasn’t factored in. Whilst it was known that data mining may be used in this task, the extent of this wasn’t known. The number of models to test and the time taken to prepare the data were the main factors in this task over running.
The initial plan was highly based on estimates. In reality some of the time scales changed, although this did not affect the overall project finish time due to the contingency time added as the result of the risk analysis.
The risk management stage of this project helped to identify, analyse and control the risks of the project. The project didn’t encounter any major problems and the few problems that did occur had little or no effect on the project. This alone represents that the risk management chapter was completed to a high standard and was a success.
The evaluation of the project objectives proves that the measures taken such as including users in the requirements process, careful planning, contingency time and in-depth research have ensured that the project has been a success, fit for purpose and completed to a high standard.
Overall, the project can been deemed a success. The aims and objectives of the project were to create a data quality improvements report that would help improve the data quality of the Snap-shot database both now and for the future. Whilst the data quality of the Snap-shot database in the future cannot be tested, I have full confidence that if the recommendations of the report are implemented then the data quality will continually improve. The metrics used to measure the data quality prove that Snap-shot database is now on average 4.65% more Accurate, Complete, Consistent, Useful, Unique and Valid. The result of this is that the users should have more confidence in their data.
In addition, the author achieved the academic aims of this project which were to develop an understanding of data quality, to gain and develop project management skills and improve SQL skills. The review of literature in particular was very rewarding as it helped to broaden the author’s knowledge in the data quality field.
Above everything else this project has taught the author the importance of the consumers having awareness, knowledge and understanding of data quality. This is because so many of the data quality issues can be related back to incorrect data input and user mistakes. What was most surprising is that in general this is mainly down to naivety and the fact that the users simply don’t know the benefits and reasons for data quality.
This report will help provide QAS with a database of higher data quality, above and beyond this and most importantly it has created a platform that QAS can use to further improve the data quality in the future. Before this project started the Technical Service team knew that the data quality problems were there, however they had no idea what data quality was, how it could be measured and ultimately how to improve it. The data quality improvements report has provided QAS with knowledge, data quality metrics and a structure to repeat the process in the future. In addition, it has provided recommendations above and beyond the data quality improvement of the static database. The idea of the artefact of this project is for it to be a repeatable process and for the Technical Services department to continue to improve, regulate and implement the data quality aspects discussed in this project.
Word Count: 9988
Blokdijk, G, 2007. Risk Management 100 Success Secrets : Identifying and Project Managing Risk Management Research, Design, Training and Operations in the Enterprise. Brisbane: Emereo. Available from: http://eds.b.ebscohost.com [Accessed 14 April 2014].
Bobrowski, M., Marré, M. and Yankelevich, D, 1998. A Software Engineering View of Data Quality.
Bobrowski, M., Marré, M. and Yankelevich, D, 2002. Measuring Data Quality.
Boehm, B.W, 1991. Software Risk Management: principles and practices. Defense Advanced Reseach Project Agency, 8 (1), 32-41.
Delone, W.H., and McLean E.R., 1992. Information systems success: the quest for the dependent variable. Information systems research, 3(1). 60–95.
Goodhue, D.L., 1995. Understanding user evaluations of information systems, Management Science. 41(12). 1827– 1844.
Kendrick, Tom, 2009. Identifying and Managing Project Risk : Essential Tools for Failure-proofing Your Project. 2nd ed. New York: AMACON. Available from: http://eds.b.ebscohost.com [Accessed 12 April 2014].
Lee, Y. W. and Strong, D. M., 2003. Knowing – why about data processes and data quality. Journal of Management Information Systems, 20(3), 13–39.
Lee, Y. W., Strong, D. M., Kahm, B. and Wang, R., 2002. AIMQ: a methodology for information quality assessment. Information & Management, 40, 133-146.
Mielke, P. W., 1995. Artificial Skill and Validation in Meteorological Forecasting. AOML, AOML. Available from: http://www.aoml.noaa.gov/hrd/Landsea/artificial/ [Accessed 14 April 2014].
Müller, H., and Freytag, J.C., 2003. Problems, Methods, and Challenges in Comprehensive Data Cleansing.
Murphy, G.D., 2009. Improving the quality of manually acquired data: Applying the theory of planned behaviour to data quality. Reliability Engineering and System Safety, 94, 1881-1886.
Orr, K., 1998. Data quality and systems theory. Communications of the ACM, 41 (2). 66–71.
Oxford Dictionaries, 2014. quality. Oxford Dictionaries. Available from: http://www.oxforddictionaries.com/definition/english/quality [Accessed 17 April 2014].
Pipino, L. , Yang, L., and Wang, R, 2002. Data quality Assessment. Communications of the AMC, 45, 211-218.
Redman, T., 1998. The Impacts of Poor Data Quality on the Typical Enterprise. Communications of the AMC, 41 (2), 79-82.
Spider Financial, 2014. MAE. Spider Financial, Available from: http://www.spiderfinancial.com/support/documentation/numxl/reference-manual/descriptive-stats/mae [Accessed 13 May 2014].
SPSS, 2000. CRISP-DM 1.0. London: IBM. Available from: ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserManual/CRISP-DM.pdf [Accessed 26 April 2014].
Strong, D. M., Lee, Y. W. and Wang, R. Y., 1997. Data Quality in Context. Communications of the AMC, 40 (5), 103 -110.
Tayi, K. and Ballou, D., 1998. Data Quality in Context. Communications of the AMC, 41 (2), 54 – 57.
Tee, S. W., Bowen, P. L., Doyle, P. and Rohde, F. H., 2007. Factors influencing organizations to improve data quality in their information systems. Account and Finance, 47, 335-355.
Tuomi, I., 1999, “Data is more than knowledge”. Journal of Management Information Systems, 16(3), 107-21.
Uhrowczik, P., 1973. Data dictionary/directories. IBM System Journal, 12.4, 332-350. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.468&rep=rep1&type=pdf [Accessed 22 March 2014].
Vale, S, 2008. Accessibility and clarity: The most neglected dimensions of quality?. Data Quality for International Organizations, 7-8 July 2008 Rome. Rome: UNECE, 1-5.
Wand, Y., and Wang, R., 1996. Anchoring Data Quality Dimensions Ontological Foundations. Communications of the AMC, 39(11), 87-95.
Wang, R. Y. and Strong, D. M., 1996. Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12 (4), 5-33.
Wang, R. Y., Storey, V., and Firth, C. P., 1995. A Framework for Analysis of Data Quality Research. IEEE Transaction on Knowledge and Data Engineering, 7 (4), 623-640.
Willmott, C. J. and Matsuura, K., 2005. a more natural measure of average error. Climate Research, 30, 79-82.
Zmud, R., Concepts, theories and techniques: an empirical investigation of the dimensionality of the concept of information, Decision Sciences 9 (2), 1978, pp. 187–195.
Project Title: An Investigation into the effects of Poor data quality in the QAS technical Service Database
I have a Snap-shot database table which is a Siebel extract of the Technical Services customer relationship management system. The database is direct feed from the Technical Services CRM view and contains all of the information the members of this department can see and edit.
Each technical services team member will log each Service Request they complete for a customer under the associated account, they will included information such as activity time, contact they dealt with, description of Service Request and the product it was concerning.
The Snap-shot database has been put together to contain all of the key columns that the Technical Services department will need, this helps make the SQL querying simpler as no table joins are needed an. Generally, the managers of the 1st and 2nd line Support Teams use this database for SQL queries. Some of the common queries they use it for are finding the 10 longest running open Service Request, to show which clients have had the most issues in the last year. In addition, it is also used to support important business decision such as where is extra training is needed, which processes and products be improved. Which processes if automated would save us the most time.
The problem with the database is that the quality of data is very poor, there are many rushed entries, missing information and inconsistencies, this causes some issues when the managers want to use this data to make a key business decisions as the reliability, precision and accuracy of the analysis can be questioned. Before this data can be used for effective data analysis the data quality will have to be improved.
What are the project’s aims?
The projects aim is to improve the data quality of the customer support Service Request database and make recommendations on how QAS can maintain a good level of data quality in the future.
The Purpose of this is so I can produce a report with recommendations on how to improve the data quality. In addition I will look at using techniques such as SQL queries and Data mining to help eliminate/reduce the data quality issues and their effects.
The key aspects I will be investigating are:
- Data types
- Null Count
- Min and Max Lengths of columns
I will also be researching what data quality is and how it can be broken down into different data quality dimensions. I will then look at how to measure and improve the data quality of the Snap-shot database using the finding of the research.
Using SQL queries, Data mining, other data quality improvement methods and creating a data quality recommendation document should help ensure the Technical services managers can extract useful information from the Snap-shot database with the effects of the poor data quality minimized. I discuss the pros and cons of in the context of the impact on day to day operations and therefore make a recommendation of which techniques the organisations should implement and why.
What is the artefact that you intend to produce?
I intend to create a report on the data quality of the Customer support Service Request database with advice and recommendations on the keys issues and problems with it and how it can be improved. I will investigate, test and perform data quality improvements of the database.
In addition I will use SQL queries, Data Mining tasks and other tasks to improve the data quality. the data quality document I will discuss the effectiveness, practicality and if they should be implemented and why.
How am I going to evaluate my work
I am going to discuss the data quality recommendations I have looked at for my artefact, what data quality actually is and how you can actually raise the quality of data? In addition, I will look into the practicality of the data quality improvements I have recommended such as the SQL queries I have created and the data mining tasks performed and critically evaluate their effectiveness, how suitable they are to use and what effects it would have in practise.
Why is my Project Honours Worthy? How does this Project relate to Business Information Technology?
This project will demonstrate many skills I have learn throughout my Business Information Technology course. It will use aspects from the SQL and Relational database units in year 1 and year 2 and combine them with the Data mining and business units of year 2 and year 3.
It is business related as it is for QAS and it will have an actual day to day effect on the Technical Services department. It demonstrates solving business problems by the means of information technology.
To find out more information about the Snap-shot database and what it is used for
- Who uses the Snap-shot database?
All members of the Technical Service Departments
- Do any other departments have access to or use the database?
- What is the Snap-shot database used for?
Querying the Service Requests to find out important information about products and Service Requests
- Can I have some examples of queries that are typically used?
Finding the most longest opening Service Requests
Finding the average activity time for each product
To see the Service Requests a support analyst has open
Finding which problems take longest to resolve
- How is the database added to?
The database is added to by the Technical Support employees via the Siebel CRM system
- Is the database ever manually edited or added to?
- Do you base decisions based on this data?
- What type of decisions have you made based on this data?
Where extra training is needed
Which analysts are struggling
Which problems should be escalated quicker
Which processes should be automated
- Could you provide me with definitions for each of the fields and why they are used for?
Presented with list of fields and their definitions
To find out more information about the attributes and characteristics of the Snap-shot database and formats, allowed values and data types of each field.
Before the interview the host looked over the database to get a feel for the type of data in each field and from this devised a set of questions and talking points. Throughout the interview there was access to the Snap-shot database and active discussion for each question.
- The srNumber field is the unique Service Request identifier:
- What are the allowed formats of this field?
The srNumber can be in four different formats. 1-1AAAAA, A1-11111111, A111-1111111 and AA-AAAA1
- The territory field has the following territory options: NZL, H, E, SYD, SGP, P, L, U, MEL and M:
- Are these territories all valid entries?
- Are there any additional territories?
- What are the allowed values for the status field?
Open, Dead or Closed
- Are all the entries outside these errors?
- The detailed status field contains many different formats:
- What is the correct format?
All entries should follow the 0 – 9 categories the allowed values are:
|0 – Open_Analyst Action|
|1 – Open_Customer Action|
|2 – Open_Escalations|
|3 – Open_Waiting Int Response|
|4 – Open_Waiting Ext Response|
|5 – Open_Updates|
|6 – Closed_Send New Release|
|7 – Closed_Unconfirmed|
|8 – Closed_Solved|
|9 – Dead|
- Are all values outside these incorrect values for this field?
- For the date, month and year fields what is the correct format?
Year should be between 2008 – 2013, but with years about 2013 allowed.
Month fields should have a value of 1-12
Date filed should have the DD/MM/YYYY format.
- The database is only for the Technical Services department, should the Service Request of teams outside this department be in this database?
No the database should only contain technical services, Service Requests
- What are the teams and departments allowed in this database?
- Customer Management
- Support Services – AUS
- Support Services – GBR
- Support Services – USA
- Specialist and Support
- Technical Support
- Technical Support AUS
- Specialist and Support
- Specialist Support
- Product and Data Solutions
All teams and departments outside this are errors.
- What is the purpose of having both the commitBreach and borderCommitBreach in the same database?
- Are both of these required?
They both contain the same data and the same purpose; both of them are not required.
- What are the accepted entries for the verified field?
- not verified
- The contact id field seems to contain the employees name not an ID is this correct?
No the data in this field is the Employees name and should be called PrimaryOwnerName
- There are two fields named responseDate:
- Are both of these fields required?
Yes but the names should be changed
- What should they be changed to?
The first responseDate field should be called startDate
The startDate field should be called contactid
- What is the correct format and allowed values for the accountTier field?
- Tier 2
- Tier 3
- Tier 4
- Tier 5
- What is the correct format for the accountStatus field?
Text, no numbers
- What is the correct format for the category field?
0-8 categories, example: 1 – Configuration
- What is the correct format for the component field?
- Are numbers and text and allowed in the description field?
- What is the correct format for the renewalProductMapping field?
It should contain the shortened name of the product
- What is the correct format for the renewalProductFinMapping field?
It should contain the name of the product
- What is the correct format for the renewalProductFamily field?
It should contain the name of the family the product is in
- What is the correct format for the renewalProduct field?
It should contain the full name of the product
- What is the correct format for the versionNumber field?
The three accepted formats are V1.1, V1.11 or V1
- The operatingSystemVersion field looks like it has been spelt incorrectly:
- What is the correct spelling?
- What is the correct format for this field?
No set format with text and numbers allowed
- What is the correct format for the hardware field?
No set format with text and numbers allowed
- What is the correct format for the accountManager3 field?
Should just contain the name of the account manager
- Can the product quantity field ever be 0?
No every renewal product needs to have a quantity of at least 1
- What are the allowed values for the source field?
Can either be phone or email
- What is the allow range of values for the activityTime?
- Could a Service Request ever take 0 minutes to complete
- Could a Service Request take longer than 3000 minutes to complete?
Rarely but it is possible
- What is the correct format for the cqNumber field?
- This field contain a very large number of Nulls, why do you think this is? Is it still a requirement of this database?
No the cqNumber field is an old requirement, hence the incompleteness
- What is the correct format and allow values for the type field?
- Service Request
- What is the correct format and allow values for the priority field?
1. Which answer best represents your level of technical expertise?
I Don’t Know
2. What is your role at QAS?
|1st Line Technical Support Analyst
2nd Line Technical Support Analyst
3rd Line Technical Support Analyst
1st Line Technical Support Manager
2nd Line Technical Support Manager
Technical Services Manager
3. How often do you access the Technical Service Database?
Almost Every Day
Less Than once a Month
4. Which of the following best describes the current Data Quality level of the QAS Technical Services Database?
5. Do you believe the Data Quality of the Technical Services Database needs to be improved?
I Don’t Know
6. How often do you question the accuracy of the data in the QAS Technical Services Database?
|Every time I use it
I Don’t Know
7. Do you have confidence in the data you use from the QAS Technical Service Database?
I Don’t Know
8. What problems do you encounter due to poor Data Quality of the Technical Services Database?
More difficult to do your job
Less confidence in managerial decisions based on the data
9. Select each box that applies in relation to the current QAS Technical Services Database
|The data accurately reflects reality
The data is available when needed
The data is displayed and formatted in a consistent manner
All necessary data is present
The data falls within acceptable ranges
The data is easily accessible, usable and understandable
|Type/Name||Description||MAE||RMSE||Time To Build Model|
|ZeroR (Baseline)||Baseline (MAE)||35.56||63.99||0.01|
|IBK (Neighbour 1 Search)||K-nearest 1 neighbours classifier||40.04||77.27||0.01|
|IBK (Neighbour 13 Search)||K-nearest 13 neighbours classifier||31.25||59.87||0.01|
|IBK (Neighbour 21 Search)||K-nearest 21 neighbours classifier||31.01||60.01||0.01|
|IBK (KStar 20)||K* instance based classifier (similarity test)
Global blend 20
|IBK (KStar 40)||K* instance based classifier (similarity test)
Global blend 40
|LWL (KNN -1)||Locally weight learning regression modelling using Rep Tree and Linear NN Search
|LWL (KNN 0)||Locally weight learning regression modelling using Rep Tree and Linear NN Search
|Decision Stump||Decision Stump regression algorithm based on MAE||30.56||58.57||0.11|
|M5P||Rule based regression algorithm||37.48||66.56||43.04|
|Rep Tree||Fast decision regression tree learner, uses variance and pruning to reduce errors||35.56||63.99||0.18|
|Decision Table CV1||Simple decision table majority model (capable of numerical values) Cross Validation 1||29.79||59.93||0.77|
|Decision Table CV2||Simple decision table majority model (capable of numerical values) Cross Validation 2||30.35||61.07||0.56|
|Vote (Decision Stump and Decision table)||Decision Stump and Decision table (average of probabilities from each model)||29.53||56.19||0.6|
|Vote (Decision Stump, Decision table and LWL)||Decision Stump, Decision table and LWL (average of probabilities from each model)||29.67||56.23||0.56|
|Vote (Decision Stump and LWL)||Decision Stump and LWL (average of probabilities from each model)||30.35||58.36||0.01|
|Vote (Decision table and LWL)||Decision table and LWL (average of probabilities from each model)||29.47||56.23||0.55|
|Vote (IBK Nearest Neighbour K 21 and Decision table)||IBK Nearest Neighbour K 21 and Decision table (average of probabilities from each model)||29.45||57.81||0.01|
|Vote (IBK Nearest Neighbour K 1 and Decision table)||IBK Nearest Neighbour K 1 and Decision table (average of probabilities from each model)||32.24||62.49||0.0|
|Vote (IBK Nearest Neighbour K 21 and Decision stump)||IBK Nearest Neighbour K 21 and Decision stump (average of probabilities from each model)||29.46||57.79||0.01|
|Stacking (LWL and IBK NNK 21)||Combines LWL and IBK NNK 21 Models using IBK meta regression model||30.26||56.44||21.12|
|Stacking (LWL and IBK NNK 21)||Combines LWL and IBK NNK 21 Models using LWL as meta regression model||30.81||57.34||25.72|
|Random Sub Space (Decision Stump)||Decision Stump regression model||31.61||59.14||0.08|
|Random Sub Space (Rep Tree)||Rep tree regression model||35.25||63.66||0.95|
|Random Sub Space (Decision Table)||Decision Table regression model||29.81||55.38||2.56|
|MultiScheme (Decision Stump and LWL)||Uses cross validation to combine the regression models Decision Stump and LWL||30.41||58.48||13.92|
|MultiScheme (IBK NN K 21 and Decision Stump)||Uses cross validation to combine the regression models IBK NN K 21 and Decision Stump||31.07||59.87||3.5|
|Bagging (Decision Stump)||Reduces variance of the Decision Stump regression model||32.04||60.15||0.06|
|Bagging (Decision table)||Reduces variance of the Decision table regression model||35.33||63.84||7.16|
|Bagging (IBK NNK 21)||Reduces variance of the IBK NN K 21 regression model||35.98||68.01||0.02|
On CD and additional document
Appendix L: CD Contents
Weka Data Mining Models – Contains all of the Weka model files
Artefact DATA QUALITY REPORT – Artefact document
Dissertation the Effects of Poor Data Quality within a Technical Services Database –Dissertation Document
ErrorDistributionSpreadSheet – Error distribution file
ReadMe – Contains information on what to do with each file in order to view them