This part presents a brief review of the literature related to data quality and importance of it, different data quality dimensions and frameworks and metadata quality.
Data is the central concept of data quality and understanding the meaning of that is important. According to Hicks “Data is A representation of facts, concepts or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means.” (Hicks,1993). We can explain the concept of data from three points of view: objective, subjective and intersubjective and each of them emphasis on different possible roles of data. Here we present a brief explanation of these views:
1. Objective view of data
This kind of data is factual and is resulted from the measurable objects or events record. It contains details of individual objects and events that are produced in enormous scale in our modern society. This view of data tends to assume that all process of data will be automated. There are some definitions that different scientists present about objective data:
“Data represent unstructured facts.” (Avison & Fitzgerald, 1995)
“By themselves, data are meaningless; they must be changed into a usable form and placed in a context to have value. Data becomes information when they are transformed to communicate meaning or knowledge, ideas or conclusions.” (Senn [1982: 62] quoted by Introna )
2. Subjective view of data
The subjective view is very different from the previous type. This kind of data is not necessarily present correct, true and accurate of a particular fact.
There are some definitions that different scientists present about subjective data:
Maddison defines data as “facts given, from which others may be deduced, inferred. Info. Processing and computer science: signs or symbols, especially for transmission in communication systems and for processing in computer systems, usually but not always representing information, agreed facts or assumed knowledge; and represented using agreed characters, codes, syntax and structure”
(Maddison [1989: 168] quoted by Checkland and Holwell )
Senn defines subjective data as Example definitions:
“Data: Facts, concepts or derivatives in a form that can be communicated and interpreted.” (Galland, 1982)
“Data are formalized representations of information, making it possible to process or communicate that information.” [Dahlbom & Mathiassen, 1995: 26]
3. Intersubjective view of data
in this view establishing a communication is the main purpose and data can be processed and provided directly by the person or by computer. Also, information can be retrieved from this kind of data.
This kind of data depends on types of database records are saved in formalized structured, so it uses accepted codes, structures, syntax, characters and way of coding and decoding. a series of bits are data if we have a key to decode it, for instance, text data structure is based on syntax and semantics of a language.
because this kind of data is recordable and has predesigned structure are suitable for further processing and interpretations and we can conclude more information from them, so they are useful potentially
“Data: The raw material of organizational life; it consists of disconnected numbers, words, symbols, and syllables relating to the events and processes of the business.” (Martin & Powell 1992)
Much like data, the meaning and concept of quality can be present in different ways by different sciences. Edwards Deming was one of the first quality proponents that is famous because of his works in the industrial renovation in Japan after World war II. He claimed that productivity improvements that create a competitive position in the industry are results of quality improvements (Deming, 1982). He emphasized that low quality will wastes production capacity and effort and is causes of cost increases and rework. In his view, the most important section of the production line is the customer (Deming, 1982).
he strongly believes that “the cost to replace a defective item on the assembly line is fairly easy to estimate, but the cost of a defective unit that goes out to a customer defies measure” (Deming, 1982).
Demining prize is established by the Japanese Union of Scientists and Engineers in 1951 to illustrate the certain level of quality achievement in organizations (Mahoney & Thor, 1994).
In 1988, Juran contributed to study about quality and proposed that quality means fitness of use (Juran, 1988). He emphasized the role of customers in quality definition and measurement. Juran believes that customers are all people who are impacted by our products and our process are customers and he provides some definition about various kind of customers like internal and external customers (Juran, 1988). He believes that three reasons make a force for the organization to pay attention to the quality issue: decrease in sales rate, poor quality costs and society treatments. He said that all of the organizations must do three things: quality planning, quality improvement and, quality control (Juran, 1988).
The third person that works on the quality subjects is Crosby. He continued his colleague’s idea about the importance of customers and said “the only absolutely essential management characteristic of the twenty-first century] will be to acquire the ability to run an organization that deliberately gives its customers exactly what they have been led to expect and does it with pleasant efficiency” (Crosby,1996).
He states that define quality is difficult because everyone has a different definition of it and thinks that others have this kind of definition about quality (Crosby,1996).Another special effort about improvement in quality concepts performed by US Congress in 1987 that established the Malcolm Baldridge National Quality Award. This award evaluates each business on seven major criteria and is focused on customer satisfaction and preventative behavior rather than passive approach to quality management (Mahoney & Thor, 1994).
Another attempt in this area is development some international standards like ISO 9000 series. This kind of standards are focused on capabilities of organizations in regard to quality management.
How does quality as discussed in this section relate to data quality? Can you introduce the relevance of quality in management and organizations to your literature review and research question?
3. Data quality
data quality has become a very important subject of study in information systems area and a vital issue of modern organizations. During last decades, the number of data warehouses and databases have incredible growth and we are faced with a huge amount of data. it’s obvious that decision about future activities of organizations is related to stored data, so the quality of data is very important (WHO,2003).
We often define the term of “data quality” or “information quality” as “fitness of use” of information (Wang & Strong, 1996).
Good quality data means that information is complete, permanent, and accurate and based on standards With Improve information of the business to reduce costs, improve productivity and accelerate service.
The data that is economical, rapidly to create and evaluate decisions and effective has quality. We know that data quality is a multidimensional concept and contains some factors like relevancy, accessibility, accuracy, timeliness, documentation, user expectation and capability. References?
However, the ways that we collect data and create a set of them or the amount of data that we gather in our datasets influence on some characteristics like cost, accuracy or user satisfaction, so good or bad data quality has more impact on our results than results that will be shown in statistical analysis.
for instance, some scientists claim that terrorist attacks on September 11, 2001, was the fault of US government of collect accurate and relevant data in federal databases and maybe they could prevent this attacks with correct data and correct analysis.
We have two kinds of decisions: some of them are based on special data and some others are about the data. Both kinds of them making the cost for us, so it would be good to know about:
- how much is the cost of achieving a special level of data quality
- after improving data quality, what are the financial profit in an organization
- what are the impacts and costs of poor data quality
however, data quality subjects are new, but researchers deployed some frameworks about it with data quality criteria, factors, dimensions and the ways that we can assess and measure some of these factors in organizations and recently international standards organization developed some frameworks and definition about this subject
Over the past decade, research activities in the data quality filed for measures and improving the quality of information have increased dramatically to achieve goals (2).
data quality has significant implications. on January 28, 1986, seven astronauts killed in shuttle explosion some second after lift-off.
This event happened again around seventeen years later and a shuttle that has called Columbia broke apart in space and again seven astronauts killed. in 1988, the U.S. Navy Cruiser shot down 290 passengers of an Iranian commercial airplane.
on 11 September 2011, nineteen hijackers passed airport security and killed around 3000 passengers of four commercial airplane.
These are some events that were horrible and many people lose their life and in consequence make some international problems and war between countries, but all of the events have something in common. according to the investigation of some commissions, we can find at least two main common things in these events; firstly, in all of them the results were against of organization aim and objective and secondly, in all of them inadequate data and poor data quality were the obvious major of accidents. (9/11 Commission, 2004; Columbia Accident Investigation Board, 2003)
For instance, the Challenger accident was investigated by The Rogers Commission, they figure out that the decision to launch “was based on incomplete and sometimes misleading information” (Rogers Commission, 1986).
According to an investigation of Fisher and Kingma about Vincennes accident in 2001, “data quality was a major factor in the USS Vincennes decision-making process” (Fisher& Kingma, 2001). A commission about 11 September event in the US, found out that “relevant information from the National Security Agency and the CIA often failed to make its way to criminal investigators” (Fisher& Kingma, 2001).
about Columbia accident investigation results showed that “the information available about the foam impact during the mission was adequate” [n], yet also noted that “if Program managers had understood the threat that the . . . foam strike posed . . . a rescue would have been conceivable” (Fisher& Kingma, 2001).
We can’t claim that in all of these events insufficient data quality is the main cause of happening but “it still remains difficult to believe that proper decisions could be made with so many examples of poor data quality” (Fisher& Kingma, 2001).
We can learn many things from mentioned accidents and they indicate the importance of data quality, but they are not a kind of normal events. in contrast, there are many typical examples about poor data quality around us. one example is nurse of a hospital that misplaced a decimal point and didn’t understand the problem and caused a
Paediatric patient overdose (Belkin, 2004), another one is an eyewear company that loses at least 1 million dollars each year because of fifteen percent lens-grinding rework rate cost. (wang et la.,1998), or a healthcare center that paid over four million dollars to patients that weren’t eligible for profits anymore (Katz-Haas & Lee, 2005).
Some organizations that find out the significance of data quality can expand their data resources to solve their billing differences (Redman, 1995), or create some benefits of accurate and complete data (Campbell et la., 2004), or rise satisfaction of customers (McKinney et la., 2002).
Although overall cost of business problems and losses estimation because of poor data quality is variable in the different businesses, its more than one billion dollars each year and this cost contains human lives costs, permanent and continues changes (Fisher& Kingma, 2001).
3.1. data quality importance in healthcare
There are two important reasons that make Data quality a critical subject in healthcare area in recent years, firstly it causes increase and development in patient care standards and procedures and secondly, its effect on government investments for the preservation of health services for people.
All parts of health care, as well as faraway aid stations and clinics, hospitals and health centers, health departments and ministries, have to be worried about poor data quality in health care and effects of it on results and outputs of this section.
In most of the countries, administrators are confused by poor health and medical record documentation, confliction in health codes, large amount medical records, and data
that are waiting to be coded as well as poor utilization and access to soundness data (WHO,2003).
The significance of data quality in health care can be shown as follows:
- patient care: provide impressive, adequate and right care and decrease health hazards.
- Up-to-date patients
- efficient administrative and clinical operations, like make effective communication with patients and their families
- planning strategic schedule
- monitoring the health of society
- decrease health hazards (for wrong patient or treatment)
- help for feature investments in healthcare (Moghadasi, 2005).
one of the most substantial dimensions of data quality is usability that makes a product simple to use and pleasant.
Kahn and his colleagues have introduced some dimensions such as accessibility, believability, ease of manipulation and reputation as factors of usability (Kahn et la, 2003).
usability of data and information that are provided in healthcare systems are the principal issue for practitioners. in addition, it is an essential factor for final users and patient acceptance of the technology (Rianne and Boven, 2013). we can define usability as a capability of systems that able users to perform their tasks calmly, pleasantly, efficiently and effectively (Preece et la., 2002), While the inadequate amount of usability that is related to administration and installation of information systems have a direct impact on user satisfaction (hassan shah, 2009).
Therefore, today usability is recognized as the main factor of interactive healthcare systems and evaluation of it has become more important during last years.
a system with high usability has some benefits such as help and support users, decrease user’s faults, increase acceptance rate of the system by users, increase efficiency (Maguire, 2001).
A description about usability is provided by Jakob Nielsen, who is very famous in usability research subjects, “a quality attribute that assesses how easy user interfaces are to use” (Nielsen, 1993) He described that “the word ‘usability’ also refers to methods for improving ease-of-use during the design process”, also he said that usability has five essential components:
- Learnability: this concept refers to the rate of user’s confident sense when they view easy to design for the first time for doing their tasks.
- Efficiency: This concept refers to the speed of doing tasks again after users have learned the design and used it.
- Memorability: this concept refers to how easily users can establish gained skills again When they use the specific design after a while.
- Errors: This factor refers to the number of user’s faults, the intense rate of these errors and effort rate to solve the errors
- Satisfaction: This factor shows a number of users that are happy and calm by using special design.
ISO 9126 in 1991 offered a good definition for quality in software engineering field and classified usability as a quality component and describes that: “Usability is the capability of the software product to be understood, learned, used and attractive to the user when used under specified conditions” (ISO, 1991).
increase the acceptance rate of a system by users is an indirect result of system design. During last decade, we had a toward trend to create and develop systematic systems with a high rate of usability. Now, the main goal of designers is developing a usable and useful system (Hartson, 1998). they believe that people are the main purpose of the designing system so it must be usable for them. results of some studies have been shown that if functions of our system work correctly but it can’t gain the user’s expectations, the system will not be used in future (Bevan,1995).
information system’s users can be important participants in the designing process because they have valuable experience of work with systems. human-centered interface design methods emphasize on contribute users as a main part of the designing process. in this kind of methods, users can help designers to consider users’ needs at the time of system designing, verifying ways of designing a system, performing the tests of usability during system development (Sadoghi et la. , 2012).
where is this discussed in the lit review?
Generally, data quality is a multidimensional concept and shows different aspect and characteristics of data [y].
There isn’t any special agreement about factors and dimensions of data quality. Price and Shanks in 2005 defined that there are four different research approaches to describe data quality: empirical, practitioner, theoretical, or literature-based.
- Empirical approaches: these ways are based on the feedback of customers to identify quality criteria and factors and rank them into different categories like research of Kahn et al. (2002) or Wang & Strong (1996)
- Practitioner-based approaches: these kinds are emphasized on the experience of industrial and ad-hoc observation. Some researchers claim that this approach doesn’t have enough rigor. English in 1999 deployed this approach to develop a model.
- Theoretical approaches: these ways are created by information economics and theory of communication. Some researchers believe that this approach has defective in the relationship between concepts.
- Literature-based approaches: in these ways for deriving data quality criteria, literature review and different kind of analysis will use.
Maybe we can define the fifth approach such as a designed-oriented approach that indicates predesign use of data and information [z]. Therefore, this approach against other ones helps system designers to understand verity of different system stakeholders and provide a real guidance for them to realize data deficits by mapping state of an information system on the state of the real world [aa].
There are many data quality frameworks and dimensions categorize that are defined by different data scientists. In this section, we will introduce some of them:
In 1974, Gallagher presents one of the first framework of data quality.
He considered usefulness, desirability, meaningfulness, and relevance, among others, in determining the value of information systems as factors of data quality His framework (Gallagher, 1974).
A few years later Halloran and his colleagues in 1978 determined accuracy, relevance, completeness, recoverability, access security, and timeliness as data quality factors. Also, he specified metrics for each of these in terms of the overall system. They state that “an organization can keep error statistics relating to data accuracy”.
Relevance in their term described as “how [the system’s] inputs, operations, and outputs fit in with the current needs of the people and the goals it supports” (Halloran,1978).
Another framework introduced in 1996 by Zeist and Hendricks. Their model is an ISO model extension. They ranked data quality factors into Functionality, Reliability, Efficiency, Usability, Maintainability and Portability dimensions.
Although all of the previous frameworks indicate some aspect of data quality, Wang, and Strong in 1996 developed a framework that is a base of many of today’s researches (Wang &Strong, 1996).
they indicate factors of data quality that are demanded by consumers of data. Also, they state that “although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy”. start of their research was an investigation about data quality factors and gatherers 200 items, then decrease the list according to some analyses and finally introduce 15 dimensions in four categories:
- Intrinsic DQ explains that data in their own right has quality
- Contextual quality emphasized the requirement that data must be considered based on the current task.
- Representational DQ and accessibility DQ indicate the significant role of systems.
This framework indicates that high data quality must be contextually suitable for our task, clearly represented, accessible to the data consumer and intrinsically well-defined.
In 1997, strong and his colleagues used this framework in three different organizations to find out their data quality issues and suggest some solutions. They found out that schema of data quality problems can move between distinct groups. For example, a problem that is related to incompatible data representation can be discovered as an accessibility issue. Therefore, they state that “two different approaches to
Problem resolution: changing the systems or changing the production process” (Strong, Lee, wang, 1997). They strongly emphasized that for solving data quality problems, its need to have the view and perspective beyond the limitations of intrinsic quality dimensions.
Figure 5. Data quality as a multi-dimensional construct.
Note. Adapted from Wang, R. Y., and Strong, D. M. (1996).Beyond accuracy: What
quality means to data consumers. Journal of Management Information Systems, 12(4), 5-
Two other researchers, Shanks and Corbitt in 1999 proposed a Semiotic-based framework for information quality. Their factors are Well defined / formal syntax, comprehensive, unambiguous, meaningful, correct, timely, concise, easily accessed, reputable, understood, awareness of bias (Shanks & Corbitt, 1999).
TABLE IV. IQ DIMENSIONS BASED ON SEMIOTIC AND QUALITY ASPECTS 
|Pragmatic||Relevance, completeness||Timeliness, actuality, efficiency||Information process, application|
|Semantic||Precise data definitions, easy to understand and objective data definitions.||Interpretability, accuracy (free-of error), consistent datavalues, completedata values, believability, reliability||Comparison with real world and experience|
|Syntax||Consistent and adequate syntax||Syntactical correctness, consistent representation, security, accessibility||Syntactical standards and agreements|
Helfert, M. (2001),
Another considerable improvement in this area developed by Kahn in 2002. his framework is a conceptual model and knows data as a product. Kahn and his colleagues state that it “can also be conceptualized as a service” A service, unlike a product, “is perishable, for you cannot keep it; it is produced and consumed simultaneously”. Also, the adopted two categories as their purpose “conformance to specifications” (Kahn et la., 2002) and “meeting or exceeding customer expectations”.
Then, they developed a significant model based on Wang and strong model, by merging two mentioned categories with product and service aspects of data quality and called it as product and service performance model for information quality (PSP/IQ)” (Kahn et la., 2002).
The PSP/IQ model is a two-by-two grid that Product and service quality are rows, and specification and expectations make its columns. The various dimensions of information quality from the Wang and Strong model are mapped onto this two-by-two grid, and each of the quadrants has a short and descriptive name. On the product side, the product-conformance quadrant is referred to as “sound information” and the product-expectations quadrant represents “useful information”. On the service side, the service-conformance quadrant indicates “dependable information”, with “usable information” making up the service expectation quadrant (Kahn et la., 2002).
|Conforms to Specifications||Meets or Exceeds Expectations|
• Free of error
• Concise representation
• Consistent representation
• Appropriate amount
• Ease of operation
The PSP/IQ model
Note. Adapted from Kahn, B. K., Strong, D. M., and Wang, R. Y. (2002). Information
quality benchmarks: Product and service performance. Communications of the ACM,
Categorize data quality dimensions:
as described in the last section, Wang and Strong provide a data quality framework that comprises sixteen different dimensions that are organized into four categories. Because of the importance of these categories, here there is some information that Fisher provides for these categories:
- Intrinsic data quality
Fisher and his colleagues state that there is a strong relevance between objectivity, accuracy, believability, and reputation of data. they believe that “The high correlation indicates that the data
Consumers consider these four dimensions to be intrinsic in nature” Fisher et al. (2011: 42-45)
when the quality of data is cognizable of the data, data has the intrinsic quality.
Wang and strong (Wang & Strong,1996) clarified that almost all of companies have focused on accuracy as the most important factor of data quality, while there are some other factors that have an influence on company’s performance. for instance, fisher stats people live with their beliefs so believability of data can be much more important for them. there is a negative correlation between the degree of judgment used in the data structure and how people observe data to be objective (Fisher et al. 2011:44).
Another key factor is data reputation that is produced over time by both data sources and data (Wang & Strong, 1996). data reputation might stop people examining the rate of accuracy.
DOI 10.1007/978-3-658-08200-0_2, © Springer Fachmedien Wiesbaden 2015
- Contextual data quality
This category covers completeness, relevancy, timeliness, value-added, and amount
of data (Fisher et al., 2011: 45).
value-added first-time introduced by wang and strong and refers to data can improve operations of the company and create some competitive options for organizations.
timeliness shows the age of data. according to Fisher (Fisher et al. (2011: 45))timeliness age can be important for some of the data, for example, old data in financial decisions causes wrong and incorrect decision another factor in this category is the amount of information.
it is assumed that too much data and information always is not good and vital to making perfect decisions and maybe lead us to make some wrong decisions.
- Representational data quality
This category is “based on the direct usability of data” (Fisher et al., 2011: 47) and shows the significance of data presentation and contains five factors such as ease of understanding, interpretability, conciseness of representation, representational consistency, and manipulability.
present data in its previous format and compatible with them is defined as representational consistency. they describe understandability of data as readability and clarity. also, well-organized, aesthetically pleasing, well-formatted, and represented compactly are known as consistency dimensions Wang & Strong (1996).
Fisher and his colleagues indicate that there is a strong relationship between difficulties that we have for selecting necessary part of a long statement
and problem that we have for remembering what a short statement stands for when shortening a long statement. so, it’s better that data designer and analysis work with the real user to determine the best way of data presentation. Fisher et al. (2011: 47)
- accessibility data quality
This category includes dimensions that are related to data security and data accessibility. its obvious that security and accessibility have a reverse relationship with each other, but we must know if data are accessible, how much is access rate or how is security rate.
(Fisher et al., 2011: 47-48).
Wang and strong listed nature of data and “inability of competitors to access data due to its restrictiveness” as factors of this dimension. Wang & Strong (1996)
|Intrinsic IQ||Contextual IQ||Representational IQ||Accessibility IQ|
|Wang and Strong ||Accuracy Believability Reputation Objectivity||Value-Added Relevance Completeness Timeliness Appropriate Amount||Understandability Interpretability Concise Representation Consistent Representation||Accessibility
Ease of Operations Security
|Quantity Reliable/Timely||Arrangement Readable Reasonable|
|Jarke and Vassiliou ||Believability Accuracy Credibility Consistency Completeness||Relevance Usage Timeliness Source currency
Data warehouse currency
Version control Semantics
availability Transaction Availability Privileges
|DeLone and McLean ||Accuracy Precision Reliability Freedom from Bias||Importance Relevance Usefulness Informativeness Content Sufficiency Completeness Currency Timeliness||Understandability Readability Clarity Format Appearance Conciseness Uniqueness Comparability||Usableness Quantitativeness Convenience of Access*|
|Goodhue ||Accuracy Reliability||Currency
Level of Detail
|Compatibility Meaning Presentation Lack of Confusion||Accessibility Assistance Ease of Use (of H/W, S/W) Locatability|
|Ballou and Pazer ||Accuracy Consistency||Completeness Timeliness|
|Wand and Wang ||Correctness Unambiguous||Completeness||Meaningfulness|
This dimension has two elements: age that shows how old is the information or how long before it was recorded. The second element is volatility that indicates the frequency change of entity attribute value (Batini et la.,2009).
We measure this dimension for figure out in which age data is good enough for our current tasks (Wang & strong, 1996).
The delay between a change in a real-world state and see results in a state of aninformation system is known as timelines (Batini et la.,2009).
It refers data that is matchable and in the same format with prior data (Wang & strong, 1996).
If data is reliable and correct and confirmed, it is accurate (Wang & strong, 1996).
Data is accurate if real-world value and stored values are the same (Batini et la.,2009).
The capability of information systems to show every situation that real-world shows (Batini et la.,2009).
It indicates that is good enough for the current task and has adequate depth and scope (Wang & strong, 1996).
The Ratio of the actual information arrived in the sources and/or the data warehouse (Batini et la.,2009).
The Proportion between the number of non-null values in a source and the size of the general relation (Batini et la.,2009).
Refers to which information is accessible, or retrieve easily and fast (Wang & strong, 1996).
A measure of unwanted duplication existing within or across systems for a specific data set, field or record (McGilvray, 2008).
- Consistent Representation
It refers that which data is presented in the similar format (Pipino bet la., 2002).
It refers that which information is highly observed in terms of content or source (Wang & strong, 1996).
It is the ability of a function to achieve standard levels of risk of harm to process, people, the environment or property (Heravizadeh, 2009).
- Appropriate amount of data
It refers that which size of data is suitable for the current task (Pipino bet la., 2002).
It indicates that for to maintain information security which access must be restricted appropriately (Wang & strong, 1996).
It indicates which information is observed as true and sound (Wang & strong, 1996).
It indicates which data are without ambiguity and clear and comprehended easily (Wang & strong, 1996).
It indicates which information is without bias, unprejudiced and neutral (Wang & strong, 1996).
The Extent to which information is appropriate and beneficial for the current task (Wang & strong, 1996).
It is the capability of the functions that enable users in a specified context of use, complete identified goals with accuracy and completeness (Batini et la.,2009).
It shows which data describe in appropriate languages, symbols, and units and with a clear definition (Pipino bet la., 2002).
- Ease of Manipulation
It shows which data is easy to manipulate and can change to different or same format (Pipino bet la., 2002).
- Free-of –error
It shows which data is correct and reliable (Pipino bet la., 2002).
- Ease of Use and maintainability
A measure of the degree to which data can be accessed and used and the degree to which data can be updated, maintained and managed (McGilvray, 2008).
To extend to which information is clear and easily used ( Knight & Burn, 2005).
It indicates which information is correct and reliable (Wang & strong, 1996).
It is the capability of the function to keep a specified level of performance when used on special condition (Heravizadeh, 2009).
- Amount of data
It shows the quantity or size of accessible data is suitable (Wang & strong, 1996).
you have identified some of the main framework – now need to discuss them in relation to each other and any limitations.
why is metadata important for data quality?
Metadata has been specified as data about data or information about information and usually can be explained as records of data that holds information about special resources. The structures of these records help us to manage, detect and retrieve the resources that they described (Al-Khalifa & Davis, 2006) and potential users can understand more about a a resource without fully examine it (Haase, 2004).
Metadata contains some elements that are associated with the the related resource and their roles are such as described and define a special type of resources.
There are different metadata standards and organizations use different metadata standards according to their community context-specific needs(Kraan, 2003).
Base on needs, Programmer change and fit defined standards for specific application. They create some restriction and vocabulary and change something in existing standards according to their needs (Duval et al., 2002). In this way, metadata has some customize standards and can make a good cooperation with the application ((Duval et al., 2006).
Metadata have been used in various range of literature as a digital storage, they used in statically data and reports (Yamada, 2004), marine and geographical data (NDN, 2004) and medical resources ((Shon and Musen, 1999).
There is a special matter about metadata and that is related to metadata providers. Working with metadata in digital projects are complicated and intricate task because various types of customers and stakeholders with different experiences use them, so the way that metadata will be presented in softwares and information systems is an important issue of metadata experts. It is suggested that experts of special domains create high-quality metadata for their domains (Weinheimer (2000). for instance, educational experts create metadata with focused on educational properties of the resources.
It is vital that we use metadata to improve and measure the quality of data because many of data resources don’t have sufficient information about the usage of their data for determining their quality. Metadata can help us in three parts:
- Metadata can provide Data quality by making some limitation for information such as different types of checks, sanity checks, and consistency checks. Because data must be present as modeling reality, it is necessary to provide enough metadata to help users understand limitations and hypothesis of data. In brief, metadata can help evaluator about the condition, range and intended purpose of data.
- Results of our evaluation can be registered in metadata and users can take advantage from this results in future. Especially, “certified data” must use some metadata to show the fact that they have Verification, Validation (V&V). metadata must present the Verification, Validation that has to be done.
- Metadata can provide some context that we need for control changes of data during processes. For instance, metadata can save historical metrics value that are used during processes. Therefore, we can use metadata for identify activates to ensure data quality.
We presented some famous models of data quality in the last section. In this part, we want more focus on metadata quality frameworks; Like data quality, we have some dimensions and factors and they help us to assess metadata quality.
- Bruce and Hillmann framework
It was the first model that proposed dimensions of metadata. There are seven parameters of metadata quality. In this model: provenance, completeness, accuracy, logical consistency and coherence, conformance to expectation, accessibility and timeliness. They describe and explain each dimension but didn’t present any formal definition or metrics. They declare that ‘‘although the criteria provide opportunities to converse about quality, without ways to measure that quality, they remain frustratingly beyond reach’’, so they suggest we provide a “template of expectation” and compare our metadata quality with this template to understand about the quantitative measurement of quality dimensions. They said that three dimensions: completeness, conformance to expectations and accuracy have the potential of assessing (Bruce and Hillmann, 2004).
- Stvilia et al. framework
This framework is a kind of general and efficient measurement model. In contrast to previous models that are based on context-specific assessment and are depended on local needs’ variables, this model has more focus on reasons for quality changes and their framework contains typologies of information quality worries, relevant actions and a categorized set of information quality dimensions. They identified four main problems of information quality that are: changes to the fundamental entity
or condition, mapping, changes to the information entity, and context changes. In this framework, an information quality problem will happen if the quality of an information entity isn’t equal to information quality requirements of an activity on a dimension(s). here, information quality dimension is any aspect determining the information quality concept. They believe that their model is “a knowledge resource and guide for developing IQ measurement models for many different settings’’ (Stvilia et al., 2007).
- Ochoa and Duval framework
- This framework presented after Hillmann frameworks and is related to it. Their framework contains some parameters that are useful for quality evaluators and they provide an automatic measurement method for Bruce and Hilman framework. Also, they propose some metrics for each parameter with some guidelines and calculation formulas. They claim that these parameters are not comprehensive, but can be considered as ‘‘a first step for the automatic evaluation of metadata quality’’ (Ochoa and Duval, 2009).
Figure 1 – Mapping between the Bruce & Hillman framework and the Gasser & Stvilia framework
(Note: We have divided Bruce & Hillman measures into two sections for clarity.)
(Shreeves et la. , 2005)
The architecture of computer system commonly is illustrated in layers, so when the user can’t gain significant results, the best idea is that start with examining this layer. there are two system layer presentations below (Kochtanek, 2002, pp 89-92):
(Kochtanek, 2002, pp 89-92):
The left-hand diagram illustrates that user uses a user interface to communicate with web servers. as its clear, there are four different layers between data and users, so if the user makes an appropriate interaction with the user interface, but has problem in connection between interface and web server or web server and scripts or scripts and database structures or database structure and database there is a problem to get expected results. Therefore, maybe everything works well, but if we don’t have correct format data, our results are not appropriate.
another diagram includes “middleware layer” that is between operating system and applications and network elements. This layer makes systems more secure and reliable and seamless(Dalziel, 2004).
This is a good start on your literature review – obviously lots to do still but now need to link data and metadata quality to presentation of data.