In: Computer Science
The present trend in Health information technology is to build sophisticated model, tools for business, clinic intelligence; among these an area of focus is the quality of data required to build these models. The error source, the consequence of these errors are discussed below in the healthcare context.
1. SOURCES, CONSEQUENCES AND REMEDIES
An assessment of data quality in healthcare has to
(1) address problems arising from errors or inaccuracies in the data itself and
(2) consider the source or sources of the data and how the purpose and business model of their collection impact the analytic processing and knowledge expected to extracted from them.
A consideration of errors and inaccuracies in the data would include data entry errors, errors
arising from transformations in the extracting and transforming process for analytics, missing data, etc. Anexamination of the source(s) of the data can reveal limitations and concerns as to its appropriateness to thetype of analysis being performed (financial data for evaluation of treatments), variations due to data merged from two different business models, variations due to entity and identity disambiguation (or a lack thereof), and variations due to changing and merging business models. some common modes of quality corruption that occur in real healthcare data are described below-
1 RELEVANCE AND CONTEXT
Analysts often fall prey to the uncritical acceptance of data without quantifying accuracy and consistency. Recognition of the interplay between settings in which data is collected and the settings to which they can be unambiguously applied often is a root-cause for misleading insights.completeness and accuracy of key data fields are essential.
2 ENTRY ERRORS (MANUAL/SOFTWARE)
The process of data creation in academic and commercial settings outside of healthcare is highly automated, such as the recording and ordering of key strokes for website access, financial transactions, etc. However, in the healthcare world, most of the collection, integration and organization of data is manual. The critical difference is the data input by humans who can both, intentionally and unintentionally, introduce systematic data errors. Incorrect entry of a name, address, or key id field such as social security number or insurance id can lead to ambiguous data records that could be attributed to the wrong person or could lead to multiple records for a single person. automation creating data redundancies. While making entries into electronic health records, physicians often use templates or copy-paste commands to generate text that will comply with guidelines and regulations of the insurers such as Medicare. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid
3 DIVERSITY AND EVOLVING STANDARDS
Healthcare data often consists of several referential codebooks. For example, race and gender codes in beneficiary datasets; medical practice taxonomy and specialty codes in provider datasets, diagnosis and procedure codes (CPT-HCPCS, ICD, etc.) in claim datasets, the NDC codes on prescription drug events. While some of these codebooks may be standardized, codebooks can also be specific to a data system. Medical datasets may sometimes use several standards - some 50 years or older. Some legacy systems may have not kept up to evolving standards and even modern systems may not be flexible to incorporate evolving standards.
4 DATA STAGING ERRORS
Serious quality errors can occur in the pre-processing and staging of data for analysis. Usually, data staging can involve data migration, integration, machine-to-machine translation and database-to-database conversion. Decisions made during the extract, transform, load (ETL) process - such as using metric units vs. British units, allowing the choice of leaving a key cost-field blank (which could be encoded in a legacy system as ‘88888’ or ‘99999’) can have down-stream ramifications in the analytics workflow. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid. Furthermore, there can be ambiguity in semantic description of diagnosis and procedures across physicians and how billing agents encode the semantic variations in claim forms. When reimbursements are at stake, there could be incentive to even deliberately distort the data. For instance, physician may mention treatment for herpes and the billing agent is highly likely to choose a treatment for herpes-2 versus herpes-1, which may not be reimbursed.
5 ENTITY RESOLUTION
Healthcare data involves a complex web of entities such as providers, patients, and payers and their policies. There is a critical need to know and track every entity within the system with a high degree of confidence. Often referred to as the identity disambiguation problem, it is one of the major, if not the toughest data quality challenge in healthcare. Accurate association of health care episodes of a patient who may be visiting multiple healthcare providers is absolutely essential to documenting and retrieving a complete history of health-related events.
The present trend in Health information technology is to build sophisticated model, tools for business, clinic intelligence; among these an area of focus is the quality of data required to build these models. The error source, the consequence of these errors are discussed below in the healthcare context.
1. SOURCES, CONSEQUENCES AND REMEDIES
An assessment of data quality in healthcare has to
(1) address problems arising from errors or inaccuracies in the data itself and
(2) consider the source or sources of the data and how the purpose and business model of their collection impact the analytic processing and knowledge expected to extracted from them.
A consideration of errors and inaccuracies in the data would include data entry errors, errors
arising from transformations in the extracting and transforming process for analytics, missing data, etc. Anexamination of the source(s) of the data can reveal limitations and concerns as to its appropriateness to thetype of analysis being performed (financial data for evaluation of treatments), variations due to data merged from two different business models, variations due to entity and identity disambiguation (or a lack thereof), and variations due to changing and merging business models. some common modes of quality corruption that occur in real healthcare data are described below-
1 RELEVANCE AND CONTEXT
Analysts often fall prey to the uncritical acceptance of data without quantifying accuracy and consistency. Recognition of the interplay between settings in which data is collected and the settings to which they can be unambiguously applied often is a root-cause for misleading insights.completeness and accuracy of key data fields are essential.
2 ENTRY ERRORS (MANUAL/SOFTWARE)
The process of data creation in academic and commercial settings outside of healthcare is highly automated, such as the recording and ordering of key strokes for website access, financial transactions, etc. However, in the healthcare world, most of the collection, integration and organization of data is manual. The critical difference is the data input by humans who can both, intentionally and unintentionally, introduce systematic data errors. Incorrect entry of a name, address, or key id field such as social security number or insurance id can lead to ambiguous data records that could be attributed to the wrong person or could lead to multiple records for a single person. automation creating data redundancies. While making entries into electronic health records, physicians often use templates or copy-paste commands to generate text that will comply with guidelines and regulations of the insurers such as Medicare. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid
3 DIVERSITY AND EVOLVING STANDARDS
Healthcare data often consists of several referential codebooks. For example, race and gender codes in beneficiary datasets; medical practice taxonomy and specialty codes in provider datasets, diagnosis and procedure codes (CPT-HCPCS, ICD, etc.) in claim datasets, the NDC codes on prescription drug events. While some of these codebooks may be standardized, codebooks can also be specific to a data system. Medical datasets may sometimes use several standards - some 50 years or older. Some legacy systems may have not kept up to evolving standards and even modern systems may not be flexible to incorporate evolving standards.
4 DATA STAGING ERRORS
Serious quality errors can occur in the pre-processing and staging of data for analysis. Usually, data staging can involve data migration, integration, machine-to-machine translation and database-to-database conversion. Decisions made during the extract, transform, load (ETL) process - such as using metric units vs. British units, allowing the choice of leaving a key cost-field blank (which could be encoded in a legacy system as ‘88888’ or ‘99999’) can have down-stream ramifications in the analytics workflow. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid. Furthermore, there can be ambiguity in semantic description of diagnosis and procedures across physicians and how billing agents encode the semantic variations in claim forms. When reimbursements are at stake, there could be incentive to even deliberately distort the data. For instance, physician may mention treatment for herpes and the billing agent is highly likely to choose a treatment for herpes-2 versus herpes-1, which may not be reimbursed.
5 ENTITY RESOLUTION
Healthcare data involves a complex web of entities such as providers, patients, and payers and their policies. There is a critical need to know and track every entity within the system with a high degree of confidence. Often referred to as the identity disambiguation problem, it is one of the major, if not the toughest data quality challenge in healthcare. Accurate association of health care episodes of a patient who may be visiting multiple healthcare providers is absolutely essential to documenting and retrieving a complete history of health-related events.
The present trend in Health information technology is to build sophisticated model, tools for business, clinic intelligence; among these an area of focus is the quality of data required to build these models. The error source, the consequence of these errors are discussed below in the healthcare context.
1. SOURCES, CONSEQUENCES AND REMEDIES
An assessment of data quality in healthcare has to
(1) address problems arising from errors or inaccuracies in the data itself and
(2) consider the source or sources of the data and how the purpose and business model of their collection impact the analytic processing and knowledge expected to extracted from them.
A consideration of errors and inaccuracies in the data would include data entry errors, errors
arising from transformations in the extracting and transforming process for analytics, missing data, etc. Anexamination of the source(s) of the data can reveal limitations and concerns as to its appropriateness to thetype of analysis being performed (financial data for evaluation of treatments), variations due to data merged from two different business models, variations due to entity and identity disambiguation (or a lack thereof), and variations due to changing and merging business models. some common modes of quality corruption that occur in real healthcare data are described below-
1 RELEVANCE AND CONTEXT
Analysts often fall prey to the uncritical acceptance of data without quantifying accuracy and consistency. Recognition of the interplay between settings in which data is collected and the settings to which they can be unambiguously applied often is a root-cause for misleading insights.completeness and accuracy of key data fields are essential.
2 ENTRY ERRORS (MANUAL/SOFTWARE)
The process of data creation in academic and commercial settings outside of healthcare is highly automated, such as the recording and ordering of key strokes for website access, financial transactions, etc. However, in the healthcare world, most of the collection, integration and organization of data is manual. The critical difference is the data input by humans who can both, intentionally and unintentionally, introduce systematic data errors. Incorrect entry of a name, address, or key id field such as social security number or insurance id can lead to ambiguous data records that could be attributed to the wrong person or could lead to multiple records for a single person. automation creating data redundancies. While making entries into electronic health records, physicians often use templates or copy-paste commands to generate text that will comply with guidelines and regulations of the insurers such as Medicare. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid
3 DIVERSITY AND EVOLVING STANDARDS
Healthcare data often consists of several referential codebooks. For example, race and gender codes in beneficiary datasets; medical practice taxonomy and specialty codes in provider datasets, diagnosis and procedure codes (CPT-HCPCS, ICD, etc.) in claim datasets, the NDC codes on prescription drug events. While some of these codebooks may be standardized, codebooks can also be specific to a data system. Medical datasets may sometimes use several standards - some 50 years or older. Some legacy systems may have not kept up to evolving standards and even modern systems may not be flexible to incorporate evolving standards.
4 DATA STAGING ERRORS
Serious quality errors can occur in the pre-processing and staging of data for analysis. Usually, data staging can involve data migration, integration, machine-to-machine translation and database-to-database conversion. Decisions made during the extract, transform, load (ETL) process - such as using metric units vs. British units, allowing the choice of leaving a key cost-field blank (which could be encoded in a legacy system as ‘88888’ or ‘99999’) can have down-stream ramifications in the analytics workflow. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid. Furthermore, there can be ambiguity in semantic description of diagnosis and procedures across physicians and how billing agents encode the semantic variations in claim forms. When reimbursements are at stake, there could be incentive to even deliberately distort the data. For instance, physician may mention treatment for herpes and the billing agent is highly likely to choose a treatment for herpes-2 versus herpes-1, which may not be reimbursed.
5 ENTITY RESOLUTION
Healthcare data involves a complex web of entities such as providers, patients, and payers and their policies. There is a critical need to know and track every entity within the system with a high degree of confidence. Often referred to as the identity disambiguation problem, it is one of the major, if not the toughest data quality challenge in healthcare. Accurate association of health care episodes of a patient who may be visiting multiple healthcare providers is absolutely essential to documenting and retrieving a complete history of health-related events.
The present trend in Health information technology is to build sophisticated model, tools for business, clinic intelligence; among these an area of focus is the quality of data required to build these models. The error source, the consequence of these errors are discussed below in the healthcare context.
1. SOURCES, CONSEQUENCES AND REMEDIES
An assessment of data quality in healthcare has to
(1) address problems arising from errors or inaccuracies in the data itself and
(2) consider the source or sources of the data and how the purpose and business model of their collection impact the analytic processing and knowledge expected to extracted from them.
A consideration of errors and inaccuracies in the data would include data entry errors, errors
arising from transformations in the extracting and transforming process for analytics, missing data, etc. Anexamination of the source(s) of the data can reveal limitations and concerns as to its appropriateness to thetype of analysis being performed (financial data for evaluation of treatments), variations due to data merged from two different business models, variations due to entity and identity disambiguation (or a lack thereof), and variations due to changing and merging business models. some common modes of quality corruption that occur in real healthcare data are described below-
1 RELEVANCE AND CONTEXT
Analysts often fall prey to the uncritical acceptance of data without quantifying accuracy and consistency. Recognition of the interplay between settings in which data is collected and the settings to which they can be unambiguously applied often is a root-cause for misleading insights.completeness and accuracy of key data fields are essential.
2 ENTRY ERRORS (MANUAL/SOFTWARE)
The process of data creation in academic and commercial settings outside of healthcare is highly automated, such as the recording and ordering of key strokes for website access, financial transactions, etc. However, in the healthcare world, most of the collection, integration and organization of data is manual. The critical difference is the data input by humans who can both, intentionally and unintentionally, introduce systematic data errors. Incorrect entry of a name, address, or key id field such as social security number or insurance id can lead to ambiguous data records that could be attributed to the wrong person or could lead to multiple records for a single person. automation creating data redundancies. While making entries into electronic health records, physicians often use templates or copy-paste commands to generate text that will comply with guidelines and regulations of the insurers such as Medicare. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid
3 DIVERSITY AND EVOLVING STANDARDS
Healthcare data often consists of several referential codebooks. For example, race and gender codes in beneficiary datasets; medical practice taxonomy and specialty codes in provider datasets, diagnosis and procedure codes (CPT-HCPCS, ICD, etc.) in claim datasets, the NDC codes on prescription drug events. While some of these codebooks may be standardized, codebooks can also be specific to a data system. Medical datasets may sometimes use several standards - some 50 years or older. Some legacy systems may have not kept up to evolving standards and even modern systems may not be flexible to incorporate evolving standards.
4 DATA STAGING ERRORS
Serious quality errors can occur in the pre-processing and staging of data for analysis. Usually, data staging can involve data migration, integration, machine-to-machine translation and database-to-database conversion. Decisions made during the extract, transform, load (ETL) process - such as using metric units vs. British units, allowing the choice of leaving a key cost-field blank (which could be encoded in a legacy system as ‘88888’ or ‘99999’) can have down-stream ramifications in the analytics workflow. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid. Furthermore, there can be ambiguity in semantic description of diagnosis and procedures across physicians and how billing agents encode the semantic variations in claim forms. When reimbursements are at stake, there could be incentive to even deliberately distort the data. For instance, physician may mention treatment for herpes and the billing agent is highly likely to choose a treatment for herpes-2 versus herpes-1, which may not be reimbursed.
5 ENTITY RESOLUTION
Healthcare data involves a complex web of entities such as providers, patients, and payers and their policies. There is a critical need to know and track every entity within the system with a high degree of confidence. Often referred to as the identity disambiguation problem, it is one of the major, if not the toughest data quality challenge in healthcare. Accurate association of health care episodes of a patient who may be visiting multiple healthcare providers is absolutely essential to documenting and retrieving a complete history of health-related events.
The Characteristics of Quality Healthcare Data
Data quality in healthcare must consider a number of characteristics including accuracy, consistency, and relevancy.