In: Economics
CASE STUDY:
A large Transformational Medical Technologies and Services Organization.
Business Problem :-
The client offered a hosted Electronic Medical Record (EMR) solution and wanted to build a portal for healthcare and pharmaceutical companies to allow access to the data for medical care quality measures and research needs.
Key Issues :
• How do we collect data from over 100 different physical sites into a central location for warehousing?
• How do we combine EMR data from numerous organizations with different data collection standards?
• What is the best method to properly cleanse the data such that personal identifying information is removed?
* Electronic medical records (EMR)
EMR stands for Electronic medical records, which are the digital equivalent of paper records, or charts at a clinician’s office. EMRs typically contain general information such as treatment and medical history about a patient as it is collected by the individual medical practice.
* How to collect data from different physical sites to a central location for warehousing?
Health care involves a diverse set of public and private data collection systems, including health surveys, administrative enrollment and billing records, and medical records, used by various entities, including hospitals, CHCs, physicians, and health plans. Data on race, ethnicity, and language are collected, to some extent, by all these entities, suggesting the potential of each to contribute information on patients or enrollees.
The data of medical is very crucial in nature thus it can't be handled light handedly. It needs utmost care. Thus the data collection is to be done with utmost care.
Developing a online portal will play a key role in gathering data from multiple location at a very high rate and with high accuracy. Now we need to set up a dedicated server for the portal which will play it's role as warehouse and all data collected will be stored there.
We will provide all our healthcare institutes and pharmaceutical companies a login id from where they can upload data will will automatically be transferred and stored the data in the central warehose location server which could be accessed from anywhere.
* How do we combine EMR data from numerious organisation with different data collection standards?
The unstructured data of an EMR are present in clinical notes, surgical records, discharge records, radiology reports, and pathology reports. Clinical notes are documents written, in free text, by doctors, nurses, and staff providing care to a patient, and offer increase detail beyond what may be inferred from a patient’s diagnosis codes. The information contained in clinical notes may concern a patient’s medical history (diseases, interventions, among others), family history of diseases, environmental exposures, and lifestyle data. Therefore, applying an automatic way of interpretation of these clinical notes and records is of the utmost importance.
We propose to model the data preprocessing phase as an extract-transform and load (ETL) process. ETL derives from data warehousing and covers the process of how data are loaded from the source system to the data warehouse. A typical data preprocessing phase is thus composed of the following three phases: (i) Extract available data (extract), (ii) transform and clean data (transform), and (iii) store the output data in an output repository (load). These phases are preceded by the data source selection, typically a set of files or a database. More detailed insight of each of the three phases is provided:
The extract process is responsible for the data extraction from the source system or database and makes it accessible for further processing or management process. In health care, this process needs to deal with data privacy, and most of the extract process has an anonymization process associated. At this point, the researcher decides which data make sense to use.
The transform is a process based on a set of rules to transform the data from the source to the target. In the current research, a new approach using semantics is proposed for this phase. This can be a complex process taking into account different dimensionalities cases, and it needs to assure that all variables are in the same units so that they can later be joined and a clean process can be conducted. The transformation step also requires joining data from several sources, generating aggregations, surrogate keys, sorting, deriving newly calculated values, and applying advanced validation rules.
The load process merges all data into a target database (output data).
* Best method to properly cleanse the data
Screening involves systematically looking for suspect features in assessment questionnaires, databases, or analysis datasets. The diagnosis (identifying the nature of the defective data) and treatment (deleting, editing or leaving the data as it is) phases of data cleaning requires an in depth understanding of all types and sources of errors possible during data collection and entry processes. Documenting changes entails leaving an audit trail of errors detected, alterations, additions and error checking and will allow a return to the original value if required.
Openrefine (ex-Google Refine) and LODRefine are powerful tools for working with messy data, cleaning it, or transforming it from one format into another. Videos and tutorials are available to learn about the different functionalities offered by this software. The facets function is particularly useful as it can very efficiently and quickly gives a feel for the range of variation contained within the dataset.
If the data are cleaned by more than one person, then the final step is to merge all the spreadsheets together so that there is only one database. The comments or change logs that are made as the cleaning progresses should be compiled into one document. Problem data should be discussed in the documentation file. Update cleaning procedures, change log and data documentation file as the cleaning progress. Provide feedbacks to enumerators, team leaders or data entry operators if the data collection and entry process is still ongoing. If the same mistakes are made by one team or enumerators, make sure to inform the culprit. Data cleaning is a continuous process. Some problems cannot be identified until the analysis has begun. Errors are discovered as analysts manipulate the data, and several cleaning stages are generally required as inconsistencies are discovered. In rapid assessments, it is very common that errors are detected even during the peer review process