1.4.4 De-identification of patients' data

Next: 1.5 Clinical overview of Up: 1.4 Data Organization Previous: 1.4.3 Subject ID - Contents

1.4.4 De-identification of patients' data

The process for the removal of protected health information (PHI) in the the MIMIC II database is fully described in our publication [6] which can be freely accessed at the following URL: http://www.biomedcentral.com/1472-6947/8/32 A labeled subset of the data, together with a public version of the code can be found on PhysioNet at: http://www.physionet.org/physiotools/deid/.

Figure 1.5 illustrates the de-identification process. Briefly, the salient points for the user of our database are:

All dates were shifted $10\pm3.12$ years into the future. The date shift for each patient was independently assigned by sampling from a uniform distribution.
All ICU dates for a given patient were shifted by the same amount to preserve inter-admission time gaps.
The day of the week and season of the year were preserved.
Patients who turned 90 during one of their admissions have been removed form the database. They may be included at a later date.
Patients older than 89 years at the date of first admission have had their dates of birth shifted so that they appear to be 200 years old at the time of their first admission. They will therefore show up as extreme outliers. Their inter-admission timings are still preserved.
Since date shifts were randomly assigned, longitudinal studies that involve changes in patient care practices over time cannot be supported by the fully de-identified data. Support for studies that require the year of admission will be considered on an individual basis by special request.
All HIPAA-defined types of PHI were removed, plus care-giver and hospital-specific identifiers.
The algorithm achieved an overall recall of 0.967 and precision of 0.749 on a `gold standard' test corpus, which out-performs a single human de-identifier and performs at least as well as a consensus of two human de-identifiers [6].

Examples of a de-identified nursing progress note and discharge summary can be found in figures 1.6 and 1.7 respectively. Note that a few of the de-identified sections of the nursing note are false positives, and a small fraction of the clinical information may have been lost. However, all dates and names (the only PHI in this document) were caught by our algorithm. Note also the the high prevalence of abbreviations such as S/0 (sign out), D/C'd (discontinued, or discharged), Neo (neosynephrine), NSR (normal sinus rhythm), F/E (fluid and electrolytes), GI (gastrointestinal), HEME (hematology), ID (infectious disease), A (assessment), P (plan), etc. Note also the low degree of structure in the nursing note, broken into a few categories; S/O, F/E, NEURO, GI, HEME, ID, RESP, SKIN, ACCESS, SOCIAL, A, and P. The boldface type has been added to this figure to highlight these categories, but is not available in the notes.

Next: 1.5 Clinical overview of Up: 1.4 Data Organization Previous: 1.4.3 Subject ID - Contents

djscott 2010-08-24