Next: 1.5 Clinical overview of
Up: 1.4 Data Organization
Previous: 1.4.3 Subject ID -
Contents
1.4.4 De-identification of patients' data
The process for the removal of protected health information
(PHI) in the the MIMIC II
database is fully described in Neamatullah
et-al(6). A labeled subset of the data,
together with a public version of the code can be found on PhysioNet
at: http://www.physionet.org/physiotools/deid/.
Figure 1.5 illustrates the de-identification process.
Briefly, the salient points for the user of our database are:
- All dates were shifted into the future.
- All ICU dates for a given patient were shifted by the same
amount to preserve inter-admission time gaps.
- The day of the week and season of the year were preserved.
- Patients who turned 90 during one of his/her admissions have been
removed from the database. They may be included at a later date.
- Patients older than 89 years at the date of first admission have
had his/her dates of birth shifted so that they appear to be 200
years old at the time of his/her first admission. They will
therefore show up as extreme outliers. The inter-admission timings
are still preserved.
- Since date shifts were randomly assigned, longitudinal studies
that involve changes in patient care practices over time cannot be
supported by the fully de-identified data. Support for studies that
require the year of admission will be considered on an individual
basis by special request.
- All HIPAA-defined types of PHI were removed, plus care-giver and
hospital-specific identifiers.
- The algorithm achieved an overall recall of 0.967 and precision
of 0.749 on a `gold standard' test corpus, which out-performs a
single human de-identifier and performs at least as well as a
consensus of two human de-identifiers (6).
Figure 1.5:
De-identification process
|
Examples of a de-identified nursing progress note and discharge
summary can be found in figures 1.6 and
1.7 respectively. Note that a few of the
de-identified sections of the nursing note are false positives, and a
small fraction of the clinical information may have been
lost. However, all dates and names (the only PHI in this document)
were caught by our algorithm. Note also the the high prevalence of
abbreviations such as S/O (sign out), D/C'd (discontinued, or
discharged), Neo (neosynephrine), NSR (normal sinus rhythm), F/E
(fluid and electrolytes), GI (gastrointestinal), HEME (hematology), ID
(infectious disease), A (assessment), P (plan), etc. Note also the
low degree of structure in the nursing note, broken into a few
categories; S/O, F/E, NEURO, GI, HEME, ID, RESP, SKIN, ACCESS,
SOCIAL, A, and P. The boldface type has been added to this
figure to highlight these categories, but is not available in the
notes.
Figure 1.6:
Example of a de-identified progress note. Sub-headings
have been capitalized in bold face type for easier
reading. Removed text is denoted by square brackets. True
positives are colored green, false positives are colored
red.
|
Figure 1.7:
Example of a section of a de-identified discharge
summary. All de-identified elements are denoted by square
brackets. No false positives exist in this
example.
|
Next: 1.5 Clinical overview of
Up: 1.4 Data Organization
Previous: 1.4.3 Subject ID -
Contents
djscott
2011-09-07