**The new PhysioNet website is available at: https://physionet.org. We welcome your feedback.**

*George B. Moody*

Harvard-MIT Division of Health Sciences and Technology

*A common problem in signal analysis is to sort waveforms into groups by morphology (shape). Principal component analysis can provide compact descriptions of shape that are minimally affected by noise. Using the QRS complex of the ECG as an example, this tutorial presents practical methods for principal component analysis of waveforms, including software that can be used as is or customized as desired. These methods can be applied to other waveforms within the ECG as well as those contained in many other signals.*

For additional discussion of the methods
described here, see
GB Moody, RG Mark. QRS morphology representation and noise estimation using the Karhunen-Loève transform. Please cite the above publication when referencing this material, and also include the standard citation for PhysioNet: Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals. |

## Introduction

The goal of this exercise is to produce a program which, given a WFDB record containing an ECG signal and an annotation file containing QRS times, produces as output a list of QRS feature vectors. Many approaches are possible, but this exercise focuses on principal component analysis. In order to complete the exercise, you will need to:

- install necessary software
- decide what the pattern vectors will be (define the pattern space)
- select a training set of data and determine the distribution of pattern vectors within the pattern space
- derive the covariance matrix of the distribution and its eigenvectors (the principal component basis functions)
- decide how many principal components should be used in each feature vector (define the feature space)
- implement a method for deriving feature vectors, and observe how the feature vectors are distributed throughout the feature space.

This directory contains sources for several programs that can be used to work through the exercise:

`defs.h`- definitions for data structures shared by the programs below
`makepv.c`- reads an annotated WFDB record and writes pattern vectors to the standard output
`makecv.c`- reads a file of pattern vectors and writes the covariance matrix to the standard output
`makeev.c`- reads a covariance matrix from the standard input and writes a sorted list of eigenvectors to the standard output
`makefv.c`- reads a list of pattern vectors and a list of eigenvectors, and writes a list of feature vectors to the standard output

There is also a `Makefile` that can be used to
automate the process of compiling these programs (to use it, you will
need to have some version of the `make` utility available under
all versions of Unix and Linux, and also included in the free
Cygwin
development environment for MS-Windows).

## Preparation

Download and install the WFDB Software Package if you have not already done so.

Make a working directory for this exercise, and save copies of
`defs.h`,
`makepv.c`,
`makecv.c`,
`makeev.c`,
`makefv.c`, and
`Makefile` in it.

## Define the pattern space

The *pattern space* is the space in which your *pattern vectors*
are embedded. Pattern vectors are the initial numerical descriptions of the
items (QRS waveforms, in this exercise) to be analyzed. Depending on the
application, the components of pattern vectors might be raw samples, or they
might be filtered, baseline-corrected, or otherwise preprocessed. In other
applications, a pattern vector might be a collection of any measurements of
the items to be analyzed.

Begin the exercise by choosing the number of elements in your pattern
vectors, and insert this number in `defs.h` (find and change the
definition of `PVDIM`). Don't worry about setting
the number of elements in the feature vectors yet. How do we choose a
value for `PVDIM`? Usually, we want this number to
match the number of samples within a typical waveform of the type we wish
to characterize. We'll be using waveforms from the
MIT-BIH Arrhythmia Database for this
exercise, and these are sampled at 360 samples per second per signal. A
typical QRS complex lasts 80 milliseconds, or about 30 samples at 360 samples
per second, so you might set PVDIM to 30. Of course, you don't need to use
every sample from the QRS complex in your pattern vectors -- you might have a
better idea.

Edit `makepv.c` so that it derives the pattern
vectors you wish to use, and compile it. (Since `makepv` reads
PhysioBank data, it must be compiled and linked with the WFDB library, included
with the WFDB Software Package. If you don't know how to do this, see the
`Makefile` for this tutorial, or see the *WFDB Programmer's Guide*.) Note that
`makepv` prints `tagged' pattern vectors, i.e., the annotation
times and types are copied from input to output. In most practical
applications, annotation types would not be available, so do not write
your pattern-vector derivation code to make use of that information.
You may wish to use it as a debugging aid.

## Get training set pattern vectors

Run `makepv` on your training set (see the `Makefile` for an example of how to do this). Your
training set should include at least several hundred beats, preferably much
more. Your training set might be the first several hundred beats from the
record you wish to analyze; choose one from the
MIT-BIH Arrhythmia Database for your
first experiment. You may use beats from several different records if you wish
to experiment with generating a `universal' set of basis functions. For input
annotation files, use either reference (`atr`) files, or generate
annotation files using `sqrs`
or your own QRS detector if you have one. Programs `rdann` and
`wrann` may be useful in
selecting sets of annotations to use for training. (The WFDB Software Package
includes `sqrs`, `rdann`, and `wrann`.)

## Derive the principal component basis functions

Compile `makecv.c` (once again, use or refer to
the `Makefile` for guidance) and run the
resulting program on the output of `makepv`. (If it's not obvious how
to do so, read this.) You may wish to examine
the structure of the covariance matrix before continuing. If the structure of
the matrix reveals strongly dependent components, you might consider
simplifying your pattern vector in the interest of reducing the computational
burden.

Compile `makeev.c` and run it on the output of
`makecv`. The output of `makeev` contains the calculated
eigenvectors (the principal component basis functions) and the corresponding
eigenvalues, which can be used to assess the relative importance of the
eigenvectors. Save this output in a file for use later on.

## Define the feature space

From the output of `makeev`, plot the eigenvalues vs. the eigenvector
numbers and try to determine a suitable number of feature vector components
from this plot. (If you are lucky, you may see a distinct `knee' in the plot,
separating the essential from the relatively insignificant eigenvectors.) If
your pattern vectors are composed of time-ordered samples, it may also be
helpful to plot the basis functions (i.e., the components of each eigenvector);
the first few should be recognizably similar to typical pattern vectors, and
the last few may look like noise. Edit `defs.h` again to insert the
number of feature vector components you have chosen (find and change the
definition of `FVDIM`).

## Derive the feature vectors

Compile `makefv.c` and run it on your test set.
When you run `makefv`, supply the name of the file containing basis
functions (the output of `makeev`) as a command-line argument to
`makefv` (see the `Makefile` for an example).

There are several useful ways to look at the output of `makefv` (the
feature vectors). A scatter plot using the first two principal components as
the axes is a good start. Make separate plots for normal and ectopic beats and
observe how they cluster. You should also study the size of the residual
errors: are they larger for ectopic beats? for noisy beats? What happens if
the fiducial point is not correctly placed (i.e., if the time given in the
annotation file is incorrect by one or several samples)?