%DATASETS Info on the dataset class construction for PRTools % % This is not a command, just an information file. % % Datasets in PRTools are in the MATLAB language defined as objects of the % class PRDATASET. Below, the words 'object' and 'class' are used in the pattern % recognition sense. % % A dataset is a set consisting of M objects, each described by K features. % In PRTools, such a dataset is represented by a M x K matrix: M rows, each % containing an object vector of K elements. Usually, a dataset is labeled. % An example of a definition is: % % DATA = [RAND(3,2) ; RAND(3,2)+0.5]; % LABS = ['A';'A';'A';'B';'B';'B']; % A = PRDATASET(DATA,LABS) % % which defines a [6 x 2] dataset with 2 classes. % % The [6 x 2] data matrix (6 objects given by 2 features) is accompanied by % labels, assigning each of the objects to one of the two classes A and B. % Class labels can be numbers or strings and should always be given as rows % in the label list. A lable may also have the value NaN or may be an empty % string, indicating an ulabeled object. If the label list is not given, % all objects are marked as unlabeled. % % Various other types of information can be stored in a dataset. The most % simple way to get an overview is by typing: % % STRUCT(A) % % which for the above example displays the following: % % DATA: [6x2 double] % LABLIST: [2x1 double] % NLAB: [6x1 double] % LABTYPE: 'crisp' % TARGETS: [] % FEATLAB: [2x1 double] % FEATDOM: {1x2 cell} % PRIOR: [] % COST: [] % OBJSIZE: 6 % FEATSIZE: 2 % IDENT: {6x1 cell} % VERSION: {1x2 cell} % NAME: [] % USER: [] % % These fields have the following meaning: % % DATA : an array containing the objects (the rows) represented by % features (the columns). In the software and help-files, the number % of objects is usually denoted by M and the number of features is % denoted by K. So, DATA has the size of [M,K]. This is also defined % as the size of the entire dataset. % LABLIST : The names of the classes, stored row-wise. These class names % should be integers, strings or cells of strings. Mixtures of % these are not supported. LABLIST has as many rows as there are % classes. This number is usually denoted by C. LABLIST is % constructed from the set of LABELS given in the DATASET command % by determining the unique names while ordering them alphabetically. % NLAB : an [M x 1] vector of integers between 1 and C, defining for each % of the M objects its class. They are indexing LABLIST. % LABTYPE : 'CRISP', 'SOFT' or 'TARGETS' are the three possible label types. % In case of 'CRISP' labels, a unique class, defined by NLAB, is % assigned to each object, pointing to the class names given in % LABLIST. % For 'SOFT' labels, each object has a corresponding vector of C % numbers between 0 and 1 indicating its membership (or confidence % or posterior probability) of each of the C classes. These numbers % are stored in the array TARGETS of the size M x C. They don't % necessarily sum to one for individual row vectors. % Labels of type 'TARGETS' are in fact no labels, but merely target % vectors of length C. The values are again stored in TARGETS and % are not restricted in value. % TARGETS : [M,C] array storing the values of the soft labels or targets. % FEATLAB : A label list (like LABLIST) of K rows storing the names of the % features. % FEATDOM : A cell array describing for each feature its domain. % PRIOR : Vector of length C storing the class prior probabilities. They % should sum to one. If PRIOR is empty ([]) it is assumed that the % class prior probabilities correspond to the class frequencies. % COST : Classification cost matrix. COST(I,J) are the costs % of classifying an object from class I as class J. Column C+1 % generates an alternative reject class and may be omitted, % yielding a size of [C,C]. An empty cost matrix, COST = [] % (default) is interpreted as COST = ONES(C) - EYE(C) (identical % costs of misclassification). % OBJSIZE : The number of objects, M. In case the objects are related to a % n-dimensional structure, OBJSIZE is a vector of length n, storing % the size of this structure. For instance, if the objects are pixels % in a [20 x 16] image, then OBJSIZE = [20,16] and M = 320. % FEATSIZE : The number of features, K. In case the features are related to % an n-dimensional structure, FEATSIZE is a vector of length n, % storing the size of this structure. For instance, if the features % are pixels in a [20 x 16] image, then FEATSIZE = [20,16] and % K = 320. % IDENT : A cell array of M elements storing indicators of the M objects. % They are initialized by integers 1:M. % VERSION : Some information related to the version of PRTools used for % defining the dataset. % NAME : A character string naming the dataset, possibly used to annotate % related graphics. % USER : Free field for the user, not used by PRTools. % % % The fields can be set by commands like SETDATA, SETFEATLAB, SETLABELS, % see below for a complete list. % Note that there is no field LABELS in the DATASET definition. Labels are % converted to NLAB and LABLIST. The command SETLABELS however exists and % takes care of the conversion. % % The data and information stored in a dataset can be retrieved as follows: % % - By DOUBLE(A) and by +A, the content of A.DATA is returned. % - [N,LABLIST] = CLASSSIZES(A); % It returns the numbers of objects per class and the class names stored % in LABLIST. % - By DISPLAY(A), it writes the size of the dataset, the number of classes % and the label type on the terminal screen. % - By SIZE(A), it returns the size of A.DATA: numbers of objects and features. % - By SCATTERD(A), it makes a scatter plot of a dataset. % - By SHOW(A), it may be used to display images that are stored as features % or as objects in a dataset. % - By commands like: GETDATA, GETFEATLAB, etcetera, see below. With some % exceptions they point to a single dataset field. E.g. GETSIZE(A) returns % [M,K,C]. A aet of commands does not return data, but instead they return % indices to objects that have specific identifiers, labels or class indices: % FINDIDENT, FINDLABELS, FINDNLAB. % % Many standard MATLAB operations and a number of general MATLAB commands have % been overloaded for variables of the DATASET type. % % SEE ALSO (PRTools Guide) % PRDATASET, DATA2IM, OBJ2FEAT, FEAT2OBJ, IM2FEAT, IM2OBJ, DATAIM % SETDATA, SETFEATLAB, SETFEATDOM, SETFEATSIZE, SETIDENT, SETLABELS, % SETLABLIST, SETLABTYPE, SETNAME, SETNLAB, SETOBJSIZE, SETPRIOR, SETCOST, % SETTARGETS, SETUSER, SETLABLISTNAMES, SETVERSION % GETDATA, GETFEATLAB, GETFEATDOM, GETFEATSIZE, GETIDENT, GETLABELS, % GETLABLIST, GETLABTYPE, GETNAME, GETNLAB, GETOBJSIZE, GETPRIOR, GETCOST, % GETSIZE, GETTARGETS, GETUSER, GETVERSION, GETCLASSI, GETLABLISTNAMES, % FINDIDENT, FINDLABELS, FINDNLAB % Copyright: R.P.W. Duin, r.p.w.duin@37steps.com % Faculty EWI, Delft University of Technology % P.O. Box 5031, 2600 GA Delft, The Netherlands