function result=rsimpls(x,y,varargin) %RSIMPLS is a 'Robust method for Partial Least Squares Regression based on the % SIMPLS algorithm'. It can be applied to both low and high-dimensional predictor variables x % and to one or multiple response variables y. It is resistant to outliers in the data. % The RSIMPLS algorithm is built on two main stages. First, a matrix of scores is derived % based on a robust covariance criterion (see robpca.m), % and secondly a robust regression is performed based on the results from ROBPCA. % % The RSIMPLS method is described in: % Hubert, M., and Vanden Branden, K. (2003), % "Robust Methods for Partial Least Squares Regression", % Journal of Chemometrics, 17, 537-549. % % To select the number of components in the regression model, a robust RMSECV (root mean squared % error of cross validation) curve is drawn, based on a fast algorithm for % cross-validation. This approach is described in: % % Engelen, S., Hubert, M. (2005), % "Fast model selection for robust calibration methods", % Analytica Chimica Acta, 544, 219-228. % % Required input arguments: % x : Data matrix of the explanatory variables % (n observations in rows, p variables in columns) % y : Data matrix of the response variables % (n observations in rows, q variables in columns) % % Optional input arguments: % k : Number of components to be used. % (default = min(rank([x,y]),kmax)). If k is not specified, % it can be selected using the option 'rmsecv'. % kmax : Maximal number of components to be used (default = 9). % If k is provided, kmax does not need to be specified, unless k is larger % than 9. % alpha : (1-alpha) measures the fraction of outliers the algorithm should % resist. Any value between 0.5 and 1 may be specified. (default = 0.75) % h : (n-h+1) measures the number of observations the algorithm should % resist. Any value between n/2 and n may be specified. (default = 0.75*n) % Alpha and h may not both be specified. % rmsecv : If equal to zero and k is not specified, a robust R-squared curve % is plotted and an optimal k value can be chosen. (default) % If equal to one and k is not specified, a robust component selection-curve is plotted and % an optimal k value can be chosen. This curve computes a combination of % the cross-validated root mean squared error and the robust residual % sum of squares for k=1 to kmax. Rmsecv and k may not both % be specified. % rmsep : If equal to one, the robust RMSEP-value (root mean squared error of % prediction) for the model with k components (default = 0). This value % is automatically given if rmsecv = 1. % plots : If equal to one, a menu is shown which allows to draw several plots, % such as a robust score outlier map and a regression % outlier map are drawn. (default) % If the input argument 'classic' is equal to one, the classical % diagnostic plots are drawn as well. % If 'plots' is equal to zero, all plots are suppressed. % See also makeplot.m % labsd : The 'labsd' observations with largest score distance are % labeled on the outlier map (default = 3) % labod : The 'labod' observations with largest orthogonal distance are % labeled on the outlier map (default = 3) % labresd : The 'labresd' observations with largest residual distance are % labeled on the outlier map (default = 3) % classic : If equal to one, the classical SIMPLS analysis will be performed as well % (see also csimpls.m). (default = 0) % % Options for advanced users: % kr : Total number of components used by the ROBPCA method. % We advise to use kr=k+q. (default) % kmaxr : Maximal number of components used by the ROBPCA method. % default = min(kmax+q,rank([x,y])) % plotsrobpca : If equal to one, a robust score outlier map from ROBPCA is drawn. % If the input argument 'classic' is equal to one, the classical % outlier map is drawn as well. % If 'plotsrobpca' is equal to zero, all plots are suppressed. (default) % st : Indicates the current stage of the algorithm for cross-validation (RMSECV) % default = 0 -> performs all the stages of the algorithm and computes all the % parameters for the full model. % st=1, robpca is still performed, but when % st=2, the algorithm proceeds based on the previous knownledge of the output from robpca. % out : Is an empty structure, but is constructed while performing the cross-validation. % (see rrmse.m) % % I/O: result=rsimpls(x,y,'k',k,'kmax',10,'alpha',0.75,'h',h,'rmsecv',0,'rmsep',0,... % 'plots',1,'labsd',3,'labod',3,'labresd',3,'classic',1,'kr',kr,... % 'kmaxr',kmaxr,'plotsrobpca',0,'st',0,'out',[]); % The user should only give the input arguments that have to change their default value. % The name of the input arguments needs to be followed by their value. % The order of the input arguments is of no importance. % % Examples: % rsimpls(x,y,'k',5,'plots',1); % rsimpls(x,y,'classic',1,'rmsecv',1); % % The output of RSIMPLS is a structure containing: % % result.slope : Robust slope estimate % result.int : Robust intercept estimate % result.fitted : Robust fitted values % result.res : Robust residuals % result.cov : Estimated variance-covariance matrix of the residuals % result.T : Robust scores % result.weights.r : Robust simpls weights % result.weights.p : Robust simpls weights % result.Tcenter : Robust center of the scores % result.Tcov : Robust covariance matrix of the scores % result.rsquared : Robust R-squared value for the optimal k % result.rcs : Robust Component Selection Criterion: % This is a matrix with kmax columns. The first row contains the approximate % R-squared values (for k=1,...,kmax) and the second row the square root of the weighted % residuals sum of squares. This is equal to the RCS-value with % gamma = 0 (see Engelen and Hubert (2005) for the definition). % If the input argument rmsecv = 1, the third row contains the RCS-value for % gamma = 0.5 and the fourth for gamma = 1. The last one is equal to the robust % cross-validated RMSE values. % Note that all the entries in this matrix depend on the choice of kmax. % result.rmsep : Robust RMSEP value % result.k : Number of components used in the regression % result.h : The quantile h used throughout the algorithm % result.sd : Robust score distances % result.od : Robust orthogonal distances % result.resd : Residual distances (when there are several response variables). % If univariate regression is performed, it contains the standardized residuals. % result.cutoff : Cutoff values for the score (result.cutoff.sd), orthogonal % (result.cutoff.od) and residual distances (result.cutoff.resd). % We use 0.975 quantiles of the chi-squared distribution. % result.flag : The observations whose score distance is larger than % 'result.cutoff.sd' receive a flag 'result.flag.sd' equal % to zero (good leverage points). Otherwise 'result.flag.sd' % is equal to one. % The components 'result.flag.od' and 'result.flag.resd' are % defined analogously, and determine the orthogonal outliers, % resp. the bad leverage points/vertical outliers. % The observations with 'result.flag.od' and 'result.flag.resd' % equal to zero, can be considered as calibration outliers and receive % 'result.flag.all' equal to zero. The regular observations and the good leverage % points have 'result.flag.all' equal to one. % result.class : 'RSIMPLS' % result.classic : If the inputargument 'classic' is equal to 1, this structure % contains results of the classical SIMPLS analysis. (see also csimpls.m) % results.robpca : The results of robpca on [X,Y]. % % This function is part of LIBRA: the Matlab Library for Robust Analysis, % available at: % http://wis.kuleuven.be/stat/robust.html % % Written by Karlien Vanden Branden % Version date: 07/04/2004 % Last update: 04/08/2006 % %initialization with defaults % if rem(nargin,2)~=0 error('Number of input arguments must be even!'); end [n,p1]=size(x); [n2,q1]=size(y); z=[x,y]; rx=rank(x); rz=rank(z); if n~=n2 error('The response variables and the predictor variables have a different number of observations.') end niter=100;counter=1;mcd=0;alfa=0.75; kmax=min([9,rx,floor(n/2),p1]); kmaxr=min([kmax+q1,rz]); h=floor(2*floor((n+kmaxr+1)/2)-n+2*(n-floor((n+kmaxr+1)/2))*alfa); labsd=3;labod=3;labresd=3; plotsrobpca=0;plots=1;k=0; kr=k+q1; st=0;rmsecv=0; out=[];classic=0;rmsep=0;rmsep_value=nan;rmsecv_value = nan;rsquared_value = nan;rss_value = nan; default=struct('alpha',alfa,'h',h,'labsd',labsd,'labod',labod,'labresd',labresd,'k',k,'kr',kr,... 'plotsrobpca',plotsrobpca,'plots',plots,'kmax',kmax,'st',st,... 'out',out,'rmsecv',rmsecv,'classic',classic,'rmsep',rmsep,... 'kmaxr',kmaxr,'rmsep_value',rmsep_value,'rmsecv_value',rmsecv_value,'rsquared_value',... rsquared_value,'rss_value',rss_value); list=fieldnames(default); options=default; IN=length(list); i=1; % if nargin>2 % %placing inputfields in array of strings % for j=1:nargin-3 if rem(j,2)~=0 chklist{i}=varargin{j}; i=i+1; end end dummy=sum(strcmp(chklist,'h')+2*strcmp(chklist,'alpha')); switch dummy case 0 %Take on default values options.alpha=alfa;%0.75; options.h=h; case 3 error('Both input arguments alpha and h are provided. Only one is required.') end % %Checking which default parameters have to be changed % and keep them in the structure 'options'. % while counter<=IN index=strmatch(list(counter,:),chklist,'exact');%contains the users input one by one if ~isempty(index) %in case of similarity for j=1:nargin-3 %searching the index of the accompanying field if rem(j,2)~=0 %fieldnames are placed on odd index if strcmp(chklist{index},varargin{j}) I=j; end end end options=setfield(options,chklist{index},varargin{I+1}); index=[]; end counter=counter+1; end options.h=floor(options.h); options.kmax=floor(options.kmax); options.k=floor(options.k); options.kmaxr=floor(options.kmaxr); options.kr=floor(options.kr); kmax=min([options.kmax,floor(n/2),rz,p1]); kmaxr=max([min([options.kmaxr,rz]),kmax+q1]); k=min(options.k,kmax); while k<0 k=input(['The number of components can not be negative.\n'... 'How many principal components would you like to retain?\n']); end if any(strcmp(chklist,'kr')) kr=floor(max([options.kr,k+q1])); else kr=k+q1; end if dummy==1 %checking inputvariable h if options.h-floor(options.h)~=0 mess=sprintf('Attention (rsimpls.m): h must be an integer. \n'); disp(mess) end if kr==0 if options.hn options.alpha=0.75; if kr==0 options.h=floor(2*floor((n+kmaxr+1)/2)-n+2*(n-floor((n+kmaxr+1)/2))*options.alpha); else options.h=floor(2*floor((n+kr+1)/2)-n+2*(n-floor((n+kr+1)/2))*options.alpha); end mess=sprintf(['Attention (rsimpls.m): h should be smaller than n. \n',... 'It is set to its default value ',num2str(options.h)]); disp(mess) end elseif dummy==2 if options.alpha < 0.5 options.alpha=0.5; mess=sprintf(['Attention (rsimpls.m): Alpha should be larger than 0.5. \n',... 'It is set to 0.5.']); disp(mess) end if options.alpha > 1 options.alpha=0.75; mess=sprintf(['Attention (rsimpls.m): Alpha should be smaller than 1.\n',... 'It is set to 0.75.']); disp(mess) end if kr==0 options.h=floor(2*floor((n+kmaxr+1)/2)-n+2*(n-floor((n+kmaxr+1)/2))*options.alpha); else options.h=floor(2*floor((n+kr+1)/2)-n+2*(n-floor((n+kr+1)/2))*options.alpha); end end h=options.h;alfa=options.alpha;labsd=max(0,min(floor(options.labsd),n)); dummyh = strcmp(chklist,'h'); dummykmax = strcmp(chklist,'kmax'); if all(dummyh == 0) && any(dummykmax) h = floor(2*floor((n+kmax+1)/2)-n+2*(n-floor((n+kmax+1)/2))*alfa); end labod=max(0,min(floor(options.labod),n));labresd=max(0,min(floor(options.labresd),n)); plotsrobpca=options.plotsrobpca;plots=options.plots; st=options.st; out=options.out; rmsecv=options.rmsecv; classic=options.classic; rmsep=options.rmsep; rmsep_value = options.rmsep_value; rmsecv_value = options.rmsecv_value; rmsecv = options.rmsecv; rsquared_value = options.rsquared_value; rss_value = options.rss_value; end if q1==1 && k>=(h-2) mess=sprintf(['Attention (rsimpls.m): The number of components, k = ',num2str(k),... '\n is larger than our recommended maximum value of k = ',num2str(h-2)-1,'.']); disp(mess) elseif q1>1 && k>=((h/q1)-(q1/2)-0.5) mess=sprintf(['Attention (rsimpls.m): The number of components, k = ',num2str(k),... '\n is larger than our recommended maximum value of k = ',num2str(floor((h/q1)-(q1/2)-1.5)),'.']); disp(mess) end % %MAIN PART % % selection of number of components if k == 0 if rmsecv == 0 [R2,final]=rsquared(x,y,kmax,'RSIMPLS',options.h); rss = final.rss; k = final.k; result=rsimpls(x,y,'k',k,'kr',k+q1,'h',h,'rmsecv',0,'rmsep',options.rmsep,'plots',0,'classic',classic,'rsquared_value',R2,'rss_value',rss); else out=rrmse(x,y,h,kmax,'RSIMPLS',1); R2 = out.R2; rss = out.rss; k=out.k; rmsecv_value = out.rmsecv; pred = rrmse(x,y,h,kmax,'RSIMPLS',0,k,out.weight,out.res); result=rsimpls(x,y,'k',k,'kr',k+q1,'h',h,'rmsecv',0,'rmsep',0,'plots',0,'classic',classic,'rmsep_value',pred.rmsep,'rmsecv_value',rmsecv_value,'rsquared_value',R2,'rss_value',rss); end else if rmsecv error(['Both RMSECV and k were given.', ... 'Please rerun your analysis with one of these inputs. (see help file)']) end %First stage: Obtain the scores T by first performing ROBPCA on z: if st<=1 out.robpca=robpca(z,'k',kr,'h',h,'plots',plotsrobpca,'kmax',kmaxr,'classic',classic,'mcd',mcd); out.h=h; out.centerz=out.robpca.M; out.sigmaxy=out.robpca.P(1:p1,:)*diag(out.robpca.L)*out.robpca.P(p1+1:p1+q1,:)'; out.sigmax=out.robpca.P(1:p1,:)*diag(out.robpca.L)*out.robpca.P(1:p1,:)'; out.xcentr=x-repmat(out.centerz(1:p1),n,1); out.ycentr=y-repmat(out.centerz(p1+1:p1+q1),n,1); out.weights2=out.robpca.flag.all; end if st i=k; else i=1; end while i<=k out.sigmayx=out.sigmaxy'; if q1>p1 [RR,LL]=eig(out.sigmaxy*out.sigmayx); [LL,I]=greatsort(diag(LL)); rr=RR(:,I(1)); qq=out.sigmayx*rr; qq=qq/norm(qq); else [QQ,LL]=eig(out.sigmayx*out.sigmaxy); [LL,I]=greatsort(diag(LL)); qq=QQ(:,I(1)); rr=out.sigmaxy*qq; rr=rr/norm(rr); end tt=out.xcentr*rr; uu=out.ycentr*qq; pp=out.sigmax*rr/(rr'*out.sigmax*rr); vv=pp; if i>1 vv=vv-out.v*(out.v'*pp); end if vv'*vv==0 error('The number of components is too large') end vv=vv./norm(vv); out.sigmaxy=out.sigmaxy-vv*(vv'*out.sigmaxy); out.v(:,i)=vv; out.q(:,i)=qq; out.t(:,i)=tt; out.u(:,i)=uu; out.p(:,i)=pp; out.r(:,i)=rr; i=i+1; end %Second Stage : Robust ROBPCA-regression robpcareg=robpcaregres(out.t,y,out.weights2); breg=robpcareg.coeffs(1:k,:); Yhat=out.t*breg+repmat(robpcareg.coeffs(k+1,:),n,1); b=out.r*breg; int=robpcareg.coeffs(k+1,:)-out.centerz(1:p1)*out.r*breg; if rmsep==1 rmse=rrmse(x,y,h,kmax,'RSIMPLS',0,k); out.rmsep=rmse.rmsep; end % testing several output parameters if ~isnan(rmsep_value) out.rmsep = rmsep_value; options.rmsep = 1; end if ~isnan(rsquared_value) out.rcs = rsquared_value; end if ~isnan(rss_value) out.rcs = [out.rcs;sqrt(rss_value)]; end if ~isnan(rmsecv_value) gammahalf = 0.5*sqrt(rss_value) + 0.5*rmsecv_value; out.rcs = [out.rcs;gammahalf;rmsecv_value]; options.rmsecv = 1; end if any(isnan(rsquared_value)) && any(isnan(rss_value)) && any(isnan(rmsecv_value)) out.rcs = 0; end %The output: out.T=out.t; out.weights.p=out.p; out.weights.r=out.r; out.kr=kr; out.h=h; out.alpha=alfa; out.slope=b; out.int=int; out.yhat=x*b+repmat(int,n,1); out.x=x; out.y=y; out.res=y-out.yhat; out.class='RSIMPLS'; out.k=k; out.cov=robpcareg.cov; if ~st %calculation of robust distances %Score distance out.Tcov=robpcareg.sigma(1:k,1:k); out.Tcenter=robpcareg.center(1:k); out.sd=sqrt(mahalanobis(out.t,out.Tcenter,'cov',out.Tcov))'; out.cutoff.sd=sqrt(chi2inv(0.975,k)); %robust residual distance if q1==1 out.resd=out.res/sqrt(out.cov); else out.resd=sqrt(mahalanobis(out.res,zeros(1,q1),'cov',out.cov))'; end %robust orthogonal distances xtilde=out.t*out.p'; Cdiff=out.xcentr-xtilde; for i=1:n out.od(i,1)=norm(Cdiff(i,:)); end r=rank(x); if k~=r [m,s]=unimcd(out.od.^(2/3),out.h); out.cutoff.od = sqrt(norminv(0.975,m,s)^3); else out.cutoff.od=0; end out.cutoff.resd=sqrt(chi2inv(0.975,q1)); %Computing flags out.flag.od=out.od<=out.cutoff.od; out.flag.resd=abs(out.resd)<=out.cutoff.resd; out.flag.all=(out.flag.od & out.flag.resd); %Multivariate Rsquared Yw=y(out.flag.all==1,:); cYw=mcenter(Yw); res=out.res(out.flag.all==1,:); out.rsquared=1-(det(res'*res)/det(cYw'*cYw)); %Assigning output if options.rmsep~=1 && options.rmsecv~=1 out.rmsep=0; end if classic resultclassic=csimpls(x,y,'k',k,'plots',0); else resultclassic=0; end result=struct('slope',{out.slope}, 'int',{out.int},'fitted',{out.yhat},'res',{out.res}, 'cov',{out.cov},... 'T',{out.T}, 'weights', {out.weights},'Tcenter',{out.Tcenter},'Tcov',{out.Tcov},'rsquared',{out.rsquared},'rcs',{out.rcs},'rmsep',{out.rmsep},... 'k',{out.k},'alpha',{out.alpha},'h',{out.h},'sd', {out.sd},'od',{out.od},... 'resd',{out.resd},'cutoff',{out.cutoff},'flag',{out.flag},'class',{out.class},'classic',{resultclassic},'robpca',{out.robpca}); if result.rcs==0 result=rmfield(result,'rcs'); end if result.rmsep==0 result=rmfield(result,'rmsep'); end else result=out; end end % Plots try if plots && options.classic makeplot(result,'classic',1,'labsd',labsd,'labod',labod,'labresd',labresd) elseif plots makeplot(result,'labsd',labsd,'labod',labod,'labresd',labresd) end catch %output must be given even if plots are interrupted end