function [thres,outlierindex] = findoutliers(x,xcenter,N,alpha,mingap) % [THRES,OUTLIERINDEX] = FINDOUTLIERS(X,XCENTER,N,ALPHA,MINGAP) % % Define threshold and identify "outliers" in vector x, i.e., points that % are much larger than most of the points in x, separated by a large gap % that is very unlikely if all of the points are generated by the same % distribution. % % Author: % Katherine T. Schwarz % Department of Physics, University of California, Berkeley % December 2008 % kts@cal.berkeley.edu % % Inputs: % X vector of data % XCENTER approximate center of distribution of regular data; we only % look at points above this value. The algorithm is based on the % assumption that beyond this starting point is the tail of the % distribution. % N number of points to average when estimating gap size % ALPHA significance level: find gap with P(gap) < alpha % MINGAP minimum allowable size of gap % % Output: % THRES maximum of "regular" data; everything above it is outlier % OUTLIERINDEX same size as X, true/false indicator of outliers % % Algorithm: % 1. Sort data. % 2. Start from XCENTER and work up toward higher ranked points. % At each point, estimate the expected value of spacing between this % point and the point below, if this is the highest valid point. % 3. Stop at the first value where the gap is much larger than expected. % "Much larger" means the probability of a gap at least this large is % less than parameter ALPHA. Typically ALPHA = 0.05. % % reference: % P. Burridge and A.M.R. Taylor (2006). "Additive Outlier Detection via % Extreme-Value Theory". Journal of Time Series Analysis 27(5):685-701 % % I. Weissman (1978). "Estimation of Parameters and Large Quantiles Based % on the k Largest Observations." Journal of the American Statistical % Association 73(364):812-815 xsort = sort(x); D = diff(xsort); % spacings between successively ranked points % If values of x are all drawn from the same distribution, and if it is in % the Gumbel domain of attraction, then the extreme order statistics of x % form a Poisson process, i.e. the spacings between them are independent % and exponentially distributed, with mean proportional to 1/i, where i is % the rank counting down from maximum (i.e., i=1 for the largest x, etc.) % % Dhat[n] is the estimated mean value of D[n], assuming it is the gap at % the top of the ladder, between the 1st and 2nd largest regular points. % This is the minimum-variance unbiased estimator (Weissman 1978). Dhat = filter([0 2:N],1,D)/(N-1); % Dhat[n] = average of i*D[n-i] for i=2:N % D is an exponentially distributed random variable; % P(gap > critgap*mean of D) = exp(-critgap) = alpha % If the data has finite resolution, there will often be long sequences % of identical values in xsort, resulting in long sequences of 0 in D, % and thus also nearly 0 in Dhat, so the next nonzero value of D % will appear to be a "large" gap. To prevent this, require gaps to % exceed a minimum gap size, as well as the critical size defined above. % Find the *first* gap satisfying these criteria, starting from N points % beyond xcenter, to allow enough points to average over istart = find(xsort > xcenter,1) + N - 1; gapindex = find(D(istart:end) > log(1/alpha)*Dhat(istart:end) & ... D(istart:end) > mingap,1); if isempty(gapindex) outlierindex = false(size(x)); thres = NaN; else thres = xsort(gapindex + istart - 1); % value of point just before gap outlierindex = (x > thres); end return