How to Identify and Describe the Impact of Influential Outliers

Before you analyze your data, it is very important that you examine the data for the presence of outlying values.

Delete Observations with Implausible Values.

There are 10,080 minutes in each week.  Censor the data by deleting study participants who have minutes of weekly activity that exceed this number. In the PAQMSTR dataset we created a variable that combines minutes of household/yard, transportation, and leisure-time weekly activity to describe total moderate-to-vigorous minutes of physical activity per week (TOTMINW).

 

Sample Code

proc freq data =paq;
tables TOTMINW*SEQN/ list ;
where WTINT4CD > 0 and RIDAGEYR >= 16 ;
where TOTMINW > 10080 ;
run ;

data paq;
set paq;
if TOTMINW > 10080 then delete ;
run ;

Check for Outliers among Plausible Data by Running a Univariate Analysis

Use the PROC UNIVARIATE procedure to get all default descriptive statistics such as mean, minimum and maximum values, standard deviation, and skewness. Use the VAR statement to identify the variable of interest (PAG_MINW). Use the ID statement to list the sequence numbers associated with extreme values in the output.

Sample Code

proc univariate data =paq normal plot ;
 var TOTMINW;
 where WTINT4CD > 0 and RIDAGEYR >= 16 ;
 id seqn;
 title 'Distribution of TOTMINW among study participants aged 16 and older' ;
run ;

 

Output of Program

Download program output [PDF - 196 KB]

 

Plot Sample Weight against the Distribution of the Variable

Use the PROC GPLOT procedure to plot total minutes of moderate-to-vigorous activity per week (TOTMINW) by the corresponding sample weight for each observation in the dataset. Set 7,560 minutes per week as the maximum reasonable volume of weekly activity based on a maximum of 18 hours per day considering that study participants are sleep for a minimum of 6 hours each night.

Sample Code

symbol1 value = square height = .5 ;

proc gplot data = paq;
 plot WTINT4CD*TOTMINW/ href = 7560 frame ;
run ;

 

Output of Program

Download program output [PDF - 52 KB]

Highlights

  • There are seven outliers with weekly minutes of total moderate-to-vigorous physical activity (TOTMINW) with values higher than 7560 minutes.
  • Only one of the outlier observations has a moderately large sample weight.   Therefore, removing these observations would not have a great effect on population estimates.

 

Identify Outliers and Compare Estimates with Outliers Deleted Against the Original Estimates with Outliers Included

Use the IF, THEN, and DELETE statements in the DATA step to delete the identified outliers with TOTMINW > 7560. Use the PROC MEANS procedure to determine the mean and standard error for the dataset both with and without excluding the outlier values.

 

Sample Code

proc freq data =paq;
 tables TOTMINW*SEQN/ list ;
 where WTINT4CD > 0 and RIDAGEYR >= 16 and TOTMINW > 7560 ;
run ;

data exclude_SP;
 set paq;
 if WTINT4CD > 0 and RIDAGEYR >= 16 and TOTMINW > 7560 then delete ;
run ;

proc format ;
 value GENDERF 1 ='Male'
               2 ='Female';

proc means data = paq mean stderr maxdec = 1 ;
 title 'No Exclusions';
 var TOTMINW;
 class RIAGENDR;
 weight WTINT4CD;
 format RIAGENDR GENDERF. ;
run ;

proc means data = exclude_SP mean stderr maxdec = 1 ;
 title 'Outlier Exclusion';
 var TOTMINW;
 class RIAGENDR;
 weight WTINT4CD;
 format RIAGENDR GENDERF. ;
run ;

 

Output of Program

Download program output [PDF - 38 KB]