How to Identify and Describe the Impact of Influential Outliers

Before you analyze your data, it is very important that you examine the data for the presence of outlying values.

Check for Outliers by Running a Univariate Analysis

Use the PROC UNIVARIATE procedure to get all default descriptive statistics such as mean, minimum and maximum values, standard deviation, and skewness. Use the VAR statement to identify the variables of interest (ALLMEAN_CNT and ALLMEAN_MV). Use the ID statement to list the sequence numbers associated with extreme values in the output.

Sample Code

proc univariate data =paxmstr normal plot ;
 var allmean_cnt;
 where ridageyr > 20 and ridageyr <60;
 id seqn;
 title 'Distribution of Mean total counts per day from all valid days' ;
run ;

 

Output of Program

Download program output [PDF -48 KB]

 

Sample Code

proc univariate data =paxmstr normal plot ;
 var allmean_mv;
 where ridageyr > 20 and ridageyr < 60 ;
 id seqn;
 title 'Mean duration (minutes) of moderate and vigorous activity bouts (8 out of 10 minute bouts) per day from all valid days' ;
run ;

 

Output of Program

Download program output [PDF - 34 KB]

 

Plot Sample Weight against the Distribution of the Variable

Use the PROC GPLOT procedure to plot the summary metric (ALLMEAN_CNT or ALLMEAN_MV) by the corresponding sample weight for each observation in the dataset. Remember that when combining data from two NHANES cycles, you must adjust the weight by dividing by 2. Before this step, we created a new weight variable (WTMEC4CD), which takes this into account. As an analyst, you should decide whether the observed distribution of variables warrants further investigation. For this tutorial, we treat the ALLMEAN_CNT and ALLMEAN_MV variables as acceptable “as is” based on the observed distributions and do not remove outliers.

Sample Code

symbol1 value = dot height = .2 ;

ods rtf file = "c:/nhanes/output/weight_outliers_ALLMEAN_CNT.rtf" ;
title ;

proc gplot data = paxmstr;
 plot WTMEC4CD*allmean_cnt/ frame ;
 where ridageyr > 20 and ridageyr < 60 ;
run ;

ods rtf close ;

****;
symbol1 value = dot height = .2 ;

ods rtf file = "c:/nhanes/output/weight_outliers_ALLMEAN_MV.rtf" ;
title ;

proc gplot data = paxmstr;
 plot WTMEC4CD*allmean_mv/ frame ;
 where ridageyr > 20 and ridageyr <60 ;
run ;

ods rtf close ;

 

Output of Program

Download program output [PDF - 311 KB]