## Task 3: How to Identify Outliers and Evaluate Their Impact

In this task, you will check for outliers and their potential impact using the following steps:

### Step 1: Check for Outliers By Running a Univariate Analysis

Before you analyze your data, it is very important that you examine the data for the presence of outlying values. The example below is taken from the sample “Outlier” program.

#### Sample Code

*--------------------------------------------------------------;
* Use the PROC UNIVARIATE procedure to get all default         ;
* descriptive statistics, such as mean, minimum and            ;
* maximum values, standard deviation; and skewness.  Use       ;
* the VAR statement to identify the variable of interest.      ;
* Use the ID statement to list the sequence numbers associated ;
* with extreme values in the output.                           ;
*--------------------------------------------------------------;

proc univariate data =diet normal plot ;
var dr1tcalc;
id seqn;
title 'check the distribution of total calcium intake in children
aged 6-11'
;
title2 'uses NHANES 03-04 day 1 intake data' ;
run ;

### Step 2: Plot Sample Weight against the Distribution of the Variable

In this example, you will plot the 2-year MEC sample weight (WTMEC2YR) against the distribution of the total day 1 calcium intake variable to determine whether the extreme observations are outliers. As a reminder, the example below is taken from the sample “Outlier” program.

#### Sample Code

*-----------------------------------------------------------------;
* Use the PROC GPLOT procedure to plot the total day 1 calcium    ;
* intake (DR1TCALC) by the corresponding sample weight for each   ;
* observation in the dataset.  SYMBOL and HEIGHT are option       ;
* statements used to format the output of the plot.               ;
*-----------------------------------------------------------------;

symbol1 value =dot height = .2 ;

proc gplot data =diet;
plot wtdrd1*dr1tcalc/ frame ;
run ;

### Step 3: Identify Outliers and Compare Estimates with Outliers Deleted Against the Original Estimates with Outliers Included

In this step you will:

• delete the three outliers identified in the plot above using the SEQN numbers; and
• compare the mean of the new dataset without the outliers against the mean of the original dataset that includes the outliers to check the impact of the outlier observations.

#### Sample Code

*----------------------------------------------------------------;
* Use the IF, THEN, and DELETE statements in the DATA step to    ;
* delete the identified outliers using their sequence numbers.   ;
*                                                                ;
* Use the PROC MEANS procedure to determine the mean and         ;
* standard error for the dataset both with and without           ;
* excluding the outlier values.                                  ;
*----------------------------------------------------------------;

data exclu_3SPs;
set diet;
if seqn in ( 24099 , 22459 , 25817 ) then delete

proc format ;
value gender 1 = 'male'
2 = 'female' ;
run ;

proc means data =diet mean stderr maxdec = 1 ;
title3 'without exclusion' ;
var dr1tcalc;
class riagendr;
weight wtdrd1;
format riagendr gender. ;
run ;

proc means data =exclu_3SPs mean stderr maxdec=1 ;
title3 'after removing 3 outlier values' ;
var dr1tcalc;
class riagendr;
weight
wtdrd1;
format riagendr gender. ; run ;