## Task 3: How to Identify Outliers and Evaluate Their Impact in NHANES III Data

In this task, you will check for outliers and their potential impact using the following steps:

### Step 1: Check distributions by running a univariate analysis

Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables.

Program to Plot Distribution of Continuous Variable
Statements Explanation

proc univariate data =demo3_nh3

Use the proc univariate procedure to get all default descriptive statistics, such as mean, minimum and maximum values, standard deviation, and skewness, etc...

normal plot;

Use the normal plot statement to obtain a plot of normality.

where hsageu=2 and hsageir>= 20 and dmpstat=2 ;

Use the where statement to select the participants who were age 20 years and older and who had both the home interview and the MEC exam.

id seqn;

Use the id statement to list the sequence numbers associated with extreme values in the output.

var tcp;

Use the var statement to list the variables of interest.

Highlighted items from the univariate analysis output :

• In this example, five outlier values with serum cholesterol values over 490 mg/dl are identified in the distribution.
• The id statement allows you to link the extreme values with identifier sequence numbers (SEQN). These sequence numbers are useful if you decide to delete these outlier cases.

### Step 2: Plot a graph of survey weight against the distribution of the variable

In this example, you will plot the 6-year MEC survey weight (wtpfex6) against the distribution of the cholesterol variable to determine whether the extreme observations are influential outliers.

#### Plot Exam Weight Against Cholesterol

Statements Explanation
symbol1 value =dot height = .2;

Use the option statements, symbol and height, to format the output of the plot.

proc gplot data =demo3_nh3;
where hsageu=2 and hsageir>= 20 and dmpstat=2 ;

plot wtpfex6*tcp/ frame ;

title 'NHANES III, adults age 20 years and older' ;

run ;

Use the proc gplot procedure to plot the total serum cholesterol (tcp) by the corresponding sample weight for each observation in the dataset.  Use the where statement to select the participants who were age 20 years and older and who had both the home interview and the MEC exam.

Highlighted items from plotting the survey weight against the distribution of the cholesterol variable:

• Two outliers with serum cholesterol values higher than 600 mg/dl are identified from the plot.
• Neither of these two observations has an extremely large survey weight.

### Step 3: Identify outliers and compare estimates with outliers deleted against the original estimates with outliers included

In this step you will:

• delete the two outliers identified in the plot above using the SEQN numbers; and
• compare the mean of the new dataset without the outliers against the mean of the original dataset that includes the outliers to check the impact of the outlier observations.

#### Program to Create Dataset Without Outliers and Output Means of Both Datasets

Statements Explanation
data temp4_nh3;

set demo3_nh3;

Use the data and set statements to refer to your analytic dataset.

if seqn in (2736, 33629) then delete;

Use the if, then statements to delete the outliers using their SEQN previously identified in the plot of survey weight versus distribution of the variable.  The SEQNs associated with these outliers are listed in the proc univariate output under extreme observations.

proc format;

value race

1 = 'Non-Hispanic White'
2 = 'Non-Hispanic Black'

3 = 'Mexican-American'
4 = 'Other Race' ;
run;

Use the proc format procedure to give easily understood labels to your race/ethnicity variable values.

proc means data=demo3_nh3 mean stderr maxdec=1;

Use the proc means procedure to determine the mean and standard error for the dataset with the outliers.

where hsageu=2 and hsageir>= 20 and dmpstat= 2 ;

Use the where statement to select the participants who were age 20 years and older, and who had both the home interview and MEC exam.

var tcp;

Use the var statement to indicate the variable of interest.

class dmarethn;

Use the class statement to group the variable of interest by race/ethnicity categories.

weight wtpfex6;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for all six years of data is used.

format dmarethn race. ;

Use the format statement to label your race variable with English labels you defined in the proc format statement.

proc means data =temp4_nh3 mean stderr maxdec = 1 ;

Use the proc means procedure to determine the mean and standard error for the dataset without the outliers.

where hsageu=2 and hsageir>= 20 and dmpstat= 2 ;

Use the where statement to select the participants who were age 20 years and older, and who had both the home interview and MEC exam.

var tcp;

Use the var statement to indicate the variable of interest.

class dmarethn;

Use the class statement to group the variable of interest by race/ethnicity categories.

weight wtpfex6;

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for all six years of data is used.

format dmarethn race. ;

Use the format statement to label your race variable with easily understood labels you defined in the proc format statement.

Highlighted items from comparison of the results with and without outliers:

• In this example, the outliers do not significantly affect mean cholesterol values of the race/ethnicity subgroups or the overall mean. Therefore, you will use the dataset with the outliers for your analysis.