Task 3c: How to Identify Outliers and Evaluate Their Impact Using Stata

In this task, you will check for outliers and their potential impact using the following steps:


Step 1: Check distributions by running a univariate analysis

Before you analyze your data, it is very important that you check the distribution and normality of the data and identify outliers for continuous variables.


Program to Plot Distribution of Continuous Variable

Use the summarize command with the detail option to get  descriptive statistics, such as mean, minimum and maximum values, standard deviation, and skewness, etc. for the participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the histogram command with the normal option to graph the continuous variable cholesterol and draw the normal distribution curve. Use the graph box command to draw a box chart graph of the continuous variable cholesterol.


summarize lbxtc [w=wtmec4yr] if (ridageyr >=20 & ridageyr <.) & ridstatr==2, detail
histogram lbxtc if (ridageyr >=20 & ridageyr <.) & ridstatr==2, normal
graph save "C:\STATA\tutorial\descriptive\histogram_discriptive.gph", replace
graph box lbxtc [w=wtmec4yr], medtype(line), if (ridageyr >=20 & ridageyr <.) & /// ridstatr==2                                                                  
graph save "C:\STATA\tutorial\descriptive\box_plot.gph", replace


Highlighted items from the univariate analysis output:


Step 2: Plot a Graph of Survey Weight Against the Distribution of the Variable

In this example, you will plot the 4-year MEC survey weight (wtmec4yr) against the distribution of the cholesterol variable to determine whether the extreme observations are outliers.


Plot Exam Weight Against Cholesterol

Use the graph twoway scatter command to plot the total serum cholesterol (lbxtc) by the corresponding sample weight for each observation in the dataset for participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the mlabel option to label the sequence numbers associated with extreme values in the output.

graph twoway scatter wtmec4yr lbxtc if (ridageyr >=20 & ridageyr <.)& ridstatr==2, mlabel(seqn) title(NHANES 1999-2002: adults age 20 years and older)


Highlighted items from plotting the survey weight against the distribution of the cholesterol variable:


Step 3: Identify Outliers and Compare Estimates with Outliers Deleted Against the Original Estimates with Outliers Included

In this step you will:

Program to Create Dataset Without Outliers and Output Means of Both Datasets

Use the label commands to describe labels used for the race/ethnicity variable values.

Use the drop command to delete the outliers using their SEQNs previously identified in the plot of survey weight versus distribution of the variable.  The SEQNs associated with these outliers are labeled on the scatter plot output under plot exam weight against cholesterol.

Use the mean command to determine the mean and standard error for both the dataset with the outliers and the dataset without outliers by race/ethnicity for participants who were interviewed and examined in the MEC and who were age 20 years and older. Use the weight option to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data is used.

label define race 1 "Mex American",
label define race 2 "Other Hispanic", add
label define race 3 "NH White ", add
label define race 4 "NH Black", add
label define race 5 "Other Race - Including Multi-Racial", add
label values ridreth1 race
drop if seqn==10494 | seqn==13866 | seqn==17821
save C:\Nhanes\Data\exclu_3sps, replace
//   mean total cholesterol without extreme values
mean lbxtc if (ridageyr >=20 & ridageyr <.) & ridstatr==2 [pweight=wtmec4yr], over(ridreth1)
// Mean of serum total cholesterol - including outliers
use C:\Nhanes\Data\demo_bp2b, clear
// mean total cholesterol with etreme values included    
mean lbxtc if (ridageyr >=20 & ridageyr <.) & ridstatr==2 [pweight=wtmec4yr], over(ridreth1)


Highlighted items from comparison of the results with and without outliers:

close window icon Close Window