In this task, you will check for outliers and their potential impact using the following steps:

- Run a univariate analysis to obtain all default descriptive statistics.
- Plot survey weight against the distribution of the variable.
- Identify outliers and compare the outlier-deleted estimates with the original estimates that include the outliers.

Before you analyze your data, it is very important that you **check
the distribution and normality** of the data and **identify outliers** for
continuous variables.

Use the *summarize*
command with the *detail* option to get descriptive
statistics, such as mean, minimum and maximum values,
standard deviation, and skewness, etc. for the participants who
were interviewed and examined in the MEC and who were age 20
years and older. Use the *histogram* command with the
*normal *option to graph the continuous variable
cholesterol and draw the normal distribution curve. Use the
*graph box* command to draw a box chart graph of the
continuous variable cholesterol.

summarize lbxtc [w=wtmec4yr] if (ridageyr
>=20 & ridageyr <.) & ridstatr==2, detail

histogram lbxtc if (ridageyr >=20 &
ridageyr <.) & ridstatr==2, normal

graph save "C:\STATA\tutorial\descriptive\histogram_discriptive.gph",
replace

graph box lbxtc [w=wtmec4yr], medtype(line),
if (ridageyr >=20 & ridageyr <.) & /// ridstatr==2

graph save "C:\STATA\tutorial\descriptive\box_plot.gph",
replace

Highlighted items from the univariate analysis output:

- In this example, five outlier values with
serum cholesterol values over 475 mg/dl are identified in the
distribution.

- Watch animation of program and output
- Can't view the demonstration? Try our Tech Tips for troubleshooting help.

In this example, you will plot the 4-year MEC survey weight (*wtmec4yr)*
against the distribution of the cholesterol variable to
**determine whether the extreme observations are outliers**.

Use the *graph
twoway scatter* command to plot the total serum
cholesterol (*lbxtc*) by the corresponding sample weight for
each observation in the dataset for participants who were
interviewed and examined in the MEC and who were age 20
years and older. Use the *mlabel *option to label the sequence numbers
associated with extreme values in the output.

graph twoway scatter wtmec4yr lbxtc if (ridageyr >=20 & ridageyr <.)& ridstatr==2, mlabel(seqn) title(NHANES 1999-2002: adults age 20 years and older)

Highlighted items from plotting the survey weight against the distribution of the cholesterol variable:

- Three outliers with serum cholesterol values higher than 600 mg/dl are identified from the plot.
- None of these three observations has an extremely large survey weight.

In this step you will:

**delete the three outliers**identified in the plot above using the SEQN numbers; and**compare the mean**of the new dataset without the outliers against the mean of the original dataset that includes the outliers to check the impact of the outlier observations.

Use the *label *
commands to describe labels used for the race/ethnicity
variable values.

Use the *drop* command to delete the
outliers using their SEQNs previously identified in the plot of survey weight
versus distribution of the variable. The SEQNs associated with these outliers
are labeled on the scatter plot output under plot exam weight against
cholesterol.

Use the *mean* command to determine
the mean and standard error for both the dataset with
the outliers and the dataset without outliers by race/ethnicity for participants
who were interviewed and examined in the MEC and who were age 20 years and
older. Use the *weight *option to account for the unequal probability of
sampling and non-response. In this example, the MEC weight for 4 years of
data is used.

label define race 1 "Mex American",

label define race 2 "Other Hispanic", add

label define race 3 "NH White ", add

label define race 4 "NH Black", add

label define race 5 "Other Race - Including
Multi-Racial", add

label values ridreth1 race

drop if seqn==10494 | seqn==13866 | seqn==17821

save C:\Nhanes\Data\exclu_3sps, replace

// mean total cholesterol without extreme
values

mean lbxtc if (ridageyr >=20 & ridageyr <.)
& ridstatr==2 [pweight=wtmec4yr], over(ridreth1)

// Mean of serum total cholesterol -
including outliers

use C:\Nhanes\Data\demo_bp2b, clear

// mean total cholesterol with etreme
values included

mean lbxtc if (ridageyr >=20 & ridageyr <.)
& ridstatr==2 [pweight=wtmec4yr], over(ridreth1)

Highlighted items from comparison of the results with and without outliers:

- In this example, the outliers do not significantly affect mean cholesterol values of the race/ethnicity subgroups or the overall mean. Therefore, you will use the dataset with the outliers for your analysis.