## Task 1a: How to Check Frequency Distribution and Normality in SAS

The SAS procedure, proc univariate, generates descriptive and summary statistics that are useful in describing the characteristics of a distribution. These statistics can also be used to determine whether parametric (for a normal distribution) or non-parametric tests are appropriate to use in your analysis. As noted in the Clean & Recode Data module it is advisable to check for extreme weights and outliers before starting any analysis.

### Step 1: Use the univariate procedure to generate descriptive statistics in SAS

Use the SAS procedure, proc univariate, to generate descriptive statistics. The frequency distribution can be presented in table or graphic format. The freq option generates the frequency distribution in tabular form by listing the number of observations for each value of the variable. Due to the large sample size and the possibility of a long list of different values, it is not reasonable to request the freq option for variables that are not nominal or ordinal. The plot option generates the frequency distribution in graphic form (histogram, box, and normal probability plots), and the normal option generates statistics to test the normality of the distribution.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Univariate Procedure for Descriptive Statistics
Statements Explanation

proc sort data=analysis_data ;

by riagendr age;

run ;

Use the sort procedure to sort data by the same variables used in the by statement of the univariate procedure. In the example, data is sorted by gender (riagendr) and age (age).

PROC UNIVARIATE PLOT NORMAL ;

Use the univariate procedure to generate descriptive statistics, which include number of missing values, mean, standard errors, percentiles, and extreme values. Use the plot option to generate histogram, box and normal probability plots, and the normal option to generate statistics to test normality.

In this example, plots (plot) and normality test statistics (normal) are requested and the results will be sorted and generated separately for each combination of the variables on the by statement.

where ridageyr >= 20 ;

Use the where statement to select those 20 years and older.

by riagendr age;

The by statement determines the groups (all combinations of the variables defined by the var statement) that separate descriptive statistics will be produced. This statement should match the by statement in the sort procedure preceding it.

VAR lbxtc;

Use the var statement to indicate variable(s) for which descriptive measures are requested. In this example, the total cholesterol variable (lbxtc) is used.

FREQ wtmec4yr;

run ;

Use the freq option with the appropriate sample weight yields an estimate of the standard deviation whose denominator is the estimated population size. In this example, the 4-year examination weight (wtmec4yr) is used.

WARNING

The freq option, with the appropriate sample weight, yields an estimate of the standard deviation whose denominator is an estimate of the population size, i.e., the sum of the the sample weights. Using the weight option instead of the freq option yields an estimate of the standard error whose denominator is the sample size.

### Step 2: Check output of descriptive statistics

The univariate procedure generates extensive descriptive statistics, including moments, percentiles, extremes, missing values, basic statistical measures, and tests for location. Below is a snapshot from the extensive output of the SAS program which shows the result of using the plot and normal options.

• The output is arranged by gender and age group so you can see the results for each combination.
• The standard deviation is a measure of the deviation of the observations for the mean.
• Kurtosis is a measure of the peakedness of the distribution. For SAS, the kurtosis of a normally distributed random variable is 0. A kurtosis greater than 0, as in this example, indicates excess values close to the mean and at the tails of the distribution.
• Skewness is a measure of the departure of the distribution of a random variable from symmetry. The skewness of a normally distributed random variable is 0.
• The standard error of the mean is not correctly calculated and will not be used in this example.
• The output also contains the five lowest and highest values, which are useful for review.
• The histogram for a normally distributed random variable is symmetric  and bell-shaped. For variables based on data collected in a survey, such as NHANES 1999-2002, the distribution will deviate at least slightly from normality. Note the one outlier on the upper tail of the distribution.
• The variable of interest is plotted against a normally distributed random variable. The resulting plot is called a Q-Q plot. If the variable of interest is normally distributed a straight line intersecting the y axis at a 45 degree angle would be obtained. For this example note the outliers in the upper tail of this distribution.

### Step 3: Request selective statistics and output results to SAS dataset

In some instances, you may not need all of the statistics generated by proc univariate. You can use proc univariate to select a few descriptive statistics and output the results to a SAS dataset to view.

IMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS univariate procedure for displaying selected statistics
Statements Explanation

proc sort data=analysis_data;

by riagendr age;

run ;

Use the sort procedure to sort data by the same variables that will be used in the by statement of the univariate procedure. In the example, the data are sorted by gender (riagendr) and age (age).

PROC UNIVARIATE NOPRINT;

Use the univariate procedure to generate descriptive statistics. Use the noprint option to suppress the detailed default descriptive statistics.

where ridageyr >= 20 ;

Use the where statement to select those 20 years and older.

by riagendr age;

The by statement determines the groups (all combinations of the variables defined by the var statement) that separate descriptive statistics will be produced. This statement should match the by statement in the sort procedure preceding it.

VAR lbxtc;

Use the var statement to indicate variable(s) for which descriptive measures are requested. In this example, the total cholesterol variable (lbxtc) is used.

FREQ wtmec4yr;

Use the freq option with the appropriate sample weight yields an estimate of the standard deviation whose denominator is the estimated population size. In this example, the 4-year examination weight (wtmec4yr) is used.

WARNING

The freq option, with the appropriate sample weight, yields an estimate of the standard deviation whose denominator is an estimate of the population size, i.e., the sum of the the sample weights. Using the weight option instead of the freq option yields an estimate of the standard error whose denominator is the sample size.

OUTPUT out= SASdataset mean=mean Q1=p_25 median=median Q3=p_75;

run ;

Use output statement to print the results to the new SAS dataset, SASdataset, which will contain the statistics of interest. The requested statistics are labeled with the names given after the equal sign. In this example, the mean, 25th, 50th, and 75th percentiles are requested. (For a complete list of statistics that can be requested see the proc univariate entry in SAS manual.)

proc print DATA=SASdataset;

run ;

Use proc print to view the results in the new SAS dataset, SASdataset.

### Step 4: Check output of selective statistics

The output is sent to a SAS dataset, which is printed to view. See results below. Note that the new SAS dataset contains only the statistics requested on the output statement.

• Because this example used the noprint option, there is only one page of output with the requested statistics -- mean, 25th percentile, median, and 75th percentile.