Task 2b: How to Generate Population Counts in SAS Survey Procedures

In this example, you will use SAS Survey Procedures to combine age subgroups and generate population estimates for high blood pressure (HBP) by sex and race/ethnicity for persons 20 years and older.  The method outlined in this module uses a SAS data file with CPS population totals. The process for combining subgroups and calculating population estimates is then automated using the code outlined below. 

Alternatively, you can use the CPS population totals located on the respective survey cycle NHANES web page (referred to in Key Concepts), plus the results from a proc surveymeans procedure and manually calculate population estimates within a spreadsheet.  If you choose this option, you will need to define the age, race/ethnicity and gender subgroups of interest and calculate population totals within the spreadsheet on your own.

 

 

Step 1: Calculate Prevalence of Health Condition of Interest

The SAS Survey Procedure, proc surveymeans, is used to generate population estimates.  The general program for obtaining population estimates is outlined in the 3-step process below:

In the first step, you will calculate the prevalence of the health condition (i.e. HBP) by sub-domains of interest.  You will need to use appropriate weights, especially when combining across survey cycles.

 

The health outcome must be coded as a dichotomous (0, 100) variable for absence (0) or presence (100) of the health condition of interest (i.e.  HBP and HBPX).

 

hbpx=. ;

if hbp= $1 then hbpx= 100 ;

else if hbp= $1 then hbpx= $1 ;

 

A new variable (sel) will be created to reflect the study subpopulation of interest (age 20 years and older) used in the domain statement of the proc surveymeans procedure.

 

sel=. ;

If ridageyr ge 20 then sel=1;

Else sel=2;

 

Population estimates will not be age standardized, so the estimates reflect the true population sampled. The results will be output to a SAS data file using the ods output statement below.  

 

Info iconIMPORTANT NOTE

These programs use variable formats listed in the Tutorial Formats page. You may need to format the variables in your dataset the same way to reproduce results presented in the tutorial.

SAS Survey Procedure for Generating Prevalence Rates
Statements Explanation
proc surveymeans data=ANALYSIS_DATA nobs mean stderr clm;

Use the proc surveymeans procedure to obtain number of observations, mean, standard error and confidence intervals.

strata sdmvstra; 

Use the stratum statement to define the strata variable (sdmvstra).

cluster sdmvpsu; 

Use the cluster statement to define the PSU variable (sdmvpsu).

class

Use the class statement to specify the discrete variables used to select the subpopulations of interest (i.e., gender [riagendr] and race [race]).

var hbpx; 

 

Use the var statement to specify which variable(s) will be analyzed. In this example, the HBP variable (hbpx) is used.

weight

Use the weight statement to account for the unequal probability of sampling and non-response. In this example, the MEC weight for 4 years of data (wtmec4yr) is used.

domain sel sel*riagendr sel*race sel*riagendr*race;

Use the domain statement to specify the subpopulations of interest.

ods OUTPUT domain(match_all)=unadj;

run ;

Use the ods statement to output the SAS dataset of estimates from the subdomains listed on the domain statement.  This set of commands will output four datasets for each domain specified in the domain statement above (unadj for sel  unadj1 for sel*riagendr, unadj2 for sel*race, and undadj3 for sel*riagendr*race).

 

Format Data from SAS Output Dataset
Statements Explanation
data bp_stats;
set unadj unadj1 unadj2 unadj3;

Use the data statement to create a new dataset (bp_stats) from the SAS dataset created previously (unadj unadj1 unadj2 unadj3).

if sel= 1 ;

if race= . then race= 0 ;

if riagendr= . then riagendr= 0 ;

Use the if statement to select the subgroups of interest. Use  if, then statements to recode missing values to 0 for race and riagendr.

ll=round(lowerclmean,.01 );

ul=round(upperclmean,.01 );

Use these statements to round  and rename the lower limit (lowerclmean to ll), and upper limit (upperclmean to ul) of the Wald 95% confidence intervals.

percent=round(mean,.01 );

sepercent=round(stderr,.01 );

run ;

Use these statements to round the mean and standard error estimates and rename them to percent and sepercent, respectively.

 

 

Step 2: Combine CPS Population Totals

In Step 2, you will combine appropriate CPS population totals across survey cycles AND across years of age to reflect the subpopulation of interest (i.e., those 20 and older). 

In this module, CPS population totals are supplied as a SAS dataset with values for: age (CTUTAGE) ranging from 0 to 85+ years ; gender (CTUTGNDR); race/ethnicity (CTUTRACE), where 1= non-Hispanic white, 2=non-Hispanic black, 3=Mexican American and 4=other; race/ethnicity (CTUTRETH), where 1=Mexican American, 2=non-Hispanic other, 3=non-Hispanic white, 4=non-Hispanic black, 5=other Hispanic;  ethnicity (CTUTHISP) where 1=Hispanic and 2=non-Hispanic; survey cycle (CTUTSRVY); and the population total (CTUTPOPT).  Appropriate age, race/ethnicity, and gender groups were created in a previous step. 

The proc means procedure for simple random samples in SAS  will be used to calculate CPS population totals for the sub-domains of interest (i.e., sex and race) for the subpopulation of interest (age 20 and older).  In this case, no sample design factors or weights need to be used.   Subgroup totals are output to another SAS data set (saspt9902) for use in Step 3.    

 

SAS Procedure to Calculate CPS Population Totals
Statements Explanation
Proc means data =nh.cpstot9902; where ctutage >= 20 ;

Use the proc means procedure and the where statement to calculate totals for persons 20 years of age and older.

var ctutpopt;

Use the var statement to select the variable of interest (ctutpopt).

output out =d1 n = n sum = sum ;
run ;

Use the ouput statement to create a dataset (d1) for the population totals (sum).

proc sort data =nh.cpstot9902; by ctutgndr;

run ;

Use the proc sort procedure to sort the dataset by sex.

proc means data =nh.cpstot9902; where ctutage >= 20 ;

Use the proc means procedure and the where statement to  calculate totals for persons 20 years of age and older.

var ctutpopt;

Use the var statement to select the variable of interest (ctutpopt).

by ctutgndr;

Use the by statement to generate population totals by sex (ctutgndr).

output out =d2 n = n sum = sum ;

run ;

Use the output statement to create a dataset (d2) for the population totals (sum).

proc sort data =nh.cpstot9902; by ctutrace; run ;

Use the proc sort procedure to sort the dataset by race.

proc means data =nh.cpstot9902; where ctutage >= 20 ;

Use the proc means procedure and the where statement to  calculate totals for persons 20 years of age and older.

var ctutpopt;

Use the var statement to select the variable of interest (ctpopt).

by ctutrace;

Use the by statement to generate population totals by race.

output out =d3 n = n sum = sum ;

run ;

Use the output statement to create a dataset (d3) for the population totals (sum).

proc sort data =nh.cpstot9902; by ctutgndr ctutrace; run ;

Use the proc sort procedure to sort the dataset by sex and race.

proc means data =nh.cpstot9902;
where ctutage >= 20 ;

Use the proc means procedure and the where statement to calculate totals for persons 20 years of age and older.

varctutpopt;

by ctutgndr ctutrace;

Use the var statement to select the variable of interest (ctutpopt).

Use the by statement to generate population totals by sex and race.

output out =d4 n = n sum = sum ;

run

Use the output statement to create a dataset (d4) for the population totals (sum).

data saspt9902;

set d1 d2 d3 d4;

if ctutrace= .
then ctutrace= 0 ;

if ctutgndr= . then ctutgndr= 0 ;

run ;

This data step consolidates the datasets created above into a single dataset for use in the next step (saspt9902).

 

 

Step 3: Multiple Prevalence Estimates with CPS Population Totals

In this last step, you will multiply prevalence estimates with corresponding CPS population totals to estimate the total number of non-institutionalized U.S. citizens affected with HBP.

 

Note that the datasets produced in Step 1 and Step 2 will be sorted on the sub-domain variables and merged.  The new dataset will be used in the final SAS program.  Percent prevalence estimates as well as lower and upper 95% confidence limits will be multiplied to the corresponding population total for that subgroup.  Results will be rounded, formatted, and printed in SAS.

Calculate Population Estimates from SAS Output Dataset
Statements Explanation
proc sort data =bp_stats; by riagendr race ; run ;

proc sort data =saspt9902(rename=(ctutgndr=riagendr ctutrace=race)); by riagendr race ; run ;

 

Use the proc sort procedure to sort the two datasets by sex and race. In the second dataset, rename the CPS total race and gender (ctutrace and ctutgndr) variables to match the variable names used in the original dataset.

data comb;

merge (in =a) saspt9902 ;

by riagendr race ;

if a ;

Use the data statement to create a new dataset (comb) by  merging SAS datasets created previously (bp_stats and saspt9902). Keep all data for both datasets if values for race and sex exist in bp_stats (in=a).

 

popmean=(percent/100 )*total ;

popl=ll/100 *sum ;

popu=ul/100 *sum  ;

Use these statements to calculate the population counts by applying the population totals (sum) to the prevalence estimate (percent) and the 95% confidence interval limits.

poplr=round(popl,1000 );

popur=round(popu,1000 );

popmeanr=round(popmean,1000 );

totalr=round(total,1000 ) ;

Use these statements to round and format the estimates to the nearest thousand.

proc print noobs split= '/' double;

var  riagendr race percent sepercent  ll ul n

totalr popmeanr poplr popur ;

formatrace racefmt.   riagendr sexfmt.   n 5.0 percent 5.2 sepercent 5.2

ll 4.2 ul 4.2   ;

label  

percent='%' / 'with' / 'high' / 'bp'

n='Num' / 'bp' / 'status'

sepercent='Std' / 'error'

ll='Lower' / '95 %' / 'Wald' / 'CI'

ul='Upper' / '95 %' / 'Wald' / 'CI'

popmeanr='Pop' / 'Est' / 'US' / 'with' / 'high' / 'bp'

totalr='Pop' / 'total' / 'US'

poplr='Pop Est' / 'Lower' / '95 %' / 'WALD' / 'CI'

popur='Pop Est' / 'Upper' / '95 %' / 'WALD' / 'CI' ;

title1 'Prevalence of persons with high Bp - US, 1999-2002' ;

title2 'Percent and population estimates of number with high Bp-Wald CI' ;

run ;

Use the proc print procedure to print the variables of interest.

Highlights from the output include:

 

close window icon Close Window