## Task 4: How to Create New Variables

This task reviews how to recode or derive new variables so they are appropriate for your analytic needs and how to check recoded variables.

### Step 1: Recode or Derive New Variables

Creating new variables, by recoding or deriving, is an important step when preparing your analytical dataset.

The sample code below shows how to recode and derive new variables in multiple scenarios, using SAS data steps and procedures. The first and third snippets of code are taken from the “Food Sources” program; the second and fourth are from the “Supplement” program.

#### Sample Code

*---------------------------------------------------------------;
* Create a new variable called FOODGRP to group the data        ;
* into nine major food categories.  This classification is      ;
* based on the first digit of the actual food code (DR1IFDCD).  ;
* By dividing the 8-digit food codes by 10000000 and using the  ;
* INT function, the first digit of the food code is assigned    ;
* to the variable FOODGRP.  This is only one of the ways to     ;
* group foods using SAS code.                                   ;
*---------------------------------------------------------------;

data FDSRC;
set FDSRC;
FOODGRP=int(DR1IFDCD/10000000 );

#### Sample Code

*-----------------------------------------------------;
* Create a new variable in which age is categorized   ;
* into groups (AGEGP).                                ;
*-----------------------------------------------------;

data DEMOOST;
set DEMOOST;
by SEQN;

if ( 20 <= RIDAGEYR <= 39 ) then AGEGRP= 1 ;
else if ( 40 <= RIDAGEYR <= 59 ) then AGEGRP= 2 ;
else if (RIDAGEYR >= 60 ) then AGEGRP= 3 ;

#### Sample Code

*------------------------------------------------------------------;
* Create a new variable called WTD_CALC, which is the amount       ;
* of calcium in a particular food multiplied by the sample         ;
* weight of the individual who consumed that food.  This is        ;
* one of the variables used to determine food sources of calcium.  ;
*                                                                  ;
* The WTDRD1 variable must be divided by 2 since two waves of the  ;
* survey are being combined.                                       ;
*------------------------------------------------------------------;

data FDSRC;
set FDSRC;
WTD_CALC=DR1ICALC*WTDRD1/2 ;

Sometimes, you may need to create new variables through a complex process that involves both recoding and derivations and that uses both SAS data steps and SAS procedures.  This next example illustrates this complex process by showing how to derive the amount of calcium from all dietary supplements.

#### Sample Code

*------------------------------------------------------------------;
* Once the supplement files are merged, the code below can be      ;
* used to output only supplements with calcium as an ingredient.   ;
* The following list shows the name of the supplement as well      ;
* as the corresponding ingredient code (DSDINGID)                  ;
*                                                                  ;
* Calcium: DSDINGID = 10000070                                     ;
* Elemental Calcium:  DSDINGID = 10000507                          ;
* Oyster Shell Calcium: DSDINGID = 10001483                        ;
* Calcium carbonate (40% calcium): DSDINGID = 10000611             ;
* Calcium citrate (21% calcium): DSDINGID = 10001394               ;
* Calcium d-glucarate (12.5% calcium): DSDINGID = 10001144         ;
* Calcium gluconate (9% calcium): DSDINGID = 10000584              ;
* Calcium hypophosphite (23.6% calcium): DSDINGID = 10002193       ;
* Calcium pantothenate (8.4% calcium): DSDINGID = 10000437         ;
* D-calcium pantothenate (8.4% calcium): DSDINGID = 10000521       ;
* Calcium phosphate (30% calcium): DSDINGID = 10000795             ;
* Calcium pyruvate (18.7% calcium): DSDINGID = 10000920            ;
* Calcium HMB (14% calcium): DSDINGID = 10002190                   ;
*------------------------------------------------------------------;

data CALCIUM;
set ALLSUP;
if
DSDINGID in (10000070 ) or
DSDINGID in (10000507 ) or
DSDINGID in (10001483 ) or
DSDINGID in (10000611 ) or
DSDINGID in (10001394 ) or
DSDINGID in (10001144 ) or
DSDINGID in (10000584 ) or
DSDINGID in (10002193 ) or
DSDINGID in (10000437 ) or
DSDINGID in (10000521 ) or
DSDINGID in (10000795 ) or
DSDINGID in (10000920 ) or
DSDINGID in (10002190 )
then output ;

* Recode in a Data step
*-------------------------------------------------------------------;
* Convert the calcium compounds to their elemental amount - use     ;
* the percentage of calcium listed next to each of the compounds    ;
* above as the conversion factor.  Note that this information can   ;
* be found in the documentation for the supplement data.            ;
*-------------------------------------------------------------------;

data dosage;
set calcium;
if DSDINGID = 10000611 then PROP_SUPP_CALC = DSDQTY/ 100 * 40 ;
if DSDINGID = 10001394 then PROP_SUPP_CALC = DSDQTY/ 100 * 21 ;
if DSDINGID = 10001144 then PROP_SUPP_CALC = DSDQTY/ 100 * 12.5 ;
if DSDINGID = 10000584 then PROP_SUPP_CALC = DSDQTY/ 100 * 9 ;
if DSDINGID = 10002193 then PROP_SUPP_CALC = DSDQTY/ 100 * 23.6 ;
if DSDINGID = 10000437 then PROP_SUPP_CALC = DSDQTY/ 100 * 8.4 ;
if DSDINGID = 10000521 then PROP_SUPP_CALC = DSDQTY/ 100 * 8.4 ;
if DSDINGID = 10000795 then PROP_SUPP_CALC = DSDQTY/ 100 * 30 ;
if DSDINGID = 10000920 then PROP_SUPP_CALC = DSDQTY/ 100 * 18.7 ;
if DSDINGID = 10002190 then PROP_SUPP_CALC = DSDQTY/ 100 * 14 ;
run ;

*Derive in a Data step
*-----------------------------------------------------------;
* Use serving amount and intake information to create new   ;
* variables that will be used to calculate the amount of    ;
* calcium intake per day                                   ;
*                                                           ;
* For purposes of this analysis, data is imputed with the   ;
* assumption that when the reported amount of a supplement  ;
* taken is missing (DSD122Q), then the label serving size   ;
* (DSDSERVQ) is equal to the serving size that was consumed ;
* (SUPP_CALC).                                              ;
*                                                           ;
* Determine the total amount of calcium per supplement      ;
* (THIS_SUPP_CALC) by multiplying the serving size by the   ;
* quantity of the supplement taken (DSDQTY)                 ;
*                                                           ;
* Convert units of total calcium per supplement from grams  ;
* to milligrams. It is important to check the units to make ;
* sure they are consistent throughout the data.             ;
*                                                           ;
* Determine the average amount of calcium consumed per      ;
* supplement day (DAILYAVG) by dividing the total amount    ;
* of calcium per supplement by 30.                          ;
*-----------------------------------------------------------;

data CALCIUM1;
set DOSAGE;
if DSD122Q= 999999 then DSD122Q= . ;
if DSD103= 9999 then DSD103= . ;
SUPP_CALC = DSD122Q / DSDSERVQ;
if DSD122Q = . then SUPP_CALC = DSDSERVQ;
if DSDINGID in ( 10000070 10000507 10001483 )
then THIS_SUPP_CALC = SUPP_CALC*DSDQTY;
if DSDINGID in ( 10000611 10001394 10001144 10000584 10002193
10000437 10000521 10000795 10000920 10002190 )
then THIS_SUPP_CALC = SUPP_CALC*PROP_SUPP_CALC;
if (DSDUNIT= 5 ) then THIS_SUPP_CALC_MG = THIS_SUPP_CALC * 1000 ;
else if (DSDUNIT= 1 ) then THIS_SUPP_CALC_MG = THIS_SUPP_CALC;
DAILYAVG  = THIS_SUPP_CALC_MG * DSD103/30 ;
run ;

*Derive in a SAS Procedure
*-----------------------------------------------------------;
* Sort the data set , CALCIUM1, by sequence number (SEQN)   ;
*-----------------------------------------------------------;

proc sort data =CALCIUM1;
by SEQN;
run ;

*--------------------------------------------------------;
* Use the PROC MEANS procedure to sum the values of the  ;
* amount of calcium consumed per supplement per day      ;
* (DAILYAVG).  The data will be output in a file called  ;
* FINALCA.                                               ;
*--------------------------------------------------------;

proc means data =CALCIUM1 NOPRINT ;
by SEQN;
var DAILYAVG;
output out =FINALCA sum =DAILYAVG;

run ;

### Step 2: Check Recoded Variables

In this step, you will use the PROC MEANS procedure in SAS to confirm that the derived and recoded variables correctly correspond to the original variables. This example is taken from the “Food Sources” program.

#### Sample Code

*--------------------------------------------------------------;

* Use the PROC SORT procedure to sort the dataset by FOODGRP.  ;

* Use the PROC MEANS procedure to calculate the number of      ;

* observations, minimum values, and maximum values for the     ;

* original food code variable (DR1FDCD) by the new FOODGRP     ;
* variable.                                                    ;

*--------------------------------------------------------------;

proc sort data =FDSRC;
by FOODGRP;
proc means data =FDSRC N min max ;
by FOODGRP;
var DR1IFDCD;
title 'Check if FOODGRP is defined correctly' ;
run ;

#### Output of Program IMPORTANT NOTE

In each of the Preparing an Analytic Dataset modules, you will be working with temporary datasets, which are saved in the WORK folder of your SAS program.  The dataset exists only as long as your SAS session and is deleted when you exit the program.  If you would like to save these datasets so that you can return to them at a later time, you can learn how to do this in the Save a Dataset module at the end of this course. Close Window to return to module page.