# How Do I ....

1: How do I select predicted values for the observations used in the analysis from an output file of expected values?

2: How do I recode a 0-1 variable so that 0 recodes to 2 and 1 remains 1?

3: How can I incorporate an interaction between two continuous variables in a MODEL?

4: How can I test the equality of means between four sub-populations?

5: How can I run a backward or stepwise regression in SUDAAN?

6: How do I perform an ANOVA in SUDAAN?

7: How do I apply Hosmer-Lemeshow goodness-of-fit measures to my SUDAAN output in LOGISTIC?

8: How do I compute the Mean Square Error from my PROC REGRESS output?

9: Is there a way to use the odds ratios produced in LOGISTIC to compute the probability of the response variable for a given set of covariates?

10: I've tried to find a method in PROC CROSSTAB that can produce a 95% confidence limit for a binomial proportion (this is the proportion of observations in the first variable level that appears in the output). Is there a model in SUDAAN that can compute this CI?

11: How do I use time-dependent covariates in SURVIVAL via the counting process approach?

1: How do I select predicted values for the observations used in the analysis from an output file of expected values?

You can include any number of additional identification variables on the PREDICTED output data set by using the IDVAR statement in your procedure. These extra variables can be used to uniquely connect the predicted values to the original data set.

2: How do I recode a 0-1 variable so that 0 recodes to 2 and 1 remains 1?

You can place the variable on a CLASS statement with the options SORT=INTERNAL and DIR=DESCENDING. Thus for example, to convert the 0-1 variable YESNO to a 1-2 variable with the 0's flipped to 2's, use the following:

CLASS YESNO / SORT=INTERNAL DIR=DESCENDING;

3: How can I incorporate an interaction between two continuous variables in a MODEL?

You can create a new variable whose value on each record is the product of the values of the two continuous variables,(e.g. AB=A*B) and then use this variable in your MODEL statement. Currently, you must do this data manipulation outside of SUDAAN.

4: How can I test the equality of means between four sub-populations?

Here is one possible way to accomplish this in SUDAAN. Create a new variable A with values 1, 2, 3, and 4 to designate the particular subpopulation to which the observation belongs. Suppose you wish to compare the means for the variable Y. The following SUDAAN program will do this:

PROC REGRESS DATA=<data set> DESIGN=<design>;
NEST < nest variables>;
WEIGHT <weight variable>;
SUBGROUP A;
LEVELS 4;
MODEL Y=A;

The test of hypothesis for the effect of A is the same as the test of equality of subgroup means.

5: How can I run backward or stepwise regressions in SUDAAN?

SUDAAN does not directly implement backward or stepwise regression. To run a backward regression using SUDAAN you must sequentially remove one variable from your model, and rerun your job. To run a forward regression, successively add one variable to your model, and rerun your job.

6: How do I perform an ANOVA in SUDAAN?

You can effectively perform an ANOVA by using the linear regression (REGRESS) model in SUDAAN. It works very much like the GLM procedure in SAS. You specify the categorical covariates (coded 1,2,3,...) on the SUBGROUP and LEVELS statements, the dependent variable on the left-hand side of the MODEL statement, and all independent variables (categorical and continuous) on the right-hand side of the MODEL statement.

SUBGROUP X;
LEVELS 4;
MODEL Y=X Z;

Here X is categorical with 4 levels (coded 1,2,3,4), Y is the dependent variable, and Z is a continuous covariate. X will be modeled using dummy variables (one for each level of X), and Z will be modeled with one regression coefficient (the slope of Z). By default, the last level of each of the categorical covariates is used as the reference cell for the covariate. You can change the reference cell of any categorical covariate by using the REFLEVEL statement.

7: How do I apply Hosmer-Lemeshow goodness-of-fit measures to my SUDAAN output in LOGISTIC?

Beginning with Release 9.0.0, in LOGISTIC, SUDAAN computes the following Hosmer-Lemeshow type statistics:

• A Wald F test with numerator degrees of freedom equal to the rank of the variance-covariance matrix (usually G-1) and denominator degrees of freedom equal to the (Number of PSUs - number of strata) for Taylor series and Delete-1 jackknife designs, and (Number of replicates) for BRR and Replicate weight jackknife designs.
• A simple Chi-square test, which is a weighted analog of the original Hosmer-Lemeshow test. The degrees of freedom are G-2.
• For Taylor series designs, the Satterthwaite adjusted F-test, degrees of freedom and p-value are also provided.

There are two new options HLGROUPS=count and HLVAR=variable on the MODEL statement. HLGROUPS allows you to specify the number of groups of residuals to form. By default LOGISITC forms 10 groups. Note however that LOGISTIC will not form more groups than are supported by the data. HLVAR permits you to specify a variable on the input data set which gives the group number to associated with each residual. You may use at most one of these options.

LOGISTIC has two new output groups. HLGROUPS contains information on the residual groups formed. HLTEST contains all of the available test statistics. See your SUDAAN Language Manual, Chapter 10 for details.

8: How do I compute the Mean Square Error from my PROC REGRESS output?

The concept of "mean square error" is defined only for the case of simple random samples. There is no equivalent definition for complex survey data. For computing the variance of a predicted value for given X:

PREDICTED Y = B'X;
VARIANCE(PREDICTED Y) = X' {V(B)} X;

SUDAAN prints out the variance-covariance matrix V(B) of the estimated coefficients B. You can use this to compute the variance of any predicted value.

9: Is there a way to compute the probability of the response variable for a given set of covariates in LOGISTIC?

You can get the equivalent of "adjusted means" for logistic regression and other nonlinear models using the PREDMARG and CONDMARG statements. You can test hypotheses and form general linear contrasts among the marginals using the PRED_EFF and COND_EFF statements. See the LOGISTIC chapter of the SUDAAN User's Manual for more details.

10: I've tried to find a method in PROC CROSSTAB that can produce a 95% confidence limit for a binomial proportion (this is the proportion of observations in the first variable level that appears in the output). Is there a model in SUDAAN that can compute this CI?

SUDAAN 9.0.0 and above provides confidence intervals for proportions in the CROSSTAB procedure. These are printed by default as part of the TABLECELL group. The confidence limits for row percent (ROWPER) are LOWROW and UPROW. The confidence limits for column percent (COLPER) are LOWCOL and UPCOL. The confidence limits for total percent (TOTPER) are LOWTOT and UPTOT.

11: How do I use time-dependent covariates in SURVIVAL via the counting process approach?

There needs to be some work done prior to running SURVIVAL in order to use time-dependent covariates. The method that is assumed one is using to handle time-varying covariates follows the work of Anderson and Gill (1982, "Cox's Regression Model Counting Process: A Large Sample Study," Annals of Statistics, vol. 10, pp. 1100-1120) who developed the notion of the counting-style process of inputs.

Suppose you have data with time-dependent covariates and the context is survival analysis of people where you are seeing how long people survive over time and there is the possibility that the predictor (independent) variables are time-varying. For the sake of discussion, suppose you follow people from birth to death and you have recorded the weight of each person at varying points in time, say every year of their life on December 31st.

The counting-style process of input requires each person have a record every time the independent variable value changes, so, here is a "pseudo" case. Note that Time1 and Time2 are the left and right endpoints of the time interval over which the WEIGHT (time-dependent covariate) was constant.

 PersonID Weight time1 time2 Survive? (Indicates survival) 1 7 0 1 1 1 15 1 2 1 1 25 2 3 1 1 35 3 4 1 1 40 4 5 1 1 42 5 6 1

If you have more than one time-dependent covariate, the construction of the multiplicities of records gets trickier.

With time-dependent covariates in general, what must be done is to take the interval of time over which each person is followed and break up the interval into periods of time where the time-dependent covariates are all CONSTANT.

In the above example, suppose in each year that a person's height changed mid-year (it was constant on the first half of the year and constant on the second half but not the same value in both halves). Then, you would have to take each record given above and split it into two records, one where the height represents the value in the first half of the year and one where the height represents the value in the second half of the year. Below we show this for the first record given above:

 PersonID Weight time1 time2 Survive? Height 1 7 0 .5 1 18 inches 1 7 .5 1 1 20 inches

This is how you would handle the case of two time-dependent covariates.

Once the data is set up outside of SUDAAN, you treat the data in SURVIVAL as if the covariates are NOT time-varying, like so (but still using the counting process style of input):

Model time1 time2 = weight height;

All you specify are the two variables that record the time points of the left and right endpoints of each record's "time interval".