SUDAAN Results
2: Why am I getting ******** in the output instead of results?
3: I am getting nonzero parameter estimates in LOGISTIC for the reference group. What is wrong?
5: Does it matter whether I use SUBPOPN or subset my data outside of SUDAAN before analyzing?
6:Can I use "2 log likelihood" to evaluate the relative fit of two models?
8: Which SEMETHOD should I use with R=EXCHANGEABLE?
1: My SUDAAN estimates, standard errors, and/or tests of hypothesis are not the same as the ones I get in other packages. Why is this?

If you are analyzing data from a complex sample survey, you will likely get different results in SUDAAN vs. other packages. First, if you cannot use the survey sampling weights in other packages, the point estimates will be different. Some packages allow a WEIGHT statement, and that will ensure that the point estimates are the same between SUDAAN and most other packages. Point estimates are generally biased if the survey sampling weights cannot be utilized. In addition, the variances, standard errors, tests of hypotheses, and pvalues will still be different, even when weights are utilized. This is because SUDAAN allows the user to specify the sampling design and thereby compute a robust variance estimate, yielding valid inferences. If another package does not allow for specification of the complex sampling design (stratification, clustering, etc.), then variance estimation, and hence test statistics and pvalues, will be wrong. Usually, this results in variances that are too small and falsepositive tests of hypothesis.

In some procedures different estimates as well as different standard errors may be due to different tolerances for matrix inversion. Try changing the value of the TOL parameter on the PROC statement.

In the iterative regression procedures, different estimates may be due to a different number of iterations. Try changing the values of MAXITER, EPSILON and / or P_EPSILON on the PROC statement.
2: Why am I getting ******** in the output instead of results?
The ******* indicates that the default field width is not large enough for the result. Suppose, for instance, you find **** in the output from one of the descriptive procedures where you requested WSUM . You can add something similar to the following to your PRINT statement after the slash:
where w is the overall field width you desire and d is the number of decimal places. You should choose w large enough to accommodate the number of decimal places d, the decimal point, and enough digits to the left of the decimal to contain the largest value.
3: I am getting nonzero parameter estimates in LOGISTIC for the reference group. What is wrong?
The large number of records on your data set may be the cause of the problem. The large size reduces the precision of sums of squares and cross products, which are accumulated in order to estimate the parameters. In this case, the roundoff errors may be larger than the default tolerance for matrix inversion (TOL=1e6). We suggest that you supply a larger tolerance on the PROC statement (TOL=1e5 for example) and rerun the job.
4: I am trying to estimate quantiles for a variable with a large percentage of 0 values. I am getting missing values for the quantile and for the SEs and upper and lower confidence limits. Is there anything I can do?
SUDAAN is unable to estimate any quantile that is less than or equal to the percentage of data accounted for by the 0 values. This will happen for any variable where the smallest value of that variable has ties.
5 Does it matter whether I use SUBPOPN or subset my data outside of SUDAAN before analyzing?
It makes a difference any time parts of the sampling design (e.g., an entire PSU) are lost after subsetting the data. SUDAAN needs the entire design present in order to estimate variances correctly. In most cases, it will make a difference. The difference shows up in the variance estimation and hypothesis testing.
Here is how the SUBPOPN statement works. Imagine a new variable named ELIGIBLE which is equal to 1. If the observation is to be included in the analysis through SUBPOPN, and ELIGIBLE is equal to 2, it is not included in the analysis. If this variable is used on the SUBGROUP statement with the corresponding LEVELS equal to 2, and also crossed with every term on a TABLES statement, then it will produce results for both levels ELIGIBLE=1 and ELIGIBLE=2. The use of SUBPOPN ELIGIBLE=1 will produce results that are identical to the results for the cell for "ELIGIBLE=1" when both levels are analyzed.
If you instead subset the population outside of SUDAAN and then analyze the data using SUDAAN, the results may be different in the two analyses. One case for which the results will be the same is when "DESIGN=WR" and the subset contains al least one observation (with positive weight) in each of the original PSUs.
In conclusion, the safe (therefore preferred) approach is to use SUBPOPN, and not subset the data prior to using SUDAAN.
6:Can I use "2 log likelihood" to evaluate the relative fit of two models?
You can use "2 log likelihood" to evaluate the relative fit of two models, but not the absolute pvalue to test a hypothesis, since we do not know the distribution of the likelihood for complex samples.
7: I ran the same procedure using both WR and Delete1 Jackknife designs. Results were very similar, but the Jackknife design takes much longer to execute. Which method do you recommend?
Both methods are good large sample approximations. Here "large" refers to the number of PSUs (Primary Sampling Units), not the number of observations. Which to use is a matter of preference. There is no evidence that one method is superior to the other in general.
8: Which SEMETHOD should I use with R=EXCHANGEABLE?
You can use either SEMETHOD=ZEGER or SEMETHOD=BINDER to obtain the robust variance. BINDER is most often used in complex sample surveys. ZEGER is most often used in randomized experiments and nonsurvey applications. In many cases, ZEGER and BINDER are identical.
Use SEMETHOD=MODEL to obtain the modelbased or "naive" variance estimate. This estimate assumes that exchangeable intracluster correlations (R=EXCHANGEABLE) are correct. This is the most efficient variance estimate when the "working" correlation assumption (R=EXCHANGEABLE or R=INDEPENDENT) is correct. SEMETHOD=MODEL is most often used with randomized experiments and other nonsurvey applications.
9: My regression model contains independent variables that are coded as 01 indicator variables, and I listed these variables on the SUBGROUP statement. SUDAAN seems to be deleting a lot of observations from my analysis, and the regression coefficients don't look correct to me. What could be the problem?
Do not list independent variables that are coded as 01 on the SUBGROUP statement. Values of 0 are treated as missing for variables that are listed on the SUBGROUP statement and will be excluded from your analysis. Independent variables coded 01 may be placed on the CLASS statement if you wish to treat them as categorical, or you can enter them into the model as is.
10: I have used LSMEANS in PROC REGRESS to estimate means by race and income level controlling for body weight, sex, race, and income level. I have a set of LSMEANS for race and income. How can I test for differences between those LSMEANS?
First, you can use the ttests that are printed by SUDAAN to test H_{0}: Beta=0. Tests of the betas=0 are equivalent to testing for differences in LSMEANS. These ttests automatically compare each level of the categorical covariates to the reference cells. You can also use the EFFECTS statement to compare other specific levels of the categorical covariates, and that is also equivalent to comparing LSMEANS.
11: Is there something about the way percentiles are calculated that would make the percentile estimates appear incompatible with proportions from a dichotomous variable?
I calculated percentiles for a duration variable (number of minutes walked) using:
var walkdur;
tables _one_
percentile 10 25 50 75 90;
And then using a cutoff of 30 minutes, I created a dichotomous variable and calculated using proc crosstab the proportion who walked for 30 or more minutes
Percentile results were:
29.1 min.  25th
34.5 min.  50th
59.3 min.  75th
77.2 min.  90th
I calculated an estimate of 79% walking for 30 minutes or more using a dichotomous variable. However, from the percentiles, one would expect that less than 75% walked for 30 min. or more.
If you have a large percentage of ties, then when SUDAAN interpolates between successive values, you may see the type of behavior that you have indicated.
SUDAAN is interpolating between 30 and the value just below it in order to estimate the 25th percentile. The value closest to but less than 30 that occurs for the walkdur variable is 29. Values at or below 29 account for, roughly, 22% of the data. Since the number 30 is , roughly, 28% of the data, SUDAAN assumes this percentage is distributed between 29 and 30. So, 29 is the 22nd percentile and 30 is the 22+28=50th percentile. In order to find where between 29 and 30 the 25th percentile occurs, we need to find how far away from 29 that the 25th percentile occurs. In order to find the amount to add to 29, solve this equation for x: