***************************************************************************;
** PROGRAM: control_3MRCV.sas **;
** AUTHOR: Christopher R. Bilder **;
** Department of Statistics **;
** University of Nebraska-Lincoln **;
** chris@chrisbilder.com **;
** DATE: 3-23-06 **;
** PURPOSE: Control program for other programs that do the following: **;
** Fit a generalized loglinear model via the marginal modeling **;
** method and use the bootstrap and Rao-Scott approximations **;
** to find a p-value for testing the model of interest vs. the **;
** saturated model. This program is for only 3 MRCVs. **;
** NOTES: **;
** 1) Do not distribute copies of this program without the permission of**;
** the author. **;
** 2) Copyright 2006 Christopher R. Bilder **;
** 3) For more information on these procedures, see **;
** Bilder, C. R. and Loughin, T. M. (2006). Modeling Association **;
** between Two or More Categorical Variables that Allow for **;
** Multiple Category Choices. Submitted to Communications in Stat.**;
** 4) This is the control program for the analysis. The program **;
** reads in code from Gange_3MRCV.sas, data_gen_resample.sas, and **;
** boot_3MRCV.sas to do the analysis. **;
** 5) The data set needs to be in a multinomial format and each row **;
** represents a subject sampled. **;
** 7) The statements: **;
** options nomprint nosymbolgen nomlogic; **;
** options nosource nonotes; **;
** may be helpful to prevent the LOG window from having too much **;
** information displayed to it. Of course, this should only be used **;
** when one is confident the user is running the program correctly! **;
** 8) A form of conditional resampling is done here to make sure the **;
** marginal table has at least one positive response per item. This **;
** is done to keep the resampled marginal table the same size as what**;
** was observed. See the &totsim and &numbsim variables to control **;
** this. **;
** 9) This program can take some time to run. I STRONGLY recommend **;
** running a test case with a small number of resamples and a small **;
** number of iterations for the Gange algorithm. The program takes **;
** about 6 hours on a 2.4GHZ processor with 1GB memory computer. **;
** 10) With respect to specifying the model, the data set is transformed **;
** in the program to the following format: **;
** **;
** W Y Z wi yj zk COUNT **;
** 1 1 1 0 0 0 115 **;
** 1 1 1 0 0 1 8 **;
** 1 1 1 0 1 0 88 **;
** 1 1 1 0 1 1 28 **;
** 1 1 1 1 0 0 11 **;
** 1 1 1 1 0 1 2 **;
** 1 1 1 1 1 0 21 **;
** 1 1 1 1 1 1 6 **;
** 1 1 2 0 0 0 93 **;
** 1 1 2 0 0 1 30 **;
** ... **;
** **; **;
** The W, Y, and Z denote the item numbers. The wi, yj, and zk **;
** denote the 0 or 1 item responses. Thus, the first row count above**;
** denotes the W_1 = 0, Y_1 = 0, Z_1 = 0 sub-table cell count of 115.**;
** Models can be specified in a similar manner as shown in **;
** control_2MRCV.sas for the MODEL statement of PROC GENMOD. For **;
** example, W*Y*Z accounts for the gamma_ijk terms. **;
** PROC GENMOD Model statements are given in the Settings section **;
** below to model the data in this format. **;
** 11) These programs have only been tested in SAS version 8.02 **;
** 12) The code in boot_3MRCV.sas is not quite general enough to use with**;
** any data set. Code in the DATA STEP to form pred_sort2 and the **;
** DATA STEP to form sasdata_one3 would need to be changed to **;
** correspond to the correct number of W, Y, and Z items. **;
** 13) The RS MACRO calculates the V matrix of Appendix A. The program **;
** is currently set up to read the matrix in from v_mat_m.txt which **;
** contains the matrix from the Section 4.2 data set. This is done **;
** since it can take 20 minutes for the matrix to be calculated **;
** in PROC IML and it does not change for each model. When a **;
** different data set is used, one should changed some of the **;
** code in the RS MACRO (this part is commented - see macro for more **;
** information). **;
***************************************************************************;
dm 'log;clear;output;clear;';
options ps=70 ls=100 pageno=1;
goptions reset=all border ftext=swiss gunit=cm htext=0.4 htitle=0.5;
options source notes;
options mprint symbolgen mlogic;
*Suppress information sent to the RESULTS window (the program will be slower without it);
ODS noresults;
title1 'Chris Bilder''s marginal modeling of 3 MRCVs program';
************************************************************************************;
*Read in other programs containing the code;
* Do not forget to change the folder from where I stored the programs on my computer;
* to where you are storing the programs on your computer!;
*Part of Gange algorithm which finds a set of multinomial probabilities satisfying the
* null hypothesis model;
%include "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\Gange_3MRCV.sas";
*Uses the multinomial probabilities from the Gange program to take the resamples;
%include "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\data_gen_resample.sas";
*Performs the bootstrap and Rao-Scott (RS) methods, also finds the standardized Pearson residuals;
%include "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\boot_3MRCV_final.sas";
***************************************************************************************;
*Settings;
*Number of resamples to take. This should be greater than or equal to the number actually desired.;
* Resamples are not used if there are no positive responses for an item. This is done to keep;
* the marginal table the same size as what was observed.;
* I recommend using 6,000 with this data set.;
%let totsim=6;
*Number of resamples actually to use. For example, B = 5,000 are used for the 2 MRCV example in;
* the paper. I recommend trying a low number (like 5) first in order to estimate how long
* it will take for 5,000.;
%let numbsim=5;
*Number of row items for W;
%let nrow2=3;
*Number of column items for Y;
%let ncol2=4;
*Number of strata items for Z;
%let nstrat2=5;
*Convergence criterion for Gange program - max(abs(tau*_i - tau*_(i+1))) <= tol where;
* tau*_i is a vector of multinomial probabilities for the ith iteration;
%let tol = 0.00000001;
*Maximum Gange iterations for iterative proportional fitting;
* I recommend using at least 250, but test (time it) first with a smaller number.;
%let gange_iter = 5;
*Constant to add to 0 cells in the item response table;
%let constant_zero = 0.5;
*Level of significance used with bootstrap methods;
%let alpha = 0.05;
*Class statement to use in PROCs GENMOD and GLMMOD (2nd PROC is used to find the X matrix);
%let class = w y z wi yj zk;
*Complete independence model;
%let comp_ind_model = w*y*z wi(w*y*z) yj(w*y*z) zk(w*y*z);
*Ho model - everything after complete independence model;
%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(z) yj*zk(y) yj*zk(z*y); *Final model;
*%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(z) yj*zk(y);
*%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(z);
*%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(y);
*%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk;
*%let model = ; *For complete independence model;
*NOTE: The final model above with constant_zero = 0.5 will result in a few standardized Pearson ;
* residuals which are greater than 2.576 in absolute value. These residuals are all ;
* associated with the 0 cell count sub-tables. When this constant is lowered (say to 0.001),;
* there are no more standardized Pearson residuals greater than 2.576 in absolute value;
* The other models do have standardized Pearson residuals greater than 2.576 when adding;
* this constant and these residuals are not associated with the 0 cell count sub-tables.;
*Number of sub-tables - no changes need to be made here;
%let ncol=%eval(&nrow2 + &ncol2 + &nstrat2);
*Total number of multinomial probabilities - no changes need to be made here;
%let numbp=%eval(2**&ncol);
***************************************************************************************;
*Read in the data set and get it into the correct format;
*W is test waste, Y is waste storage method, and Z is sources of veterinary information;
data sasdata_one;
infile 'C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\3_MRCV_example_data.txt'
firstobs=2;
input w1-w3 y1-y4 z1-z5;
run;
*Add a dummy variable to the data set;
* The programs are actually partially set up for a simulation study. By using just one iteration ;
* and a few other modifications (already done here), the programs can be used for an example data set;
data sasdata_two;
set sasdata_one;
iter = 1;
run;
*This DATA STEP reads in the V matrix (cov(m)) given in Appendix A. The boot_3MRCV.sas contains code in its
* RS macro to calculate it, but this can take some time. Since the matrix does not
* change for each model, I calculated it once and then just read it in for whatever model under
* investigation. ;
PROC IMPORT OUT= WORK.COV_M_SET
DATAFILE= "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\v_mat_m.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
***************************************************************;
* Bootstrap;
*%SIM_LOOP(data set, seed number for resamples, sample size);
* - Performs the bootstrap methods;
* - This is located in the boot_3MRCV.sas program;
%SIM_LOOP(sasdata_two, 636265496, 279);
*%RS(sample size) - Finds the Rao-Scott (RS) adjusted Pearson statistic and corresponding p-value;
* - Also finds the standardized Pearson residuals;
* - This is located in the boot_3MRCV.sas program;
%RS(279);
***************************************************************;
*Sound to let user know the program is done;
data _null_;
call sound(300, 1000);
run;
quit;