Doing More with SAS
Today's topics
1) Data manipulation
2) Basic inferential statistics
3) Intermediate inferential statistics
4) SAS Graph procedures
5) Permanent SAS files
HSB Codebook
|
ID |
ID NUMBER |
1-3 |
|
SEX |
9 |
|
|
RACE |
17 |
|
|
SES |
SOCIO-ECONOMIC STATUS |
25 |
|
SCTYP |
SCHOOL TYPE |
33 |
|
HSP |
HIGH SCHOOL PROGRAM |
41 |
|
LOCUS |
LOCUS OF CONTROL |
49-53 |
|
CONCPT |
SELF CONCEPT |
57-61 |
|
MOT |
MOTIVATION |
65-68 |
|
CAR |
CAREER CHOICE |
73-74 |
|
RDG |
READING SCORE |
81-84 |
|
WRTG |
WRITING SCORE |
89-92 |
|
MATH |
MATH SCORE |
97-100 |
|
SCI |
SCIENCE SCORE |
105-108 |
|
CIV |
CIVICS SCORE |
113-116 |
Combinations of subsetting, merging, dropping variables, and outputting from procedures allow the user to create the desired data set, which can then be saved permanently to a new data file.
Subsets
Subsets of cases
Done within a data step with a "set" command
Use a combination of conditionals, Boolean operators, and
either "output" or "delete"
Subsets of variables
Done within a data step with a "set" command
"Keep" the variables you want
Merging
Concatenating data sets
Use the "set" command in a data step, naming the data sets
to be merged
Matched merging
Sort the data sets by the variable or variables you want to
match by
"Merge" the data sets in a data step by the desired variables
Beware of duplicates
Summary data sets
Some procedures (e.g., proc means) allow you to create
an output data set of summary statistics;
you basically just have to name the data set and desired variables on an
output subcommand of the procedure. Can be done by subgroups using Proc
sort and the "by" subcommand
Creating permanent text files
Associate a filename with the output file
Use a data step to associate a data set with filename
Use a "put" statement to place the variables in the desired locations within
a record
Basic Inferential Statistics
Proc corr analyze relationships between
variables
Pearson vs. Spearman correlations
Significance: 1 or 2 tailed p-values
listwise or pairwise deletion
Proc freq analyze differences in proportions
Statistics
Chi-squared
phi and Cramer's V
Cells
Observed
Expected
Row, column, total percents
Residuals: cellchi2
Proc means analyze differences in means
Paired-samples t-test (dependent t-test) )
Two variables
Confidence interval
Proc ttest analyze differences in means
Independent samples t-test
2-groups
Define groups
Intermediate Inferential Statistics
Analysis of Variance (ANOVA) analyze differences among means
One-way factorial ANOVA: Proc GLM
Differences between group means (like an extension of independent t-test)
Compare means: One-way ANOVA
Post hoc tests show pairwise comparisons while controlling for the number
of tests being conducted
Repeated measures ANOVA : Proc GLM
Differences among means for variables (like an extension of dependent t-test)
Often used for analyzing differences over time
Specify the factor name and number of levels
Regression forming models for prediction
from multiple predictor variables
Proc REG
Outcome variable is usually an interval variable
Predictor variables are usually interval variables, but can include dichotomous
variables
Linear regression forms a linear composite of the predictor variables
in such a manner that the correlation between the composite of predictors
and the outcome variable is maximized
SAS/GRAPH procedures
Output of regular SAS graph and chart procedures produce low quality output that is actually text. SAS/GRAPH procedures (e.g., Gchart, Gplot, Gcontour) produce higher quality, graphical output.
Depending on the type of graph produced, Goptions can be used to control aspects of the output, including: text, titles, labelling, frames, fill patterns, colors, axes, scaling, number of bars, etc.
Because of the number of available options for SAS/GRAPH procedures, creating nice graphs with SAS can be complicated and time-consuming, with a lot of trial and error involved.
Editing charts is done in a chart window, and there are limited editing options; one is better off setting up the graphs beforehand with Goptions.
From graph windows, graphs can be exported in a half-dozen formats, including
gif, ps, and eps. With some extra SAS programming, these options can
be expanded.
A Sample Annotated SAS Program
/* this is a comment, and can be placed almost
anywhere in */
/* your program (except the 1st 2 columns) */
/* comments here are in boldface, but would not be bold*/
/* in your program*/
filename in1 "~/public/minicourses/sas/hsb.dta";
filename out1 "~/public/minicourses/sas/hsb_2.dta";
options pagesize=55 linesize=72;
/* the first filename command references an
internal file "in1"*/
/* to the external raw data file "hsb.dta" which lives in a*/
/* subdirectory "public" in my Unix home directory*/
/* the second filename references a new data set "hsb_2.dta"*/
/* which will house a permanent data set created below*/
proc format;
value sexfmt 1="MALE"
2="FEMALE";
value racefmt 1="HISPANIC"
2="ASIAN"
3="BLACK"
4="WHITE";
value sesfmt 1="LOWER"
2="MIDDLE"
3="UPPER";
value schfmt 1="PUBLIC"
2="PRIVATE";
value hspfmt 1="GEN"
2="ACAD"
3="VOC";
/* proc format is used to create value labels
to values*/
/* the formats are assigned to variables in the data step*/
data one; infile in1 missover;
input
id sex race ses sctyp hsp locus concpt mot car rdg wrtg math sci civ;
format sex sexfmt. race racefmt. ses sesfmt. sctyp schfmt. hsp hspfmt.;
/* the variables are read in free-field format*/
/* value labels are assigned to particular variables*/
/* with the format command*/
rw_dif=rdg-wrtg;
rm_dif=rdg-math;
/* two new variables (difference scores) are created*/
*subsetting cases;
/* this is another type of comment; the asterisk
and semi-colon*/
/* are necessary*/
data subset_1; set one;
if sctyp=1 then output;
/* a new data set is created which contains
only cases whose*/
/* school type is 1*/
proc freq data=subset_1;
tables hsp;
data subset_2; set one;
if sctyp=2 then delete;
/* another new data set is created which contains
only cases whose*/
/* school type is not equal to 2*/
data subset_3; set one;
if (hsp=1 or hsp=2) and sctyp=1 then output;
data subset_4; set one;
if (hsp=1 or hsp=2) and sctyp=1 then delete;
/* creating complicated subsets*/
*subsetting variables;
data subset_5; set subset_2;
keep id sex race--hsp rdg--civ;
data subset_6; set subset_2;
keep id locus--car;
/* creating subsets where only certain variables are kept*/
*concatenating data sets;
data mset_7; set subset_3 subset_4;
/* combining data sets "vertically" (i.e., adding cases)*/
*match-merging data sets;
proc sort data=subset_5;
by id;
proc sort data=subset_6;
by id;
data mset_8;
merge subset_5 subset_6;
by id;
/* combining data sets "horizontally"
(i.e., adding variables)*/
/* matching by id; data sets must be sorted first*/
*getting a summary data set;
proc means data=one n mean std;
var math;
output out=mnset_9 mean=mean1;
proc print data=mnset_9;
/* many procedures let you create an output
data set*/
/* this creates a data set with a single case, which has*/
/* the mean math score for all original cases*
*getting a summary data set by groups;
proc sort data=one;
by hsp sex;
proc means data=one n mean std noprint;
var math;
by hsp sex;
output out=mnset_10 mean=mean2;
proc print data=mnset_10;
/* this creates a data set with the mean math
score*/
/* for subgroups. Cases must be sorted first*/
*sending a data set to a file;
data mnset_11; set mnset_10; file out1;
put hsp 1 sex 3 @5 mean2 5.2;
/* this creates a permanent data set
sent to the file "out1"*/
/* defined at the beginning of the program*/
/* the put statement defines what variables are written to*/
/* what columns and in what format*/
*getting correlations;
proc corr data=one pearson nosimple;
var rdg--civ;
proc corr data=one spearman nosimple;
var rdg--civ;
/* inferential statistics: correlations*/
*doing chi-squared analyses;
proc freq data=one;
tables race*ses/ chisq expected nopercent cellchi2;
/* a chi-squared analysis*/
*doing t-tests;
proc means data=one mean t prt;
var rw_dif rm_dif;
/* paired (i.e., dependent) t-tests*/
proc ttest data=one;
class sex;
var rdg math;
/* independent t-tests*/
*doing one-way ANOVA's;
proc glm data=one;
class ses;
model math=ses;
means ses/bon lines;
proc glm data=one;
class race;
model locus=race;
means race/bon lines;
/* one-way analyses of variance*/
proc glm data=one;
class ;
model rdg wrtg math sci=/nouni;
repeated measure 4 contrast(1)/summary nom;
/* a repeated measures one-way analysis of variance*/
*doing multiple regression;
proc reg data=one;
model wrtg=sex locus concpt mot rdg/ stb;
/* linear regression*/
/* what follows are examples of graphs that
can be*/
/* created using sasgraph procedures*/
/* options, patterns, symbols, colors, axes, etc.*/
/* are all defined and controlled*/
goptions reset=all hsize=6 in vsize=6 in htitle=5 pct border gwait=5;
pattern1 value=solid color=blue;
pattern2 value=solid color=red;
pattern3 value=solid color=green;
axis1 label=('HIGH SCHOOL PROGRAM');
axis2 label=('PROGRAM');
axis3 label=('SCIENCE SCORES');
symbol1 color=blue value='';
symbol2 color=red value='*';
proc gchart data=one;
vbar hsp/
type=freq discrete
caxis=black coutline=black frame freq clipref width=8
ref=0 to 300 by 100 minor=1
patternid=midpoint maxis=axis1 raxis=0 to 350 by 50;
proc gchart data=one;
vbar hsp/
type=mean sumvar=rdg discrete
caxis=black coutline=black frame clipref mean width=8
ref=30 to 50 by 10 minor=1
patternid=midpoint maxis=axis1 raxis=30 to 60 by 10;
proc gchart data=one;
vbar hsp/
type=pct discrete group=sex g100
caxis=black coutline=black frame clipref pct width=8
ref=0 to 50 by 10 minor=1
patternid=midpoint maxis=axis2 raxis=0 to 60 by 10;
proc gchart data=one;
vbar sci/
type=freq
caxis=black coutline=black frame clipref
ref=0 to 100 by 10 minor=1 SPACE=0.0
maxis=axis3 raxis=0 to 110 by 10;
proc gplot data=one;
plot wrtg*rdg=sex ;
run;
quit;
/* the quit command is necessary to kill*/
/* the sasgraph processes*/
Some Useful SAS Procedure
Base Procs
|
Proc |
Description |
|
Format |
Defines value labels |
|
|
Prints variable values for each case |
|
Sort |
Sorts cases by one or more variables |
|
Rank |
Computes ranked values for a variable |
|
Freq |
Produces one-way or two-way frequency tables |
|
Univariate |
Computes descriptive statistics for a variable |
|
Means |
Computes descriptive statistics for a variable |
|
Corr |
Computes correlations among variables |
|
Chart or Gchart |
Produces bar charts, pie charts, etc. |
|
Plot or Gplot |
Produces scatterplots |
Stat Procs
|
Proc |
Description |
|
Anova |
Conducts an analysis of variance |
|
Cancorr |
Conducts a canonical correlation analysis |
|
Cluster |
Conducts a cluster analysis |
|
Discrim |
Conducts a discriminant analysis |
|
Factor |
Conducts a factor analysis |
|
Freq |
Conducts a chi-squared analysis |
|
GLM |
Used for complicated linear analyses (General Linear Model) |
|
Princomp |
Conducts a principal components analysis |
|
Reg |
Conducts a linear regression analysis |
|
Score |
Uses output of other stat procs to compute composite scores |
|
Tree |
Produces a dendogram for a cluster analysis |
|
Ttest |
Conducts an independent or dependent (paired) t-test |
Some useful, basic Unix commands
Remember, Unix is case sensitive. For illustration purposes only, file and directory names are written in italics below.
|
Command |
Description |
|
ls |
list files and subdirectories in the current directory |
|
cd /name |
navigate to the specified directory |
|
cd name |
navigate to the specified subdirectory |
|
cd .. |
navigate to the directory one level above the current directory |
|
cd |
Navigate to your home directory |
|
ps |
lists the current processes you own and their process id numbers (pid) |
|
kill -9 ##### |
kills the process with the specified process id |
|
top |
lists the big jobs currently running; q quits |
|
man command |
lists the manual for usage of the specified command |
|
mkdir name |
creates a directory with the specified name |
|
rmdir name |
removes the specified directory (must be empty) |
|
rm name |
removes the specified file (permanently) |
|
rm *.* |
removes all files in the current directory (be careful!) |
|
cp name1 name2 |
makes a copy of first file and names it the second file |
|
df -k |
lists some system stats |
Other useful tips
Control-z will usually halt a job or process that has gone bad or is not responding. Doing a ps and then killing the process can be used to eliminate that process.
As SAS is running, it creates a work directory for your job to hold temporary files, usually in a "scratch" directory. If the SAS session ends cleanly (as it usually does), the work directory closes down. If, however, a SAS session dies a bad death, it can leave behind work files that can possibly interfere with future SAS sessions. If that happens, you may be asked to clean up what was left behind.
If you intend to run big jobs (i.e., you're working with huge data files) you might want to check with Research Computing for advice before you start. Certain aspects of a SAS session can be customized that may help you run efficiently.
Please don't run more than one SAS session at a time; it may interfere with other users.
Permanent SAS Data Sets
1) Create a directory (i.e., folder or library) where you want to store the data files
Unix: mkdir whatever (e.g., mkdir rick_sas_perm)
N.B. A directory named sasuser.v91 is already built for you
2) In the Explorer Window, go to File/New/Library
3) Name the library (e.g., rick_sas) and enable it at startup
4) Browse to or write the path (e.g., /afs/northstar.dartmouth.edu/ufac/rbarton/rick_sas_perm)
Saving a permanent SAS data set
Method 1
1) Read your raw (i.e., text) data into SAS using the filename/infile/data commands
Example:
______________________________________________________________________________
filename in1 "~/public/minicourses/sas/hsb.dta";
data one; infile in1 missover;
input
id sex race ses sctyp hsp locus concpt mot car rdg wrtg math sci civ;
run;
______________________________________________________________________________
2) From the Tools menu, open up the Table editor
3) From the Viewtable File menu, Open the data set’s name from the Work library
4) From the Viewtable File menu, Save as the filename you wish to the library you wish
(e.g., Save as rick1 to the library rick_sas)
5) Close out the Viewtable window
Method 2
1) Assign a library name using a LIBNAME statement in a SAS program
2) In your data step that reads the raw data, use a two-level libref.filename to name the data set
Example:
______________________________________________________________________________
filename in1 "~/public/minicourses/sas/hsb.dta";
libname rick_sas "~/rick_sas_perm";
data rick_sas.rick2; infile in1 missover;
input
id sex race ses sctyp hsp locus concpt mot car rdg wrtg math sci civ;
run;
_____________________________________________________________________________
Using SAS data sets
1) Identify the library using the LIBNAME statement
2) In data steps and procedures, use the two-level libref.filename to refer to the data set
Example:
_____________________________________________________________________________
libname rick_sas "~/rick_sas_perm";
proc print data=rick_sas.rick2;
run;
______________________________________________________________________________
A note on SAS data sets
SAS data sets are somewhat unique among statistical programs in that formats (such as value labels) are saved in a file separate from the data. This can cause difficulties when trying to move SAS data into other programs.
SAS Help at Dartmouth College
Help SAS has help facilities that are pretty good, with sample programs
Tutorial There is a tutorial under the help menu
Manuals The library has a number of SAS manuals
Books If you plan on working with SAS a lot there are "how to" books out there
Other users Sometimes they know more about SAS than I
SAS web site has FAQ's and support for the extremely technical questions
Research Computing web site ~rc
Richard Barton
Berry 179C
646-0255
richard.barton@dartmouth.edu