written by Aimee Chin
February 7, 2000
Stata is a statistical analysis software package. Stata will be need to
complete the empirical exercises in the problem sets.
This handout provides an introduction to Stata. It consists of five parts:
A. The Three Components of Your Stata Session
What is it? | Where does it come from? | Hand in with problem set? | |
*.dta | The "input" file. This is the Stata data file. | You need to download the data files for each problem set from the course website. | NO |
*.do | The "program" file that acts upon the "input" file. This is a text file containing a list of Stata commands. Save your program as text file with a "do" extension. Then to run your program, at the Stata prompt, type "do filename" for your program called filename.do. | You write it. I usually use emacs. I have an
xterm window with my do file running on emacs, and another xterm window
with Stata.
|
YES -- but only the FINAL, working version. |
*.log | The "output" file. This file echoes whatever appears on screen in ASCII text format. | You ask Stata to echo a session by typing "log using filename" and Stata automatically names it filename.log. When you are done, you type "log close" and the log file will be ready for your editing, printing, etc. I usually ask Stata to open (and close) a log file in my do file. | NOT IN ENTIRETY -- edit your log file and only hand in the parts that are directly relevant for answering the problem set questions. |
Although it is possible to use Stata interactively (i.e., you enter the
command at the Stata prompt, Stata performs it, you enter another command,
etc.), in this course you will be required to write Stata do-files. The
advantage of writing a do-file is that you do not have to type the same commands
again and again before you get the correct sequence of commands. Also, if you do
not complete your problem set in a single session, you can easily pick up where
you left off.
Note that Stata will stop running at a line that it cannot execute. When a
Stata do-file stops running in the middle, you will need to fix your do-file. In
your text editor, edit the part of the program that Stata stopped at. Run the
program again. When Stata executes the whole program, you will have a clean log
file.
B. The Most Basic Commands: getting into Stata, getting help, opening a
data set, getting out of Stata
1. Getting Started
To start Stata, at the athena prompt, you type
add stata
and then start Stata by typing
stata
2. Tutorial and Help
Stata provides on-line help. For a menu of choices, type
help
and press Enter. You can obtain help on any command in Stata by typing help followed by the command's name. For example, to learn about the sortcommand, type
help sort
Stata includes an on-line tutorial which will help you learn about Stata. To run the tutorial type
tutorial intro
3. Opening a Data Set
You can call up a Stata data set for use by typing
use filename
Where Stata will automatically look for filename.dta in your current directory (if it is located elsewhere, or if the file extension is different, then you will need to specify). If there is something already in memory, you need to first type
clear
All the contents in the current Stata workspace will be erased, and Stata
will be ready to load up a new data set.
I will always provide you with datasets in Stata format already. However, in the future, you may encounter data of different formats (e.g., ASCII, Excel spreadsheets) or you may have to input data yourself. You can find more information about this by typing
help infile
Suppose you have made changes to your data set and wish to save the changes. Type
save filename
and Stata will name it filename.dta. If filename.dta already exists, then Stata will not perform the save. You can either specify a new file name, or to overwrite the old data file, type:
save filename, replace
4. Exiting Stata
To exit Stata, type
exit
Stata will not let you exit if there is unsaved data. If you don't wish to save the modified data set, just type
exit, clear
C. The Structure of a Typical Stata do-file
Here is the structure of a typical Stata do-file:
***
cap log close
set more 1
clear
cd ~/your_directory/your_14.31_subdirectory
log using filename, replace
use filename
<insert stata commands here>
log close
***
The first line closes any log files that you might have accidentally left
open.
Line 2 tells Stata not to wait for keyboard input before waiting to display
the next screen of output -- you will not be able to read what Stata is doing as
it scrolls by, but you can read the output in the log file.
Line 3 tells Stata to erase everything in the current workspace memory. A do
file containing a command to open up a data file cannot be executed if there is
something in current memory that has not been saved.
Line 4 tells Stata the default location of files to be used and files to be
created.
Line 5 command tells Stata to start a log file named filename.log to echo the
session. Appending a ",replace" overwrites the log file of the same name. Line 8
closes this log file.
Line 6 opens up a Stata data file named filename.dta.
Line 7 is the meaty part of the program, where you issue the commands for
Stata to perform with the data. You can learn about specific commands in the
next section of this handout.
D. An Example of a Stata do-file, including Stata Commands for Getting
Descriptive Statistics and Running Regressions
Here is a more detailed example of a Stata do-file, complete with some
commands you will likely be using. Note that the comments in parentheses are my
comments to you; they are NOT part of the Stata do-file.
***
/* STARTING THE SESSION */
cap log close
set more 1
clear
log using sample,
replace
* Sample do file
(A star at the beginning a line indicates a comment; Stata
ignores this line because there is nothing to execute)
/* Created by Aimee Chin on 2/7/00 */
(You can make comments in places other than the beginning of
the line by using "/* comment */" )
use data
/* COMMANDS FOR DESCRIBING THE DATA */
describe
(Gives a description of your data file, including variable
names, any variable labels and way the data is sorted.)
summarize var1 var2
(Gives the summary statistics, including mean and s.d., of the
variables specified. If you just type "summarize", the summary statistics for
ALL variables will be given. Append ", detail" and more details about the
specified variable will be given.)
tabulate var1
(Tabulates the values of a categorical variable. You can do a
cross-tabulation by typing "tabulate var1 var2".)
correlate var1 var2
(Computes the correlation between var1 and var2. You can
specify more variables, and the whole correlation matrix will be
displayed.)
covariance var1 var2
(Computes the covariance.)
list var1 var2
(Lists var1 and var2 of each observation. If you just type
"list", all variables for each observation will be listed -- basically the raw
data.)
/* COMMANDS FOR QUALIFYING THE DATA */
/* Sometimes, you are only interested in a part of your sample
meeting certain qualifications, e.g., observations after 1990 or individuals
living in Massachusetts. Here are ways you can run the commands for a
subsample.*/
list in 1/5
(Lists the first five observations. The "in" qualifier requires
you knowing the observation number of the observations you are interested in.
Observation number can change when you sort and re-sort the data so be careful
when you rely on "in".)
summarize if year > 1990
(The "if" qualifier is very powerful. I would advise you to
type "help if" to learn more.)
sort year
(Sorts all the observations by year.)
by year: summarize
(The sort, by combination is also very useful. If you only had two years, you can alternatively type "summarize if year==1990 and "summarize if year==1991" but the sort, by combination saves work when there are more years. You can sort by more than one variable and use by for more than one variable.)
/* You may put as many conditions as you want using and
("&") and or ("|"). Consider the following examples. */
keep if year > 90 & state = "MA"
(Keeps only the post-1990 observations in which the individual
lives in Massachusetts; removes the rest of the observations.)
keep if year > 90 & (state = "MA" | state == "NY")
(Keeps only the post-1990 observations in which the individual
lives in Massachusetts or in New York. This is different from the following
example.)
keep if year > 90 & state = "MA" | state == "NY"
(Keeps the observations in which: (1) post-1990 and live in MA
or (2) live in NY. This is different from the previous example.)
/* COMMANDS FOR MANIPULATING THE DATA */
generate newvar = something
(Generates a new variable called newvar which is whatever you
specify. It can be a function of existing variables.)
replace var1 = something if blah == 1
(Replaces existing value of var1 with something if blah is
equal to 1. Something may be an expression. Replace can also be used without the
"if" qualifier. Notice there's a distinction between "=" and "==" throughout
Stata.)
replace varname = 1 in 100/120
(Sets the variable varname to 1 for observations 100 to
120.)
drop var1
(Removes var1 from your dataset.)
label newvar "description of newvar"
(Gives newvar a label so when you use the describe command, you
will be reminded of what newvar is.)
/* COMMANDS FOR GRAPHING THE DATA */
graph y x, saving (filename, replace)
(Stata prints the graph to a file called filename.gph. There
are many options for making your graph look pretty, including axis labels and
such, that you can learn about by typing "help graph".)
gphpen filename.gph
(Stata now writes the graph saved in filename.gph to a
postscript file named filename.ps that you can print by typing "lpr
-Pprintername filename.ps" at the Athena prompt.)
/* COMMANDS FOR REGRESSIONS */
regress y var1 var2
(This computes the ordinary least squares estimates. y is the dependent variable, all others are independent variables. Stata automatically includes a constant in the regression unless you type ",noconstant" after the command.)
predict yhat
(Creates a variable yhat that contains the predicted values of
y based on the regression just run.)
predict ehat, resid
(Creates a variable ehat that contains the residual (equals y
minus yhat) based on the regression just run.)
/* WRAPPING UP THE SESSION */
save data, replace
(You may or may not want to save changes to data.dta. If you've
generated a lot of new variables and expect to use these variables again,
perhaps you should save it. But if you only plan to use data.dta in the context
of this do file, then I wouldn't bother saving it.)
log close
(You can now find your completed masterpiece, sample.log in the
current directory or the directory you specified. You can view and edit it using
a text editor.)
set more 0
(This turns back on pausing after a screen's worth of
information is displayed.)
***
E. A Few Tips