Introduction to Data Analysis: The Rules of Evidence
Introduction: What Is Data Analysis?
What is the wealth of the United States? Who's
got it? And how is it changing? What are the consequences of an experimental
drug? Does it work, or does it not, or does its effect depend on conditions?
What is the direction of the stock market? Is there a pattern? What is the
historical trend of world climate? Is there evidence of global warming?
-- This is a diverse lot of questions with a common element: The answers
depend, in part, on data. Human beings ask lots of questions and sometimes,
particularly in the sciences, facts help. Data analysis is a body of
methods that help to describe facts, detect patterns, develop explanations,
and test hypotheses. It is used in all of the sciences. It is used in business,
in administration, and in policy.
The numerical results provided by a data analysis are usually simple: It
finds the number that describes a typical value and it finds differences
among numbers. Data analysis finds averages, like the average income or
the average temperature, and it finds differences like the difference in
income from group to group or the differences in average temperature from
year to year. Fundamentally, the numerical answers provided by data analysis
are that simple.
But data analysis is not about numbers -- it uses them. Data analysis
is about the world, asking, always asking, "How does it work?"
And that's where data analysis gets tricky.
For example:
Between 1790 and 1990 the population of the United States increased by 245 million people, from 4 million to 249 million people.
These are the facts. But if I were to interpret these numbers and report that the population had grown at an average rate of 1.2 million people per year, 245 million people divided by 200 years, the report would be wrong. The facts would be correct and the arithmetic would be correct -- 245 million people divided by 200 years is approximately 1.2 million people per year. But the interpretation "grew at an average rate of 1.2 million people per year" would be wrong, dead wrong. The U.S. population did not grow that way, not even approximately.
For example:
The average number of students per class at my university is 16.
That is a fact. It is also a fact that the average number of classmates a student will find in his or her classes is 37. That too is a fact. The numerical results are correct in both cases, both 16 and 37 are correct even though one number is twice the magnitude of the other -- no tricks. But the two different numbers respond to two subtly different questions about how the world (my university) works, subtly different questions that lead to large differences in the result.
The tools of the trade for data analysis begin with just two ideas: Writers
begin their trade with their A, B, C's. Musicians begin with their scales.
Data analysts begin with lines and tables. The first of these two ideas,
the straight line, is the kind of thing I can construct on a graph using
a pencil and a ruler, the same idea I can represent algebraically by the
equation "y = mx + b". So, for example, the line constructed on
the graph in Figure 1 expresses a hypothetical relation between education,
left to right, and income, bottom to top. It says that a person with no
education has an income of $10,000 and that the rest of us have an additional
$3,000 for each year of education that is completed (a relation that may
or may not be true).

Figure 1
Hypothetical Linear Relation Between Income and Education
The hypothetical line shows an intercept, b, equal to $10,000
and a slope, which is the rise in dollars divided by the run in years, that
is equal to $3,0000 per year.
This first idea, the straight line, is the best tool that data analysts
have for figuring out how things work. The second idea is the table or,
more precisely, the "additive model". The first idea, the line,
is reserved for data we can plot on a graph, while this second idea, the
additive model, is used for data we organize in tables. For example, the
table in Figure 2 represents daily mean temperatures for two cities and
two dates: The two rows of the table show mean temperature for the two cities,
the two columns show mean temperatures for the two dates.
The additive model analyzes each datum, each of the quantities in the table,
into four com- ponents -- one component applying to the whole table, a second
component specific to the row, a third component specific to the column,
and a fourth component called a "residual" -- a leftover that
picks up everything else. In this example the additive model analyzes the
temperature in Phoenix in July into
1: 64.5° to establish an average for the whole table, both cities and both dates,
2: plus 7.5° above average for Phoenix, in the first row,
3: plus 21° above average for July, in the second column,
4: plus 1° as a residual to account for the difference between the sum of the first three numbers and the data.
Adding it up,
Observed equals All Effect plus Phoenix Effect plus July Effect plus Residual .
That is,
92° = 64.5° + 21° + 7.5° + 1°

Figure 2
Normal Daily Mean Temperatures in Degrees Fahrenheit
From the Statistical Abstract of the United States, 1987,
Table 346, from the original by the U.S. National Oceanic and Atmospheric
Administration, Climatography of the United States, No. 81, Sept., 1982.
Also note John Tukey's, Exploratory Data Analysis, Addison Wesley, 1970,
0. 333.
There you are, lines and tables: That is data analysis, or at least a
good beginning. So what is it that fills up books and fills up the careers
of data analysts and statisticians? Things begin to get "interesting",
that is to say, problematical, because even the best-behaved data show variance:
Measure a twenty gram weight on a scale, measure it 100 times, and you will
get a variety of answers -- same weight, same scale, but different answers.
Find out the incomes of people who have completed college and you will get
a variety of answers. Look at the temperatures in Phoenix in July, and you
will get a variety, day to day, season to season, and year to year. Variation
forces us to employ considerable care in the use of the linear model and
the additive model.
And life gets worse -- or more interesting: Truth is that lots of things
just are not linear: Adding one more years of elementary school, increasing
a person's years of education from five to six, doesn't really have the
same impact on income as adding one more year of college, increasing a person's
years of education from fifteen to sixteen -- while completing a college
degree. So the number of dollars gained for each extra year of education,
is not constant -- which means that, often, the linear model doesn't work
in its simplest form, not even when you allow for variation. And with tables
of numbers, the additive model doesn't always add up to something that is
useful.
So what do we do with a difficult problem? This may be the single most important
thing we teach in data analysis: Common sense would tell you that what you
tackle a difficult problem with a difficult technique. Common sense would
also tell you that the best data analyst is the one with the largest collection
of difficult "high powered" techniques. But common sense is wrong
on both points: In data analysis the real "trick" is to simplify
the problem and the best data analyst is the one who gets the job done,
and done well, with the most simple methods.
Data analysts do not build more complicated techniques for more complicated
problems -- not if we can help it. For example, what would we do with the
numbers graphed in Figure 3? Here the numbers double at each step, doubling
from 1, to 2, to 4, to 8, which is certainly not the pattern of a straight
line. In this example the trick is to simplify the problem by using logarithms
or the logarithmic graph paper shown in Figure 4 so that, now, we can get
the job done with simple methods. Now, on this new graph, the progression,
1, 2, 4, 8,... is a straight line.

Figure 3 |
Figure 4 |
And what are the Rules of data analysis? Some of the rules are
clear and easy to state, but these are rather like the clear and easy rules
of writing: Very specific and not very helpful -- the equivalent of reminders
to dot your "i's" and cross your "t's". The real rules,
the important ones, exist but there is no list -- only broad strategies
with respect to which the tactics must be impro- vised. Nevertheless it
is possible to at least name some of these "rules." I'll try the
list from different angles. So:
That circle of three rules describes one of the constant practices of
analysis, cycling between the central tendencies and the exceptions as you
revise the ideas that are guiding your analysis. Trying to describe the
Rules from another angle, another theme that organizes the rules of evidence
can be introduced by three key words: falsifiability, validity, and parsimony.
I will be specific about the more easily specified rules of data analysis. But make no mistake, it is these broad and not-well-specified principles that generate the specific rules we follow: Think about the data. Look for the central tendency. Look for the variation. Strive for falsifiability, validity, and parsimony. Perhaps the most powerful rule is the first one, "Think". The data are telling us something about the real world, but what? Think about the world behind the numbers and let good sense and reason guide the analysis.
Reading: Stephen D. Berkowitz, Introduction to Structural Analysis , Chapter 1, "What is Structural Analysis," Butterworths, Toronto, 1982; revised edition forthcoming, Westview, Denver, circa 1995. Stephen J. Gould, "The Median Isn't the Message," Discover , 19__. Charles S. Peirce, "The Fixation of Belief", reprinted in Bronstein, Krikorian, and Wiener, The Basic Problems of Philosophy , 1955, Prentice Hall, pp. 40- 50. Original, Popular Science Monthly , 1877. |