CHANCE News 6.07

(11 May 1997 to 7 June 1997)


Prepared by J. Laurie Snell and Bill Peterson, with help from Fuxing Hou, Ma.Katrina Munoz Dy, Kathryn Greer, and Joan Snell, as part of the Chance Course Project supported by the National Science Foundation.

Please send comments and suggestions for articles to jlsnell@dartmouth.edu.

Back issues of Chance News and other materials for teaching a Chance course are available from the Chance web site:



A chi-square is a mysterious thing, a bit like baking soda to an amateur cook.
Maura McDermott, A student in Gudmund Iversen's statistics class.



Jim Baumgartner suggested the following article.

The sports taboo: why blacks are like boys and whites are like girls.
The New Yorker, 19 May 1997, pp. 50-55
Malcolm Gladwell

Gladwell does not accept the taboo of talking about racial differences in athletic abilities. He points out that there is no such taboo of talking about medical differences, such as, that blacks have a higher incidence of hypertension than whites and that black males have twice the incidence of diabetes and prostate cancer than white males. You might say the taboo is obviated because medical differences may be attributed to genetic differences. But Gladwell argues that superior athletic ability of blacks may also be explained by genetic differences.

Gladwell says that there are more men than women at the top in standardized tests because the variance of men's scores is higher. There are also more men than women at the bottom. But, on average, men are not smarter than women. (See Chance News 4.10). Gladwell suggests that the fact that the very top athletes tend to be black could also be because the variance of their athletic abilities is greater than that of whites while, on average, the athletic abilities of blacks and whites are about the same.

Gladwell has no trouble supporting his claim that top athletes tend to be black: for example, in track, if you look at the 20 fastest times recorded last year for the various distances run, you find at least 15 and, in some cases 20 were achieved by Africans. African Americans will make up 80% of the players on the floor for the N.B.A. basketball playoffs. Obviously, it is more difficult to settle the question of who is at the bottom of athletic ability.

Gladwell suggests another parallel between gender and race. Psychologists say that if a boy does well on a math exam, he is likely to attribute this to his fine ability in math. If he does poorly, he may, for example, blame the teacher. If a girl does well she is apt to attribute this to hard work. If she does poorly it is because she is really not very good at math. A similar stereotype appears in sports. If a black is a superstar basketball player, it is because of his natural talent. But if a white is a superstar, it comes from exceptionally hard work. If a black has a bad night, the referee made some bad calls. But if a white has a bad night, he just did not prepare enough for the game.

Gladwell remarks that these stereotypes are self-fulfilling and discourage women from trying to excel at mathematics and whites to excel at sports.


(1) Sports Illustrated described Steve Kerr, who plays alongside Michael Jordan for the Chicago Bulls, as a "hard-working over- achiever," distinguished by his "work ethic and heady play" and by a shooting style "born of a million practice shots." Do you think that Kerr works harder at perfecting his game than Michael Jordon?

(2) How would you design an experiment to test the hypothesis that blacks have a larger variance in athletic abilities with a similar average ability but, on average, have about the same athletic ability as whites?

(3) In a study reported in Science, 7/7/1995, Hedges and Nowell analyzed data from the National Assessment of Educational Progress program. They found that the variance of the scores for men was larger in all four areas: reading, mathematics, science, and writing. On average, men did better in math and science and women in reading and writing. Would you expect, under Gladwell's theory, differences in average athletic ability in different sports?

(4) Does it seem reasonable to assume that genetic difference might affect the variance but not the average in something like mathematical or athletic ability.

While we are on the subject of differences between men and women, here is another interesting study.

Swedish study finds sex bias in getting science jobs.
The New York Times, 22 May, 1997, A25
Lawrence K. Altman

Nepotism and sexism in peer-review.
Nature, Vol 387, 22 May 1997, p 341-343
Christine Wenneras and Agnes Wold

It is often stated that women have to be twice as good as men to succeed. Well, at least in getting a post-doctoral fellowship under the peer-review system of the Swedish Medical Research Council (MRC) this is approximately correct. More precisely, women have to be 2.5 times as good.

There has long been anecdotal evidence that there are biases in the peer-review system but, evidently, no one before ever thought of looking at the data. Researchers Christine Wenneras and Agnes Wold decided to do this when they had the rare opportunity, under Sweden's Freedom of Press Act, to obtain the data providing the reviews by the MRC for 114 post-doctoral applications for the year 1995.

Each application is considered by one of 11 committees. Within each committee, applicants are read by five reviewers who give the applications a score from 0 to 4 on: scientific competence; relevance of the research proposal; and the quality of the proposed methodology. The three scores are multiplied obtaining a score between 0 and 64. These scores are averaged over the 5 reviewers. This average determines the rank of the applicant by the committee. The committee then submits 1 to 3 applicants to the Swedish Medical Research Council (MRC) (made up of one member from each committee) which makes the final choice for awards. In 1995 there were 114 applicants for 20 awards. 62 of the applicants were men and 52 women. 4 awards were made to women and 16 to men.

Wenneras and Wold established a measure of scientific productivity for each applicant independent of that used by the evaluation committee. The Wenneras and Wold measure took into account the number of publications and, for each publication, whether the applicant was the first author or not, the quality of the journal, and how often the article was cited. Using this measure of scientific productivity, Wenneras and Wold found that only women given a score of 100 or more by the MRC's measure were considered to be as competent as men, although only as competent as men who scored fewer than 20 points on MRC's measure.

To see why women were scored so badly by the MRC reviewers, Wenneras and Wold carried out a multiple regression to find factors that exerted a significant influence on the competency scores given by the reviewers. They concluded from this that a women had to have about 67 more points on their measure of scientific competency to earn the same score as a man. This could be achieved by publishing three more papers in one of the most prestigious general science journals such as Science or Nature, or 20 more papers at the next level which would be a top journal in a specific field.

The researchers remark: "Considering that the mean total impact of this cohort of applicants was 40 points, a female applicant had to be 2.5 times more productive than the average male applicant to receive the same competence scores by the calculation (40+64)/40 = 2.6."

A bonus, approximately equal to this male bonus, was achieved by knowing a member of the committee. If you were a female and did not know anyone on the committee you had a handicap that provided very little opportunity to be compensated for by scientific productivity alone.


(1) What do you think about multiplying the scores between 0 and 4 on scientific competence; relevance of the research proposal; and the quality of the proposed methodology to get a single score for an applicant?

(2) The New York Times article states that "Dr. Jaleh Daie, the president of the Association for Women in Science in Washington, said that, despite imperfections in the peer review system in the United States, she would not favor opening the confidential scores compiled by anonymous reviewers." Why do you think she would object to the kind of review carried out by Wenneros and Wald?

(3) There were not enough women on the Swedish review committees to determine if women rated women candidates significantly differently than men. Do you think the women would have fared better with more women evaluators? In particular, do you think the solution might be a simple as assuring that women are well represented on the review committee?

(4) The United Nations has recently said that Sweden is the leading country in the world with respect to equal opportunities for men and women. Does this suggest that the United Nations is just wrong or, as the authors suggest, that similar problems might exist in other countries?

(5) Do you understand the calculation showing that women have to be 2.5 times more productive than the average man to get a fellowship?

The Bible Code.
Simon and Shuster, N.Y., 1997
Michael Drosnin

In Chance News 4.15, we discussed an article by Witztum, Rips and Rosenberg (WR and R) published in "Statistical Science" (1994, Vol, 9, No. 3, pp 429-438). In this article, the authors claim to show that the Hebrew version of Genesis contains information about events that occurred long after the Bible was written that cannot be accounted for by chance.

Orthodox Jews believe that the Torah, consisting of the first five books of the Bible, represents the word of God unmediated by human beings. This article has been widely quoted as providing evidence for this belief.

For their study, WR and R chose two lists of names of prominent Rabbis, born thousands of years after the Bible was written. There were 34 Rabbis in their first list and 32 in the second. With each of these names they associated a date representing the date of birth and/or death. Dates, in Hebrew, are written using letters only, no numbers. Actually each Rabbi's name and date were represented by WR and R in more than one way, corresponding to the different ways they might occur in scholarly books.

WR and R represented the book of Genesis as one long string of Hebrew characters with no spaces. A word was said to be "found" in Genesis if its letters appeared equispaced within this string.

The two lists of Rabbis were tested separately. For each test, a computer program searched for the Rabbi's names and dates in Genesis. The authors defined an overall measure, Omega, of the distance of the names of the Rabbis from their dates. To test for significance, they computed Omega for a large number of permutations of the names. They found, from this, that it was very unlikely to get a value of Omega as small as that obtained for the original order of the names.

In accepting the paper, the editors of "Statistical Science" commented that the referees doubted the conclusions of the study but could not find anything wrong with the statistical analyses. They were publishing it for others to try to discover what might be going on.

For the first 2 to 3 years after the paper's publication, there was no formal scientific response, though religious groups used the results as new evidence of divine intervention and, of course, there were a number of lively discussions of the work on the web.

The editors could hardly have anticipated that the response to this article would be a best-seller and possibly a movie. However, when you think about the current public attitude towards science, this response is not such an unlikely event.

Drosnin is a writer best known for his best seller, "Citizen Hughes." He became interested in the WR and R study and learned from Rips how to do his own search for hidden messages. Drosnin then did what is called "data mining", looking for important modern events encoded in Genesis. He found Rabin's name and, next to his name, the phrase "assassin that will assassinate". He sent a message to Rabin, telling him of this finding. A year later, when Rabin was assassinated, Drosnin says he became a believer.

Maya Bar-Hillel tells us that the Hebrew for "assassin that will assassinate" is exactly the same as "assassin that will be assassinated" and remarks: "while Drosnin has the ears of the world, I suggest he should use his fame to warn the prison authorities in Israel that Igal Amir (Rabin's assassinator) is in danger for his life!"

Drosnin shows that it is possible to find just about every great moment in recent history including the Gulf War, Watergate, the collision of the comet Shoemaker-Levy with Jupiter, Clinton's election, the Holocaust, etc. He shows us evidence for the final atomic war in the year 2000 or 2006. He covers himself slightly by saying that future events are really probabilities.

While a serious study of the WR and R study has been slow in coming, now Maya Bar-Hillel and Dror Bar-Natan at the Hebrew University, Brendan McKay at the Australian National University, Arie Levitan and Alec Gindis have accepted the challenge of the Statistical Science editors and have made serious studies of the WR and R results.

Maya recently visited Dartmouth and talked about this work. She remarked that anyone who understands Hebrew realizes that, along the way, many rather arbitrary decisions had to be made. For example, as in English there are various ways in Hebrew that Rabbis are addressed and different forms for their names as well as for the dates of their birth or death. There are many ways that the distance between words could be measured. Maya reported that there is a large (and growing) list of felicitous choices the WR and R made, many of which would have detracted from the final result if they had been made differently (even though apriori, you would think it shouldn't matter). It is all too common for experiments to be designed, consciously or unconsciously, by making those choices which produce the results being looking for.

Maya remarked that the WR and R were happy to try the same experiment with works where you would not expect to find the names of the Rabbis. For example, they did their experiment on the Hebrew version of "War and Peace" and found nothing significant there. But they were less willing to carry out experiments that would replicate their discovery. They did do one replication, or cross validation, using their second list of Rabbis. They again found significant results. However, they declined to do the more convincing replication of using one of the other four books in the Torah. Since all five of these books are believed to be the direct words of God, similar evidence of divine writing would be expected in the other four. When Maya and her colleagues tried this replication, using the same words used by WR and R, no significant results were found.

McKay and Bar-Natan, are preparing a paper for publication that reports the results of a series of experiments that attempted to replicate the WR and R experiment. One was only a slight variation of the WR and R experiment. In a more substantial variation, they replaced the date associated with the Rabbi by the name of the most famous book written by the Rabbi. All versions of their experiments were carried out for each of the five books of the Torah. They used the same method for computing distances and determining significance and also another method suggested by Perci Diaconis. They could not find anything that could not be attributed to chance in any of their experiments. You can see a detailed description of these experiments and their results at: Report on new ELS tests of Torah

Maya reports that, while the original WR and R list found nothing significant in "War and Peace," a list, identical to the original in everything except for some playing around with choice of names and titles (in a manner that would probably be undetectable to us and possibly even to experts in the field) did yield significance values in "War and Pease." that match or surpass those reported by WR and R in their original article.

McKay and his colleagues have also provided amusing examples of what can be done by the type of data mining employed by Drosnin. A computer search in the "Law of the Sea Treaty" found such phrases as: "Hear all the law of the sea" and "safe UN ocean convention to enclose tuna" which, as usual, if predicted in advance would have been difficult to attribute to chance. Also they found 59 words related to Chanukah "closely clustered" in a small segment of the Hebrew version of "War and Peace." You can find a discussion of this experiment at: Astounding Discoveries in War and Peace!


(1) Brendan McKay says: Sorry Maya, but your theory is quite wrong. The code is referring to the intending assassin being assassinated, i.e. that Amir would be killed before carrying out his deed. Who do you think got it right? What other interpretation can you give to this phrase?

(2) Professor Rips has made the following public statement:

I have seen Michael Drosnin's book The Bible Code. There is indeed serious scientific research being conducted with regard to the Bible codes. I do not support Mr Drosnin's work on the codes, or the conclusions he derives. The book gives the impression that I have done joint work with Mr Drosnin. This is not true. There is an impression that I was involved in finding the code relating to Prime Minister Yitzhak Rabin's assassination. I did witness in 1994 Mr Drosnin 'predict' the assassination of Prime Minister Rabin. For me, it was a catalyst to ask whether we can, from a scientific point of view, attempt to use the codes to predict future events. After much thought, my categorical answer is no. All attempts to extract messages from Torah codes, or to make predictions based on them, are futile and are of no value. This is not only my own opinion, but the opinion of every scientist who has been involved in serious codes research.

If there are divine words in the Bible why should they not be able to be used to predict future events? Why did Rips have to think about it so hard if it is the opinion of every scientists involved in serious codes research?

(3) It often appears that believers in ESP are able to get significant results while non-believers, carrying out the same experiments, fail to get significant results. Do you think this will be the case with Bible codes?

(4) Drosnin gives the impression that he feels that God has encoded an unlimited amount of knowledge in Genesis. Can you see how God might have done this using only finitely many words?

(5) If I showed you that a minor change in one of the subjective decisions made by the experimenters in the WR and R study would make their results not statistically significant, would you reject the results of their study? Would you be influenced by the number of minor changes I had to look at to find a change where the results would not be significant?

Michael Olinick sent us a number of interesting web references to the Bible code story. Among them was the following amusing proof that kiddie TV character Barney is actually the devil:

Barney wants to corrupt our children!

Prove: Barney is satanic

        The Romans had no letter 'U', and used 'V' instead for
        printing, meaning the Roman representation would for
        Barney would be: CVTE PVRPLE DINOSAVR


        Extracting the Roman numerals, we have:

        CV    V  L  DI    V

        And their decimal equivalents are:

        CV    V  L  DI    V
       / |    |  |  | \   |
     100 5    5 50 500 1  5

        Adding those numbers produces: 666.

        666 is the number of the Beast.


Michael points out that the same strategy works (modulo M) for


   L V I     LL    D  [M]V        V   
  /  | |     |\    |     |        |
50   5 1    50 50  500   5        5

Adding these numbers again gives 666. 

Skeptic in war on cancer accepts decline in death rate.
The New York Times, 29 May, 1997, B10

Cancer undefeated.
New England Journal of Medicine, 29 May 1997, 1569-1574
J. C. Baylar III, H. L. Gornik

In 1971, the Government declared war on cancer at a cost of $30 billion dollars. In 1986 Baylar caused a flurry by reporting that cancer rates continue to increase despite all this effort. Now, in this NEJM article, Baylar and Gornik study the overall cancer rate from 1970 through 1994. They find that this rate continued to increase until 1991 and has decreased about 1% a year since then. They look at the change in cancer rates for specific kinds of cancer and the reasons from these changes.

While they acknowledge that, in certain specific cancers (especially those related to children) progress in treatment has been made, they find that the biggest gains have been made in prevention such as that brought about by the introduction of mammogram testing and decreased smoking. They recommend, therefore, that the government put a larger proportion of their resources into developing new methods of preventing cancer. They feel that this is where the battle can be won.

The NEJM article discusses the problem of measuring the cancer rate in a way to make comparisons through a period of years meaningful. They argue that the incident rate is affected by too many things which vary in time and are hard to measure. For example, new methods to detect a particular disease can suggest a rise in the number of cases of this cancer when, in fact, the death rate for this cancer has not changed. This occurred when the incidence of prostate cancer doubled between 1974 and 1990 when a new test for prostate cancer was discovered and promoted. However there was no noticeable change in the mortality from this cancer during this period.

While Baylar and Gornik argue that the death rate is the most meaningful quantity to look when considering trends, even this rate must be computed with care. Because of changes in the number of people in different age groups through the years, it is necessary to use age specific death rates. These are simple ratios of the number of deaths to population size. The numerators are the number of deaths from a specific cancer or group of cancers in specific age ranges and demographic characteristics. The denominator is the corresponding U.S. population estimated by the Census. The age related rate is then determined by weighting these age specific death rates by the proportion of the population in each age group in a particular reference year. These authors chose the year 1990 as their reference year. This allows you to compare the death rates over a period of years, assuming a population distribution similar to that of 1990. The choice of the reference year can make a significant difference in interpreting the results. (See the discussion question.)


In recent reports by the Department of Health and Human Welfare and the American Cancer Society on cancer death rates, the years 1940 and 1970 were used as reference years. Baylar and Gorkin argue that, since death rates for older people have been going up and younger people going down, choosing a reference year of 1940, when the population distribution was much younger than in 1990, gives an unduly optimistic view of recent trends in cancer death rates. Why is this?

Predictive use of breast-cancer genetic test is disputed.
The Wall Street Journal, 15 May 1997, B1
Michael Waldholz

Defects in the genes BRCA1 and BRCA2 have been identified by cancer researchers as risk factors for breast cancer. Earlier studies estimated that women who carry such defects have an 85% chance of developing breast cancer by age 70. For women in the general population who don't have the defect, the risk is 12%. Since 1995, Myriad Genetics Inc. of Salt Lake City has sold a $2000 test that can detect defects in these genes, and many women who tested positive have elected to undergo preemptive mastectomies.

Now a new report from the National Cancer Institute, presented in the "New England Journal of Medicine", questions the 85% figure, suggesting that the right number is 56%. Mary-Claire King, the University of Washington geneticist who in 1990 identified the BRCA1 gene, faults the new study for basing its findings on interviewees recollections of their family medical history. The earlier studies on BRCA1 and BRCA2--which gave the 85% figure--counted families' breast-cancer cases by confirming them in medical records.

The new report looked at more than 5000 Jewish women in the Washington DC area and found that 120 (2.3%) carried a defect in their BRCA1 or BRCA2 genes. The women were asked how many cases of breast cancer had occurred in their families. The risk of developing cancer by age 70 was then projected.


(1) People's difficulty in recalling medical histories is a notorious problem for retrospective studies. Still, if the real risk is 85%, it seems there would be a lot of breast cancer in the families who carry defective genes. Does it surprise you that this would not be remembered?

(2) Dr. Mary Skolnick of Myriad Genetics expressed concern that the new report would convince women that their risk is low enough so that they don't need to consider preemptive mastectomy. How sensitive do you think such a decision is to the level of risk? (If an 85% risk would convince you to have surgery, would an 80% risk also?) Do you think there is a threshold (say 50%) at which there would be a substantial change in people's decisions?

(3) In light of the new study, we are comparing a 56% risk to an 85% risk. Suppose we frame the comparison in terms of risk multiples. The earlier study suggest the defective genes give 7- fold increase in risk, and the new study suggests a 4- to 5-fold increase. Does this change your perception?

Statistics: the conceptual approach.
Springer-Verlag, 1997
Gudmund R. Iversen, Mary Gergen

In his article "Statistics in the Liberal Arts Education" in the February 1985 issue of the "The American Statistician", Gudmund argued that statistics, being a part of every student's life, should also be a part of every students liberal education. Students who take at most one statistics course should come out of the course with a general understanding of the basic concepts of statistics. The coming of the statistical packages made this both possible and increased the need for an understanding of what statistical tests really mean and how they are used.

Gudmund has practiced what he preached with his popular introductory statistics course at Swarthmore: Stat 1: Statistical Thinking. This course satisfies Swarthmore's primary distribution requirement. Courses that satisfy this requirement "place particular emphasis on the mode of inquiry in a particular discipline," are limited to 25 students and have a writing component.

The popularity of Stat 1 (about 125 students take it each year) and the limited enrollment, have given Gudmund a lot of opportunities to experiment with and develop the course. This book, written with psychologist Mary Gergen, brings us the results of this rich experience.

Consistent with its aims, the book is written in English rather than in formulas. Formulas play an important role - but from their position at the end of chapters. Examples are chosen from real studies and carefully selected to make clear to the students the role that statistics will play in their lives. This relevance is reinforced by photographs related to the studies and by a wonderful collection are cartoons. The student, being introduced to data analysis, sees a Calvin and Hobbes cartoon with Calvin saying "I love messing with data."

The coverage of basic statistics concepts and tests is remarkably complete, including analysis of variance and multiple regression. Students are often invited to "stop and ponder" and answer a question to test their understanding of what they have just read. Three types of exercise are provided at the end of each chapter: Review, Interpretation, and Analysis. For example, from the chapter on Hypothesis Testing we find such exercises as: Review: What does statistical significance mean? Interpretation: Why can we conclude that the gun-owning percentage in this town is not equal to the national average? Analysis: Given this data, test the null hypothesis that there is no difference in the graduating rates of men and women athletes.

Curiously, the excellence of the discussions of the basic statistical concepts does not carry over to probability. The absence of any discussion of independence and conditional probability can only reinforce future doctors' and lawyers' idea that "when in doubt multiply", when faced with the problem of finding the probability that two events occur.

The final chapter "Statistics in Everyday Life" should be read by every student who takes a statistics course. It is great! This chapter reviews the basic ideas students should have learned and shows them the "problems inherent in developing research designs and how statistics can be used and abused in public forums."

This is an important book that will extend to other colleges the remarkable contribution to liberal education Gudman has made at Swarthmore College.

Ask Marilyn.
Parade Magazine, 25 May 1997, p19.
Marilyn vos Savant

A reader asks: "Say you've run into two of the Brown brothers a zillion times. Half of the time, both had blue eyes; the other half, they didn't. What's the smallest number of brothers that must be in the Brown family, and what color are their eyes?

Marilyn starts by assuming that "the two Brown brothers are a random selection of all Brown brothers." She then argues that there must be three blue-eyed brothers and one brown-eyed brother. Her logic is that there must be at least three Brown brothers, including two with blue eyes and one with brown eyes, or else you'd see blue-eyed pairs all the time. But if this is the whole family, you'd only see two blue-eyed brothers 1/3 of the time. Adding another blue-eyed brother gets the proportion right.


(1) What does Marilyn mean by "the two Brown brothers are a random selection of all Brown brothers."

(2) Can you reproduce the calculations backing up her reasoning?

Data smog: surviving the information glut.
Technology Review, May/June 1997, p18
David Shenk

This essay describes the gap between the rate at which information can be electronically delivered and the rate at which humans can usefully assimilate it. According to Nelson Thall of the University of Toronto's Marshall McLuhan Center, "we're pushing ourselves to speeds beyond which it appears we were designed to live."

The article declares that the goal of the information industry is to convince consumers that whatever they have is not enough. The resulting push to upgrade software and hardware comes at a big price for individuals and corporations. Moreover, with new technology comes new tasks that people are expected to perform. The author suggests that the resulting stress may be a cause of polling results that say Americans feel they have lost control of their lives.

The data glut is manifested in multiple studies and conflicting points of view on all policy issues. In an interview on NPR's "All Things Considered" regarding the latest cancer data, Philip Taylor of the National Cancer Institute remarked: "If you don't have some level of confusion about how to interpret this study, you should." It seems there is always a way to generate more data, spin the results, and continue the debate. The author asserts that TV shows like "Crossfire" are designed to exploit the entertainment value of what he calls the stat war phenomenon. "Charges fly back and forth as furiously as a ping pong ball. But...the show always ends before viewers have time to gauge the accuracy of the shots." And when we are helplessly overwhelmed by conflicting expert opinions, we may fail to come to conclusions on important issues.


In support of his claim that more knowledge has led to less clarity, the author asks: "Is dioxin as dangerous as we once thought? Do vitamins prevent cancer? Would jobs have been gained or lost under Bill Clinton's comprehensive health care plan?"

But is information the culprit in all these examples? Was there ever any clear answer as to whether vitamins prevent cancer, or whether a given government program would produce a net gain or loss of jobs?

Study lists factors common in teenage suicide.
The Boston Globe, 3 June, 1997, A15
Associated Press

A study in the June edition of the journal "Pediatrics" finds that risk-taking and problem behaviors are associated with higher teenage suicide rates. The report is based on an anonymous questionnaire filled out in 1993 by 3054 Massachusetts high school students. There were 288 suicide attempts reported in the group. A computer model was used to analyze questionnaire responses to questions related to their behavior and assign individuals to one two groups: attempted suicide or not. The article reports that the model was "right 92% of the time."

Controlling for other factors, students who smoked regularly within the last 30 days were twice as likely to report a suicide attempt. Some other behaviors and increased risks included:

Fights in past 12 months, 1.3 times. Carrying a gun in the past 30 days, 1.4 times. Lack of seat belt use, 1.3 times. Substance abuse before sexual activity, 1.4 times.


(1) If you had guessed that none of the teens would report a suicide attempt, you would have been right in 2766 (= 3054 - 288) cases, or 90.6% of the time. So how impressed should we be that the computer was right 92% of the time? What do you think the article meant here? What else would you like to know?

(2) A number of disparate behaviors seem to produce about the same risk multiples. What do you make of this?

(3) What fraction of attempted suicides do you think will actually be reported? Could there could be a relationship between any of the above behaviors and the chance of actually reporting a suicide attempt?

Unconventional Wisdom
The Washington Post, 18 May 1997, C5
Richard Morin

The World's Happiest Countries.

Social psychologist Ruut Veenhoven of the Netherlands has devised a new way to measure a country's quality of life. His "happy life expectancy" is an attempt to quantity objectively the number of years the average citizen in a country lives happily.

To compute the "happy life expectancy", Veenhoven multiplies the average life expectancy in a country by the percentage of the time residents are happy and satisfied with their lives. Iceland ranked first with a "happy life expectancy" of 62 years, while the U.S. ranked tenth with an expectancy of 57.8 years. The lowest ratings were found in the former Soviet Bloc and Africa.

It is important to note that the survey data from several countries are shaky and, because of the absence of solid data in many developing nations, Western democracies are over-represented in Veenhoven's sample of 48 countries.

Beauty News.

Psychologist Michael McCall of Ithaca College recently reported in the "Journal of Applied Social Psychology" that attractive individuals are less likely to be carded (their age checked) in restaurants and bars. McCall passed out photos of attractive men and women in their late teens and early twenties to 108 undergraduates. He asked the undergraduates if they would card the individuals if they worked at a bar or restaurant. Good-looking men and women were less likely to be carded and were judged more likable and trustworthy.


What problems can you see with comparing a measurement of happiness between countries?

Study Finds Secondhand Smoke Doubles Heart Disease.
The New York Times, 20 May 1997, A1
Denise Grady

Harvard researchers reported a study which claimed that second-hand smoke is much more dangerous than previously thought. The study has "broad implications for public health policy and probably direct impact on at least one major lawsuit."

The study was based on the Harvard Nurses Study that began in 1976 with 121,700 women filling out health information surveys every two years. In 1982, the researchers began asking about the women's exposures to smoke and smoking, both in the home and in their workplace. The study included only the 32,046 women who had never smoked and who did not have cancer or heart disease. The women in the study ranged in age from 36 to 61 when the study began. During the ten year period of the study, 153 had heart attacks, with 25 of these fatal. Women who were regularly exposed to second-hand smoke were 91 percent more likely to have a heart attack than those not exposed. Those who were occasionally exposed were estimated to be 58% more likely to have a heart attack.

"There may be up to 50,000 Americans dying of heart attacks from passive smoking each year," assistant professor of health and social behavior Dr. Ichiro Kawachi said. He works in the Harvard School of Public Health and is the main author of the study. The study was published in the journal "Circulation."

This study did not consider the effect of second-hand smoking on lung cancer. Earlier studies had led to an estimate that second hand-smoke could cause 3- to 4- thousand deaths a year. Also, before the study, research had found that second-hand smoke caused asthma, bronchitis, and middle-ear infections in small kids. Previous studies had estimated a smaller increased risk for heart attacks of from 20% to 30%.

Research with both people and animals has shown various ways in which chemicals in second-hand smoke can harm the heart. Smoke reduces a person's oxygen supply, damages arteries, lowers levels HDL(a beneficial cholesterol) and increases the tendency of blood platelets to stick to one another and form clots--which can cause a heart attack.

This study is not only relevant to the area of general health but it will also will serve as a legal aid in several cases against tobacco companies. It will affect the movement for smoke-free areas. Although the federal Occupational Safety and Health Administration has recommended smoke-free workplace laws nationwide, they are not in effect. Smoking regulations vary from state to state and city to city. "This study will be of enormous help to legislative bodies, statewide and locally, who are trying to get limits on smoking," said Edward Sweda, senior lawyer at the Tobacco Control Resource Center at Northeastern University in Boston.

Stanley Rosenblatt, a Miami lawyer, thinks this is an important study. He is representing flight attendants in a class-action suit against tobacco companies. It is the first of its kind and may end up involving over 60,000 former and current flight attendants. They seek billions of dollars in damages for being harmed by second-hand smoke in airplane cabins on flights in which smoking was still legal. Also, this study may "affect negotiations between Northwest Airlines and its flight attendants." Northwest still allows smoking on several of its flights to Japan, to keep up with competition from Japanese airlines that permit smoking. John Austin, Northwestern's spokesman, says that the study will "certainly factor in. But it's hard to say what the impact will be."


How much confidence would you have in the 91% estimate?

Making sense of randomness.
Psychological Review, Vol. 104, No.2, 301-318
Ruma Falk and Clifford Konold

To understand this paper, we have to tell you more than you would like to know about what a "random sequence" is.

It is generally believed that the standard probability model for tossing a fair coin describes a process that produces a random sequence of H's and T's. This model give the same probability for any particular sequences occurring. So, in one sense, any particular sequence is as random as any other. However, many people think that the sequence HHHHH is not as "random" as the sequence HTHHT. Indeed George Wolford, of our Psychology department, was able to convince a Chance class that he could verify ESP. To do this, he wrote down a sequence of 5 H's and T's and put the result in a sealed envelope. He asked the students to write down such a sequence, concentrating on the sequence in the envelope. The majority agreed with the sequence in the envelope: HTHHT.

The problem of telling when a specific sequence is random has fascinated mathematicians and psychologists for a long time. For a history of the struggles of mathematicians, see "An Introduction to Kolmogorov Complexity", Li and Vitanyi, Springer- Verlag 1993, and for the struggles of the psychologists see "The perception of randomness," Bar-Hillel and Wagenaar, Advances in Applied Mathematics, 12, 428-454 (1991).

Psychologists have shown that people do not consider all sequences of H's and T's to be equally random and have tried to explain why. Kahneman and Tversky gave this as an example of their "representative" heuristic. Students choose HTHHT because it "represents" the kind of outcome that typically occurs. However, it has been found that people do not make choices consistent with what might rationally be thought of as typical. For example if you compute the probability for any specific number of alternations in a sequence of 5 H's and T's, you find that two alternations is the most likely outcome. However, in our ESP example we observed that the students chose a sequence HTHHT with 3 alternations most often. If they are choosing representative sequence to occur, as regards alternations, they should choose a sequence like HTTHH. This sequence also has the most likely number of heads and tails. One might try to define a sequence to be "random" if it represents the most likely outcome with respect to a couple of descriptive quantities. We shall see an example of this approach in our next article.

Gilovich and Tversky did a famous study of basketball fans to see if it is generally understood how many runs of H's or T's there should be in a "typical" random sequence. They found that basketball fans believe players have streaks of successful shots called "hot hands". To test if this was true, they looked at data for players on the Philadelphia 76ers basketball team to see if any of these players had streaks. Even though fans felt strongly they did, Gilovich and Tversky found that standard statistical tests would not reject the hypothesis that the shots were from a simple coin tossing model with probability of heads equal to the players' shooting percentage. They attributed much of the "hot hand" theory to the fact that people do not expect as long runs as typically occur in a random sequence. This is the same as saying people expect more alternations than would typically occur (which is what made George Wolford's trick work).

The hypothesis testing in the basketball example suggests another way to think about "what is a random sequence?" We might say that a sequence is "random" if it is not rejected by certain standard statistical tests for randomness, such as the runs test. This idea was carried out by Martin-Lof, to give a definition of a finite random sequence. His definition turned out to be equivalent to another apparently very different definition of randomness proposed by three researchers in the early 60's: R. J. Solomonoff, G. J. Chaitin and A. N. Kolmogorv. There idea was to define a sequence to be random if it is difficult to describe. In other words, if it has no obvious patterns.

They gave a formal definition of randomness by first defining the complexity of a sequence as the length of the shortest computer program that would produce the sequence. For truly complex sequences, you cannot do any better than just print out the sequence. This would take a program with length about the same as the length of the original sequence. Such sequences are called 'random' by the complexity definition of randomness. The sequence consisting of the first million digits of Pi is not random because this sequence can be produced by a very short program. It can be proven that most sequences produced by a coin tossing process are random but, curiously, it is very hard to produce examples of random sequences. However, this is an elegant definition, and it seems to agree with our intuition that a random sequence should be one without patterns.

We return now to the paper by Falk and Konold. These authors were interested in studying how people determine their subjective measure of the randomness of a sequence. The complexity definition of randomness would suggest that they implicitly encode the complexity of a sequence to determine how random the sequence is. Falk and Konold designed and carried out experiments to see if their behavior is consistent with this hypothesis.

In their experiments Falk and Konold measured the subject's subjective complexity of binary sequences of the type used to judge randomness. In each experiment, they presented subjects with sequences of 21 X's and O's having either 10 or 11 X's. Subjects were given samples of such sequences with an even number of alternations varying from 2 to 20. They used three different methods to quantify subjective complexity each based on the difficulty of encoding a sequence: the first measured the difficulty of memorizing a sequence, the second the of copying the sequence and the third the assessment of the difficulty of copying the sequence.

Previous research suggested two possible hypotheses concerning the subjective complexity of a sequence as a function of the number of alternations in the sequence: Under the first hypothesis it should be correlated with objective randomness of a sequence as determined by complexity. Under the second hypothesis subjective complexity should be correlated with subjects subjective randomness of the sequences.

Under the first hypothesis the subjects complexity, as a function of the number of alternations, should be symmetric with a maximum at the expected number of alternations 10. Under the second hypotheses it should be skewed to the right with a maximum around 14 alternations since subject's subjective assessment of the randomness of such sequences it typically skewed in this way.

All three experiments supported the second hypothesis, suggesting that people do indeed determine their subjective assessment of randomness through attempting to encode the complexity of the sequence.

For their experiments Falk and Konold needed a measure of objective complexity of sequences. Complexity, defined as the length of the shortest program to produce the sequence, is difficult to compute. Thus Falk and Konold used, instead, an interesting entropy measure of complexity called "Approximate Entropy" (ApEn) which has been used for some times by psychologists. It is nicely explained in a charming little book by Attneave, F. (1959), "Applications of information theory to psychology." As you will see from the next review, it has recently been rediscovered as a measure of the randomness of a finite sequence. Here is how approximate entropy is determined.

Consider the sequence HTTHHTHHTHTT of 12 H's and T's. Compute the empirical distribution for patterns of each length n. For example let n = 1. There are two possible patterns of length 1, H and T. Since our sequence has 6 H's and 6 T's, the empirical distribution for these patterns is p(1) = {6/12,6/12). Consider next n = 2. Then there are four possible patterns of length 2: HH, HT, TH, and TT. There are 11 such patterns, since each outcome except the last one starts such a pair. The frequencies of these patterns in this sequence are p(2) = (2,4,3,2). Find, for all n, the empirical density for all possible patterns of length n. Then the kth order approximate entropy is defined as

ApEn(k) = Entropy(p(k)) - Entropy(p(k-1))

where Entropy(p) of a distribution p = (p1,p2,..,pm) is defined by:

E(p) = - p1*log(p1) - p2*log(p2) - .... - pm*log(pm).

Falk and Konold used second order approximate entropy as their measure of objective randomness.


It appears that Falk and Konald are assuming that sequences with the same number of H's and T's and the same number of alternations have about the same approximate entropy.. Do you think this would be true for the following two sequences



both of which have 10 H's and 11 T's and 7 alternations? Would you think they would have about the same complexity?

Putting randomness in order.
New Scientist, 10 May 1997
Ian Stewart

Not all (possibly) "random" sequences are created equal.
Proc. of the National Academy of Sciences, Vol. 94, pp. 3513-3518
Steve Pincus and Rudolf E. Kalman

Pincus and Kalman do not feel that the complexity definition of randomness is suitable for finite sequences. They claim that this definition is not practical for small sequences. Also, they see no reason why a sequence should not be random just because it happens to be part of the expansion of Pi and hence easy to compute. They propose a new definition of randomness, using the concept of approximate entropy defined in the previous review. They determine a measure of how random a finite sequence is as how far the approximate entropy is from the maximum value that it could be.

This definition of randomness turns out to be equivalent to requiring that the empirical distribution of k patterns be as near to uniform as possible. This is because entropy is maximized by the uniform distribution.

Let's see which sequences of length 5 are random by this definition. The empirical density of patterns of length 1 being as close to uniform as possible means that there must be either 2 or three H's. For the empirical distribution for patterns of length 2 we must have equal numbers of each pattern. Only four sequences: HHTTH, HTTHH, TTHHT, and THHTT have these two properties. (Note that our students favorite sequence HTHHT is not random by this definition of randomness.) Pincuse and Kaman only this maximum property to hold for k less than log(log(n))+ 1 which means for n = 5 it needs only hold for k = 1 and 2 so we have found all the sequences of length five that are random by their definition.

Pincus and Kalman use their concept of randomness to see how random the decimal expansions of some of our favorite constants are. Pi is the most random but sqr(2) seems to be doing better than e. This surprised the authors since sqr(2) is an algebraic number and e is transcendental. Again, none of these decimal expansions are random by the complexity definition of randomness.

The "New Scientist" article by Ian Stewart describes more romantic applications of this new notion of a random sequence. These include uncovering evidence that the "male menopause" may exist and showing that the value of stocks as measured by the Standard and Poors 500, are far from random. (The one exception to this happened in the two-week period just prior to the stock market crash of 1987 when the ApEn was nearly maximally irregular)


(1) For a sequence of length 100 to be ApEn random it must have exactly 50 heads. This occurs with probability about .05. Does it make sense to have 95% of the sequences obtained by tossing a coin 100 times not random?

(2) How do you feel about ruling out the first n digits in the expansion of Pi being random just because they are easy to compute.

(3) In the New Scientist article we find the following:

In effect, ApEn measures how predictable the data are. Choose some block of data, say 101, and look at what digit comes next. If 101 is usually followed by 1, then the data have a degree of predictability. But if 101 is followed by 0 just as often as 1, then the sequence is unpredictable. ApEn gives the average unpredictability over all possible blocks.

This suggests that approximate entropy is related to the conditional probability of the nth outcome given the first n-1. Can you see why this might be?

Note: We had hoped to include still another application of the concept of complexity to Psychology. This is the article:

Reconciling simplicity and likelihood principles in perceptual organization. Psychological Review, Vol. 103, No.3, 566-581 Nick Chater

Perhaps we will do this in the next Chance News.

Please send comments and suggestions for articles to jlsnell@dartmouth.edu.


CHANCE News 6.07

(11 May 1997 to 7 June 1997)