Class 25 and 26: Pears and Beans

Half the class is touring the PEAR (Princeton Engineering Anomalies Research) lab in E-Quad C131. Meanwhile, we are going to sample beans.


Suppose you have a bag full of black and white beans having a proportion p of white beans and a proportion 1-p of black beans. You pick out a bean at random (with replacement) 225 times.

Suppose, instead, you pick out 225 beans without replacement. What proportion of the beans would you expect to be white? Do you expect the standard error to be smaller, bigger, or the same?

Do pollsters pick out "beans" with or without replacement? Why?

Lets try it

We have a bag of white and black beans for you to sample. There are about 35,000 beans but we do not know the exact number nor the proportion p of white beans. We want you to sample without replacement, but to assume that our formulas for standard error are the same as with replacement. In order to get a lot of samples with the same standard error, we want everyone to do a sample of approximately 225 beans.

Count the number of black beans and white beans in your sample. What is your estimate for the proportion of white beans in the bag? What is your estimate for the standard error. (Since you do not know what the actual proportion of white beans is, you'll have to make some estimates even when using your formula.) What range would you give with 95% confidence for proportion of white beans in the bag.

What do we mean by 95% confident?

Do you think it is reasonable to use the formulas for sampling with replacement even though we are sampling without replacement?

How would you estimate the total number of beans in the bag?


Comments on journals (batch 4):

Census article from NY Times:

One student wrote, "how can [the Census Bureau] say that they plan to directly count 90% of the population? If they can count 90% directly, and know that they have counted exactly 90%, then they would also know the exact total population, and would not need to calculate a remainder." I agree this is ridiculous, and I blame the NY Times for unclear reporting. As we discussed in class just before spring break, and as you can read in Freedman's Census 2000 report, what the Census Bureau really plans to do is draw up a residence master list for each county, and directly count the people in households until 90% of the HOUSEHOLDS have been accounted for in each county. Then they will take a sample of the remaining 10% of households in each county and pester these households until they have counted all the people in them.

The same student was disturbed by two sentences from the article. First the article said "At the core of the legal challenge to the 1990 challenge to the a990 census was the racially disparate undercount, the existence of which no one disputed." Then it said that Robert A. Mosbacher, the Secretary of Commerce under Bush, claimed that "While a statistical adjustment could improve the numerical accuracy of the census, there was no proof that it would improve the distributive accuracy." The student points out that these two claims seem contradictory, unless each state had roughly the same proportion of minorities (which isn't true) or racial undercounts differed greatly by location. Can anyone think of another resolution?

Several students were worried about replacing a head count with a statistical estimate. One student touched upon the danger that a census based on statistical methods might be subject to repeated re-examination and debate, since the same data could always be reinterpreted using different statistical methods and different assumptions.

Fishing and the Census

People had a lot of different ideas about how births and deaths of fish affect the capture recapture estimate. Here are my thoughts.

Before you can decide how births and deaths affect the estimate, you have to decide whether you are trying to estimate the population at capture time or at recapture time. I will assume you are interested in the population at capture time, as the Census Bureau is.

Death of fish can affect your estimate in different ways, depending whether the fish with tags are more or less likely to die than fish without tags. If fish with tags and fish without tags die in equal proportions, then the ratio r/t should not be affected by fish death, so c*r/t will still be a good estimate of the population AT CAPTURE TIME (but an overestimate of population at recapture time.)

Birth of fish causes more problems, because fish are not born with tags. So after a bunch of fish are born, the proportion of fish with tags is smaller, so r/t will be bigger, and c*r/t will be an overestimate of the fish population at capture time (but an accurate estimate of the population at recapture time).

Of course, in reality, births and deaths are both happening, not just one or the other. When both are happening, then c*r/t will overestimate the population at capture time AND will overestimate the population at recapture time. (Just put together the analyses of the two paragraphs above. Or, as one student put it, "The more flux that occurs, the more `tagged' individuals you will lose and `untagged' you gain, both of which will lead to overestimates.) Of course things get more complicated if fish with tags are more likely to die than fish without, or vice versa.

The problem with people being born and dying after the census is similar; however I suspect that the other problems (such as moving) are probably much bigger.

Incidentally, the problem with prostitutes entering and leaving the population can be mapped to the problem of fish being born and dying: prostitutes who are tagged and untagged leave the profession, but only untagged people enter the profession. I think it's interesting to speculate whether or not tagged and untagged prostitutes leave in equal proportions.

Some people seemed confused about how trap-friendly and trap-shy fish affect the capture recapture estimate. The existence of EITHER kind of fish tends to increase the overlap between your capture sample and your recapture sample, and therefore make your estimate of the population too small.


A few people mentioned an article about the chance of life in outer space, based on the chance that there is a similar sun somewhere, with a similar planet, etc.

One person mentioned a study in which women who took estrogen during and after menopause were less likely to develop Alzheimer's. (Does this contradict the findings in the study of Alzheimer's and writing styles in nuns?)

Several students mentioned that they enjoyed the probability demos. We will try to make them available on the Courseware server so that you can use them on your own.

One student found a study of horse race betting done by Nottingham University's School of Management, which concluded that "women make better gamblers on horse racing than men". He also found an article about a male contraceptive injection that decreases sperm count, and an article about an enzyme that helps control sperm development. Defects in the enzyme's activity could possibly cause male infertility.