- demographic analysis
- capture-recapture (dual system estimator)
- correlation bias
- stratified sampling
- (multiple) regression

Turning now to statistical issues, it is natural to ask how we can know about the undercount-and even produce estimates of its size-when by definition the persons involved were not enumerated in the census. The answer is that there exist comparatively well-accepted estimates for the total U.S. population, based on demographic data-immigration and emigration information, and birth and death records. The demographic data, available for reliable estimation since the 1940 census, confirm that minorities have historically missed at higher rates (rates for blacks have been roughly 5%greater than for whites). The task of the Census is much more difficult, however, since it must identify people at their primary residence on Census Day (April 1). The demographic data cannot provide the fine detail required to make adjustments at the local level.

To overcome this problem, the Census Bureau developed a plan for a post-enumeration survey (PES) to assess coverage by the enumeration phase. The basic idea is the same as that for the capture-recapture techniques used to estimate the size of animal populations. For example, to estimate the number of fish in a lake, an initial ``capture" sample is first caught, tagged and released. A second ``recapture" sample is caught some time later. Computing the proportion of tagged fish in the recapture sample, we assume that this is equal to the overall proportion of tagged fish in lake. The latter proportion is simply the size of the capture sample divided by the (unknown) number of fish in the lake. Thus if 200 fish are initially captured, and 2/3 of the recapture sample have tags, we would estimate that there are 300 fish in the lake. The census version of this procedure goes by the name ``dual system estimation" (also ``coverage-measurement survey"). Here the traditional enumeration corresponds to the original capture sample. The PES is the recapture sample. The ``tags" are the census records from the enumeration. Let E denote the number counted in the enumeration, P denote the number in the PES, and O (for overlap) denote the number of people found by both. Then, following the basic capture-recapture recipe outlined above, we would simply estimate the total population T from the relationship , which gives .

The rest of this section summarizes the problems that arise when we try to apply this model to the U.S.
population. David Freedman (*Science*, May 31 1991) gives a good accessible discussion of the complications,
but it should be noted that he was not in favor of adjusting. A somewhat more technical discussion, this
one by proponents of adjustment, can be found in Eriksen, Kadane and Tukey (*JASA*, 1989).

Note that equating the two proportions in the capture-recapture estimation procedure implicitly assumes that that all animals have the same chance of being captured, and that being caught in the capture sample and being caught in the recapture sample are independent events. As stated earlier, it is known that the probability of capture differs among geographical and ethnic groups. For this reason, the procedure cannot be simply applied in one shot to the entire population; instead, it is performed separately in smaller areas, with different adjustments made for each. We will discuss this more fully below. Failure of the independence assumption is called ``correlation bias." This too will be discussed in more detail below. We record here two additional assumptions. First, the tags must not fall off the animals (or otherwise become difficult to identify) between the time of capture and recapture. Also, the population should remained closed between these times. Thus births and deaths, or migration in and out of the study area, pose additional problems for the analysis. Again, for the census, the tags are the census records for the original enumeration. Matching a record from the post-enumeration sample to a corresponding record from the enumeration is a non-trivial problem, with potential for human and/or computer error. A false match is relatively rare; the ususal mistake is a false non-match. Matching is particularly difficult when people have changed addresses. False non-matches inflate the estimate of the undercount, because they make it appear that the Census enumerated a smaller proportion of the population that it actually did.

With regard to correlation bias, note that the original census is not a probability sample; it is an attempt at complete enumeration. People who are hard to find the first time are likely to be hard to find in the post-enumeration, which violates the independence assumption. Several methods for estimating the degree of correlation are discussed in the statistical articles referenced here. The notation used in those articles is a bit more elaborate than the above, and is based on the following two-way table:

Included in PES?

Here n11 is the number of people found in the census and matched in the PES (corresponds to overlap O from above). The is the number found in the PES but not in the Census. These are considered to be omissions. (It is also possible that these people were erroneously counted by the PES; the hope is that by keeping the size of the PES small, this kind of error will be extremely rare). Next, is the number in the Census but not found in the PES. (It may turn out that the PES process exposes such a record as an erroneous enumeration or a double-counting, in which case it is removed from the Census; additional investigation is done to ensure that represents real people). The term represents the number not found by either; it is estimated by

where corresponds to the independence assumption, corresponds to positive correlation bias, and to negative bias. Methods for estimating have not gained wide acceptance, so the default of is used. As described above, the correlation is most likely positive, so ignoring it would tend to underestimate , leading to conservative adjustments for undercounting (counterbalancing to some degree the problem of false non-matches). Note that with , the estimated total population is

as before.

Finally, recall that the undercount differs among ethnic groups and geographical areas. The proposed solution is to carry out the dual estimation procedure separately for different groups, and then combine the results to estimate the total population. The basic geographical unit identified by the Census Bureau is the ``block." (Cities and towns are divided into tracts, and tracts are subdivided into blocks; there are over 300,000 census blocks in the U.S. averaging roughly 300 households each). For the Post-Enumeration Survey, the blocks were partitioned into strata based on demographic similarities, and a stratified random sample of 5000 blocks was taken, representing some 165,000 households. In each of the sample blocks, a complete enumeration was attempted, and the results compared with the original census enumeration. The dual system estimator technique can then be used to estimate the total population of each block. This gives estimates for the blocks in the sample, but not yet for the country as a whole. However, each block in the sample gives rise to an ``adjustment factor", which is simply the ratio of the dual system estimate to the census count. The simplest procedure would be to apply this ``raw" factor to adjust the count for every other block in the stratum. What actually happens is more complicated. Groups called ``post-strata" are formed by classifying residents of the blocks into according to age, race, sex and ethnicity- factors which are known to affect enumeration rates (there are 1392 categories overall). Then a regression model is used predict adjustment factors from the variables used to define the strata and post-strata. This has the effect of ``smoothing" together results from similarly defined strata, thereby mitigating the effect of any extreme values that might rise in the ``raw" adjustment factors due to sampling errors.

Freedman gives some details on the mechanics of implementing the adjustments. He considers the post-stratum of ``black or Hispanic males, 45-64 living in central cities in New England." If the dual system estimate is 10%higher than the enumeration, the adjustment factor is 1.1. If some central city block in New England is found to have 10 such males in the census enumeration, then the count would be adjusted to 1.1*10 = 11, by choosing one of the real census recored and duplicating it. It turns out that some adjustment factors will be less than 1.0 (resulting from removal of erroneous enumerations), which correspond to overcounting. Thus if the factor were .95 for while males aged 45-64 in such a block, a census count of 20 would be adjusted down to .95*20 = 19. This would be physically accomplished by selecting a record at random and introducing a ``negative"person into a special adjustment category. In such situations, real people wind up being subtracted from official census tables! This points up potential difficulties in winning public acceptance for the adjustment procedures.