On the Conservation of Snow Leopards

Alex and Jack Strang - March 22, 2020

You are working with a nature documentary crew looking for snow leopards. Four days in, the guides lead you to the top of a mountain pass dividing two valleys. Snow leopards are sometimes sighted using the pass to move between the valleys.

You carefully place your camera traps and return to camp. Weeks pass with no sightings. You place more camera traps on the other passes and wait. Days come and go but no leopards. Frustrated, you seek out the team’s conservationist. “How many leopards do you think are in this valley?”

“Two,” she says, “on average.”

“Not many for all this space,” you answer, looking out across the rugged miles that surround your mountain camp. You slip into the main tent to check the tapes. You are scrolling through the video feed when, at last, a snow leopard appears on the screen. It crossed the pass last night. It even paused to inspect the camera before moving on. Thrilled, you wake the team and replay the tape. Everyone celebrates. It was a long month waiting.

The conservationist is watching over your shoulder. You look up. “I guess there’s only one in the valley left to find.”

She shrugs. “I still think there’s two out here.”

“But we saw one leave. How can there still be two?”

She shrugs again and smiles. “Two,” she says, “on average.”

Introduction

This is an example of a problem in expectation. The actual number of snow leopards in the valley is unknown. But, on average, there are two leopards in the valley. The question is how this expected number should update after seeing a snow leopard leave the valley.

At first the answer seems obvious. Every time a leopard leaves the valley the total number of leopards in the valley decreases by one. So the expectation should decrease by one.

But should it?

Suppose that the expected number of leopards in the valley is $0.5$ instead of $2$ . When one leopard leaves the valley, the expected number of leopards remaining in the valley cannot be $0.5 - 1 = - 0.5$ —there is no such thing as an anti-leopard. In this case it is obvious that we cannot update the expectation by subtracting one leopard from the expectation.

The subtlety here is in the distinction between expecting that there are a certain number of leopards in the valley and knowing the number. If we had counted all the leopards then we would know the number exactly. This number would match the expectation and would decrease by one when a leopard leaves. If we do not know the number of leopards for sure then the expectation is an average. Knowing the average number is not the same as knowing the actual number of leopards because it leaves the actual number uncertain. This uncertainty means that there is implicitly a probability distribution on the size of the leopard population.

For example, if the expected number of leopards is $0.5$ there is not actually half a leopard. There is no more a fractional leopard than an anti-leopard. It means that there is a nonzero probability that there are zero leopards, and a nonzero probability that there are one or more leopards.

Now we see a leopard leave. This would have been impossible if there were zero leopards in the valley, since leopards cannot appear from thin air. So there was at least one leopard in the valley. The observation event has taught us something: given that we a saw a leopard leave we should ignore the possibility that there had been zero leopards. As a result, the conditional expectation of the number of leopards in the valley before one left should always be greater than one. In turn the new expectation can never be negative.

Speaking more broadly, the more leopards in the valley the more likely it is to observe one leaving. Therefore, the observation event carries information about the number of leopards in the valley before the event. The fact we saw one leave means there could not have been zero—there may have been more than we thought. This means we should revise our old expectation upwards before subtracting the individual who left. The more uncertain we were before seeing a leopard leave the more we should revise upwards.

Revising the expectation upward before subtracting the wandering leopard is common sense in other contexts. For example, consider a fisherman who tries a new pond and has the most successful day of his fishing career. The fisherman is likely to return to the pond; since he caught fish there means there are (or at least were) fish in the pond. As long as he continues to catch fish, he is likely to return. Every catch is evidence that the pond contains fish, despite depleting the pond.

Here we formalize this problem to show how the expectation should be updated after observing a leopard leave. This requires introducing some notation and formalizing the relationship between the rate at which leopards leave the valley and the number of leopards in the valley.

Model

Let $X (t)$ be the number of snow leopards in the valley at time $t$ . This is an example of a stochastic process since $X (t)$ is a random variable for each $t$ . Let $p (x, t)$ be the probability $X (t) = x$ .

Then the expected number of leopards at time $t$ is:

\overset{x}{ˉ} (t) = E [X (t)] = x = 0 \sum \infty x p (x, t) .

Now we need a model for when snow leopards cross the pass.

Since snow leopards do not wear wrist-watches, and since all our efforts to speak to snow leopards have failed, the timing of each crossing is unpredictable so is best modeled as a random variable. This is an example of a counting process. Each time a leopard crosses the pass we count an additional crossing, but the timing of the crossings is random. In order to model the crossing we need to specify a probability distribution that returns a probability (or probability density) for any sequence of crossing times.

A natural way to construct such a distribution is to define an expected, or average, event rate, $λ (x)$ . If we let $N ([t, t + Δ t])$ be the number of events that occur in the interval $[t, t + Δ t]$ , then this usually means that:

Pr{N([t,t+Δt])=n}=⎧⎪⎨⎪⎩1−λ(X(t))Δt+o(Δt) if n=0λ(X(t))Δt+o(Δt) if n=1o(Δt) if n≥2.

Here $o (Δ t)$ represents any function of $Δ t$ that converges to zero as $Δ t$ goes to zero faster than $Δ t$ . A function $f (Δ t)$ is $o (Δ t)$ if $lim_{Δ t \to 0} f (Δ t) / Δ t = 0$ . Under these assumptions the expected number of events in any time interval is simply $λ$ times the length of the time interval. This sort of counting process, a Poisson process, is widely used to model rare events. For example, this is the precise probabilistic description for the decay of radioactive material.

A what? (On Poisson processes)

Suppose the rate

λ (X (t))

is a constant in time. Then the process defined above is a time homogeneous Poisson process. A Poisson process is an example of a counting a process. A counting process is a stochastic process,

N (t)

, that represents the number of times an event has occurred between times

0

and

t

. The time between events in a counting process are the waiting times. Let

W_{j}

represent the time in between the

j^{t h}

and

j + 1^{s t}

events. The event times

T_{j} = \sum_{i < j} W_{i}

are the times at which the events occur.

A Poisson process is a counting process where the number of events in non-overlapping time intervals are independent of each other, and the probability that

n

events occur in a given time interval depends only on the length of the time interval, not the time at the start of the interval, or the number of events that have occurred in the past. Counting processes that do not posses these properties are not Poisson. For example, the timing of water droplets from a leaky faucet is not a Poisson process, since the drops drop with a characteristic rhythm—they are more likely to occur during some time intervals than others, even if those intervals are of equal length. This is also not a Poisson process, since the number of drops during an interval is not independent of the number of drops that came before. If no drops dropped during the preceding interval, then it is more likely that a drop will drop during the current interval, since more water has built up on the faucet.

Poisson processes are the most widely used counting process models. Example applications of Poisson processes include queueing, systems of chemical reactions at the cellular scale, and population modeling with discrete birth and death events.

Here we provide four different characterizations of the Poisson process. These characterizations are minimal sets of assumptions about the counting process that guarantee it is Poisson. All four characterizations share the same

0^{t h}

assumption, which ensures the process is a counting process. Formally:

0.

N (0) = 0

lim_{t \to \infty} N (t) = \infty

with probability one,

N (t)

is nondecreasing and right continuous:

lim_{s \to t^{+}} N (s) = N (t)

, and at the event times

N (t)

increases by exactly one.

To ensure that a counting process is a Poisson process we need to introduce one of the following three assumptions. Combining one of these assumptions with assumption 0 ensures that a process is Poisson, and guarantees that the remaining assumptions are also true. ¹ The assumptions are:

1. For

0 < t_{1} < t_{2} < . . .

, the increments

N (t_{1}) - N (0), N (t_{2}) - N (t_{1}), . . .

are independent and the distribution of

N (t) - N (s)

depends only on the width of the time interval

t - s

.

2. The waiting times

W_{j}

are independent and exponentially distributed with parameter

λ

Pr{Wj=w}=λexp(−λt).

3. For

0 < t_{1} < t_{2} < . . .

the increments

N (t_{1}) - N (0), N (t_{2}) - N (t_{1}), . . .

are independent and are Poisson-distributed:

Pr{N(t)−N(s)=n}=(λ(t−s))nn!exp(−λ(t−s)).

Assumptions 2 and 3 are stronger than assumption 1. in that they are explicit assumptions about the distribution of either the waiting times or the increments.

Similarly, if assumption 0 is made, and

Pr{N([t,t+Δt]}

matches the asymptotic behavior we used to define our counting process, then

X (t)

is a Poisson process. The equivalence of assumptions 1, 2, 3, and the asymptotic form for

Pr{N([t,t+Δt]}

under assumption 0 is partly why Poisson processes are widely used. Assumptions 0 and 1 are reasonable for many counting processes, and the asymptotic form for

Pr{N([t,t+Δt]}

is the most natural interpretation of a counting process where events occur at an average rate

λ

What remains is to specify $λ (x)$ , the rate at which we expect to see leopards leave the valley if there are $x$ leopards in the valley. It is reasonable to assume that this rate increases the more snow leopards there are in the valley. Note that the actual dependence of $λ$ on $x$ depends on how leopards interact while dispersing. A highly social animal is likely to stay near other members of its species, so the rate at which any individual leaves a group may decrease the more individuals are in the group. In this case $λ (x)$ will be sublinear (not proportional to $x$ ), and may even decrease in $x$ for large enough $x$ . Territorial animals may actively avoid each other while dispersing, hence $λ (x)$ may be superlinear in $x$ . In general it is only possible to address the question, “How many do you think are out there now?,” once $λ (x)$ is specified. Here we give the solution for a particular $λ (x)$ , and provide details on the general case as supplement.

Under most migration models it is reasonable to assume that the rate $λ (x)$ is proportional to $x$ . This is often assumed for one of two reasons:

Linear models are easy to treat analytically and often give sufficiently good approximations when $x$ does not vary greatly.
Linear models match physical systems in which individuals disperse independently of one another.

Both of the reasons are in play here. It is always better to start with a tractable model in order to understand the fundamental components of a problem. Moreover the ubiquity of linear transition rates in applications makes linear rates an important test case. Finally, snow leopards are famously solitary animals—“the only prolonged social contact in snow leopards is that of a female and her dependent offspring . . . no evidence was found to substantiate territoriality”²—so it is not unreasonable to start by modeling their dispersal as independent.

If each individual disperses independently of the other individuals, then:

λ (x) = α x

for some per capita rate $α$ . The per capita (per individual) rate $α$ is simply the rate at which any individual is expected to leave the valley.

Demo 1: Each point represents a snow leopard and the rectangle the valley. They move randomly and when they leave the valley, another leopard is placed randomly in the valley. The graph shows the number of leopards that have left. Notice that, while the leopards leave at random times, the average rate at which they leave is proportional to their frequency. Test this by changing the number of leopards in the valley (move the slider).

We are now equipped to state the question formally.

Problem

Suppose that the transition rate is linear in the number of leopards in the valley, the expected number of leopards in the valley before observing one leave is $\overset{x}{ˉ}$ , and one is observed leaving. What is the new expectation?

Solution

If the expected number of leopards in the valley before the event was $\overset{x}{ˉ}$ , and the variance in the number of leopards in the valley before the event is $v$ , then the expected number after the event is $\overset{x}{ˉ} - 1 + v / \overset{x}{ˉ}$ .

Proof

The proof is organized as follows. First we show that if a sufficiently small time window is chosen around the event time, then it can be assumed that only one event occurs during the time window. We then use the asymptotic form for the probability that one event occurred during the window to compute the conditional probability that there were $x$ leopards in the valley. We then average over this distribution to compute the expected number of leopards in the valley after one is seen leaving.

Conditional probability?

The joint probability that two events occur is the probability they both occur. For example, the probability that a randomly drawn American is male and over 6 feet tall is the joint probability of the event that the American is male, and the event that the American is

6

feet tall. Joint probability is indicated with the intersection sign

\cap

. The

\cap

can be read as "and":

Pr{male and over six feet}=Pr{male∩over six feet}.

Conditional probability is the probability an event occurs given that another has occurred. For example, the probability an American man is over

6

feet tall is the conditional probability that an American is over six feet tall given that they are male. Conditional probability is denoted with a vertical bar

∣

. The vertical bar,

∣

, can be read as "given that":

Pr{over six feet given male}=Pr{over six feet∣male}.

Joint and conditional probabilities are related by the fact that the probability that two events occur,

A

and

B

, is the probability

B

occurs given

A

occurs times the probability

A

occurs:

Pr{A∩B}=Pr{B∣A}Pr{A}.

In our example, 14.5 percent of American men are over 6 feet and 49.2 percent of Americans are male. Therefore:

Pr{male∩over six feet}=Pr{over six feet∣male}Pr{male}=0.145×0.492=0.071.

This equation can also be used to solve for conditional probabilities from joint probabilities:

Pr{B∣A}=Pr{B∩A}Pr{A}.

We use this relation to solve for the conditional probability there were

x

leopards given that one was observed crossing the pass.

Suppose that an event occurs at time $t$ . Consider the time interval $[t - Δ t, t + Δ t]$ for small $Δ t$ . By assumption, the probability that more than one event occurs in the interval is $o (Δ t)$ . We condition on at least one such event occurring. The probability of at least one event occurring is proportional to $Δ t$ . This means that the probability of more than one event occurring, conditioned on an event occurring, is proportional to $o (Δ t) / (Δ t)$ , which, by definition, converges to zero as $Δ t$ goes to zero. Therefore, for sufficiently small $Δ t$ we can assume that only one transition event occurred in the time interval.

Since one event occurred during the interval, $p(x,t+Δt)=Pr{X(t+Δt)=x}$ is the conditional probability:

p(x,t+Δt)=Pr{X(t−Δt)=x+1∣N([t−Δt,t+Δt])=1}.

Or, equivalently:

p(x−1,t+Δt)=Pr{X(t−Δt)=x∣N([t−Δt,t+Δt])=1}.

That is, the probability there are $X = x - 1$ leopards in the valley after seeing the event is the probability that there were $x$ leopards in the valley before seeing the event, given that an event occurred. To compute this conditional probability we will use Bayes’ rule.

Bayes’ rule

Bayes’ rule is a method for computing the conditional probability of an event based on observations of a different, but related, event. For example, suppose that you and I play a game of dice in which the high roller wins. Given that you win, what is the probability that you rolled a five?

Let’s first find the joint probability that you rolled a five and won. The probability that you rolled a five is

1 / 6

. The probability that you won with a five is

4 / 6

, since you cannot have won if I rolled a five or six. Therefore, the joint probability is:

Pr{win∩rolled a 5}=Pr{win∣rolled a 5}Pr{rolled a 5}=46×16=436

Here we have computed the joint probability in the direction of causality. There is a probability you rolled a five; given that you rolled a five, there is another probability that you won. Bayes’ rule works by going against causality. It is equally true that:

Pr{win∩rolled a 5}=Pr{rolled a 5∣win}Pr{win}

Notice that

Pr{rolled a 5∣win}

is the conditional probability we were originally looking for. Thus, if we divide across by the probability that you win, we get:

Pr{rolled a 5∣win}=Pr{win∩rolled a 5}Pr{win}.

We can then substitute in our original expression for the joint probability that appears in the numerator:

Pr{rolled a 5∣win}=Pr{win∣rolled a 5}Pr{rolled a 5}Pr{win}

This gives the conditional probability going backwards in terms of the conditional probability going forwards. We already know what the numerator is because we know the conditional probability going forwards:

4 / 36

.

To compute the denominator, you have to find the overall probability that you win, which is: [the probability you win and rolled a

1

]

+

[the probability you win and rolled a

2

]

+

[the probability you win and rolled a

3

], and so on up to

6

. In mathematical terms:

Pr{win}=6∑x=1Pr{win∩rolled a x}=6∑x=1Pr{win∣rolled a x}Pr{rolled a x}=6∑x=1(x−1)616=0+1+2+3+4+536=1536.

Putting it all together:

Pr{rolled a 5∣win}=Pr{win∣rolled a 5}Pr{rolled a 5}Pr{win}=4/3615/36=415.

Note that the probability you rolled a five given that you won,

4 / 15

, is greater than the probability of rolling a five,

1 / 6

. This is because you are more likely to win if you roll a large number. Knowing that you won suggests you rolled a high number.

We will use this same technique to compute the conditional probability that there were

x

leopards in the valley given that one left from the conditional probability that one leaves if there were

x

leopards in the valley. For another example of Bayes’ rule in action check out our article on baseball .

Now, using the fact that $Pr{B∣A}=Pr{B∩A}/Pr{A}$ :

Pr{X(t−Δt)=x∣N([t−Δt,t+Δt])=1}=Pr{X(t−Δt)=x∩N([t−Δt,t+Δt])=1}Pr{N([t−Δt,t+Δt])=1}.

To compute the joint probability that there were $x$ leopards in the valley and one left, we use $Pr{B∩A}=Pr{A∣B}Pr{B}$ .

The probability $X (t - Δ t) = x$ is $p (x, t - Δ t)$ by definition. If $X (t - Δ t) = x$ then the probability one event occurred in the interval $[t - Δ t, t + Δ t]$ is $2 α x Δ t + o (Δ t)$ . Therefore:

Pr{X(t−Δt)=x∩N([t−Δt,t+Δt])=1}=(2αxΔt)p(x,t−Δt).

This joint probability is the numerator in the conditional probability we are solving for.

For the denominator we need the probability that one leopard left. To do this, sum the expression given above over all possible $x$ :

Pr{N([t−Δt,t+Δt])=1}=∑∞x=0(2αxΔt)p(x,t−Δt).

Then, substituting the numerator and denominator in and simplifying:

p (x - 1, t + Δ t) = Pr {X (t - Δ t) = x ∣ N ([t - Δ t, t + Δ t]) = 1} = \frac{( 2 α x Δ t )}{\sum _{y = 0}^{\infty} ( 2 α y Δ t ) p ( y , t - Δ t )} p (x, t - Δ t) = \frac{x}{\sum _{y = 0}^{\infty} y p ( y , t - Δ t )} p (x, t - Δ t) = \frac{x}{x ˉ ( t - Δ t )} p (x, t - Δ t) .

To finish, take the limit as $Δ t$ goes to zero:

p (x - 1, t + d t) = \frac{x}{x ˉ ( t - d t )} p (x, t - d t) .

This is the probability that there are $x - 1$ leopards in the valley given that one left the valley at time $t$ . Here $d t$ represents an infinitesimally small time step and is retained to distinguish times immediately preceding and immediately following the transition event.

A convenient way to think about this equation is that $x p (x, t - d t)$ is the rate at which probability flows out of the state: [there were $x$ leopards at time $t - d t$ ], and into the state: [there are now $x - 1$ leopards at time $t + d t$ ]. The rate of a probability flow is a probability flux, $j$ . The product $x p (x, t - d t)$ is the probability flux $j (x, t - d t)$ . Therefore, $p (x - 1, t + d t)$ is proportional to the distribution of probability fluxes $j (x, t - d t)$ , normalized by $\overset{x}{ˉ} (t - d t)$ and shifted down by one. The animation below shows the initial distribution of leopards $p (x, t - d t)$ transforming into the probability fluxes $j (x, t - d t) = x p (x, t - d t)$ and then scaling and shifting to recover the distribution after the leopard left, $p (x, t + d t)$ .

Old distribution becomes new distribution

Figure 1: The distribution before the event is shown in red. It is then transformed into the probability fluxes. The probability flux leaving the state representing ten snow leopards is the probability there were ten leopards in the valley times the rate at which leopards would leave the valley if there were ten in the valley. This is shown in purple. The distribution of fluxes is then normalized, and shifted down by one to account for the leopard leaving. This gives the conditional distribution for the number of leopards in the valley given that an event occurred. This distribution is shown in blue. The red distribution is the distribution of leopards before the event. The blue is the distribution afterwards.

Now that we have the probability $X (t + d t) = x - 1$ given that an event occurred at time $t$ we can compute the new expectation:

\overset{x}{ˉ} (t + d t) = E [X (t + d t) ∣ an event occurred at time t] = x = 0 \sum \infty x p (x, t + d t) .

Substituting in for $p (x, t + d t)$ in terms of the old distribution:

\overset{x}{ˉ} (t + d t) = x = 0 \sum \infty x \frac{( x + 1 )}{x ˉ ( t - d t )} p (x + 1, t - d t) .

Let $y = x + 1$ . Then:

\overset{x}{ˉ} (t + d t) = y = 1 \sum \infty (y - 1) \frac{y}{x ˉ ( t - d t )} p (y, t - d t) = \frac{1}{x ˉ ( t - d t )} y = 0 \sum \infty y (y - 1) p (y, t - d t) = \frac{1}{x ˉ ( t - d t )} E [X (t - d t) (X (t - d t) - 1)] = \frac{1}{x ˉ ( t - d t )} (E [X (t - d t)^{2}] - \overset{x}{ˉ} (t - d t)) = \frac{E [ X ( t - d t ) ^{2} ]}{x ˉ ( t - d t )} - 1 .

To simplify the equation note that the expected value of a random variable squared is the same as the variance in the random variable plus the expected value of the random variable squared.

Why?

Let $v (t)$ denote the variance in $X (t)$ . Then $\frac{E [ X ( t - d t ) ^{2} ]}{x ˉ ( t - d t )} = \frac{v ( t - d t ) + x ˉ ( t - d t ) ^{2}}{x ˉ ( t - d t )} = \frac{v ( t - d t )}{x ˉ ( t - d t )} + \overset{x}{ˉ} (t - d t)$ . Therefore:

\overset{x}{ˉ} (t + d t) = \overset{x}{ˉ} (t - d t) - 1 + \frac{v ( t - d t )}{x ˉ ( t - d t )} . ■

Discussion

This equation is easy to interpret. The new expectation is the old expectation minus one leopard, since we saw a leopard leave, plus our uncertainty in the number of leopards. We add the uncertainty because seeing a leopard leave is evidence that there may have been more leopards in the valley than we’d thought. Notice that if we had no uncertainty, then we knew the number of leopards in the valley, so the new expectation is just the old expectation minus one.

Here uncertainty is measured in the variance divided by the mean. This is the coefficient of variation (CV). The CV is a natural measure of uncertainty in this context, since it is a measure of the uncertainty relative to the mean. If we think leopards are rare, then the mean is small, and if we are very uncertain about the number of leopards, then the variance is large. This is precisely the case when observing a leopard should change our expectation the most. Accordingly the CV is largest when we expect leopards to be rare, but we are very uncertain about the number of leopards. This occurs when the distribution $p (x, t)$ is skewed positive.

So who was right? Our fictional (idealized wilderness) self or the conservationist?

It depends on the coefficient of variation. If the conservationist knew the CV then she could answer exactly. The CV could be known empirically (by studying the population of leopards in many valleys), or could be computed if it is assumed that leopards are distributed according to a one-parameter family of distributions.

Sticking to our modeling approach, let’s see what happens if we pick a distribution. The natural first choice is a Poisson distribution, since many rare items are Poisson-distributed.

Why a Poisson distribution?

Poisson bar chart

Figure 2: Poisson distributions with means 0.5, 4, and 8 (blue, purple, and red respectively). Note that the larger the mean the larger the variance in the distribution.

Suppose that the leopards are Poisson-distributed at time $t$ with mean $\overset{x}{ˉ} (t)$ . This means that:

p (x, t) = Pr {X (t) = x} = \frac{x ˉ ^{x}}{x !} exp (- \overset{x}{ˉ}) .

Remarkably, the Poisson distribution has variance equal to its mean.

Why?

Since the coefficient of variation is the variance divided by the mean, the CV of the Poisson distribution equals $1$ . But then:

\overset{x}{ˉ} (t + d t) = \overset{x}{ˉ} (t - d t) - 1 + \frac{x ˉ ( t - d t )}{x ˉ ( t - d t )} = \overset{x}{ˉ} (t - d t) .

The expected number of leopards after observing one leave is the same as the expected number before seeing one leave!

Even more provocatively, no matter how many times we see a leopard leave, our expectation does not change. That is, Poisson-distributed leopards are conserved in expectation.

We can go further. Not only the expectation stays the same. If $p (x, t - d t)$ is a Poisson distribution then $p (x, t + d t)$ is also Poisson and $p (x, t - d t) = p (x, t + d t)$ . In this case not only is the expectation conserved, the entire distribution is conserved! Hence the observation event carries no information about the number of leopards in the valley. This is illustrated by the animation below.

Poisson distribution conserved

Figure 3: The distribution before the event is shown in red. Here it is assumed to be Poisson. It is then transformed into the probability fluxes. The probability fluxes are shown in purple. The distribution of fluxes is then normalized, and shifted down by one to account for the leopard leaving. This gives the conditional distribution for the number of leopards in the valley given that an event occurred. This distribution is shown in blue. The red distribution is the distribution of leopards before the event. The blue is the distribution afterwards. The blue distribution is the same as the red distribution, so the distribution is conserved when conditioning on the observation of a leopard leaving.

Proof that Poisson-distributed leopards stay Poisson-distributed

In this case the conservationist has made the most consistent prediction. The expected number of leopards should stay the same even though one was observed leaving.

What about the expected number of leopards in the neighboring valley? If the expected number in our valley stayed the same surely the expected number in the neighboring valley does as well?

This is not true. The expected number of leopards in the neighboring valley increases by one, exactly as we might have thought before doing all this math. How is that possible? The expected number in the neighboring valley increases by one, since the rate at which leopards enter a valley is independent of the number of leopards in that valley. It follows that seeing a leopard enter a valley tells us nothing about the number in the valley before the observation event. So all we need to do is add the new leopard to our previous expectation.

This is the key idea. When modifying an expectation to account for an observed event we need to ask: does the observation convey information about our original expectation? If it does, then we modify our old expectation before subtracting or adding the number of leopards entering or leaving. If it doesn’t, then there is no need to revise our expectation.

So, while the expected number of leopards in our valley is conserved the total number of expected leopards is not. Instead it increases by one every time we see a leopard walk out of our valley. An expected leopard has, in fact, entered the pair of valleys directly from the probabilistic ether!

Where did the expected leopard come from?

The sudden appearance of a new expected leopard seems strange, since it violates our intuition about the conservation of expectation. A leopard leaving a valley does not change the total number of leopards in the two valleys, but, with these assumptions, the expected number of leopards increases every time we see a leopard move between valleys. Taken in isolation this would mean that seeing the same leopard walk back and forth between the valleys would make our expected total number increase and increase and increase. That is obviously wrong.

The natural balance to this effect is that not seeing leopards is evidence that we should decrease our expectation. After all, if our hypothetical film crew waited a year, then they would conclude that leopards are rare, and if they had to wait a decade, then they might conclude that leopards are (at least locally) extinct. In general, the expected number of leopards should decay continuously in between observation events. Using the same modeling framework it is possible to show that this is, in fact, the case. Moreover, the rate at which the expectation decays between events balances the increase in expectation after each observation event.

Let’s put it all together. While we are waiting to see a leopard our expectation decays slowly. When we finally see one leave our valley we keep the expected number in our valley the same, but add a leopard to the expected number in the neighboring valley. On the other hand, if we see one enter our valley then we keep the expected number in the neighboring valley the same, and increase the expected number in our valley by one. Then we wait again. On the whole the process will keep our expectation near the true number of leopards. This is illustrated in the animation below.

Figure 4: Simulation of Leopards moving between two valleys. The thin lines represent the actual number of leopards in the valleys, and the thick lines represent the expectation per valley. The simulation starts with three leopards in the first valley and nine in the second. The first valley is smaller than the second so the rate of transition out of the first valley is faster than the rate of transition out of the second valley (per capita rates 1 and 1/3 respectively). The expectations start at 4 and 5 leopards. Notice that this is the wrong total number of leopards. As the simulation progresses, the observed transition events inform the expectations. This corrects the error in the total expected number of leopards which approaches 12. After the initial first few events the expectations start to track the actual number of leopards. Also notice that the expectation in a given valley only jump when a leopard enters, not when one leaves.

A Little More

What about other distributions?

We provided the specific solution to the problem for a Poisson distribution above. While this is, arguably, the most relevant distribution for the discussion it is not the only distribution we could have picked.

Suppose that the leopards are geometrically distributed. This is the maximum entropy distribution supported on the natural numbers with mean

\overset{x}{ˉ}

, so it is the choice of distribution that incorporates the least side information about the distribution of leopards. The variance of the geometric distribution is equal to the mean squared minus the mean. Thus the CV of a geometric distribution is equal to the mean minus one. Now the expected number after observing a leopard leave is twice the original expectation minus one. This gives the update rule

\overset{x}{ˉ} (t + d t) = 2 (\overset{x}{ˉ} (t - d t) - 1)

The geometric distribution is highly skewed, hence the expected number after observing an individual leave nearly doubles!

All of the examples given so far are distributions with large variances. Large variance leads to large corrections to the expectation before subtracting off the vagrant leopard. In contrast, what would be the smallest possible correction to the expectation? This is the same as asking, what is the smallest the variance could possibly be given the mean?

It can be proved that, for any random variable supported on the integers with a given mean, there is a unique distribution with minimal variance supported on the integers with the corresponding mean. This distribution is nonzero only on the two integers given by rounding the mean down and rounding the mean up. The variance in this distribution is given by the distance from the mean to the mean rounded down times the distance from the mean to the mean rounded up.

This gives the update rule:

\overset{x}{ˉ} (t + d t) = \overset{x}{ˉ} (t - d t) - 1 + \frac{( ⌈ x ˉ ( t - d t ) ⌉ - x ˉ ( t - d t ) ) ( x ˉ ( t - d t ) - ⌊ x ˉ ( t - d t ) ⌋ )}{x ˉ ( t - d t )}

Note that this update rule keeps the corrections due to the uncertainty small (less than 0.25 divided by the mean). Also notice that if the mean is between zero and one this automatically sets the new expectation to zero. The minimum variance distribution for mean between zero and one assumes that the only possibilities are that there were either zero or one leopard before the transition was observed. Since the transition could not have been observed if there were zero leopards, there had to have been one in the valley originally, and it was seen leaving, so there are now zero.

What if the rate isn’t linear?

Consider general

λ (x)

(not necessarily linear). Then, repeating the same calculation, the distribution after observing an event is:

p (x, t + d t) = \frac{λ ( x + 1 ) p ( x + 1 , t - d t )}{λ ˉ ( t - d t )}

where

\overset{ˉ}{λ} (t) = E [λ (X (t))]

.
Then:

\overset{x}{ˉ} (t + d t) = \frac{1}{λ ˉ ( t - d t )} x = 0 \sum \infty x λ (x + 1) p (x + 1, t - d t) = \frac{1}{λ ˉ ( t - d t )} y = 1 \sum \infty (y - 1) λ (y) p (y, t - d t) = \frac{1}{λ ˉ ( t - d t )} y = 0 \sum \infty (y - 1) λ (y) p (y, t - d t) = \frac{1}{λ ˉ ( t - d t )} (E [X (t - d t) λ (X (t - d t))] - \overset{ˉ}{λ} (t - d t)) = \frac{1}{λ ˉ ( t - d t )} E [X (t - d t) λ (X (t - d t))] - 1 .

To simplify note that:

E [X (t - d t) λ (X (t - d t))] = E [X (t - d t) (λ (X (t - d t)) - \overset{ˉ}{λ} (t - d t))] + \overset{ˉ}{λ} (t - d t) \overset{x}{ˉ} (t - d t)

and:

E [X (t - d t) (λ (X (t - d t)) - \overset{ˉ}{λ} (t - d t))] = E [(X (t - d t) - \overset{x}{ˉ} (t - d t)) (λ (X (t - d t)) - \overset{ˉ}{λ} (t - d t))] = cov [X (t - d t), λ (X (t - d t))] .

Substituting in yields the general solution:

\overset{x}{ˉ} (t + d t) = \overset{x}{ˉ} (t - d t) - 1 + \frac{cov [ X ( t - d t ) , λ ( X ( t - d t ) ) ]}{λ ˉ ( t - d t )} .

This recovers our original solution when

λ (x)

is linear, since then the covariance is proportional to the variance in

x

. Also notice that this solution is invariant under scaling

λ (x)

by any constant. This means that the way the expectation changes depends only on how

λ (x)

scales in

x

, not the actual rate. Finally, notice that if

X

is positively correlated with

λ (X)

then observing an event increases our expectation before a leopard is removed, while if

X

and

λ (X)

are negatively correlated then it decreases our expectation.

In all seriousness

This problem was motivated by a study of chemical signaling at the cellular scale. Cells signal each other by releasing signaling molecules, which diffuse through the inter-cellular medium and bind to receptors on other cells. The rate at which the receptors bind to the signaling molecule is proportional to the number of signaling molecules. The receptors play the same role as the camera traps in the previous examples. The signal released by the transmitting cell is encoded in the number of signaling molecules. The receiving cell receives this signal indirectly through observation events (i.e., binding events at receptors). How well could a receiving cell estimate the number of signaling molecules in solution based on occasional observations of binding events?

The example provided here shows that when the rate of observable events depends on the state of a hidden variable, observing an event carries information about the hidden variable, which should influence our expectation about the hidden variable.