## Counting Achievable Labelings: Canonical Shattering Coefficients

Verbally defining shattering coefficients seems straightforward at first glance:

Given a hypothesis classH,itsnᵗʰ shattering coefficient, denotedSₙ(H),represents thelargest number of labelings achievable by classifiers inHon a sample ofnfeature vectors.

But what is a “*labeling*”? And what makes it “*achievable*”? Answering those questions will help us lay some groundwork in pursuit of a more formal definition.

In the context of binary classification, a **labeling **of a sample of feature vectors is simply any one of the ways we can assign values from the set { -1, 1 } to those vectors. As a very simple example, consider two one-dimensional feature vectors (i.e., points on a number line), *x*₁ = 1 and *x*₂ = 2.

The possible labelings are any combination of the classification values we can assign the individual feature vectors independent of one another. We can represent each labeling as a vector, where the first and second coordinate represent the values assigned to *x*₁ and *x*₂, respectively. The set of possible labelings is thus { (-1, -1), (-1, 1), (1, -1), (1, 1) }. Note that a sample of size 2 yields 2² = 4 possible labelings — we’ll see how this generalizes to arbitrarily-sized samples soon.

We say that **a labeling is achievable by a hypothesis class H** if there exists a classifier

*h*∈

*H*from which that labeling can result. Continuing with our simple example, suppose we are limited to classifiers of the form

*x*≥

*k*, k

*∈ ℝ, that is, one-dimensional thresholds such that anything to the right of the threshold is positively classified. The labeling (1, -1) is not achievable by this hypothesis class.*

*x*₂ being greater than

*x*₁ implies that any threshold that classifies

*x*₁ positively must do the same for

*x*₂. The set of achievable labelings is therefore { (-1, -1), (-1, 1), (1, 1) }.

Having understood the basic terminology, we can start to develop some notation to formally express elements of the verbal definition with which we started.

We stick to representing labelings as vectors as we did in our simple example, with each coordinate representing the classification value assigned to the corresponding feature vector. **There are 2 ⁿ possible labelings in total:** there are two possible choices for each feature vector, and we can think of a labeling as a collection of

*n*such choices, each made independently of the rest.

**If a hypothesis class**i.e., if the number of

*H*can achieve all possible labelings of a sample 𝒞*ₙ*,*achievable*labelings of 𝒞

*ₙ*equals 2

*ⁿ*,

**we say that**

*H shatters*𝒞*ₙ.*Finally, using the notation from above, we converge on a more rigorous definition of *Sₙ*(*H*):

In keeping with our explanation of shattering, *Sₙ*(*H*) equalling 2*ⁿ* implies that there exists a sample of size *n* that is shattered by *H*.

## Estimating Hypothesis Class Expressiveness: Canonical VC Dimension

**The Vapnik–Chervonenkis (VC) dimension is a way to gauge the expressive power of a hypothesis class. **It’s based on the idea of shattering we just defined, and it plays an important role in helping us determine which hypothesis classes are PAC learnable and which aren’t.

Let’s begin by attempting to intuitively define the canonical VC dimension:

Given a hypothesis classH, its VC dimension, denoted VCdim(H), is defined to be the greatest natural numbernfor which there exists a sample of sizenthat isshatteredbyH.

Using *Sₙ*(*H*) enables us to express this much more cleanly and succinctly:

VCdim(H)=max{n∈ ℕ :Sₙ(H) = 2ⁿ}

However, this definition isn’t precise. Note that the set of numbers for which the shattering coefficient equals 2*ⁿ *may be infinite. (Consequently, it is possible that VCdim(*H*) = ∞.) If that’s the case, the set has no well-defined maximum. We address this by taking the supremum instead:

VCdim(H)=sup{n∈ ℕ :Sₙ(H) = 2ⁿ}

This rigorous and concise definition is the one we’ll use moving forward.

## Adding Preferences to the Mix: Strategic Shattering Coefficients

Generalizing the canonical notions we just went over so that they work in a strategic setup is fairly straightforward. Redefining shattering coefficients in terms of the *data point best response* we defined in the previous article is practically all we’ll have to do.

Given a hypothesis classH,a preference setR, and a cost functionc,thenᵗʰ shattering coefficient of Sᴛʀᴀᴄ⟨H,R,c⟩, denoted σₙ(H, R, c),representsthe largest number of labelings achievable by classifiers inHon a set ofnpotentially-manipulated feature vectors, i.e.,ndata point best responses.

As a reminder, this is how we defined the data point best response:

We can tweak the notation we used in our discussion of canonical shattering coefficients to further formalize this:

The main difference is that each *x* in the sample has to have a corresponding *r*. Other than that, putting the data point best response where we had x in the canonical case works smoothly.

**As a quick sanity check, let’s consider what happens if R = { 0 }.** The realized reward term 𝕀(

*h*(

*z*) = 1) ⋅

*r*will be 0 across all the data points.

**Maximizing utility thus becomes synonymous with minimizing cost**. The best way to minimize the cost incurred by a data point is trivial:

**never manipulating its feature vector.**

**Δ( x, r; h) ends up always just being x,** placing us firmly within the territory of canonical classification.

**It follows that σ**This is consistent with our observation that the impartial preference class represented by

*ₙ*(*H*, { 0 },*c*)*= Sₙ*(*H*) for all*H, c*.*R*= { 0 } is equivalent to canonical binary classification.

## Expressiveness with Preferences: Strategic VC Dimension (SVC)

Having defined the *n*ᵗʰ** **strategic shattering coefficient,

**we can simply swap out the**

*Sₙ*(*H*)*in the canonical definition of the VC dimension for σ**ₙ*(*H*,*R*,*c*).

SVC(H, R, c)=sup{n∈ ℕ : σₙ(H, R, c) = 2ⁿ}

Based on the example we considered above, we find that SVC(*H*, { 0 }, *c*) = VCdim(*H*) for any *H*, *c*. Indeed,** SVC is to VCdim as the strategic shattering coefficient is to its canonical equivalent:** both are elegant generalizations of non-strategic concepts.

## From SVC to Strategic PAC Learnability: The Fundamental Theorem of Strategic Learning

**We can now use SVC to state the Fundamental Theorem of Strategic Learning,** which relates the complexity of a strategic classification problem to its (agnostic) PAC learnability.

A strategic classification instance Sᴛʀᴀᴄ⟨H,R,c⟩ is agnostic PAC learnable if and only if SVC(H, R, c) is finite.Thesample complexityfor strategic agnostic PAC learning ism(δ,ε) ≤Cε⁻² ⋅(SVC(H, R, c) + log(1/δ)), withCbeing a constant.

We won’t elaborate too much on how this can be proven. Suffice it to say that it boils down to a clever reduction to the (well-documented) Fundamental Theorem of *Statistical *Learning, which is essentially the non-strategic version of the theorem. If you’re mathematically inclined and interested in the nuts and bolts of the proof, you can find it in Appendix B of the paper.

**This theorem essentially completes our generalization of classic PAC learning to a strategic classification setting.** It shows that the way we defined SVC actually doesn’t just make sense in our heads; it actually works as a generalization of VCdim where it matters most. Armed with the Fundamental Theorem, we are well-equipped to analyze strategic classification problems much as we would any old binary classification problem. In my opinion, having the ability to determine whether a strategic problem is theoretically learnable or not is pretty incredible!