Legal analysis requires inferring causality from legal texts. We often seek to identify, for instance, factors stated in judicial opinions that were dispositive of case outcome. Recent advances in causality theory and natural language processing could automate this process, further allowing causal legal questions to be examined more rigorously. This paper introduces causal text inference techniques to a generalist legal audience and illustrates how they apply to law. I consider legal use cases for text-as-outcome, treatment, and de-confounding methods, provide a detailed survey on text de-confounding methods, and identify a fundamental problem with inferring causal effects from judicial opinions. I further illustrate how causal text inference may be applied to a novel dataset of 7,046 Supreme Court certiorari petition briefs to examine whether petition origin causally affects certiorari grant rates. I show how covariate balancing on petition texts yield differing causal estimates depending on text embedding, balancing scheme, and estimation techniques used, suggesting that failing to account for text may yield spurious results.
I thank Jim Greiner, Kevin Quinn, for helpful advice, David Costigan and Lily Lilliott for research support, and the United States Supreme Court for making petition briefs after 2017 freely accessible.
Some of the greatest legal debates center on causal inference. If indeed law is nothing more than “prophecies” of what courts will do in fact,1 then lawyers should presumably be interested in the factors that truly, causally determine, rather than superficially correlate to, judicial decisions. Likewise, the debate between realism and formalism might be understood as a disagreement on whether predictive but purely associative factors can properly be termed ‘law’, as well as whether certain outcome-causal factors are actually not ‘law’.2
Common law further assumes it possible for lawyers to extract these outcome-causal factors by reading preceding judicial opinions. Were opinions express and complete statements of all such factors, this would be a trivial comprehension exercise. But if expressivity and completeness characterized most opinions,3 the attention (and financial reward) attributed to artful legal analysis would hardly be justified. Law schools devote significant attention, particularly in first year curricula, to teaching students to analyze what opinions say. Legal analysis may thus be thought of, at least partly, as an exercise in drawing causal inferences from legal texts.
The law’s persistent unfamiliarity with causality theory4 has thus been described as both surprising and unhealthy.5 Much of this criticism concerns the use of statistical techniques, particularly regression methods, to answer causal questions in the civil rights context (i.e. did the impugned law cause the disparate impact observed?).6 Others have considered the implications of causality theory the assessment of epidemiological evidence in courtrooms, the success of criminal justice intervention programs, and the effectiveness of access to justice initiatives.7 However, as the critique concerns methodology, it readily applies to the use of statistical techniques in legal studies generally.8 Meanwhile, causal inference theory in the hard and social sciences has, in recent decades, undergone significant transformation.9 The law’s lagging adoption of modern causality theory, therefore, seems poised to widen.
Seeking to bridge this gap, this paper explores the implications of causality theory for answering causal questions in legal analysis. To clarify, I am not referring to the legal doctrine of causation which generally concerns whether the defendant’s breach of duty is a “legal cause” of the plaintiff’s injury.10 Rather, I am interested in a recurrent question in legal analysis that can be expressed with the template: does a given factor T causally affect a legal outcome Y?
My interest here is further motivated by recent advances in empirical techniques for drawing causal inference from text data.11 These techniques, which I refer to below as “causal text methods”, allow causal effects to be identified settings where text data is to be included as a key variable. To the extent that extracting causal effects from text is central to legal analysis, causal text methods have profound implications for lawyers.
Part II begins with a theoretical primer on causal inference theory. Part III provides background on causal text methods, in the process considering illustrative legal questions that each method may apply to. I focus on text de-confounding, a technique that attempts to hold texts constant across treatment and control settings, as that appears to have the broadest potential for legal application. In Part IV, I discuss challenges with applying text methods to the legal domain specifically and touch briefly on an important statistical limitation with inferring causality from opinions. Part V demonstrates how causal text methods might be applied to study how the Supreme Court grants certiorari and presents illustrative, albeit preliminary, findings on whether case origin matters.
This Part covers the foundations of causal inference theory. Readers already familiar may wish to skip ahead.
Consider the following questions common in legal scholarship and practice:
Does law and economics training for judges affect how they decide cases? 12
How does bench composition affect judicial decisions?13
What determines whether my client will be liable for injuries arising from a traffic accident in which he/she was involved?
Would pleading guilty reduce my client’s sentence?
Would a longer or better written brief increase my client’s odds?
These questions can be reduced to or restated as questions of causality. We want to know whether a different pre-existing state of affairs leads to a different outcome and, if it does, how different that outcome might be. Further, we are often interested in isolating the outcome effect of one particular attribute of that pre-existing world. This leads to the following question: would a pre-existing world with that attribute yield a different outcome from one without it?14
Causality theorists refer to the attribute of interest as the treatment variable, commonly denoted T, and to the outcome of interest as the outcome variable, denoted Y.15 To illustrate, the ‘treatment’ for Question (4) above is pleading guilty while the outcome is sentence. For simplicity, assume the client can only choose whether to plead guilty or not (i.e. there are no partial choices like pleading guilty to some charges or an intermediate bargain), and the only possible sentence is a jail term ranging from zero to three months. We can express the treatment as a binary: T=1 if the client pleads guilty, and T=0 otherwise.16 Y is then a continuous variable ranging between 0 to 3 months. Accordingly, Y(T=1) refers to the outcome jail term when the client pleads guilty, and Y(T=0) the jail term when the client does not.
It appears, then, that we may simply take the difference between Y(T=1) and Y(T=0) as the causal effect of T on Y. This, indeed, is the definition of the causal effect of T for that particular client (referred to as a single “unit” of analysis or one “observation”). In addition to this unit treatment effect, we might also be interested in the average treatment effect (“ATE”) of a given policy on groups of individuals, derived by subtracting the group average Y(T=0) from the average Y(T=1).
In practice, however, this cannot be done. It is physically impossible to observe both Y(T=1) and Y(T=0) at the same time. Unlike Schrodinger’s cat, the client may not at once plead both guilty and not guilty. Holland famously called this the Fundamental Problem of Causal Inference: for a given unit, we can only see either the treated or non-treated outcome, never both.17 This reveals that causality is fundamentally, and inevitably, a missing data problem. To infer causality, we need a mechanism that reliably provides information on what the unobserved outcomes would potentially be if the treatment status changed.18
This insight underlies Imbens and Rubin’s influential “potential outcomes” framework.19 To work around the impossibility of observing both outcomes for one unit, we use observations on multiple units, some for whom we observe Y(T=1) (the “treatment group”), and others for whom we observe Y(T=0) (the “control group”).20 If the two groups are, loosely speaking, comparable, then observed outcomes for one can be used to ‘fill in’ missing potential outcomes for the other. That is, to estimate the control group’s potential outcome had they received the treatment, and vice-versa. The control (treatment) outcomes for the treatment (control) group are conventionally denoted with YT(T=0) and YC(T=1) respectively, with the subscripts identifying which observed group the potential outcomes apply to. The comparability requirement can then be expressed as requiring YC(T=0), which we observe, being representative of YT(T=0), which we do not, and vice-versa for YT(T=1) and YC(T=1).
What I loosely termed ‘comparability’ and ‘representativeness’ is formalized in causality theory under two assumptions, both of which must be fulfilled (or reasonably assumed) for legitimate causal inference: (1) the Stable Unit Treatment Value Assumption (“SUTVA”), and (2) unconfoundedness.21 SUTVA requires that both that (A) potential outcomes for any unit do not vary with treatment assignment to other units, and (B) there is only one form of the treatment and control.22 Requirement (A) precludes interference across treatment and control groups. To illustrate, suppose two defendants DT and DC were charged for an identical crime. DT pleads guilty and gets 1 month in prison. DC does not and gets 3. We might expect the two months’ difference in sentence to be the ‘benefit’ of DT pleading guilty. But if DT and DC were, in fact, partners in the same crime (which would explain the identical charges), then DT’s guilty plea might have contributed to DC’s longer sentence, and 2 months would probably overestimate the causal effect of DT’s plea. Requirement (B) mirrors the earlier simplifying assumption that there are no ‘intermediate’ levels of pleading guilty. If so, we would be wrongly comparing units which experienced different levels of treatment as if they had or had not received the same treatment.
While SUTVA relates primarily to the treatment variable T, unconfoundedness relates to all others (though we will see that it retraces to T). In taking the difference between Y(T=1) with Y(T=0), the two pre-existing worlds we are really interested in comparing are the worlds where T=1 and where T=0. In prior exposition I have implicitly assumed that all attributes of those worlds remained constant, allowing us to isolate T’s effect. If we compared worlds where T and another variable X both differ, we cannot tell if any resultant change in outcome is due to the variation in T or in X, unless we know a priori that X does not affect the outcome. This leads us to a definition of “confounders” (i.e. variables that confound identification of the causal effect of T on Y) that those trained in regression methods may find familiar: a confounder is a variable which varies with both T and Y.23 Confounders must be held constant across the compared worlds for legitimate causal inference.
Denoting all confounders in a given pre-existing world as Z and their values as z, what we want, therefore, is to compare Y(T=1, Z=z) with Y(T=0, Z=z). Observe that it is unnecessary to hold constant every factor in the universe. Causal inference could be legitimate as long as treatment assignment is independent of potential outcomes. This is termed the “unconfoundedness” assumption,24 and would explain why the Randomized Controlled Experiment (“RCT”), where treatment is randomly assigned to all units, is the gold standard for causal inference.25 Random treatment assignment lets us safely expect confounders to be evenly distributed across treatment and control groups, such that they effectively do not differ across the worlds we want to compare. To be sure, this is not guaranteed, particularly for small sample sizes. Even if we randomly required either DT or DC to plead guilty, they would probably still differ in sentence-material aspects such as age.
“Treatment” connotes how causality requires performing some intervention to create the alternative, counterfactual world used for comparison. Pearl likens this to a “surgical procedure” in which we carefully intervene to set the treatment variable equal to its counterfactual value, leaving all else equal.26 This indeed underpins the familiar legal doctrine of ‘but for’ causation where we examine if, but for the defendant’s breach of duty, the damage would still have occurred.27
Legal causation is, in turn, one type of causal question that can be asked. Pearl and MacKenzie enumerate three levels of the so-called “Ladder of Causation”.28 At the bottom are questions of association, which relate to observation. Here, we are interested in informing our beliefs on Y upon seeing that T is true or not true. The second level comprises questions of intervention, which relate to doing. We want to know how doing or not doing T affects Y. Atop the ladder sits questions of counterfactuals, which relate to imagination. We want to know what Y would have been had T been different, and this can only be achieved in a hypothetical, alternative history. ‘But for’ causation, evidently, is a counterfactual question.
Identifying where a causal question sits on the Ladder is crucial because each level necessitates different approaches and tools. Most significantly, questions at a given level require inputs at or above that level.29 For instance, interventional questions minimally require interventional input, but can also be answered with counterfactual input (though empirically observing counterfactuals is, as the Fundamental Problem of Causality reminds us, impossible).
To illustrate the subtle difference between levels one and two, consider the running question of whether pleading guilty reduces a client’s sentence. This is an interventional question because we want to know how doing the act of pleading guilty affects sentence. But lawyers seem to treat this as an associational question. The Common lawyer’s strategy for answering it would probably be to parse precedents for how sentence length historically varied across guilty pleas.30 Strictly speaking, however, such observational data alone can only indicate whether we should expect different sentence outcomes when we observe (i.e. we are told that) a given defendant pleaded/did not plead guilty. Whether pleading guilty reduces a present client’s sentence is different “because it involves not just seeing but changing what is”.31 The reason for this difference has, in fact, been explained when we considered confounders. There may have been some other factor X in the precedents parsed that (1) also varied with pleading guilty and (2) was causing differences in sentence. If so, our present client pleading guilty may not cause a reduction in sentence.32
Of course, to say that lawyers merely measure empirical correlations in precedents caricatures the legal method as it overlooks the crucial and complementary role of qualitative, doctrinal analysis. Most lawyers, or at least those worth their billables, know that precedents relied on must be comparable to the client’s case and, by transitivity, to each other. Doctrinal analysis further identifies variables which are a priori unlikely to affect the outcome (that is, confound). To illustrate, contract doctrine might suggest that two precedents interpreting an identical clause offers safe comparison, even if the disputes are set in different states, provided the two have similar contract interpretation doctrines. But it is probably unwise to compare two precedents interpreting an identical state constitution article if they come from different states.
In the vocabulary of causal inference, the legal method can be seen as an attempt to test observational legal data against what causality theorists often refer to as a ‘causal model’.33 Causal models are hypothesized relations, specified on the basis of prior domain knowledge, between the outcome, treatment, and other variables we are studying.34 Such models are the critical interventional inputs that allow us to identify interventional causal effects, where possible.35 If our causal model hypothesizes that T does affect Y but its effect is confounded by a single variable X and nothing else, then data on T, Y, and X is necessary but also sufficient for deriving causality. If we did not have data on X (or some variable which adequately proxies for it), or if X is impossible to measure or observe, then we also know it is impossible to identify causality, regardless of how sophisticated our statistical arsenal may get.
Once legal analysis is understood, at least partly, as an exercise in causal reasoning with legal texts, the promise that causal text methods hold for legal analysis becomes clear. Causal text inference extends an established, but growing, body of computational social science research that treats text as a form of quantitative data.36 Such “text-as-data” techniques draw on fields such as corpus linguistics and the digital humanities.37 There is a growing body of scholarship applying these techniques to law, 38 though to my knowledge none have directly considered causal legal text inference.39
Put simply, causal text methods ‘plug-in’ the text as one of the three key variables in the causal framework: outcome Y, treatment T, or confounder(s) Z. We might understand the guilty plea question above as case of text confounding: we want to know the causal effect of pleading guilty on sentence, but are concerned that other factors in preceding cases, as written in judicial opinions, confounds identification. Since opinion texts are observed, we could simply control for them by providing them as inputs into the statistical model. Conceptually, we are then examining how sentence lengths differ on two precedents with identical judgments, except the defendant in only one of them pleaded guilty.
Things are, of course, not so simple. While straightforward at a broad conceptual level, causal text inference quickly runs into theoretical and practical obstacles, both generally and when applied to law specifically.40 Foremost, we cannot mathematically operate on text. The first step must therefore be to convert text to numbers or, in math-speak, to “encode” or “embed” text into a numerical space. Yet, as we shall see, the choice of text representation has significant implications for downstream inference because information in the text could be lost or mutated in translation.41 Further, text-as-data does not obviate, and may in fact exacerbate, the pitfalls of statistical inference.42
In this light, this Part introduces causal text methods generally and highlights potential legal applications. I focus on text de-confounding because it arguably has the greatest scope for legal application.
Text-as-data first requires converting text to data. Methods for doing so are canonically known as the “codebook function”, commonly denoted g, which can be broadly conceived of as an arbitrary algorithm that takes text as input and produces numbers as output.43 g derives its name from the physical books that empirical researchers use to specify how data should be encoded. Trusty research assistants hired to parse empirical data out of legal judgments are thus human examples of codebook functions. Manual codebooks are, of course, nothing new. But they are expensive, especially if legal experts (though perhaps not law students) must be hired to do it.
One of text-as-data’s primary contributions lie in exploring how automatic content coding can, perhaps imperfectly, substitute for laborious human effort.44 These techniques typically originate from the information and computer sciences, particularly a branch thereof known as natural language processing (“NLP”). Because computers can only perform numerical operations, NLP has developed (independently of causal inference) an array of methods for representing text as numbers. Numerical text representations are broadly known as “text embeddings”. This sub-Part provides background on automatic content coding to facilitate subsequent discussion of causal text methods generally. Readers familiar with this may skip ahead to Part III.B.
An illustrative and widely-used method for embedding documents are a family of encoding techniques built on the Bag-of-Words (“BOW”) model of text which, as its name suggests, conceives of documents simply as bags of words and/or phrases.45 For example, a document with only the words “this is a document” is seen as nothing more than the sum of its parts: “this”, “is”, “a”, and “document”. Documents can then be converted, say, into binary variables that take the value 1 if a given word appears in the document and 0 otherwise. Table 1 illustrates:
Table 1. Example Bag-of-Words Encodings
This produces “document-term matrices”, or tables of numbers capturing the preponderance of each term in each document. Notice that such matrices tend to be extremely high-dimensional (there are many columns), yet sparse (many cells have ‘0’ values). Every word in the corpus requires one column to itself, even if it only appears once in one document. Further, words that appear in every document produce columns with little variation, limiting their utility for downstream analysis. Both high dimensionality and sparsity hinder statistical inference.46
It is therefore standard NLP practice to, amongst other things, remove non-informative words (called stop words) from the corpus before embedding it. Standard stop word lists typically include what lawyers may call “glue words”47 – syntactic sugar such as “the”, “of”, and “and”.48 Further, term frequency scores are often normalized by document frequency scores, capturing the intuition that documents are best characterized by words common within them, but rare in others.49 Such a technique is known as the Term Frequency/Inverse Document Frequency algorithm, or “TFIDF”.50
To further reduce dimensionality, statistical algorithms are often applied to compress document-term matrices into document-topic matrices. Consider an arbitrary compression algorithm that attempts to optimally compress the document-term matrices above from 7 to 3 columns. To minimize information loss (the hallmark of optimal compression), we might expect the algorithm to group similar words together. For simplicity, let us leave aside the technical complexity of defining what “similarity” means, and assume the algorithm understands language like we do. It might yield the following document-topic matrix:
Table 2. Example Document-Topic Matrix
Each column now represents a cluster of words, instead of just one. These word clusters are known in NLP literature as “topics”. The entire algorithm is accordingly a “topic model”. Notice that, in this stylized example, document 1’s embedding is now indistinguishable from document 2’s, while the two documents are distinguishable from document 3 on dimensions (i.e. topics) 1 and 3. These numbers were, of course, cherry-picked to illustrate that the compression, though not lossless, could nonetheless yield informative signals.
Importantly, topic models also report the numerical contribution of each term in the vocabulary to that topic. Such topic-term distributions may look as follows:
Table 3. Example Topic-Term Distribution
The above numbers are arbitrary. The mathematics behind how they are derived are rather involved and will not be presented here. In any event, going into details here would be unhelpful because the computations differ depending on the specific topic model deployed. An entire family of topic models exist that start from different document term matrices and use different compression algorithms.51 The discussion below will assume only general knowledge of topic models as algorithms that transform text inputs into (1) document-topic matrices reflecting the preponderance of a given ‘topic’ across the corpus, and (2) topic-term distributions reflecting how much a given term characterizes a given topic. I explain further details on topic models where appropriate. Part V will also illustrate the results of Latent Semantic Analysis (“LSA”), a classic topic model, on an actual legal corpus.52 The main point here is that both naïve encoding techniques and the more sophisticated topic models are possible automated codebooks.53
Because more sophisticated codebook functions have been used in causal text methods, brief mention must also be made here to two additional techniques. First, word vector algorithms54 embed individual words into a high dimensional space by statistically modelling collocations between them.55 Famously, word vectors were shown to permit semantically logical arithmetic: subtracting the vector for “man” from “king” and adding the result to “woman” produces “queen”.56 Once each document word is embedded, computing an overall document embedding can be as simple as taking the sum or average across all words though, once again, more sophisticated techniques exist.57
Another recent breakthrough involves embedding documents by training neural networks to predict arbitrarily-removed words in a document, an activity reminiscent of the cloze passages that children use to learn languages.58 These “language models” have quickly become the new state-of-the-art in NLP, rewriting performance benchmarks across diverse tasks including summarization, translation, and question answering.59 The cloze task (and other downstream optimizations) forces the neural network to create informative text representations within its internal layers.60 In particular, studies on the “Bidirectional Encoder Representations from Transformers” model, an influential language model often called “BERT”, note that its internal layers seem to capture syntactic properties of language.61 The trained neural network can then be used to encode text.
As with topic models, I shall not explain word vectors and language models here in detail. It suffices to note that these are likewise possible automated codebooks. Recall that a codebook function is anything which translates texts to computable numbers. Further, different codebook algorithms make different assumptions about the text they process, thus raising different technical considerations.62 The next few sub-Parts explain how encoded text can be used as outcome, treatment, and finally de-confounding variables.
To adapt question (1) in Part I, we might want to know how law and economics training affects judicial writing. A simple approach to answering this would be to narrow down the question as follows: do they start using terms like “efficiency”, and “transaction costs” more in their judgments? We might then represent judgments entirely by how frequently each term of interest occurs. Separate statistical models can then be used to test for significant changes in term use before/after training. A more sophisticated approach might compute the total difference across all these terms at once by performing linear algebra on the entire set of term frequencies.63
The example above provides an intuitive illustration of treating text-as-outcome. Notice, however, that it forces us to answer a narrower question which might not map perfectly onto our true question of interest. Moreover, results in the given example would be sensitive to the terms we choose to look for. This may work for law and economics, which has a clearly defined set of technical terms, but might not generalize to other contexts.
Egami et al. use an example with legal flavor to demonstrate a more general text-as-outcome framework.64 They adapt Cohen et al.’s study of whether descriptions of past criminal behavior affects others’ evaluations of how far they should be punished for a subsequent crime.65 Experimental subjects were asked to read descriptions of an accused’s crime. The control group was only given the crime description. The treatment group was given an identical crime description, followed by a description of prior crimes. Both were then asked (1) whether the accused should be jailed, and (2) to describe in at least two sentences why. The response texts generated by the latter question were taken as the outcome of interest and encoded with a topic model.66 To illustrate, one of the topics derived was characterized by words like “deport”, “think”, “prison”, “crime”, and “imprison”. Egami et al. interpreted this as a topic about deportation, so that a high preponderance of this topic in a response suggested the subject considered deportation as a suitable punishment. Other topics were interpreted to concern other punishments like incarceration.67 Egami et al. then computed the ATE as the average difference in topic scores across the treatment and control groups and found that including criminal histories “significantly increases the likelihood that the respondent advocates for more severe punishment or deportation”.68
Notice that the previous experiment was also an example of text-as-treatment: treatment group subjects were given a different text prompt from control group subjects. In that experiment, we were solely interested in one aspect of those prompts – whether it described the defendant’s antecedents. A trivial codebook was implicitly used: we code the text 1 if it contains such a description, and 0 otherwise. In the law and economics example, we encoded judgments using the frequencies of specified law and economics terms.
Text-as-treatment has other legal applications. For instance, we might be interested in the effect of legal briefs on judicial decisions. Long and Christensen study whether more readable briefs improve the chances of winning an appeal.69 As a codebook, they use the Flesch-Kincaid readability score as well as other indexes developed in linguistics. The effect of a statute or case law on some legal outcome can also be fit into a text-as-treatment approach, since all laws are written in text.70 Imagine an experiment where subjects are randomly assigned different expressions of the same proposed law, and asked to evaluate them for fairness, and/or certainty.
Manual codebooks constrain us, however, if we are interested in multiple aspects of the treatment text. Further, the text features we want to investigate might be latent or multi-dimensional, such that they cannot be easily reduced into a defined set of word binaries or linguistic scores. That is, we might be interested in whether “the text” affects the outcome, but are unwilling or unable to reduce it to a set of known, specifiable attributes like readability. Automated codebooks are useful here because they may allow us pinpoint multiple, latent treatment features. Egami et al. demonstrate this on a dataset likewise of legal interest.71 They investigate whether the text of consumer complaints on financial products submitted to the Consumer Financial Protection Bureau affect how promptly the respondent businesses responds (Egami et al. do not distinguish between positive and negative responses). As above, the codebook outputs a set of topics that the authors then manually interpret as denoting complaints about “loans”, “debt collection”, “mortgage”, “detailed complaint”, and so on. They find that more detailed complaints, and complaints relating to loans in particular, receive prompter responses.72
An important premise with automatically discovering latent text treatment features is that, on top of SUTVA and unconfoundedness, we must further assume that the codebook is sufficient.73 Formally, sufficiency requires that the codebook capture enough outcome-relevant information in the text that such any information left out is orthogonal to that representation. Simply put, the codebook must not leave out any text information capable of confounding the causal effects of the text features that were captured. Thus, the ability of the chosen codebook to capture text information becomes pivotal. This explains my earlier exposition on the more sophisticated word vectors and language models.
Text de-confounding has arguably the most potential application in law because it let us ask questions of the form, “holding these texts constant, how does a given treatment T affect a given outcome Y?” A wide range of legal questions adhere to this template, including arguably the central question in case law analysis: holding the “facts” section of two judgments constant, how does a given legal factor, say, the accused pleading guilty, affect case dispositions?74 Questions on judicial behavior may also be asked: given two similar case briefs, would judge 1 have decided differently from judge 2? Likewise, given two similar case descriptions, would lawyer 1’s rather than lawyer 2’s involvement have led to a different case outcome? We might also be interested in legal-systemic questions, such as whether the originating lower court affects whether the Supreme Court grants certiorari, holding petition briefs constant.75
Given its wide applicability, the second half of this paper focuses primarily on text de-confounding. In this sub-Part, I delve into some detail on text de-confounding to establish the necessary background context. As text de-confounding implicates techniques I have yet to introduce, a brief detour into de-confounding itself is necessary.
As mentioned in Part II, the essence of de-confounding is ensuring comparability between two alternate pre-existing worlds. Technically speaking, what we are after is covariate balance: all “covariates” (i.e. non-treatment, non-outcome variables) should be similarly distributed in both the treatment and control groups.76 In an RCT, random treatment assignment lets us expect this. With observational data, however, covariate balance is rarely guaranteed.77 Covariate imbalance hinders causal inference because results from such data are sensitive to the specific statistical estimation techniques chosen, and often imprecise.78
There are two general strategies to de-confounding. First, we can limit the data we use to regions where the treatment and control groups are most comparable. To recall the running example, to know whether pleading guilty reduces sentence, we might want to study only precedents involving defendants of the same gender as the client. This, of course, requires that we know, or at least have reasons to suspect, that defendants of one gender might be sentenced differently from defendants of another. The logical conclusion of this is to limit the dataset to precedents that are identical on all non-treatment fronts. Specifically, for every precedent in the control group, we look for an identical match in the treatment group. Put differently, we would be trying to compare legal twins.
As with biological twins, however, such perfectly-matched precedents are probably rare in legal practice. As a second-best solution, we might settle for finding the closest available correspondent in the opposing group, perhaps provided it meets a minimum closeness threshold. This then requires a measure of how close an observation is to another. In law, we tend to measure precedent closeness by zooming in on key legal and doctrinal factors – we look for cases with similar questions presented, issues raised, and so on.
The intuition that some factors matter more than others carries into causal inference methods. When attempting to match observations, we are most concerned about covariate balance. Closeness is therefore logically centered on the factors which vary most across treatment and control groups. Notice that this is equivalent to asking which factors best predict whether a unit receives the treatment (itself a level one causality question). To illustrate, if all, and only, male defendants pleaded guilty, then knowing whether the defendant was male would let us perfectly predict treatment status.
Causal inference theory actualizes this intuition in a technique known as propensity score matching (“PSM”). A statistical model is used to predict the probability that a unit receives treatment given the covariates in the dataset, being the “propensity score”. By construction, the propensity score proxies for similarity across the treatment and control groups, weighting imbalanced covariates more highly.
It is also possible to simply take the distance between the covariates using one of many possible mathematical formulas for calculating the difference between two sets of numbers.79 Armed with a closeness measure, we can then match observations that are close enough, or trim away observations, too different from the rest, or both.80
The second de-confounding strategy applies after the dataset has been fixed (and preferably balanced), and would be familiar to those trained in regression methods. Specifically, it is to include potentially-confounding variables into the estimation model as so-called “controls” so that the model “adjusts” for those variables.81 There are limitations to this approach, however.82 Foremost, which variables to include or exclude falls entirely within the analyst’s discretion. Assuming the analyst has specified the right causal model, the necessary controls to include are self-evident. But that provides little reassurance: if the analyst already knew the right causal model, there would be little point to studying the dataset; if the analyst did not know the right causal model, then she might simply have specified it correctly by luck. The more realistic expectation is for the analyst, whom we assumed conscientiously studied prior literature, to have specified a causal model which approximates, but does not perfectly reflect, the truth.
Including the wrong variables as controls may, in fact, worsen results.83 A classic example is controlling for mediators.84 Suppose T does causally change Y, but primarily by first increasing a third variable M (short for “mediator”). For instance, suppose pleading guilty signals remorse, and remorse in turn lowers sentence outcomes as much as, if not more than, the guilty plea itself. Empirically, we would notice that M varies with both T and Y: setting T=1 causes M to increase; increasing M decreases Y. If M were included as a control, so that the model tried to vary T while holding M constant, we would end up underestimating the causal effect that T has on Y through M.85
Note that the two de-confounding techniques are complementary. Balancing the dataset reduces the need to include model controls; controls can address residual confounding that balancing might have missed.
Leading techniques for text de-confounding essentially revolve around matching encoded texts.86 An illustrative baseline procedure is what Roberts et al. call Topically Coarsened Exact Matching (“TCEM”), which in essence encodes text using a topic model before applying so-called coarsened exact matching on the topics. Coarsened exact matching is a matching technique where the analyst manually reduces the granularity of certain covariates by specifying cut points for variables so that observations can be matched within the newly-cut categories rather than on specific numerical values. For example, instead of finding two students with the same raw exam scores, we might cut them into three categories based on percentiles: top 25%, middle 50%, and bottom 25%. CEM then attempts to find at least one pair of treatment/control units per category. Units in unpaired categories (which implies they are themselves unpaired) are then trimmed.
Roberts et al. further develop another technique, Topical Inverse Regression Matching (“TIRM”), which encodes text with a specialized topic model capable of taking the treatment variable as input. This allows the resulting topic-term distributions to accord greater weight to terms that predict treatment assignment, and for the model to thus produce a measure analogous to propensity scores. Roberts et al. then suggest using coarsened exact matching on the both the propensity score analog and the document topic scores such that the treatment/control groups are matched on both textual content and treatment propensity. The authors test both TIRM and TCEM on three partially simulated datasets and find that they varyingly perform better/worse at improving covariate balance, depending on the metric used to measure balance.
Mozer et al. build on Roberts et al. above to propose a “general framework for constructing and evaluating text-matching methods”. They first note that because text representations and closeness measures chosen might affect downstream inferences, these choices should be made in light of the specific confounders to be targeted in a given research question.87 In this light, they consider a wider range of representations and measures.88 They show that, depending on the dataset and causal question asked, simply matching on the document-term matrix itself might yield a better-matched sample.89
Mozer et al. then demonstrate how text matching can improve covariate balance in the context of an observational study using a medical setting. The causal question was whether a particular diagnostic heart scan improved adult patient survival rates for patients in critical care with sepsis. The texts were medical notes taken by hospital staff upon the patient’s admission to critical care. Noting that the ideal text de-confounding scheme would “match documents on key medical concepts and prognostic factors that could both impact the choice of using [the heart scan] and the outcome”,90 they preferred matching on the basic document term matrix because this representation would “retain as much information in the text as possible”.91 This strategy produced a better covariate balance than propensity score matching on non-text variables alone.92
Veitch et al. recently proposed an even more computationally-sophisticated method for text de-confounding.93 Of course, computational sophistication does not itself make a technique worth mentioning. I raise this primarily to highlight how text de-confounding is receiving attention across multiple disciplines and has much scope for development. As the technique is highly mathematically involved, I will briefly describe the intuition. In essence, the authors extend the BERT language model introduced above to simultaneously produce (1) document embeddings, (2) propensity scores, and (3) predicted outcomes based on the embeddings. They then use these outputs to compute the ATE.94 Unlike prior work, Veitch et al. encode the text with BERT instead of a topic model. They thus note that their approach “replace[s] the assumption that the topics capture confounding with the assumption that an embedding method [i.e. theirs] can effectively extract predictive information”.95
Next, using BERT embeddings as a starting point, they alter the embeddings further by fitting them to the particular dataset they want to study, step known in NLP as ‘fine-tuning’.96 In this step, they task the embeddings with predicting both the treatment and outcome variables. To see why, recall that the original BERT is tasked with predicting missing cloze words, and thus develops an internal representation of general language properties. Since confounders are variables that vary across (i.e. predict) both treatment and outcome, when the model is given such an objective, it likewise begins internally representing confounding information in the dataset.
While these document embeddings could, like document-topic scores, then be piped through some matching algorithm, Veitch et al. simply use the same model to estimate causal effects. This is both logical and convenient. Notice that the model has already learnt to predict both treatment (i.e. what we do when estimating propensity scores) and outcome (i.e. what we do in the subsequent inference step).97
Given the expanse of algorithms and techniques covered in a relatively short exposition, it is useful to crystallize key themes in the text de-confounding literature. The central goal of text de-confounding is to keep the treatment and control groups comparable by exploiting information in relevant texts. This, to be sure, conceals much of the complexity behind defining what “comparable” is, especially when texts are involved. Because text is multi-faceted and multi-dimensional, applying traditional de-confounding strategies like PSM is difficult.98 Obstacles begin to arise even when choosing a text representation (i.e. the codebook function). Though numerous methods of varying statistical sophistication have been proposed, there are as yet no clearly superior or inferior methods, only methods more or less suited to the particular causal questions we want to study and their hypothesized confounders. In the next Part, I consider challenges that may arise when applying text de-confounding to the legal setting.
As alluded to in Part III.D, a wide inventory of legal questions might be answered with text de-confounding. This should not be over-stated. The legal setting raises unique challenges on top of those already identified above by general (text) de-confounding literature. This Part highlights this by considering how de-confounding might be used to extract what Ashley called “legal factors”, or “stereotypical patterns of fact that tend to strengthen or weaken a sides argument in a legal claim”.99 This question is instructive because the corpus of case precedents is arguably the text corpus that Common lawyers work with most frequently.
Consider how we typically reason with precedent. First, we assemble a set of cases similar to the client’s, trimming away those that differ on legally-material grounds. We then read the judgments, paying attention to each case’s facts, procedure, dispositions, and ratio decidendi. If there is no express ratio covering the case (no doubt a common occurrence), we fall back to reasoning by analogy. The standard argument flows as follows:
In case A, legal factor T was present, and the defendant was held liable. This may be contrasted with case B, which is in all material respects similar to case B, except factor T was absent, and the defendant was not liable. Therefore, factor T is decisive for liability. Since T is absent in my client’s case, he/she is probably not liable.
Such reasoning adopts closely the language of causality.100 A control unit (case B) is compared against a treatment unit (case A), and the difference in outcome indicates the treatment variable’s outcome effect. But because precedents are observational inputs, where our intended causal question sits on the Ladder of Causation then determines the extent we can answer it.101 If we are merely interested in predicting likely outcomes for the client based on what we see in the cases, this would likewise be an observational question that precedents themselves are sufficient to answer.
But if we want to advise the client on what to do, we would be interested in interventional causality, a question that requires interventional data to answer. While RCTs are a first-best solution to generate such data, RCTs on legal factors are practically impossible.102 The challenge, therefore, is how we might approximate interventional data with observational precedents. This is a legal analog to Rubin’s Fundamental Problem of Causal Inference. Just as we cannot observe both treatment and control outcomes for the same unit of analysis, we cannot observe both treatment and control outcomes for the same precedent. Nor can we randomize treatment assignment to simulate having observed it.
To overcome this challenge, we must first specify a causal model to test our observational data against.103 This can be obtained from qualitative analysis of the cases. If we further want to derive statistical estimates of the outcome effects, however, then recourse to text de-confounding techniques is likely necessary. Put simply, we want to the data look as if we had randomly assigned legal factors independent of confounding case attributes.
With text de-confounding in mind, we might state the question as such: “holding judgment texts constant, how does varying some legal factor T affect some legal outcome Y?” This conveniently substitutes “potentially-confounding case characteristics” with the judgment texts we are used to working with. The next steps for causal inference seem clear: encode the judgments, match the cases, and calculate ATEs.
But this approach rests on the shaky assumption that judicial opinions are sufficient and accurate sources of information on potentially-confounding case characteristics.104 Judgments are, after all, written to justify the adjudicator’s preferred outcome, so there is the danger of motivated writing.105 Further, because opinions are inevitably written after the authoring judge knows the treatment status (e.g. whether the accused pleaded guilty), equating judgments would not, strictly speaking, equate pre-existing worlds.
In the language of causal inference, judgments are both post-treatment and post-outcome variables. Using such variables for causal inferences is problematic.106 Broadly speaking, if judgments depended on both treatment status and outcome, then trying to keep the former constant while varying the latter is internally contradictory. Thus, as with controlling for mediators,107 controlling for judgments might introduce bias into our results.108 While a legally-trained human reader might be able to ‘read between the lines’ to discount any post-treatment or post-outcome information, an automated codebook probably cannot.109
Alternative text sources of confounding information are therefore necessary. Fortunately, if there is one thing the legal industry has in abundance, it is text. Indeed, before any case outcome can be determined, volumes of legal texts are often generated, taking the form of affidavits, briefs, evidence, and related filings. To the extent that case characteristics are confounders, documents which detail them might be useful de-confounders. Two document types are promising. First are case briefs, which we might expect contain much case information. While parties might be expected to present only favorable factual or legal information, we could simply use the text of both sides’ briefs for de-confounding. An alternative is briefs from neutral parties such as amici (assuming they are actually neutral). In practice, briefs are generally accessible, though perhaps neither freely nor in bulk.110
Second are judgments from lower courts. To be sure, issues may change on appeal. Lower court judgments may also be outcome-motivated. Nonetheless, they are in practice slightly less costly to obtain,111 and would represent an improvement over using judgment texts from the same courts. For these reasons, I focus on using briefs for de-confounding in the remaining third of this paper.
To enliven the techniques discussed above, this Part demonstrates how text de-confounding might be used to explore a question that has received both practical and scholastic attention: what determines how the Supreme Court grants certiorari (abbreviated “cert.” for ease of reading)? I consider in particular whether cert. outcomes differ for cases from state supreme courts (the treatment group) vis-à-vis cases from circuit appeals courts (the control group). Effectively, text de-confounding is used to hypothetically conduct the following experiment: suppose we took two similar cases on petition, and randomly assigned whether they originated from circuit appeals or state supreme courts, would certiorari outcomes differ systematically? If they do, we would have causal evidence that case origin matters.
This question was chosen because it illustrates both the applications and limitations of legal text de-confounding. On applicability, cert. grant has received significant doctrinal and empirical study.112 Further, petition briefs (and not judgments) provide a promising source of de-confounding. Finally, the certiorari question adheres to the template legal analysis question specified in Part I above and thus sheds light on how similar questions could be addressed.
On limitations, notice that whether petition briefs can be considered pre-treatment variables is a non-trivial question. Lawyers (or self-represented litigants) writing them might tailor their briefs to the court of origin. Yet the case characteristics, particularly the factual background, described in those briefs are determined before court assignment. This tension forces us to be particularly careful when describing the relevant causal model.
Further, investigating this question illustrates what I argue are realistic legal data constraints. As will become clearer, although legal texts are generally accessible, dataset imbalance pervades law, necessitating specific countermeasures. Although I obtained a data on more than 10,000 Supreme Court certiorari petitions, less than 100 of them successfully obtained the writ of certiorari. As one might expect, this imposed numerous constraints on the estimation process that will be made clearer below.113
For these reasons, it must be emphasized that the causal estimates presented below are not intended to provide empirical confirmation of whether case origin causally affects certiorari grant. My focus is rather on illustrating the procedure for deriving them, rather than their substance. Note, however, that the general statistics presented do permit associational interpretation, and might thus be of interest nonetheless. This illustration can be seen as a proof-of-concept for a more conclusive empirical investigation in future work that makes use of further and better data.
The rest of this Part follows a standard chronology for a causal text inference study.114 Part V.A analyses prior literature and identifies potential confounders. Part V.B describes the dataset and explores the initial covariate balance. Part V.C deals with dataset preparation which, for text methods, involve both text encoding and covariate balancing. Part V.D describes the estimation models and discusses results.
Scholarly interest on how the Supreme Court grants cert. can be traced back to Schubert’s seminal paper on “The Certiorari Game”.115 A political scientist, Schubert hypothesized that the Justices’ vote in “blocs” motivated by a law-development agenda. According to Schubert, a bloc of four Justices of the time were interested in ensuring the law favored railroad workers over employers. Using game theory, he demonstrated that they would have the following pure strategy:116
Never vote in favor of petitions from railroads;
Always vote in favor of petitions from workers where the lower appellate court reversed a trial judgment in their favor; and
Always vote for the petitioner on the merits
The qualification to point (2) arises because, under Schubert’s assumptions, non-bloc Justices were more likely to vote in the workers’ favor when the appellate court disagreed with the trial court than when both lower courts were in agreement. Thus, if the bloc voted to grant cert. in the latter case, cert. would be granted (since 4 is sufficient for cert.), but the bloc ran the risk of ‘losing’ the merits-stage vote 5-4 in the railroads’ favor. 12 of the 13 railroad cases which the USSC heard when the bloc voted on cert. following this strategy were decided pro-worker, relative to 8 of 11 cases heard when the bloc did not vote accordingly.
In the Certiorari Game, therefore, cert. votes are a backwards-reasoned function of projected merits stage outcomes. If a Justice prefers that the Supreme Court affirms, she/he would never vote for cert. If a Justice wants the Supreme Court to reverse, she/he would vote for cert. provided the Supreme Court was indeed likely to reverse.
Schubert’s model continues to influence empirical analyses of the USSC’s certiorari behavior. Brenner tests an extension of the model on certiorari votes cast between 1945 and 1957, obtained from Justice Harold Burton’s private docket books, and found evidence suggesting that some Justices were, indeed “skillful players in the new certiorari game”.117 More recent work crystallizes this into an “outcome prediction” theory of cert. grants: Justices vote for cert. when they expect to win at the merits stage and against cert. when they expect to lose (‘winning’ or ‘losing’ defined in terms of a Justice’s desired outcome).118 Sommer takes this a step further, arguing that Justices further consider if they are likely to author the majority opinion if the case proceeds to oral argument.119 The primary explanatory variables used in these studies were therefore Justice ideologies.
However, the outcome prediction theory provides an incomplete account of certiorari behavior. Brenner et al. point out that prior studies suffer from a “missing data problem” because cert. votes for denied petitions are not included in any of them.120 They further criticize Caldeira et al.’s attempt to impute values to the missing variables for reasons beyond the scope of this article.121 In their view, the outcome prediction theory primarily explains certiorari behavior for cases actually granted cert., and is less suited for the vast majority of cases denied cert.. The latter, they argue, might be better explained by the “error correction” theory: that justices only vote to grant cert. when they believe the lower court decision is legally (rather than ideologically) erroneous.122
The literature therefore distinguishes between “strategic” determinants of certiorari grants, being those motivated by the Certiorari Game, and “nonstrategic” determinants which revolve around the procedural and substantive characteristics of a case.123 These include case salience, whether there was disagreement in the lower courts, whether the Solicitor General was involved, and the number of amici curiae involved at the petition stage.124 Also relevant is the strength of the petition, in turn measured by whether it was filed in forma pauperis, whether it raises frivolous issues, whether the lower court handed down a written opinion, and whether the petition self-represents.125
The upshot of this brief literature review is that cert. grants are driven by numerous legal and political factors. Against this backdrop, we need to specify the causal model, which most crucially involves identifying potential confounders.
To fix ideas, let us express this in the language of potential outcomes. The outcome Y is 1 if the petition was granted and 0 if it was denied. The treatment (T=1) group comprises petitions on state supreme court judgments; the control group (T=0) comprises petitions on circuit appeals court judgments.126 For ease of reference, I refer to the former as state petitions and the latter as circuit petitions. The ATE we are after is the difference in cert. outcomes for state petitions had they originated from circuit appeals courts, as well as for circuit petitions had they originated from state supreme courts. The potential counterfactual outcomes in both cases are, of course, unobserved.
Recall that confounders are variables which both influence potential and correlate with treatment assignment.127 Applied here, this means whether cases are assigned to state or circuit courts should be independent of the other cert. relevant factors above. Given that the same Justices vote on petitions regardless of origin, judicial ideology and error correction factors are conceivably treatment independent.128 However, simply as a matter of civil procedure, federal cases are more likely to involve interstate matters, matters involving the federal Constitution, and matters requiring the Department of Justice’s (“DOJ”s) involvement. Circuit petitions might involve more salient public interest concerns. Apart from implicating more ‘convincing’ issues, circuit petitions might also be better drafted – either because parties are more likely to enlist private attorney help, or because of the DOJ’s involvement (the DOJ, as the statistics below suggest, has extensive experience with petition briefs).
While variables capturing whether attorneys/the DOJ were involved may alleviate confounding, they would be blunt instruments which implicitly assume that all attorney/DOJ-drafted briefs are of the same quality. Moreover, we would still fail to account for differences in case characteristics. Petition briefs are thus useful for de-confounding in two regards: their form de-confounds brief ‘quality’; their substance de-confounds case characteristics including, but not limited to, salience.
To substantiate the intuitions above, I created a dataset from information retrieved from the Supreme Court website’s docket search module.129 The site provides information on every Supreme Court docket number from 2003 onwards. For cases filed on October 2017 onwards, petition briefs are also available in PDFs. I downloaded them all and used PDF extraction software to (1) merge briefs that had to be downloaded in parts,130 and (2) isolate the relevant brief text.131 I excluded about 1500 of the 10,281 downloaded briefs that could not be processed this way as they had not been subject to optical character recognition.132
Using docket numbers as an index, I then linked the brief texts to metadata specified on the Supreme Court website. Variables extracted include case date, the name of the lower court implicated, and the list of litigants, attorneys, attorney offices, and amici curiae involved. I inspected the distribution of unique lower court names raised in the cases and manually mapped them to five main categories.133 Using the list of attorney offices, I coded whether the Solicitor General was involved as petitioner/respondent. 134
The final task was to determine petition outcomes. These could not be reliably automatically inferred from the Supreme Court website because where petition outcomes are specified in the docket varies with a case’s procedural history. Instead, I extracted them from Supreme Court journals for the October 2017, 2018, and 2019 Terms135 using a Python script created for this purpose.136 The extraction algorithm reads the journals line by line while keeping a record of the last section header it came across (e.g. “Certiorari Denied”). If a docket number is found, it assigns an outcome to that number based on the section header if appears under.137 While the algorithm is not perfect, the number of cert grants/denies it detects are close to the actual numbers specified in the Journals.138 Note that I do not include summary certiorari dispositions, also known as grant, vacate, and remand orders, as grants.139
To summarize, the above yields a dataset that captures, for each Supreme Court docket number from Oct 2017 onwards:
Whether cert was granted
The date the case was docketed
The number of petitioners/respondents/amici involved through the entire case (this is not limited to the petition stage for cases that proceed beyond it)
The number of attorneys representing each side, again through the entire case
Whether the Solicitor General’s office was one of those attorneys
The lower court it originates from, as well as the court’s broad category
The text of the petition brief submitted.
For subsequent analysis, I use only data on cases from either state supreme or circuit appeals courts. These respectively comprised 1,058 (10.29%) and 7,618 (74.10%) of the 10,281 cases I obtained metadata on. Of these 8,676 total cases, 7,046 had useable brief texts.140
Table 4 summarizes the dataset. Note that the bottom five variables are binary, so their averages may be interpreted as percentages.
Table 4. Covariate Balance in the Unmatched Sample
Notes: Statistics are based on all 8,676 state/circuit petitions. “Control” refers to circuit petitions while “Treat” refers to state petitions. “SD” stands for standard deviation. “SMD” refers to the standardized mean difference across treatment and control groups, also known as Cohen’s d, which provides a natural, scale-free measure to assess the difference between treatment and control covariate distributions.141 I use a pooled-variance denominator. As a general rule of thumb, SMDs 0.2 and below are ‘small’, around 0.5 are ‘medium’, and above 0.8 are large.142
The average (state or circuit) petition involves about 1 petitioner, 1 respondent, and 0.22 amici. 44.8% of petitioners are represented by at least one attorney, with state petitioners being slightly more likely to be represented (47.1% versus 44.5%). A larger proportion (78.8%) of respondents are represented, with circuit respondents being noticeably more likely than state respondents to be represented (80.2% versus 69.4%).
The DOJ is almost never involved as a petitioner, having only appeared as such in 0.3% of circuit petitions and in exactly none of the state petitions. It also rarely appears as an amicus (about 0.5% of both circuit and state petitions). However, the DOJ is a very frequent respondent, appearing as such in 44.8% of all petitions since Oct 2017 onwards. This covers 51.2% of circuit petitions, but only 0.2% of state petitions. Taken together, these statistics further suggest the DOJ is seldom involved in state petitions in any capacity.
The preceding is consistent with our legal intuition above and reinforces concerns on covariate imbalance. While there does not appear to be wide disparities in the number of parties involved, the gaps in respondent representation and DOJ involvement correspond clearly to potential confounders.
I demonstrate a simple document term matrix representation based on TFIDF as well as a topic model representation based on Latent Semantic Analysis.143 Both are broadly based on the Bag-of-Words approach, explained in Part III.1.
These methods were chosen primarily for their simplicity and transparency, particularly to non-experts. Both have been used to analyze legal corpora, albeit from a predictive perspective.144 Note that whether they are conceptually the best text representations for this question is equivocal. As Part III.D.2 explained, the ‘right’ text encoding method depends on the dataset, particularly the sources of confounding we mean to address. Here, textual form and substance both matter. While BOW representations are suitable for matching on substance,145 because BOW discards syntax and word order, it seems inappropriate for matching on form. Language model representation which, as mentioned above, can account for syntax, may be more suitable for this context.146 However, language models involve neural network methods which require special expertise even to interpret; practical application entails further technicalities.147
I thus chose to proceed with BOW models for now. For similar reasons, I follow standard NLP practice (outlined in Part III.A) in preparing the text representations, even though a more tailored approach might be better suited to this setting.148 Implementation details on both representations created may be found in Appendix VII.B.
For the TFIDF representation, I used Mozer et al.’s best performing model for the medical corpus, namely cosine similarity matching.149 For the LSA representation, I used propensity score matching instead, as that allows me to demonstrate how text and non-text methods compare on the same closeness metric.150 As above, I followed standard PSM practices, save one deviation given that I had more control than treatment units: to allow more control observations to be used, I matched, with replacement, each treatment unit to three control units. Implementation details on matching can be found in Appendix VII.B.
Four matching schemes were thus tested in total: TFIDF with cosine similarity (for ease of reference, “TFIDF+Cos”), propensity score matching with only non-text covariates (“non-text PSM”),151 propensity score matching with only LSA-encoded texts (“LSA PSM”), and propensity score matching with both LSA-encoded texts and non-text variables (“LSA+non-text” or “Full” PSM).152 For each scheme, Table 5 summarizes the number of petitions successfully matched as well as the number of unique petitions from each group used in the matched sample relative to the unmatched dataset.
Table 5. Number of (Unique) Sample Observations Before and After Matching
All matching schemes successfully identify sufficiently close (within the matching parameters I set) circuit petitions for all state petitions. However, the non-text PSM achieves this essentially by re-using only 84 unique circuit petitions to match the 885 state petitions. These 84 circuit petitions are, in other words, extremely ‘state-like’ in terms of the non-text variables. However, once we consider brief texts, a range of other circuit petitions are picked up.
Figure 1 illustrates the covariate balance achieved by each method on the non-text variables only. All matching schemes alleviate the potentially confounding imbalance in how often the solicitor general appears as respondent in state versus circuit petitions. Most also addressed imbalance in respondent representation, though the full PSM model seemingly swings it to the other side. This is noteworthy because the TFIDF and LSA only models were not directly supplied with such information, suggesting that information in the brief texts proxy for this. While the non-text PSM seemingly achieves a good balance, this is not surprising given that it attempted to match these same covariates only. Since only 84 unique circuit petitions were used, little weight should be placed on this.
Figure 1: Covariate Balance Under Various Text and Non-text Matching Schemes
Notes: Dots are point estimates for the standardized mean difference between treatment and control means. Error bars represent 95% confidence intervals. If the intervals touch the dotted line at 0, the difference is not statistically significant. Notice that within the initial, unmatched sample (represented by grey intervals) there are significant differences for variables denoting respondent representation and Solicitor General involvement as respondent. For most variables, the colored intervals, each representing one matching scheme tested, fall closer to zero than the gray intervals, indicating a better covariate balance.
Moreover, recall that the text itself should also be balanced. To illustrate how different the starting texts were, Figures 2 and 3 below present the difference in the distribution of linearized propensity scores153 for the LSA and full PSM models respectively.
Figure 2: Difference in Propensity Score Distributions Estimated by the LSA PSM, Before and After Matching
Figure 3. Difference in Propensity Score Distributions Estimated by the Full PSM, Before and After Matching
Recall that propensity scores are the probability that a given unit receives treatment given the data. The difference in propensity scores suggests that state and circuit petitions are written differently, though I cannot disentangle whether this is due to writing style or case substance (or both). Notably, LSA PSM correctly classifies 88.64% of petition origins. Put differently, an algorithm given only the LSA-encoded texts could reliably separate them into state versus circuit petitions.154 Accuracy on the non-text PSM and the Full PSM was 75.59% and 92.37% respectively. As explained above, a difference in brief texts is a possible confounder because it relates to factors such as case salience and petition strength. In this light, the next sub-Part demonstrates the difference that text matching makes on the causal estimates we derive.
Finally, I estimated causal effects from the matched datasets. For illustration, I present results from two methods for calculating ATEs. The first simply takes the difference in treatment and control group means.155 Conceptually, this is equivalent to taking the difference within each matched pair of treatment and control observations.156 This method critically assumes that matching has accounted for all confounding, such that each matched pair effectively represents one set of potential outcomes (that is, both Y(T=1|Z=z) and Y(T=0|Z=z)). However, text matching probably has not removed all confounding here. Some important non-text covariates remain unbalanced (see Figure 1 above). This is unsurprising since, as explained above, I chose possibly less appropriate text representations for the sake of illustration.
In this light, I use a second estimation method known as Peters-Belson regression.157 Peters-Belson illustrates the potential outcomes framework well as it takes the Fundamental Problem of Causal Inference rather literally. The technique broadly involves three steps. First, a model is fit only on control group data, and used to impute missing potential outcomes for the treatment group. Second, a model is fit on treatment group data and used to impute missing potential outcomes for the control group. Finally, with all missing potential outcomes ‘filled in’, the ATE can be calculated as the average difference between all treatment group outcomes and all control group outcomes.
The model used to impute treatment/control group outcomes can be varied, as any algorithm capable of classifying outcomes may be used. Inputs to the model can be tailored to account for residual confounders. Here, I use a Bayesian logistic regression to impute outcomes.158 Across all matching schemes, I provide only three variables to the model: petitioner representation, respondent representation, and Solicitor-General involvement as respondent. These were the variables that generally remained imbalanced across all matching schemes. Keeping variable inputs constant also better isolates the effect of each scheme.
One caveat with the causal estimates that follow should be noted. Because the Supreme Court rarely grants cert. the outcome variable is highly imbalanced. Across the two-and-a-half Terms sampled, only 68 (0.89%) circuit petitions and 8 (0.76%) state petitions were granted.159 Like covariate imbalance, outcome imbalance is problematic for statistical analysis, though in different ways. Just as statistics generally relies on having a sufficient number of observations, many statistical techniques, particularly more sophisticated analyses, relies on having sufficient minority outcome group observations as well.160
Note however that this does not prejudice the previous illustration of the balancing process because the outcome variable is not used therein. In fact, it is important that outcome variables remain hidden to both algorithm and analyst, until after covariate balancing is complete, to prevent us from selecting observations in a manner that favors particular outcomes (thereby influencing causal estimates).161 I raise outcome imbalance at this point to emphasize this. The results presented in Figure 4 below should be interpreted in this light.
Figure 4. Average Treatment Effect Estimates for Matched and Unmatched Samples
Notes: Dots are point estimates for the ATE while error bars represent 95% confidence intervals. Where the intervals cross the zero mark, the ATE is not statistically significant at the 5% level. Intervals for the difference in means are calculated in the conventional manner,162 while intervals for the Bayesian regression are based on the 0.025 and 0.975 percentiles of ATE estimates.
These results yield conflicting signals on whether petition origin influences cert. grants. At face value, the naïve differences in means generally suggest there is generally no effect. But the Full PSM instead suggests a statistically significant 2% reduction in grant rates for state petitions. The theoretically superior Peters-Belson estimates are even less aligned. There, the unmatched and TFIDF+Cos estimates now suggest a significant reduction in grant rates of slightly less than 1%, while the Full PSM suggests an even greater 3+% reduction. The results, therefore, do not corroborate. Instead, ATE estimates differ significantly with and without matching, given different matching schemes, and given different estimation models.
Given the data and methodological constraints explained above, such volatility is unsurprising. It is important not to attach causal significance to any of these estimates. Instead, the primary takeaway of this illustration lies precisely in the volatility. Specifically, the fact that text matching makes a significant difference suggests that failing to account for confounding texts might lead us to spurious estimates in this particular context. Thus, although insufficient to identify causality, these preliminary estimates provide an indication, which future work can build on, of the statistical countermeasures we likely need should a better dataset be available. My results further reinforce two central themes in this paper: (1) the importance of a causal framework, particularly a carefully-specified causal model, when working with observational legal data;163 and (2) the importance of accounting for text confounding in causal legal analysis.164
To be sure, the observed estimate volatility is likely also driven by dataset, rather than methodological, factors, particularly the small number of granted petitions. To the extent that this sample illustrates a realistic legal dataset, however, my results highlight the challenges with statistically deriving causal estimates from observational legal data. Text matching (when done optimally, not illustratively as I have done here) arguably represents a best-efforts attempt at doing so. Yet even such an attempt might not suffice given the additional challenges and tradeoffs required. For instance, because covariate balancing generally requires discarding observations, it reduces the number of treatment/control units, possibly worsening the lack of (minority outcome) data that we can expect in law.165
Causal text inference methods hold great promise for law, a field which, possibly more than any other, relies on (manually) extracting predictive and causal information from specialized texts. Common legal questions might be re-cast as questions of text causality, including the fundamental legal question of whether a given factor affects case outcomes. As causal text methods in the computational (social) sciences become more established and sophisticated, its potential for legal applicability should grow. The case for causal legal text inference, which this paper aimed to make, is evident. Its adoption would not only allow us to more rigorously test old answers to old questions. We may find ourselves now able to answer new questions entirely.
There is some way to go, however. Text methods are relatively new, and far from canon even outside law. Adapting them to law might raise new challenges, and legal dataset may prove unwieldy for their purposes. Each step – choosing a text representation, computing a closeness measure, determining a balancing scheme – involves its own variables and parameters. We do not (yet) have clear guidance on what works for law. I have but sketched a framework to begin studying this, and illustrated a fraction of the ways in which legal text matching may be done.
Yet if we are truly interested in capturing “a glimpse of [the law’s] unfathomable process”,166 to understand what truly drives judicial decisions, and to accord in empirical legal studies the same primacy that text holds in doctrinal analysis, the thought and effort necessary to operationalize legal text inference should not deter us. We should, on the contrary, wonder if de-confounding on text may yield results that challenge received wisdom, and bring us that much closer to the true, causal mechanism of law.
Table 6 below presents the number of petitions received by the Supreme Court from each originating court type. Data for this was retrieved from the Supreme Court’s docket search feature and manually standardized into categories. It is subject to all caveats explained in Part V.B.167 Circuit appeals courts include all nine circuits and the D.C circuit but excludes the federal circuit. The Federal Circuit contains only the United States Court of Appeals for the Federal Circuit and was intentionally kept separate from other circuit appeals courts due to its specialized docket. State supreme courts include apex courts for all fifty states. Territorial Supreme Court includes the apex courts of American Samoa, Northern Mariana Islands, Guam, Puerto Rico, Virgin Islands, and the District of Columbia.
Table 6. Originating Court Frequencies and Grant Rates, Oct 2017 – Dec 2019
All code used for preparing the dataset was written in Python 3 and is on file with the author. This Appendix outlines key implementation details for text matching. Note that all library code referenced in the following sub-sections were used with all default settings except those otherwise specified.
I first used spaCy168 to tokenize the texts and remove tokens that were entirely stop words, punctuation, whitespace, or digits. I further removed all tokens shorter than 3 characters. Remaining tokens were lemmatized and lowercased.
A second round of pre-processing was necessary because the above (standard) approach had trouble dealing with punctuation within words (as they would not be matched to stop words or lemmatized). Using basic string methods, I removed all hyphens, periods, and apostrophes occurring within tokens before again lemmatizing and lowercased them. I used the NLTK169 WordNet lemmatizer for this step.
I used scikit-learn’s170 TFIDFVectorizer class with all default settings but two. First, I set min_df=3, thus removing words occurring in fewer than 3 documents. This was to remove lint tokens that arose from inevitable imperfections in the PDF extraction step.171 Second, I set max_df=0.95, thus removing words than appear in more than 95% of all documents. This was primarily to reduce the dimensionality of the resulting document-term matrix and keep the computational load manageable. As explained in Part III.A, such ubiquitous words should not in any event be highly informative of textual context.
This yielded a matrix of 73,086 token TFIDF scores across all 7,046 petition briefs. I fed this into the sparse_dot_topn172 library which allows for efficient cosine similarity matching and wrote customized code to extract the matched docket numbers. For each state petition, I extracted the top 3 closest circuit petitions by cosine similarity, provided that the similarity scores differed by less than 0.0546, being the 0.1th quantile the distribution of all cosine distances.173
For LSA, I fed the TFIDF matrix above into scikit-learn’s TruncatedSVD (i.e. truncated singular value decomposition) class and extracted the top 100 principal components (i.e. topics). The topic scores were then normalized using the Normalizer class. Figure 5 visualizes the top 9 topics derived.
Figure 5. Term Distributions for Top 9 Topics from LSA Representation of Petition Briefs
The LSA document-topic matrix, along with base covariates such as the number of petitioner and respondents involved, were then fed varyingly into the pymatch174 Matcher class to achieve the non-text, text-only, and LSA + text PSM schemes explained above.175 Propensity scores were estimated through the library’s fit_scores method with balanced samples set to true and using only one model. As with TFIDF+Cos, I matched each state petition with replacement to 3 circuit petitions.