AQR: Design of Social Research

SOC-GA 1301 – Fall 2017

Robert Max Jackson

Assisting: Christina Nelson

Christina's office hours: Wed 1-3pm,
Office 4177 (Puck Building)



Notes: Control Variables

Adding “control” variables has appeared as a strategy, an issue, and a question repeatedly in our first several weeks.  While the meaning and use of controls will presumably receive considerable attention in your statistics classes, it is important to have a basic understanding for research design.  The candidates eligible to become “control” variables include any measured condition or process that our research does not include as part of the causal relationships that we seek to measure and test, but which we suspect may influence or condition those causal relationships we want to explore. 

In a standard experimental design, the research will have a treatment group and a control group.  The treatment group – which may comprise patients receiving a drug or plots of land receiving fertilizer or whatever other unit of analysis that is relevant – is exposed to whatever we are testing while the control group is not.  In a drug trial, for example, the control group may receive a placebo while the treatment group receives the drug being tested.  If the experiment is double-blind, neither the subjects nor the researchers knows who received the drug or the placebo.  If  we are testing a fertilizer, we might divide a large plot of land into small squares, then alternate using our test fertilizer or a comparison fertilizer on the squares to create a checkerboard pattern (you might want to consider why this could be preferable to random assignment of the squares).  The control group therefore consists of research subjects, whatever they may be, that are randomly or systematically selected from the same pool that provides the treatment subjects.  The control group will undergo exactly the same sequence of experiences as the treatment subjects with the exception that they do not receive the treatment.  The goal is to allow the researcher to have confidence that any difference between the measured outcomes for the two groups can only be due to the treatment, because the two groups are otherwise the same, having similar composition (by the consequences of random or systematic selection) and similar experiences.

With observational data from surveys, censuses, registries, and the like, used for research where experimentation is implausible for varied reasons, we have no way to replicate the control versus treatment contrast of an experiment.  So, if we simply look at the relationships between what we believe are critical causes and the outcomes in question, we are in the troubling circumstance that the relationships we measure could be due to causes outside the group that interests us.  

While we cannot eliminate the influence of “other” conditions and processes on the outcomes that concern us, we can sometimes collect data on those troubling conditions along with the outcomes and the research-targeted, causal conditions.  We are particularly concerned with those “other” influences that might have direct or indirect causal relations with both our outcomes (dependent variables) and our target causes (independent variables).  These “other” influences might create a “spurious” relationship between our causal variables and our outcome variables that we would mistakenly infer to be causal.  Or, they might suppress or disguise a causal relationship so that we do not recognize it. 

To offset these possibilities, we want to use the data we have on these “other” variables to “control” for their confounding influences.  We operationalize these controls in varied ways, such as adding the control variables to our regression equations or stratifying our data (e.g., by sex or ethnicity) and applying our data analysis separately to each stratum (so that we are holding the controlling variable – the potentially influential condition or process that is presumably believed irrelevant to our analysis – constant).  Which strategy we choose depends on the data, on the causal model, and on the statistical procedures we are using.  You will examine these choices in your statistics courses. 

Note that this does not mean we add in control variables indiscriminately.  Whatever the statistical technique we are using, we do not want to add in unneeded variables.  To do so invites confusion for ourselves and our audience. 

Realistically, scholarly journals in sociology and related disciplines have published many papers where the authors have apparently chosen control variables using an uncritical, opportunistic strategy.  They match the variables available in the data they use with implicit lists of variables often used as controls in sociological research.  As a common strategy, researchers run their analyses with an extended set of these “control” variables, eventually keeping two subgroups: (1) those that produce statistically significant coefficients (or the like) that seem to bolster the “discovery” character of the research and (2) those that show no or little effect, but that other scholars might question if absent (e.g. “why didn’t you control for race?”).  From the perspective of serious scholarship and scientific integrity, this is seriously flawed.  From the perspective of efficient production of publications within the scope of common, current practices, this is a practical adaptation.