We are currently moving our web services and information to Canada.ca.

The Treasury Board of Canada Secretariat website will remain available until this move is complete.

Program Evaluation Methods


Archived information

Archived information is provided for reference, research or recordkeeping purposes. It is not subject à to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Chapter 3 - EVALUATION DESIGNS

3.1 Introduction

An evaluation design describes the logic model that will be used to gather evidence on results that can be attributed to a program. The basic principle of experimentation was illustrated in Figure 2; it involved comparing two groups, one of which was exposed to the program, and attributing the differences between the groups to the program. This type of design is referred to as the ideal evaluation design. As discussed earlier, it can seldom be fully realized in practice. Nevertheless, it is a useful construct to use for comparison and explanatory purposes. The ideal evaluation design can also be illustrated as follows.

 

Measurement
Before

Exposure to Program

Measurement
After

Treatment Group

01

X

03

Control Group

02

 

04

In this chart, "0" represents a measurement or observation of a program result and "X" represents exposure to the program. Subscripts on the symbols indicate different measurements or treatments. The 01 represents estimates (such as estimated averages) based on observations made on the members of a group. Expressions such as 03 - 04 should be interpreted as representing a concept rather than a difference in individual observations. The diagram also indicates when an observation is made before or after exposure to the program. This notation will be used throughout the chapter to illustrate various designs schematically.

In the ideal evaluation design, the outcome attributed to the program is clearly 03 - 04. This is because 01 = 02 and so 03 = 04 + X (the program), or 03 - 04 = X. Note that, in this case, 01 and 02 are not required to determine the net outcome of the program since they are assumed to be equal. Thus, the ideal design could actually be represented as follows.

 

Exposure to
Program

Measurement
After

 
Treatment Group

X

03

 
Control Group  

04

 

However, the evaluator may be interested in the relative change that has occurred, in which case the pre-program measurement is essential.

The significance of the ideal design is that it serves as the underlying proof of program attribution for all evaluation designs described in this chapter. Causal inferences are made by comparing identical groups before and after a program. Indeed, the common characteristic of all designs is the use of comparison. What distinguishes the various designs is the degree to which the comparison is made between groups that are identical in every respect save for exposure to the program.

The most rigorous designs, called experimental or randomized designs, ensure the initial equivalence of the groups by creating them through the random assignment of participants to a "treatment" or separate "control" group. This process ensures that the groups to be compared are equivalent; that is, the process ensures that the expected values (and other distribution characteristics) of 01 and 02 are equal. Experimental or randomized designs are discussed in Section 3.2.

"In-between" designs, called quasi-experimental designs, are discussed in Section 3.3. These designs come close to experimental designs in that they use comparison groups to make causal inferences, but they do not use randomization to create treatment (or experimental) and control groups. In these designs, the treatment group is usually already given. One or more comparison groups are selected to match the treatment group as closely as possible. In the absence of randomization, group comparability cannot be assumed, and so the potential for incomparability must be dealt with. Nevertheless, quasi-experimental designs are the best that can be hoped for when randomization is not possible.

At the other end of the scale are implicit designs, which are typically weak in terms of measuring changes and attributing them to a program. An illustration of an implicit design would look like this.

 

Exposure to
Program

Measurement
After

 
Treatment Group

X

01

 

With implicit designs, a measurement is made after exposure to the program and assumptions are made about conditions before the program. Any change from what was assumed to exist before to the program is attributed to the program. In other words, it is assumed that an unspecified comparison group would experience no change, or at least not all of the change observed in the treatment group. Implicit designs are discussed in greater detail in Section 3.4.

While these different types of design reflect differing levels of rigour in determining results, they also reflect a basic difference between experimental programs and regular (non-experimental) programs. Most government programs exist to provide benefits to participants and assume that the program does, in fact, work. Participation in programs is typically determined through eligibility criteria. This differs substantially from experimental or pilot programs, which are put in place to test the theory underlying a program and to determine its effectiveness. Participants in such programs receive benefits, but these considerations are secondary to testing the efficacy of the program. Consequently, participants are often chosen to maximize the conclusiveness of program results and not necessarily with regards to eligibility criteria.

These two purposes-to provide benefits and to test the program theory-almost always conflict. Program managers typically see the purpose of their programs as delivering benefits, even if the program is a pilot. Evaluators and planners, on the other hand, will prefer to implement the program as an experiment to determine beforehand if it is worth expanding. In practice, most programs are non-experimental, so evaluators must frequently resort to non-experimental evaluation designs.

This chapter discusses the three types of evaluation design mentioned above. Specific designs for each type are described and their advantages and limitations outlined. While categorizing evaluation designs into three types-randomized, quasi-experimental and implicit-facilitates the discussion that follows, the boundaries that separate one from the next are not always fixed. Quasi-experimental designs, in particular, blend into implicit designs. Nevertheless, the distinctions are useful and in most cases indicative of differing levels of rigour. Moving from a randomized design to an implicit one, the evaluator must be concerned with an increasing number of threats to the validity of causal inferences.

References: Evaluation Design

Abt, C.G., ed. The Evaluation of Social Programs. Thousand Oaks: Sage Publications, 1976.

Boruch, R.F. "Conducting Social Experiments," Evaluation Practice in Review. V. 34 of New Directions for Program Evaluation. San Francisco: Jossey-Bass, 1987.

Campbell, D.T. and J.C. Stanley. Experimental and Quasi-experimental Designs for Research. Chicago: Rand-McNally, 1963.

Cook, T.D. and D.T. Campbell. Quasi-experimentation: Designs and Analysis Issues for Field Settings. Chicago: Rand-McNally, 1979.

Datta, L. and R. Perloff. Improving Evaluations. Thousand Oaks: Sage Publications, 1979, Section II.

Globerson, Aryé, et al. You Can't Manage What You Don't Measure: Control and Evaluation in Organizations. Brookfield: Gower Publications, 1991.

Rossi, P.H. and H.E. Freeman. Evaluation: A Systematic Approach, 2nd ed. Thousand Oaks: Sage Publications, 1989.

Trochim, W.M.K., ed. Advances in Quasi-experimental Design and Analysis. V. 31 of New Directions for Program Evaluation. San Francisco: Jossey-Bass, 1986.

Watson, Kenneth. "Program Design Can Make Outcome Evaluation Impossible: A Review of Four Studies of Community Economic Development Programs," Canadian Journal of Program Evaluation. V. 10, N. 1, April-May 1995, pp. 59-72.

Weiss, C.H. Evaluation Research. Englewood Cliffs, NJ: Prentice-Hall, 1972, Chapter 4.

Notes

3.2 Randomized Experimental Designs

Experimental designs are the most rigorous approach available for establishing causal relations between programs and their results. When successfully applied, they furnish the most conclusive evidence of program impacts. Unfortunately, experimental designs are impossible to implement for many government programs after the program has been running for some time. Nevertheless, they are important for two reasons.

First, they represent the closest approximation to the ideal evaluation design described above. As such, even when it is not feasible to implement an experimental design, less rigorous designs are often judged by the extent to which they come close to an experimental design. It is therefore important to understand their advantages and limitations.

Second, in spite of the practical difficulties involved, experimental designs can be and have been used to evaluate many programs. For instance, an experimental design was used to evaluate educational programs that prevent adolescent alcohol use and abuse. Treatment and control groups were constructed (classes receiving and not receiving the program) and measures were obtained on attitude, knowledge, beliefs, intentions and actual drinking (Schlegel, 1977).

Experimental or randomized designs are characterized by a random assignment of potential participants to the program and comparison groups to ensure their equivalence. They are experiments in the sense that program participants are chosen at random from potential candidates. There are a large number of experimental designs, four of which are described below:

  • classical randomized comparison group design,
  • post-program-only randomized comparison group design,
  • randomized block and Latin square designs, and
  • factorial designs.
  • Note that randomized design is not the same as random sampling. Whereas a randomized design involves randomly assigning members of a target population to either the control or treatment group, random sampling means using a probability scheme to select a sample from a population. Random sampling from two different populations would not yield equivalent groups for the purpose of an experimental evaluation.

Classical Randomized Comparison Group Design

This classic experimental design can be illustrated as follows, where the "R" means random allocation.

 

Measurement
Before

Exposure to
Program

Measurement
After

Treatment Group (R)

01

X

03

Control Group (R)

02

 

04

In this design, potential program participants from the target population are randomly assigned either to the experimental (program) group or to the comparison group. Measurements are taken before and after (pre-program and post-program), and the net program outcome is, schematically, (03-04) - (01-02).

Random allocation (or randomization) implies that every member of the target population has a known probability of being selected for either the experimental or the comparison group. Often these probabilities are equal, in which case each member has the same chance of being selected for either group. As a result of randomization, the experimental and control groups are mathematically equivalent. The expected values of 01 and 02 are equal. However, the actual pre-program measures obtained may differ owing to chance. As such, pre-program measurement allows for a better estimate of the net outcome by accounting for any chance differences between the groups (01 and 02) that exist despite the randomization process. In this design, the program intervention (or treatment) is the only difference, other than chance, between the experimental and control groups.

Post-Program-Only Randomized Comparison Group Design

One of the drawbacks of the classical randomized design is that it is subject to a testing bias. There is a threat to validity in that the pre-program measurement itself may affect the behaviour of the experimental group, the control group, or both. This testing bias can potentially affect the validity of any causal inferences the evaluator may wish to make. To avoid this scenario, the evaluator may wish to drop the pre-program measurement. Graphically, such a design would look as follows:

 

Exposure to
Program

Measurement
After

 
Treatment Group (R)

X

01

 
Control Group (R)  

02

 

A post-program randomized design can be highly rigorous. However, one should keep in mind that, despite the randomization process, it is possible that the two groups constructed will differ significantly in terms of the measures of interest; one cannot, therefore, be completely certain of avoiding initial group differences that could affect the evaluation results.

Randomized Block and Latin Square Designs

To make it less likely that the measured net effect of a program is the result of sampling error, one should use as large a sample as possible. Unfortunately, this can be extremely costly. To address this problem, randomization and matching (blocking) should be combined where it is necessary to use relatively small sample sizes. Matching consists of dividing the population from which the treatment and control groups are drawn into "blocks" that are defined by at least one variable that is expected to influence the impact of the program.

For instance, if those in an urban environment were expected to react more favourably to a social program than rural inhabitants, two blocks could be formed: an urban block and a rural block. Randomized selection of the treatment and control groups could then be performed separately within each block. This process would help ensure a reasonably equal participation of both urban and rural inhabitants. In fact, blocking should always be carried out if the variables of importance are known.

Groups can, of course, be matched on more than one variable. However, increasing the number of variables rapidly increases the number of blocks and ultimately the required sample size. For instance, if the official language spoken (English or French) is also expected to influence the impact of our program, the following blocks must be considered: English urban, English rural, French urban and French rural. Because each block requires a treatment and control group, eight groups are required and minimum sample size levels must be observed for each of these. Fortunately, the number of groups can be reduced by using such methods as the Latin Square design. However, these methods can be used only if the interaction effects between the treatment and the control variables are relatively unimportant.

Factorial Designs

In the classical and randomized block designs, only one experimental or treatment variable was involved. Yet, programs often employ a series of different means to stimulate recipients toward an intended outcome. When evaluators want to sort out the separate effects of the various methods of intervention used, they can use a factorial design. A factorial design not only determines the separate effects of each experimental variable, it can also estimate the joint net effects (the interaction effect) of pairs of experimental variables. This is important because interaction effects are often observed in social phenomena. For instance, the joint impact of increasing the taxes on tobacco and of increasing the budget for non-smoking advertising may be greater than the sum of the separate impacts of the two interventions.

Strengths and Weaknesses

  • Experimental designs offer the most rigorous methods of establishing causal inferences about the results of programs. They do this by eliminating threats to internal validity by using a control group, randomization, blocking and factorial designs. The main drawback of experimental designs is that they are often difficult to implement.

Unfortunately, randomization (the random assignment to treatment and control groups) is often not possible. For instance:

  • when the whole target population is already receiving the program, there will be no basis for forming a control group; - when the program has been under way for some time, in which case definite differences probably exist between those who have benefited from the program (potential experimental group) and those who have not (potential treatment group); - when it would be illegal or unethical to grant the benefit of the program to some people (experimental group) and withhold the same benefits from others (treatment group).

Clearly, the majority of government programs fall into at least one of the above categories, making randomization extremely difficult, except perhaps where the program is treated as a real experiment-that is, a pilot program.

  • Experimental designs are still subject to all the threats to external validity and some of the threats to internal validity.

The difficulty of generalizing conclusions about the program results is not automatically ruled out in an experimental design. For example, randomization for generalization purposes is a different issue from the random selection of experimental and comparison groups. The former requires that the original target population from which the two groups are created be itself selected at random from the population of potential recipients (this being the population of subjects to whom the evaluators may wish to generalize their results).

In addition, several threats to internal validity still remain important despite the implementation of a randomized selection process:

  • differential mortality (or drop-out from the program and control groups) could bias the original randomization; and - diffusion of treatment between the two groups could contaminate the results.

Furthermore, the classical experimental design raises questions:

  • changes in instrumentation could clearly still bias the measurements taken; and - the reaction to testing could result in different behaviour between experimental and control groups.

As these last two issues are primarily the result of pre-testing, the post-program-only randomized comparison group design (mentioned earlier) avoids these threats. It should nevertheless be clear that, despite the strengths of experimental designs, the results of such designs should still be interpreted carefully.

References: Randomized Experimental Designs

Boruch, R.F. "Conducting Social Experiments," Evaluation Practice in Review. V. 34 of New Directions for Program Evaluation. San Francisco: Jossey-Bass, 1987, pp. 45-66.

Boruch, R.F. "On Common Contentions About Randomized Field Experiments." In Gene V. Glass, ed. Evaluation Studies Review Annual. Thousand Oaks: Sage Publications, 1976.

Campbell, D. "Considering the Case Against Experimental Evaluations of Social Innovations," Administrative Science Quarterly. V. 15, N. 1, 1970, pp. 111-122.

Eaton, Frank. "Measuring Program Effects in the Presence of Selection Bias: The Evolution of Practice," Canadian Journal of Program Evaluation. V. 9, N. 2, October-November 1994, pp. 57-70.

Trochim, W.M.K., ed. "Advances in Quasi-experimental Design and Analysis," V. 31 of New Directions for Program Evaluation. San Francisco: Jossey-Bass, 1986.

Notes

3.3 Quasi-experimental Designs

When randomization is not possible, it may be feasible to construct a comparison group that is similar enough to the treatment group to make some valid inferences about results attributable to the program. In this section, quasi-experimental designs are characterized as those that use a non-randomized comparison group to make inferences on program results. This comparison group could be either a constructed group, which was not exposed to the program, or a reflexive group, namely the experimental group itself before exposure to the program.

Three general types of quasi-experimental designs are discussed here:

  • pre-program/post-program designs,
  • historical/time series designs and
  • post-program-only designs.

These are presented in roughly descending order of rigour, although in all cases the degree of equivalence between the experimental and comparison groups will be the overriding determinant of the design's strength.

3.3.1 Pre-program/Post-program Designs

There are two basic designs in this category: the pre-program/post-program non-equivalent design and the one group pre-program/post-program design. The former uses a constructed comparison group and the latter uses a reflexive comparison group.

Pre-program/Post-program Non-equivalent Comparison Group Design

This design, structurally similar to the classical experimental design, uses pre-program and post-program measurements on the program group and a comparison group:

 

Measurement
Before

Exposure to
Program

Measurement
After

Treatment Group

01

X

03

Control Group

02

 

04

The comparison group is selected so that its characteristics of interest resemble those of the program group as closely as possible. The degree of similarity between the groups is determined through pre-program comparison. To the extent that matching is carried out and is properly specified (that is, it is based on variables that influence the outcome variables), this design approaches the rigour of randomized comparison group design and the threats to internal validity can be minimal. Unfortunately, it is usually difficult to match perfectly on all variables of importance. This means that, typically, at least one rival explanation for observed net program impacts will remain, namely that the two groups were unequal to begin with.

One-group Pre-program/Post-program Design

This simple design is frequently used despite its inherent weaknesses. This may be because it closely resembles the ordinary concept of a program result: pre-program to post-program change. One-group pre-program/post-program designs can be illustrated as follows:

 

Measurement
Before

Exposure to
Program

Measurement
After

Treatment Group

01

X

02

There are many threats to the internal validity of this design. Any number of plausible explanations could account for observed differences between 02 and 01. This is because the comparison group in this case is simply the treatment group before being exposed to the program; it is a reflexive comparison group. The lack of an explicit comparison group means that most of the threats to internal validity are present. History may be a problem since the design does not control for events outside the program that affect observed results. Normal maturation of the program population itself may also explain any change. As well, the change may be a regression artefact; 01 may be atypically low, so that 02 - 01 is measuring chance fluctuation rather than a change resulting from the program. Finally, testing, instrumentation and mortality could be problems.

The sole advantage of this design is its simplicity. If the evaluator can achieve enough control over external factors, this design furnishes reasonably valid and conclusive evidence. In the natural sciences, a laboratory setting typically gives enough control of external factors; social science research tends to be far less controllable.

3.3.2 Historical/Time Series Designs

Historical or time series designs are characterized by a series of measurements over time, both before and after exposure to the program. Any of the pre-program/post-program designs already described could be extended to become a historical design. This means that historical designs that have only a few before-and-after measurements are subject to all of the threats to internal validity that the corresponding single measurement design faces. A more complete set of measures, on the other hand, allows the evaluator to eliminate many of these threats by analyzing pre- and post-program trends.

Two historical designs are described below:

  • the basic time series design and - the time series design with a non-equivalent comparison group.

Basic Time Series Design

A common historical design is the basic time series design, in which any number of before-and-after measurements can be made. It can be illustrated as follows:

 

Measurement
Before

Exposure to
Program

Measurement
After

Treatment Group

01020304

X

05060708

Using this design, an evaluator can identify the effects of a given program by a change in the pattern of observations measured before and after exposure. With adequate time series data, this design can be fairly rigorous, ruling out many threats to internal validity, particularly maturation and testing effects. Other threats remain-those related to history, for example-because time series designs cannot eliminate the possibility that something other than the program caused a change between measurements taken before and after exposure.

Time Series Design With Non-equivalent Comparison Group

Historical designs can be improved by adding comparison groups. Consider the time series design with a non-equivalent comparison group shown below:

 

Measurement
Before

Exposure to
Program

Measurement
After

Treatment Group

0102030405

X

011012013014015

Control Group

06070809010

 

016017018019020

Since both the experimental and comparison groups should experience the same external factors, it is unlikely that an observed change will be caused by anything but the program. As with any design using a non-equivalent comparison group, however, the groups must be similar enough in terms of the characteristics of interest. When this condition is met, historical designs can be quite rigorous.

A number of strengths and weaknesses of historical designs can be identified.

  • Historical designs using adequate time series data can eliminate many threats to internal validity.

This is true because, when properly carried out, a historical design allows for some kind of an assessment of the maturation trend before the program intervention.

  • Historical designs can be used to analyze a variety of time-dependent program effects.

The longitudinal aspect of these designs can be used to address several questions: Is the observed effect lasting or does it diminish over time? Is it immediate or delayed, or is it seasonal in nature? Some type of historical design is called for whenever these types of questions are important.

  • Adequate data may not be available for carrying out the required time series analysis.

Numerous data problems may exist with historical designs. In particular, the time series available are often much shorter than those usually recommended for statistical analysis (there are not enough data points); different data collection methods may have been used over the period being considered; and the indicators used may have changed over time.

  • Special time series analysis is usually required for historical designs.

The more common least squares regressions are inappropriate to time series analysis. A number of specialized techniques are required (see, for example, Cook and Campbell, 1979, Chapter 6; Fuller, 1976; Jenkins, 1979; and Ostrom, 1978).

3.3.3 Post-program-only Designs

In post-program-only designs, measurements are carried out only after exposure to the program, eliminating testing and instrumentation threats. However, since no pre-program information is available, serious threats to validity exist even where a control group is used. Two such designs are described below.

Post-program-only with Non-equivalent Control Group Design

A post-program-only design with non-equivalent control group is illustrated below.

 

Exposure to
Program

Measurement
After

 
Treatment Group

X

01

 
Control Group  

02

 

Selection and mortality are the major threats to internal validity in a post-program-only design. There is no way of knowing if the two groups were equivalent before exposure to the program. The differences between 01 and 02 could, consequently, reflect only an initial difference and not a program impact. Furthermore, the effect of drop-outs (mortality effect) cannot be known without pre-program measures. Even if the two groups had been equivalent at the outset, 01 or 02 will not account for the program's drop-outs and so biased estimates of program effects could result.

Post-Program -only Different Treatments Design

A somewhat stronger post-program-only design is as follows.

 

Exposure to
Program

Measurement
After

 
Treatment Group 1

X1

01

 
Treatment Group 2

X2

02

 
Treatment Group 3

X3

03

 
Treatment Group 4

X4

04

 

In this design, different groups are subjected to levels of the program. This may be accomplished through, say, a regional variation in program delivery and benefits. If sample sizes are large enough, a statistical analysis could be performed to relate the various program levels to the results observed (the 01), while controlling for other variables.

As in the previous design, selection and mortality are major threats to internal validity.

Strengths and Weaknesses

  • Quasi-experimental designs take creativity and skill to design, but can give highly accurate findings.

An evaluation can often do no better than quasi-experimental designs. When equivalence of the treatment and control groups cannot be established through randomization, the best approach is to use all prior knowledge available to choose the quasi-experimental design that is the most free from confounding effects. Indeed, a properly executed quasi-experimental design can provide findings that are more reliable than those from a poorly executed experimental design.

  • Quasi-experimental designs can be cheaper and more practical than experimental designs.

Because quasi-experimental designs do not require randomized treatment and control groups, they can be less expensive and easier to implement than experimental designs.

  • Threats to internal validity must be accounted for individually when quasi-experimental designs are used.

The extent to which threats to internal validity are a problem depends largely on the success of the evaluator in matching the experimental and control groups. If the key variables of interest are identified and matched adequately, internal validity threats are minimized. Unfortunately, it is often impossible to match all the variables of interest.

In selecting the appropriate evaluation design, evaluators should look at the various quasi-experimental designs available and assess the major threats to validity embodied in each. The appropriate design will eliminate or minimize major threats, or at least allow the evaluator to account for their impact.

Notes

3.4 Implicit Designs

Implicit designs are probably the most frequently used designs, but are also least rigorous. Often, no reliable conclusions can be drawn from such a design. Conversely, an implicit design may be all that is required in cases where the program can be argued logically to have caused the outcome. This design is basically a post-program design with no control group. Schematically, this design looks as follows.

 

Exposure to
Program

Measurement
After

 

Treatment Group

X

01

 

As represented here, neither the magnitude of the program effect is known (since there is no pre-program measure) nor can anything definitive be said about attribution (01 could be the result of any number of factors). In its worst form, this design entails asking participants if they "liked" the program. Grateful testimonials are offered as evidence of the program's success. Campbell (1977), among others, criticizes this common evaluation approach.

While this design owes its popularity in part to a poorly thought-out evaluation, it is sometimes the only design that can be implemented: for instance, when no pre-program measures exist and no obvious control group is available. In such cases the best should be made of a bad situation by converting the design into an implicit quasi-experimental design. Three possibilities are

  • the theoretical control group design,
  • the retrospective pre-program measure design and
  • the direct estimate of difference design.

Each is described below.

Post-program only with Theoretical Comparison Group Design

By assuming the equivalence of some theoretical control group, this design looks like a post-program-only non-equivalent control group design:

 

Exposure to
Program

Measurement
After

 
Treatment Group

X

01

 
Theoretical Control Group  

02*

 

The difference is that the 02* measurement is assumed rather than observed. The evaluator might be able to assume, on theoretical grounds, that the result, in the absence of any program, would be below a certain level. For example, in a program to increase awareness of the harmful effects of caffeine, the knowledge of the average Canadian (02*) could be assumed to be negligible in the absence of a national information program. As another example, consider determining the economic benefit of a government program or project. In the absence of the program, it is often assumed that the equivalent investment left in the private sector would yield an average social rate of return of 10 per cent-the 02* in this case. Thus, the rate of return on the government investment project (01) could be compared with the private sector norm of 10 per cent (02*).

Post-program only With Retrospective Pre-program Measure Design

In this case, pre-program measures are obtained, but after exposure to the program, so that the design resembles the pre-program/post-program design:

 

Retrospective
Before

Exposure to
Program

Measurement
After

Treatment Group

01

X

02

For example, the following two survey questions might be asked of students after they have participated in an English course.

1. Rate your knowledge of English before this course on a scale of 1 to 5.

2. Rate your knowledge of English after completing this course on a scale of 1 to 5.

Thus, students are asked for pre-program and post-program information, but only after having completed the course. Differences between the scores cold be used as an indication of program effectiveness.

Post-program-only with Difference Estimate Design

This is the weakest of the implicit designs and can be illustrated as follows.

 

Exposure to
Program

Measurement
After

 

Treatment Group

X

0 = (02 - 01)

 

Here, the respondent directly estimates the incremental effect of the program. For instance, firm representatives might be asked how many jobs resulted from a grant, or students in an English course might be asked what or how much they learned. This design differs from the retrospective pre-program design in that respondents directly answer the question "What effect did the program have?"

Strengths and Weaknesses

  • Implicit designs are flexible, versatile and practical to implement.

Because of their limited requirements, implicit designs are always feasible. Program participants, managers or experts can always be asked about the results of the program. Indeed, this may be a drawback in that "easy" implicit designs are often used where, with a little more effort and ingenuity, more rigorous implicit or even quasi-experimental designs might have be implemented.

  • Implicit designs can address virtually any issue and can be used in an exploratory manner.

Program participants or managers can be asked any question about the program. While obviously weak in dealing with more objective estimates of program outcomes and attribution, an implicit design may well be able to answer questions about program delivery. In the case of a service program, for example, implicit designs can address questions about the extent of client satisfaction. Furthermore, a post-program survey may be used to identify a number of program outcomes that can then be explored using other evaluation research strategies.

  • Implicit designs offer little objective evidence of the results caused by a program.

Conclusions about program results drawn from implicit designs require major assumptions about what would have happened without the program. Many major threats to internal validity exist (such as history, maturation and mortality) and must be eliminated one by one.

  • Where attribution (or incremental change) is a significant evaluation issue, implicit designs should not be used alone; rather, they should be used with multiple lines of evidence.

3.5 Use of Causal Models in Evaluation Designs

Section 2.2 and Chapter 3 stressed the conceptual nature of the ideal or classical evaluation design. In this design, the possible cause of a particular program's outcome is isolated through the use of two groups, equivalent in all respects except for the presence of the program. Based on this ideal design, alternative designs that allow the attribution of results to programs were described, as well as the varying degree to which each allows the evaluator to infer and the threats to the internal validity associated with each.

An alternative way of addressing the issues of causal inference involves the use of a causal model: an equation that describes the marginal impact of a set of selected independent variables on a dependent variable. While quasi-experimental designs focus on comparisons between program recipients and one or more control groups, causal models focus on the variables to be included in the model-both endogenous (intrinsic to the program) and exogenous (outside the program)-and their postulated relationships. For quasi-experimental designs, the program is of central interest; for causal models, the program is only one of several independent variables that are expected to affect the dependent variable.

Take, for example, the evaluation of an industrial support program that compares export sales by firms that are program recipients and sales by firms that are not. In this case, a causal model would take into account variables such as the industrial sector in which the firm operates, the size of the firm, and whether the firm was a program beneficiary. Using regression analysis, the evaluator could then determine the marginal impact of each of these variables on a firm's export sales.

Similarly, an evaluation of a program that provides grants to cultural organizations in various communities might compare (a) changes in attendance at cultural events over time in communities receiving large grants per capita and (b) attendance changes in those with lower grants. A causal model involving the effects of the community's socio-economic profile, cultural infrastructure and historical attendance patterns on current attendance levels could be generated. The data thereby derived could be used in place of or in addition to the comparison approach which has been discussed thus far.

In practice, most evaluators will want to use both causal and comparative approaches to determine program results. Quasi-experimental designs can be used to construct and manipulate control groups and, thereby, to make causal inferences about program results. Causal models can be used to estimate the marginal impact of variables that affect program success. Bickman (1987) and Trochim (1986) offer useful advice on how best to make use of causal models in evaluative work.

Causal models are best suited to situations where sufficient empirical evidence has confirmed, before the evaluation, the existence of a relationship between the variables of interest. In the absence of an a priori model, the evaluator should employ matching (blocking), as described in section s 3.2.2 and 3.3.2, to capture data for variables thought to be important. In addition, statistical analyses can be used to control for selection or history biases, rendering the conclusions about program impacts more credible.

Evaluators who use causal models should consult Chapter 7 of Cook and Campbell's book, Quasi-experimentation (1979), for a discussion of the pitfalls to avoid in attempting to make causal inferences based on "passive observation" (where there is no deliberate formation of a control group). Two of the more common pitfalls mentioned are inadequate attention to validity threats and the use of structural models that are suitable for forecasting but not for causal inference.

References: Causal Models

Bickman, L., ed. Using Program Theory in Program Evaluation. V. 33 of New Directions in Program Evaluation. San Francisco: Jossey-Bass, 1987.

Blalock, H.M., Jr., ed. Causal Models in the Social Sciences. Chicago: Aldine, 1971.

Blalock, H.M., Jr. Measurement in the Social Sciences: Theories and Strategies. Chicago: Aldine, 1974.

Chen, H.T. and P.H. Rossi. "Evaluating with Sense: The Theory-Driven Approach," Evaluation Review. V. 7, 1983, pp. 283-302.

Cook, T.D. and D.T. Campbell, Quasi-experimentation. Chicago: Rand-McNally, 1979, chapters 4 and 7.

Cordray, D.S. "Quasi-experimental Analysis: A Mixture of Methods and Judgement." In Trochim, W.M.K., ed. Advances in Quasi-experimental Design and Analysis. V. 31 of New Directions for Program Evaluation. San Francisco: Jossey-Bass, 1986, pp. 9-27.

Duncan, B.D. Introduction to Structural Equation Models. New York: Academic Press, 1975.

Goldberger, A.S. and D.D. Duncan. Structural Equation Models in the Social Sciences. New York: Seminar Press, 1973.

Heise, D.R. Causal Analysis. New York: Wiley, 1975.

Mark, M.M. "Validity Typologies and the Logic and Practice of Quasi-experimentation." In Trochim, W.M.K., ed. Advances in Quasi-experimental Design and Analysis. V. 31 of New Directions for Program Evaluation. San Francisco: Jossey-Bass, 1986, pp. 47-66.

Rindskopf, D. "New Developments in Selection Modeling for Quasi-experimentation." In Trochim, W.M.K., ed. Advances in Quasi-experimental Design and Analysis. V. 31 of New Directions for Program Evaluation. San Francisco: Jossey-Bass, 1986, pp. 79-89.

Simon, H. "Causation." In D.L. Sill, ed. International Encyclopedia of the Social Sciences, V. 2. New York: Macmillan, 1968, pp. 350-355.

Stolzenberg, J.R.M. and K.C. Land. "Causal Modeling and Survey Research." In Rossi, P.H.,et al., eds. TITLE MISSING. Orlando: Academic Press, 1983, pp. 613-675.

Trochim, W.M.K., ed. "Advances in Quasi-experimental Design and Analysis." V. 31 of New Directions in Program Evaluation. San Francisco: Jossey-Bass, 1986.

3.6 Summary

Choosing the most appropriate evaluation design is difficult. It is also the most important part of selecting an evaluation strategy, since the accuracy of the evidence produced in any evaluation will rest, in large part, on the strength of the design chosen. Because of this, the evaluator should try to select as strong a design as possible, bearing in mind the time, money and practicability constraints. The design selected should be the one that comes as close to the ideal (experimental design) as is feasible. As the evaluator moves from experimental to quasi-experimental to implicit designs, the rigor of the evaluation design and credibility of the findings will suffer. Regardless of the design chosen, it is desirable that the causal model approach be incorporated into the evaluation design, to the extent possible, to support the credibility of the findings.

Often, a relatively weak design is all that is possible. When this is the case, evaluators should explicitly identify any and all major validity threats affecting the conclusions, thus appropriately qualifying the evaluation's findings. As well, evaluators should search in earnest for additional designs that can support the conclusions reached, reduce any validity threats, or do both.

  • In summary, evaluators should explicitly identify the type of evaluation design used for each evaluation strategy.

Sometimes, evaluations are carried out without a clear understanding of which design is being used. As a result, the credibility of the resulting evidence is weakened since the basis of "proof" is not well understood. By identifying the design explicitly, the evaluator makes it possible to discuss the major relevant threats openly, and to develop logical arguments or other counter-evidence to reduce, eliminate or account for the impact of these threats. The result is a better evaluation.

  • For each research design used, the evaluator should list each of the major plausible threats to validity that may exist and discuss the implications of each threat.

The literature disagrees about which threats to validity are generally eliminated by which designs. Cronbach (1982), in particular, questions many of the statements on validity threats made by the more traditional writings of Cook and Campbell (1979). Such debates, however, are less frequent when specific evaluations and their designs are being discussed. In any particular case, it is usually clear whether there are plausible alternative explanations for any observed change.



Date modified: