Institutional Research

Institutional Research - Definitions and Explanations

All research conducted by the OIRE is governed by the Association for Institutional Research (AIR) Code of Ethics.

Facts and Figures: Where do they come from?

All colleges in the Connecticut Community College System use a system called Banner for tracking information on students, staff, courses and finances. Information is constantly entered and updated. At the start of each fall semester, the data are “frozen.” The freeze, or census, date is set by the College. On that day, a full set of Banner data is extracted. This snap shot of data is used to satisfy state and federal reporting requirements and to provide a point of comparison for year-over-year and semester-to-semester studies. Banner data are also frozen at the start and end of all other semesters. For state and federal reporting purposes, the fall census-date extract is most often used.

Non-credit student data presented on the OIRE pages are derived from a full year’s worth of data because non-credit enrollment information is not frozen.

A Note About Percentages

Many of the OIR pages contain tables that can easily be copied and pasted into Microsoft Excel. They were prepared using the program, which automatically rounds each figure up or down to the decimal point shown. When Excel sums the percentages, it is adding the true (not the rounded) figures. Occasionally, decimals rounded by Excel will appear to add up to more than the expected total.

For an explanation of statistical terms, look at our Statistics Primer.

Learn more about services provided through Institutional Research and Effectiveness.

Institutional Research - Statistics Primer

This page offers some basic information about common statistical analyses you might see in reports published by the Office of Institutional Research or other research providers. It is organized into three sections:

Common Terms
Common Statistical Tests
Further Reading

bell curve - See normal distribution.

bivariate analysis - The analysis of two variables simultaneously to determine if there is a relationship between them.

case - A specific instance of the general thing being studied in scientific research. For example, if cities were being studied, New York might be a case. In database terms, a case is the same as a record.

categorical variable - A variable measured at the nominal or ordinal level of measurement; also known as a discrete variable.

continuous variable - A variable measured at the interval or ratio level of measurement.

control variable - A variable that is held constant in an attempt to further clarify the relationship between two other variables. For example, if one found a relationship between level of education and level of prejudice, it might be useful to control for gender since the relationship between education and prejudice might differ for men and women.

correlation - A quantitative measure of how strongly two variables relate to each other. In colloquial terms, correlation is used as a way to say that two variables are simply related (e.g., geographic location and political party affiliation are correlated). The most common type of correlation is the Pearson correlation. Correlation and causation are not the same thing.

dependent variable - The variable whose values are (partially) affected by one or more other variables in a statistical analysis. The values of the dependent variable depend on the values of independent variables. For example, in a study of whether gender relates to playing on a sporting team, sports participation would be the dependent variable, also known as the outcome or response variable.

independent variable - A variable whose values are taken as given and presumed to affect the values on a dependent variable. For example, in a study of whether gender relates to playing on a sporting team, gender would be the independent variable, also known as the explanatory or predictor variable.

interval measure - A variable in which 1) the values can be rank-ordered, 2) the intervals between adjacent values are equal, and 3) there is no absolute zero point. Examples are SAT scores, IQ scores, and net worth.

mean - A descriptive statistic that measures the "central tendency" of a distribution of values for a given variable. It is computed by summing all observed values and dividing by the number of observations. The mean is sensitive to extremely high or low values (as compared to the rest of the observations).

median - A descriptive statistic that measures the "central tendency" of a distribution of values for a given variable. It represents the "middle" value in a rank-ordered set of observations. For example, in the set 1, 5, 9, 20, 50, the median is 9, while in 1, 5, 9, 20, the median is 7. The median is preferred when the mean is being drastically affected by extreme values. In a normal distribution, the mean and median are the same.

mode - A descriptive statistic that measures the "central tendency" of a distribution of values for a given variable. It represents the most frequently observed value in the set of observations and is used for nominal measures. For example, in a group of college students composed of 100 majors in literature, 50 in sociology, 30 in philosophy, and 20 in economics, the modal category is literature.

multivariate analysis - The analysis of the simultaneous relationships among three or more variables.

nominal measure - A variable in which the values are simply different from each other and cannot be rank-ordered. Examples are race, gender, and marital status.

nonsampling error - The statistical imprecision in an estimated population parameter that cannot be attributed to the sample used to make the estimate. This imprecision is unavoidable, and difficult or impossible to quantify, so care must be taken to minimize it as much as possible. Sources of nonsampling error in survey research include poorly worded questions, misunderstood questions, question ordering, question response options, incorrectly checked boxes, data entry errors, nonresponse, the provision of false data, and so on.

normal distribution - A symmetrical, bell-shaped curve plotted on two axes. The y-axis represents the number of cases with a particular value on a single variable, and the x-axis represents the value itself. In a normal distribution, the mean, median, and mode are all the same, and 68.3% of the cases fall within one standard deviation of the mean, 95.5% within two standard deviations, and 99.7% within three standard deviations.

null hypothesis - The hypothesis that is directly tested in statistical significance testing. It states that there is no relationship between the variables being analyzed. If one can statistically reject the null hypothesis, then one can conclude with relatively high certainty that the observed relationship is not due to sampling error, but to other reasons, such as theorized causes or unexamined confounding variables.

ordinal measure - A variable in which the values can be rank-ordered, but have no standard unit of measurement. Examples are movie ratings (thumbs up or thumbs down), socioeconomic status (low, middle, or high), and level of appreciation for coffee (love it, like it, or hate it).

population - The set of individuals or other things from which a sample is drawn. Ideally (though infrequently), a population is fully enumerated to ensure the selection of a well-drawn sample. A population is sometimes called a universe.

ratio measure - A variable in which 1) the values can be rank-ordered, 2) the intervals between adjacent values are equal, and 3) there is an absolute zero point. Examples are the Kelvin temperature scale, people's salaries, and the number of children a couple has.

regression coefficient - A measure expressing how an independent variable relates to a dependent variable in a regression model. In linear regression (the most common kind), a coefficient is shown in unstandardized and standardized form. The unstandardized coefficient, B, indicates how much change occurs in the dependent variable when there is a one-unit change in the independent variable. The standardized coefficient, Beta, indicates the same thing, though does so in terms of the z-scores for both variables, not in terms their original units. Comparing Beta coefficients is the way to assess the predictive strength of one independent variable against others.

sample - A set of cases drawn from and analyzed to estimate the parameters of a population. A simple random sample from an enumerated list of the population of interest is ideal. However, such samples are difficult to obtain, so other sampling techniques are often employed. These fall into two categories: probability sampling (e.g., systematic, stratification, multistage cluster, and probability proportional to size sampling) and nonprobability sampling (e.g., quota, convenience, purposive, and snowball sampling).

sampling error - The statistical imprecision in an estimated population parameter that results from using a random sample to make the estimate. The imprecision comes from the fact that the sample used for a particular estimate is only one of a large number of samples of the same size that could have been selected. If one drew multiple samples of a given size from the same population, the composition of the samples would differ due to random chance and the estimates based on the samples would differ as well. For example, if the true population mean was 50, one sample might provide an estimate of 49, another of 35, still another of 55, and so on up to the maximum number of samples that could be drawn. Statistical significance testing deals with the distribution of all of these many estimated means to determine how likely it would be to get the one particular estimate from the one sample that was selected.

standard deviation - A unit of measurement that describes how dispersed or spread out a group of values is around their mean. In a normal distribution or bell curve, 68.3% of the cases fall within one standard deviation of the mean, 95.5% within two standard deviations, and 99.7% within three standard deviations.

statistical significance - The probability, p, that an observed relationship between two variables could be attributed to sampling error or random chance operations alone. By convention, a relationship between two variables is called statistically significant when p < .05. In other words, when there is a relatively small chance (less than 5 in 100) that the observed relationship could be caused by sampling error, then one has identified a statistically significant relationship. The value of p is affected by the size of the sample and the strength of the observed relationship. Thus, it is common for trivial or weak relationships to be statistically significant in large samples. Similarly, strong relationships might not be statistically significant if a small sample is used. In any case, statistical significance should not be confused for substantive significance. See the Further Reading section below for more information about statistical significance testing.

substantive significance - The extent to which a relationship between two variables has an important or practical effect in the real world. For example, a researcher might find a statistically significant relationship between whether students take a math refresher course and their scores on a math placement test, but if the observed relationship between the two variables is such that taking the course results in an average increase of only a point or two, then decision-makers might conclude that the increase is not substantively significant (especially in relation to other concerns, such as the cost of providing math refresher courses).

univariate analysis - The analysis of a single variable for purposes of description.

variable - An attribute or characteristic of a case that is capable of assuming any of a set of values. Examples are colors of cars, breeds of dogs, and salaries of people. In database terms, a variable is the same as a field.

z-score - A standardized unit of measurement that is defined relative to the mean of a variable. This relativity lets researchers compare scores on variables that have different units of measurement. If a variable has a z-score of 1, then it is one standard deviation from the mean. For example, if the mean score on a test is 85 with a standard deviation of 5 points, then a student scoring 90 will have a z-score of 1, while a student scoring 80 will have a z-score of -1.

Common Statistical Tests

The table below shows some common statistical tests used for different combinations of independent and dependent variables at various levels of measurement. Click the test names for brief descriptions of them. More detailed information about test assumptions, null hypotheses tested, sampling distributions used, and computation of test statistics can be found in any undergraduate or graduate textbook on statistics (see the Further Reading section below).

**Determining Appropriate Statistical Tests**
Independent Variable	Dependent Variable
Categorical (Two Values)	Categorical (Over Two Values)	*Continuous*
Categorical (Two Values)	chi-square (crosstab) or Pearson correlation	chi-square (crosstab)	t-test
Categorical (Over Two Values)	chi-square (crosstab)	chi-square (crosstab)	analysis of variance (ANOVA)
*Continuous*	logistic regression	multinomial logistic regression	Pearson correlation or linear regression

analysis of variance (ANOVA) - Used to test for differences among the means of three or more groups. For example, one would use ANOVA to see if the average contribution to disaster relief differed among liberals, moderates, and conservatives. In this case, the amount of money donated is the dependent variable (measured at the ratio level) and political orientation is the independent variable (measured at the nominal level). If the ANOVA's F statistic is statistically significant, then one can say that at least one mean differs from one of the others. One cannot say which means differ. To make that determination, one uses a post-hoc test (e.g., Bonferroni) to statistically compare the pairs. In this example, there are three pairs to test (i.e., liberal/moderate, liberal/conservative, and moderate/conservative). (Return to Common Statistical Tests)

chi-square (crosstab) - Used to see whether two categorical variables relate to each other. For example, the crosstab below fictitiously shows how people's race relates to the type of music they like best. By putting race (independent variable) in the columns, music type (dependent variable) in the rows, and comparing the column percentages across the rows, we see that whites are most likely to choose rock as their favorite music, blacks are most likely to choose R&B, and Hispanics and those of other races are most likely to choose "other" types of music (e.g., jazz, world, or classical).

**Fictitious Data on Race and Musical Preference**
Music	Race	Total
White	Black	Hispanic	Other
*Rock*	20 50%	2 10%	4 20%	4 20%	30 30%
*R&B*	4 10%	12 60%	4 20%	4 20%	24 24%
*Other*	16 40%	6 30%	12 60%	12 60%	46 46%
Total	40 100%	20 100%	20 100%	20 100%	100 100%

In the example above, the differences in the percentages are fairly large, making the interpretation of the crosstab easy. Interpreting smaller crosstabs, like a 2 x 2 table composed of two binary variables, is also easy. However, difficulties arise when tables are large, differences are small, or both. To help interpret such tables, looking at the statistical significance of the chi-square statistic is useful. If the chi-square is statistically significant, then one knows there is at least one statistical difference in the crosstab. On the other hand, if it is not statistically significant, then from a statistical perspective there are no differences to be found. (Return to Common Statistical Tests)

linear regression - A technique that allows one to examine how a set of continuous variables relates to a continuous dependent variable. Linear regression assumes that the effects of the independent variables are additive, and that the relationships between the independent and dependent variables are linear. One use of regression analysis would be if a researcher were interested in how scores on a test of English fluency are affected by the number of days spent on an English immersion retreat. The control variables used might include how many years a person has been learning English and how old a person is. The primary output generated through regression analysis is a table of regression coefficients. These allow one to estimate how much effect one variable has on the dependent variable (independent of the effects of the control variables). (Return to Common Statistical Tests)

logistic regression - A technique that allows one to examine how a set of continuous variables relates to a dichotomous (a.k.a. binary, indicator, or dummy) dependent variable. Logistic regression assumes the effects of the independent variables are additive. One use of logistic regression would be if a researcher were interested in whether developing lung cancer is dependent on whether people worked in a chemical manufacturing plant. The control variables used might include how many years people smoked and whether they have a history of lung cancer in the family. In contrast to the regression coefficients produced by linear regression, those in logistic regression relate to the odds of something being the case versus not being the case (e.g., having lung cancer versus not having lung cancer). Output produced in logistic regression analysis includes odds ratios, which are defined as the natural logarithm, e, raised to the power of B. In other words, an odds ratio is eB, where e is approximately 2.72 and B is a regression coefficient produced in the logistic regression analysis. The odds ratio is interpreted as the factor by which the odds of something being the case are changed with a one-unit change in an independent variable. For example, if the odds ratio associated with working in a chemical manufacturing plant were 2, then that would mean that the odds of having lung cancer would be two times that of someone who did not work in a plant. (Return to Common Statistical Tests)

multinomial logistic regression (MLR) - An extension of logistic regression that allows one to analyze dependent categorical variables with three or more value categories (rather than dependent variables that are simply binary). The interpretation of MLR coefficients is the same as in logistic regression, though one of the categories in the dependent variable is taken as the reference category throughout the entire analysis. For example, if a market researcher were studying alcohol preference, he or she might have an outcome variable with four potential values: beer, wine, liquor, and dislikes alcohol. The researcher might want to compare the odds of liking one of the first three categories to disliking alcohol altogether. In that case, the last category would be the reference category, and the regression output would result in three tables of coefficients (for beer, wine, and liquor, respectively). (Return to Common Statistical Tests)

Pearson correlation - Used to test whether a linear relationship exists between two continuous variables and measure how strong the relationship is. The Pearson correlation coefficient, r, ranges from -1 to 1, where -1 is a perfect negative relationship (as one variable goes up, the other goes down), 0 is no relationship (the variables are independent of each other), and 1 is a perfect positive relationship (as one variable goes up, so does the other one). For example, one could use a Pearson correlation to see if there is a relationship between the number of years of education people have and their income. If one found a statistically significant value of r = .21, then one would conclude that there is a weak positive relationship between the two variables. People with more education tend to have higher incomes than people with less education and vice versa. The reason it is a weak correlation is because r2= .04, which means that each variable explains only 4% of the variation in the other. In other words, 96% of the variation must be explained by other factors. (Return to Common Statistical Tests)

t-test - Used to test for differences between two group means, or between the mean of one group and a given number, such as a known population parameter or empirical constant. For example, a researcher would use a t-test to find out if exam scores differed for men and women, or determine whether the estimates of the speed of light from a series of physics experiments differed from 299,792,458 m/s. If the t statistic in either example were statistically significant, then one would conclude that there was a statistical difference between the average scores of men and women, or that the average estimate of the speed of light statistically differed from the scientifically accepted constant. (Return to Common Statistical Tests)

bell curve	mean	ratio measure
bivariate analysis	median	regression coefficient
case	mode	sample
categorical variable	multivariate analysis	sampling error
continuous variable	nominal measure	standard deviation
control variable	nonsampling error	statistical significance
correlation	normal distribution	substantive significance
dependent variable	null hypothesis	univariate analysis
independent variable	ordinal measure	variable
interval measure	population	z-score

Institutional Research - Definitions and Explanations

Facts and Figures: Where do they come from?

A Note About Percentages

Institutional Research - Statistics Primer

Common Terms

Common Statistical Tests

Further Reading