Basic Concepts for Data Analysis

Data Desk is a general tool that can work with many kinds of data from a wide variety of sources, so it uses general terminology. Many fields and some kinds of programs use specialized terminology. The following terms are some alternatives:

RELATIONS are usually represented in statistics programs as rectangular tables of data in which each row represents a case and each column represents a variable. They

may be referred to as a data table or data frame. The term “tidy data” has been used to describe data in this form. Formally, a relation is the set of cases, the set of variables, and the values recorded for each case on each variable. A spreadsheet is a typical representation of a single relation. Some writers use the term rectangular dataset to mean the same thing.

Most statistics and graphics operations make sense only for values that are related — that is, for values measured on the same individuals in the same order. Data desk often requires that variables analyzed together be in the same relation. Even if two variables are, in fact, recorded for the same cases, you must place their icons in the same relation to analyze them together.

VARIABLES are usually represented as columns of values. In a database program, a variable would be a field. In a spreadsheet, a variable would commonly be a column. The term variable is quite standard in statistics. It suggests that the values gathered together represent some underlying phenomenon.

CASES are the individuals to which the data values refer. A case may be called by another name according to the circumstances. Thus we speak of a survey’s respondent, a psychology experiment’s subject or participant, a study’s observation, or a period in time-sequenced data. Each of these entities would commonly be a single case in a relation. In a database program, a case would be a record with values for each field. When variables are represented as columns of values, cases are represented as rows across adjacent variables. That’s how they appear when you open variables from the same relation in Data Desk.

In statistics, the group of cases is often a sample drawn from a larger population. Data Desk makes no particular assumptions about the sample, except that the inferential statistics it computes assume that the sample is representative of the population and usually that it has been selected with suitable randomization.

VALUES are the elements that fill the variable-by-case structure. Each variable in a relation has a value (which may be “missing”) for each case, and each case has a value for each variable.