Derived Variables

A Derived Variable is a variable defined by a function or expression in terms of other variables. Derived variables are used in the same way as any other variable. However, whenever any argument of a derived variable changes, the values of the derived variable update immediately. And, of course, the fact of that change is communicated to all windows (and other derived variables) in which the updated variable is an argument. Thus the update flows through your analysis.

Then Data>New>Derived Variable command opens a blank derived variable and opens its editing window. Type any expression there. You can use other variables in your expressions by enclosing their names in single quotation marks. Drag a variable into the editing window as a quick way to enter its name in the expression you are typing at the current text editing point.

The Show Numbers command in the derived variable’s editing window calculates the value of the expression and shows the results.

The correct relation for a derived variable may not be clear to Data Desk until the variable is evaluated. Therefore, Data Desk places the derived variable in the Derived folder in the File Cabinet. You may move it to the correct relation (because you are likely to know that.) Drag the Icon alias from the editing window as a quick way to do that.

Data Desk parses your expression when you close the derived variable editing window (or when you request the values with an evaluation or by using the variable.) At that time, it may report errors if your expression is not correctly phrased.

Data Desk permits two (or more) variables to have the same name. If you’ve used such a variable, Data Desk may ask you to identify which instance of the variable is the one you intende in your expression. Select the icon of the instance you intend and close the editing window (to evaluate the expression). Or simply rename your variables to reduce confusion.

Derived variables are also created with the Manip>Transform commands. These commands apply a single function to all selected variables and create a new derived variable for each. Because it knows the variables and the function, Data Desk can place the resulting derived variables in their proper Relation. These functions are the common way to re-express a variable or to compute simple functions of variables.

Evaluate Derived Variable

Select the icon of a derived variable and choose Data>Evaluate Derived Variable to calculate the values for a derived variable. Data Desk will make a new ordinary variable that holds the values. This instance of the derived variable will not update when the derived variable updates. It is an ordinary variable and can be edited or otherwise modified.

Transform Variables

The Manip>Transform menus hold common transformations of variables. To use these command, first select variables to transform. Then choose the command. The resulting derived variables are placed in their appropriate relation.

Common Re-expressions

You may wish to re-express variables to improve the shape of their distribution, the consistency of spread across groups, the linearity of relationships, or the additivity of a model. The most common re-expressions (from the ‘ladder of powers’) are found at the top of the Transform menu for convenience.

If you are not certain of the best re-expression, consider the Dynamic re-expressions as an effective way to find one.

The re-expressions found here are:

√y

Log(y) (Base-10 logarithm)

1/y

1/√y

Arithmetic Functions

y + x addition

y ‘ x subtraction

y * x multiplication

y / x division

y DIV x integer divide; truncates result to an integer

y MOD x remainder of y/x

Lesser(y, x) pairwise minimum

Greater(y, x) pairwise maximum

neg (y) = -y

Rounding Functions

ABS Absolute value; |y|.

INT Integer part, sometimes denoted [y]. The whole number nearer to zero or equal to the argument value. Int(-2.5) = -2.0. Int(2.5) = 2.

Floor The whole number less than or equal to the argument value. Floor(-2.5) = -3.0.

Ceiling The whole number greater than or equal to the argument value. Ceiling(-2.5) = -2.0.

Sign, SGN The sign of its argument. Returns ‘1, 0, or 1 according to whether its argument is negative, zero, or positive, respectively.

RoundEven The value rounded to the nearest even whole number. The fraction .5 rounds to the nearest even whole number: RoundEven(2.5) = 2.0. RoundEven(-2.5) = -2.0.

RoundUp The value rounded up to the nearest whole number. The fraction .5 rounds up to the next largest whole number: RoundUp(2.5) = 3.0. RoundUp(-2.5) = -2.0.

RoundDown The value rounded down to the nearest whole number. The fraction .5 rounds down to the next smallest whole number: RoundDown (2.5) = 2.0. RoundDown(-2.5) = -3.0.

Logical Functions

Logical values are maintained internally as Boolean values. Data Desk translates Boolean values to numbers according to the rule FALSE = 0, TRUE = 1, producing indicator variables suitable for use as selectors.

y < x Returns TRUE (1) for cases in which y < x.

y <= x Returns TRUE (1) for cases in which y <= x.

y = x Returns TRUE (1) for cases in which y = x.

y >= x Returns TRUE (1) for cases in which y >= x.

y > x Returns TRUE (1) for cases in which y > x.

y <> x Returns TRUE (1) for cases in which y <> x.

NOT Negates the truth value of its argument. When applied to variables rather than expressions, it first converts the variable to a logical expression using the rule that 0 means FALSE and anything else means TRUE.

y AND x Logical AND of two Boolean arguments. When applied to variables rather than expressions, it first converts the variables to logical expressions using the rule that 0 means FALSE and anything else means TRUE.

y OR x Logical OR of two Boolean arguments. When applied to variables rather than expressions, it first converts the variables to logical expressions using the rule that 0 means FALSE and anything else means TRUE.

y XOR x Logical exclusive OR of two Boolean arguments. When applied to variables rather than expressions, it first converts the variables to logical expressions using the rule that 0 means FALSE and anything else means TRUE.

y DIFF x Logical difference of two Boolean arguments; y DIFF x is the same as y AND (NOT x).

Collapsing Functions

Collapsing functions take a variable as an argument, but produce a single value. Often they are used as part of a larger expression typed into a derived variable.

Min The minimum value of the argument.

Max The maximum value of the argument.

Sum Sum of cases in the variable.

SSQ Sum of squares of cases in the variable.

Mean Sample average or mean of the variable.

StDev, Sdev Standard deviation.

StdError Standard error of the mean.

Variance Variance of the variable.

NumNumeric The number of numeric values .

NumNonNumeric The number of nonnumeric values.

NumCases The total number of cases.

Corr(y,x) The Pearson Product correlation for the two selected variables.

Probability Functions

Distr commands expect the statistic and any required degrees of freedom and they return the value on the distribution curve. Cum Distr commands expect the statistic and any required degrees of freedom and they return the area under the distribution curve (the prob value). InvCum Distr commands expect the probability value and any required degrees of freedom and they return the statistic.

ZDistr(y) Returns the value on the distribution curve for the z-statistic specified as y.

CumZDistr(y) Returns the probability under the distribution curve, from -inf to y, for the z-statistic specified as y.

InvCumZDistr(y) Returns the z-statistic for the probability value specified as y.

CumTDistr(y, df) Returns the probability under the student’s t distribution curve, from -inf to y, for the specified t-statistic and degrees of freedom.

InvCumTDistr(y, df) Returns the t-statistic for the specified probability value and degrees of freedom.

CumChiDistr(y, df) Returns the probability under the chi-square distribution curve, from -inf to y, for the specified chi-square statistic and degrees of freedom.

InvCumChiDistr(y, df) Returns the chi-square statistic for the specified probability value and degrees of freedom.

CumFDistr(y, df1, df2) Returns the probability under the F distribution curve, from -inf to y, for the specified F statistic and degrees of freedom.

InvCumFDistr(y, df1, df2) Returns the F statistic for the specified probability value and degrees of freedom.

BinomDistr(y, n, p) Returns the binomial probability for the number of successes specified as y, the number of trials specified as n and the probability of success specified as p.

CumBinomDistr(y, n, p) Returns the sum of the binomial probabilities from, y=0 to n, for the number of successes specified as y, the number of trials specified as n and the probability of success specified as p.

PoisDistr(y, lambda) Returns the Poisson probability for number of occurrences specified as y and the average number of occurrences specified as lambda.

CumPoisDistr(y, lambda) Returns the sum of the Poisson probabilities for y=0 to y, for number of occurrences specified as y and the average number of occurrences specified as lambda.

String Functions

String functions operate on text and treat their arguments as text, even if they are numerals. For example, you might want to create a variable that holds only the values to the right of the decimal point. Or you might want to separate the first three characters of a long category code.

Data Desk’s string functions are of two types. The first two commands return, for each case in the variable, a single number. The last three commands return, for each case in the variable, the extracted string.

Len(y) returns the length, in characters, of the variable y.

Example: If the variable Name contains the string John, Len (‘Name’) returns 4.

Pos(y, search string, start position) searches the variable y starting at the character position defined by the third argument, start position, for the first occurence of the string defined by the second argument, search string, and returns the position of the first character of that string. The search string needs to be either a variable or a string contained in double quotes. Pos returns 0 if the string is not found. The first character in the string is position 1, so if you want to search from the beginning of the string the starting position should be 1.

Example: If the variable Name contains the string John, Pos(‘Name’, “oh”, 1) returns 3.

Left (y, k) returns the k leftmost characters from the variable y.

Example: If the variable Name contains the string John, Left(‘Name, 2) returns “Jo”.

Right (y, k) returns the k rightmost characters from the variable y.

Example: If the variable Name contains the string John, Right(‘Name, 2) returns “hn”.

Mid (y, start position, k) searches the variable y starting at the character position defined by the third argument, start position, and returns the k characters from that starting position.

Example: if the variable Name contains the string John, Mid (‘Name’, 2, 2) returns “oh”

Trigonometric Functions

All trigonometric functions work in radians. Trigonometric functions are available from the Manip>Transform>Trigonometric submenu. Here are the supported functions:

sin

cos

tan

arcsin

arccos

arctan

sinh

cosh

tanh

arcsinh

arccosh

arctanh

Date and Time Functions

These functions perform numeric calculations on date and time data. Where any of these functions refers to a date in mm/dd/yy form, Data Desk consults the International settings for your operating system to determine whether to use the alternative dd/mm/yy.

Days(date) Converts dates to days from Jan 1, 1904. For example, 1/1/93 becomes 32509. Years after 2010 must include the century digit (i.e. 2011 as opposed to 11).

Months(date) Converts dates to months from Jan 1, 1904.

Quarters(date) Converts dates to quarters from Jan 1, 1904.

Year(date) Converts dates to decimal years. Decimal year is the fraction of the year, and thus offers a continuous scale suitable for statistical analysis.

Seconds(time) Converts times into seconds since midnight.

Dynamic Functions

Dynamic functions use Sliders to control parameters. By moving the slider control, you can dynamically change the parameters of the function. Use derived variables defined with dynamic functions in plots or analyses in which the Turn on Automatic Update command has been selected from the window’s Global hyperview.

Box-Cox Transformation

These are a family of re-expressions for data that include the common re-expressions at the top of the menu, but slide continuously through the powers (with the logarithm at the 0 power.) The command generates both a derived variable and a slider to control the power. Use the derived variable in a plot or analysis and set the windows to automatic update. Sliding the control will then continuously change the results windows. Plots will move dynamically, making it easy to select the best power for a re-expression.

Mix X and Y

Blends two variables with mixing proportions controlled by a slider. The command creates a slider and derived variable. The blend ranges from 100% of variable X and 0% of variable Y through 50% of each to 0% of variable X and 100% of variable Y. The variables are standardized by subtracting their means and dividing by their standard deviations before mixing, then returned to original scale after mixing.

With a mixed variable you can study the effect of replacing one variable with another in a model. Plot the fit or residuals (and set the plots to automatic update.)

Tukey’s lambda

Tukey’s lambda family of functions are appropriate for counted fractions and percentages. The argument must be values between 0.0 and 1.0. The value of lambda is bounded between 0 and 1 as well. The Lambda family includes reexpressions close to the logit, probit, and arcsine. As with Box-Cox transformations, you can move the slider control and watch plots change smoothly.

Lag

A lagged time series variable is offset by some number of time periods. But it can be difficult to determine the best lag. Lag variables shift up or down by a number of cases specified by the lag parameter in a slider. Negative lags shift up, positive lags shift down. Lags are always integer amounts, so the lagged variable changes only when the lag slider crosses an integer.

Relational Functions

Relational functions provide facilities for looking up values across relations. They thus provide the basic operations on which to build a Relational Data Analysis.

GetCase(y, x)

Each value of x is taken to specify a case number in y. The corresponding case value of y is returned. A constant or an expression evaluating to a constant may take the place of x.

Thus, GetCase (‘income’, 5) returns the fifth value in the variable ‘income’. Non-integral case numbers are tuncated. Case numbers of zero or less or case numbers greater than the number of cases in y return the missing value code.

LookUp(y, x)

If k is a constant, and y is a variable, LookUp(y, k) is the case number of a case of y for which y = k.

If ‘text’ is a quoted string then LookUp(y, ‘text’) is the case number of a case of y for which TextOf(y) = ‘text’.

If no match is found, LookUp returns missing.

If x is a variable, then LookUp(y, x) returns for each element of x, the case number of a case of y for which y equals the corresponding value in x.

LookUpLast(y, x)

If k is a constant, and y is a variable sorted in ascending order, LookUpLast(y, k) is the case number of the last case of y for which y <= k.

If x is a variable, then LookUpLast(y, x) returns a value for each element of x.

LookUpFirst(y, x)

If k is a constant, and y is a variable sorted in ascending order, LookUpFirst(y, k) is the case number of the first case of y for which y >= k.

If x is a variable, then LookUpFirst(y, x) returns a value for each element of x.

LookUpFloor(y, x)

If k is a constant, and y is a variable sorted in ascending order, LookUpFloor(y, k) is the case number of the case before the last case of y for which y <= k.

If x is a variable, then LookUpFloor(y, x) returns a value for each element of x.

LookUpCeil(y, x)

If k is a constant, and y is a variable sorted in ascending order, LookUpCeil(y, k) is the case number of the case before the last case of y for which y >= k.

If x is a variable, then LookUpCeil(y, x) returns a value for each element of x.

LookUpSorted(y, x)

When the data in y is text and is sorted in ascending alphabetic order, this function provides faster lookups than the LookUp command.

Miscellaneous functions

Concatenate

Concatenate concatenates the text of its two arguments adjacently. Thus, ‘Data’&’Desk’ yields ‘DataDesk’. The function operates on two text strings or on two variables (interpretd as holding text).

CumSum

Cumulative Sum returns a variable in which the first case is the same as the first case of the argument variable. The second case is the sum of the first and second cases. The third case is the sum of the first three, and so on.

NScores

Returns values where the ith NScore is the median of the sampling distribution of the ith order statistic based on a sample of size n drawn from a standard Normal distribution. The Nscores are the horizontal axis of a Normal Probability Plot.

Rank

Returns the rank of each case. Lowest number or earliest alphabetic gets rank 1. Ranks either numerically or alphabetically according to the type of its argument.

ZScores

Returns the standardized values, based on the equation (x-mean(x))/StDev(x), for each case.

Cross

The Cross command creates a HotResult variable, named Cross, holding structural information about all combinations of categories of the selected variables. The selected variables are treated as category variables and interpreted according to their text rather than numeric values. Each combination of categories from two or more variables is commonly called a ‘cell.’ The information held in the Cross variable is the ‘cross’ part of concepts such as ‘cross-tabulation.’ That is, it holds a case for each possible combination of categories from the selected variables whether or not any observation falls in each cell.

gamma

Returns the value of the gamma function for the selected argument. The argument can be a single value or a variable.

lngamma

Returns the value of the natural log of the gamma function for the selected argument. The argument can be a single value or a variable.

Numeric

Returns 1 for all values that are numeric and 0 for all others. Helpful for creating selector variables that identify missing values. Argument can also be a scalar, for example: Numeric(3) returns 1 but Numeric (0/0) returns 0, because zero divided by zero returns a missing value.

CaseNum

Returns the case number for each case. If the argument is a scalar, it returns a variable that counts from 1 to the value. For example CaseNum(500), returns a variable holding 1,2,3,…500.

NameOf

Returns the name of the selected variable. If the argument is a variable socket, it returns the name of the variable currently plugged into the socket.. This command is useful for Action Language Programs.

CoerceToRelation(y, x)

If y and x are in two different relations but have the same number of cases, creates a derived variable that returns the values of y. The derived variable resides in x’s relation. If y is a scalar and x is a vector, returns a vector in x’s relation with each case equal to the value of y.

NumCats

Returns the number of categories in the selected variable. The argument must be a variable and not an expression.