Derived Variables
A Derived Variable is a variable defined by a function or expression in terms of other variables. Derived variables are used in the same way as any other variable. However, whenever any argument of a derived variable changes, the values of the derived variable update immediately. And, of course, the fact of that change is communicated to all windows (and other derived variables) in which the updated variable is an argument. Thus the update flows through your analysis.
Then Data>New>Derived Variable command opens a blank derived variable and opens its editing window. Type any expression there. You can use other variables in your expressions by enclosing their names in single quotation marks. Drag a variable into the editing window as a quick way to enter its name in the expression you are typing at the current text editing point.
The Show Numbers command in the derived variable’s editing window calculates the value of the expression and shows the results.
The correct relation for a derived variable may not be clear to Data Desk until the variable is evaluated. Therefore, Data Desk places the derived variable in the Derived folder in the File Cabinet. You may move it to the correct relation (because you are likely to know that.) Drag the Icon alias from the editing window as a quick way to do that.
Data Desk parses your expression when you close the derived variable editing window (or when you request the values with an evaluation or by using the variable.) At that time, it may report errors if your expression is not correctly phrased.
Data Desk permits two (or more) variables to have the same name. If you’ve used such a variable, Data Desk may ask you to identify which instance of the variable is the one you intende in your expression. Select the icon of the instance you intend and close the editing window (to evaluate the expression). Or simply rename your variables to reduce confusion.
Derived variables are also created with the Manip>Transform commands. These commands apply a single function to all selected variables and create a new derived variable for each. Because it knows the variables and the function, Data Desk can place the resulting derived variables in their proper Relation. These functions are the common way to re-express a variable or to compute simple functions of variables.
Evaluate Derived Variable
Select the icon of a derived variable and choose Data>Evaluate Derived Variable to calculate the values for a derived variable. Data Desk will make a new ordinary variable that holds the values. This instance of the derived variable will not update when the derived variable updates. It is an ordinary variable and can be edited or otherwise modified.
Transform Variables
The Manip>Transform menus hold common transformations of variables. To use these command, first select variables to transform. Then choose the command. The resulting derived variables are placed in their appropriate relation.
Common Re-expressions
You may wish to re-express variables to improve the shape of their distribution, the consistency of spread across groups, the linearity of relationships, or the additivity of a model. The most common re-expressions (from the ‘ladder of powers’) are found at the top of the Transform menu for convenience.
If you are not certain of the best re-expression, consider the Dynamic re-expressions as an effective way to find one.
The re-expressions found here are:
√y
Log(y) (Base-10 logarithm)
1/y
1/√y
Arithmetic Functions
y + x addition
y ‘ x subtraction
y * x multiplication
y / x division
y DIV x integer divide; truncates result to an integer
y MOD x remainder of y/x
Lesser(y, x) pairwise minimum
Greater(y, x) pairwise maximum
neg (y) = -y
Rounding Functions
ABS Absolute value; |y|.
INT Integer part, sometimes denoted [y]. The whole number nearer to zero or equal to the argument value. Int(-2.5) = -2.0. Int(2.5) = 2.
Floor The whole number less than or equal to the argument value. Floor(-2.5) = -3.0.
Ceiling The whole number greater than or equal to the argument value. Ceiling(-2.5) = -2.0.
Sign, SGN The sign of its argument. Returns ‘1, 0, or 1 according to whether its argument is negative, zero, or positive, respectively.
RoundEven The value rounded to the nearest even whole number. The fraction .5 rounds to the nearest even whole number: RoundEven(2.5) = 2.0. RoundEven(-2.5) = -2.0.
RoundUp The value rounded up to the nearest whole number. The fraction .5 rounds up to the next largest whole number: RoundUp(2.5) = 3.0. RoundUp(-2.5) = -2.0.
RoundDown The value rounded down to the nearest whole number. The fraction .5 rounds down to the next smallest whole number: RoundDown (2.5) = 2.0. RoundDown(-2.5) = -3.0.
Logical Functions
Logical values are maintained internally as Boolean values. Data Desk translates Boolean values to numbers according to the rule FALSE = 0, TRUE = 1, producing indicator variables suitable for use as selectors.
y < x Returns TRUE (1) for cases in which y < x.
y <= x Returns TRUE (1) for cases in which y <= x.
y = x Returns TRUE (1) for cases in which y = x.
y >= x Returns TRUE (1) for cases in which y >= x.
y > x Returns TRUE (1) for cases in which y > x.
y <> x Returns TRUE (1) for cases in which y <> x.
NOT Negates the truth value of its argument. When applied to variables rather than expressions, it first converts the variable to a logical expression using the rule that 0 means FALSE and anything else means TRUE.
y AND x Logical AND of two Boolean arguments. When applied to variables rather than expressions, it first converts the variables to logical expressions using the rule that 0 means FALSE and anything else means TRUE.
y OR x Logical OR of two Boolean arguments. When applied to variables rather than expressions, it first converts the variables to logical expressions using the rule that 0 means FALSE and anything else means TRUE.
y XOR x Logical exclusive OR of two Boolean arguments. When applied to variables rather than expressions, it first converts the variables to logical expressions using the rule that 0 means FALSE and anything else means TRUE.
y DIFF x Logical difference of two Boolean arguments; y DIFF x is the same as y AND (NOT x).
Collapsing Functions
Collapsing functions take a variable as an argument, but produce a single value. Often they are used as part of a larger expression typed into a derived variable.
Min The minimum value of the argument.
Max The maximum value of the argument.
Sum Sum of cases in the variable.
SSQ Sum of squares of cases in the variable.
Mean Sample average or mean of the variable.
StDev, Sdev Standard deviation.
StdError Standard error of the mean.
Variance Variance of the variable.
NumNumeric The number of numeric values .
NumNonNumeric The number of nonnumeric values.
NumCases The total number of cases.
Corr(y,x) The Pearson Product correlation for the two selected variables.
Probability Functions
Distr commands expect the statistic and any required degrees of freedom and they return the value on the distribution curve. Cum Distr commands expect the statistic and any required degrees of freedom and they return the area under the distribution curve (the prob value). InvCum Distr commands expect the probability value and any required degrees of freedom and they return the statistic.
ZDistr(y) Returns the value on the distribution curve for the z-statistic specified as y.
CumZDistr(y) Returns the probability under the distribution curve, from -inf to y, for the z-statistic specified as y.
InvCumZDistr(y) Returns the z-statistic for the probability value specified as y.
CumTDistr(y, df) Returns the probability under the student’s t distribution curve, from -inf to y, for the specified t-statistic and degrees of freedom.
InvCumTDistr(y, df) Returns the t-statistic for the specified probability value and degrees of freedom.
CumChiDistr(y, df) Returns the probability under the chi-square distribution curve, from -inf to y, for the specified chi-square statistic and degrees of freedom.
InvCumChiDistr(y, df) Returns the chi-square statistic for the specified probability value and degrees of freedom.
CumFDistr(y, df1, df2) Returns the probability under the F distribution curve, from -inf to y, for the specified F statistic and degrees of freedom.
InvCumFDistr(y, df1, df2) Returns the F statistic for the specified probability value and degrees of freedom.
BinomDistr(y, n, p) Returns the binomial probability for the number of successes specified as y, the number of trials specified as n and the probability of success specified as p.
CumBinomDistr(y, n, p) Returns the sum of the binomial probabilities from, y=0 to n, for the number of successes specified as y, the number of trials specified as n and the probability of success specified as p.
PoisDistr(y, lambda) Returns the Poisson probability for number of occurrences specified as y and the average number of occurrences specified as lambda.
CumPoisDistr(y, lambda) Returns the sum of the Poisson probabilities for y=0 to y, for number of occurrences specified as y and the average number of occurrences specified as lambda.
String Functions
String functions operate on text and treat their arguments as text, even if they are numerals. For example, you might want to create a variable that holds only the values to the right of the decimal point. Or you might want to separate the first three characters of a long category code.
Data Desk’s string functions are of two types. The first two commands return, for each case in the variable, a single number. The last three commands return, for each case in the variable, the extracted string.
Len(y) returns the length, in characters, of the variable y.
Example: If the variable Name contains the string John, Len (‘Name’) returns 4.
Pos(y, search string, start position) searches the variable y starting at the character position defined by the third argument, start position, for the first occurence of the string defined by the second argument, search string, and returns the position of the first character of that string. The search string needs to be either a variable or a string contained in double quotes. Pos returns 0 if the string is not found. The first character in the string is position 1, so if you want to search from the beginning of the string the starting position should be 1.
Example: If the variable Name contains the string John, Pos(‘Name’, “oh”, 1) returns 3.
Left (y, k) returns the k leftmost characters from the variable y.
Example: If the variable Name contains the string John, Left(‘Name, 2) returns “Jo”.
Right (y, k) returns the k rightmost characters from the variable y.
Example: If the variable Name contains the string John, Right(‘Name, 2) returns “hn”.
Mid (y, start position, k) searches the variable y starting at the character position defined by the third argument, start position, and returns the k characters from that starting position.
Example: if the variable Name contains the string John, Mid (‘Name’, 2, 2) returns “oh”
Trigonometric Functions
All trigonometric functions work in radians. Trigonometric functions are available from the Manip>Transform>Trigonometric submenu. Here are the supported functions:
sin
cos
tan
arcsin
arccos
arctan
sinh
cosh
tanh
arcsinh
arccosh
arctanh
Date and Time Functions
These functions perform numeric calculations on date and time data. Where any of these functions refers to a date in mm/dd/yy form, Data Desk consults the International settings for your operating system to determine whether to use the alternative dd/mm/yy.
Days(date) Converts dates to days from Jan 1, 1904. For example, 1/1/93 becomes 32509. Years after 2010 must include the century digit (i.e. 2011 as opposed to 11).
Months(date) Converts dates to months from Jan 1, 1904.
Quarters(date) Converts dates to quarters from Jan 1, 1904.
Year(date) Converts dates to decimal years. Decimal year is the fraction of the year, and thus offers a continuous scale suitable for statistical analysis.
Seconds(time) Converts times into seconds since midnight.
Dynamic Functions
Dynamic functions use Sliders to control parameters. By moving the slider control, you can dynamically change the parameters of the function. Use derived variables defined with dynamic functions in plots or analyses in which the Turn on Automatic Update command has been selected from the window’s Global hyperview.
Box-Cox Transformation
These are a family of re-expressions for data that include the common re-expressions at the top of the menu, but slide continuously through the powers (with the logarithm at the 0 power.) The command generates both a derived variable and a slider to control the power. Use the derived variable in a plot or analysis and set the windows to automatic update. Sliding the control will then continuously change the results windows. Plots will move dynamically, making it easy to select the best power for a re-expression.
Mix X and Y
Blends two variables with mixing proportions controlled by a slider. The command creates a slider and derived variable. The blend ranges from 100% of variable X and 0% of variable Y through 50% of each to 0% of variable X and 100% of variable Y. The variables are standardized by subtracting their means and dividing by their standard deviations before mixing, then returned to original scale after mixing.
With a mixed variable you can study the effect of replacing one variable with another in a model. Plot the fit or residuals (and set the plots to automatic update.)
Tukey’s lambda
Tukey’s lambda family of functions are appropriate for counted fractions and percentages. The argument must be values between 0.0 and 1.0. The value of lambda is bounded between 0 and 1 as well. The Lambda family includes reexpressions close to the logit, probit, and arcsine. As with Box-Cox transformations, you can move the slider control and watch plots change smoothly.
Lag
A lagged time series variable is offset by some number of time periods. But it can be difficult to determine the best lag. Lag variables shift up or down by a number of cases specified by the lag parameter in a slider. Negative lags shift up, positive lags shift down. Lags are always integer amounts, so the lagged variable changes only when the lag slider crosses an integer.
Relational Functions
Relational functions provide facilities for looking up values across relations. They thus provide the basic operations on which to build a Relational Data Analysis.
GetCase(y, x)
Each value of x is taken to specify a case number in y. The corresponding case value of y is returned. A constant or an expression evaluating to a constant may take the place of x.
Thus, GetCase (‘income’, 5) returns the fifth value in the variable ‘income’. Non-integral case numbers are tuncated. Case numbers of zero or less or case numbers greater than the number of cases in y return the missing value code.
LookUp(y, x)
If k is a constant, and y is a variable, LookUp(y, k) is the case number of a case of y for which y = k.
If ‘text’ is a quoted string then LookUp(y, ‘text’) is the case number of a case of y for which TextOf(y) = ‘text’.
If no match is found, LookUp returns missing.
If x is a variable, then LookUp(y, x) returns for each element of x, the case number of a case of y for which y equals the corresponding value in x.
LookUpLast(y, x)
If k is a constant, and y is a variable sorted in ascending order, LookUpLast(y, k) is the case number of the last case of y for which y <= k.
If x is a variable, then LookUpLast(y, x) returns a value for each element of x.
LookUpFirst(y, x)
If k is a constant, and y is a variable sorted in ascending order, LookUpFirst(y, k) is the case number of the first case of y for which y >= k.
If x is a variable, then LookUpFirst(y, x) returns a value for each element of x.
LookUpFloor(y, x)
If k is a constant, and y is a variable sorted in ascending order, LookUpFloor(y, k) is the case number of the case before the last case of y for which y <= k.
If x is a variable, then LookUpFloor(y, x) returns a value for each element of x.
LookUpCeil(y, x)
If k is a constant, and y is a variable sorted in ascending order, LookUpCeil(y, k) is the case number of the case before the last case of y for which y >= k.
If x is a variable, then LookUpCeil(y, x) returns a value for each element of x.
LookUpSorted(y, x)
When the data in y is text and is sorted in ascending alphabetic order, this function provides faster lookups than the LookUp command.
Miscellaneous functions
Concatenate
Concatenate concatenates the text of its two arguments adjacently. Thus, ‘Data’&’Desk’ yields ‘DataDesk’. The function operates on two text strings or on two variables (interpretd as holding text).
CumSum
Cumulative Sum returns a variable in which the first case is the same as the first case of the argument variable. The second case is the sum of the first and second cases. The third case is the sum of the first three, and so on.
NScores
Returns values where the ith NScore is the median of the sampling distribution of the ith order statistic based on a sample of size n drawn from a standard Normal distribution. The Nscores are the horizontal axis of a Normal Probability Plot.
Rank
Returns the rank of each case. Lowest number or earliest alphabetic gets rank 1. Ranks either numerically or alphabetically according to the type of its argument.
ZScores
Returns the standardized values, based on the equation (x-mean(x))/StDev(x), for each case.
Cross
The Cross command creates a HotResult variable, named Cross, holding structural information about all combinations of categories of the selected variables. The selected variables are treated as category variables and interpreted according to their text rather than numeric values. Each combination of categories from two or more variables is commonly called a ‘cell.’ The information held in the Cross variable is the ‘cross’ part of concepts such as ‘cross-tabulation.’ That is, it holds a case for each possible combination of categories from the selected variables whether or not any observation falls in each cell.
gamma
Returns the value of the gamma function for the selected argument. The argument can be a single value or a variable.
lngamma
Returns the value of the natural log of the gamma function for the selected argument. The argument can be a single value or a variable.
Numeric
Returns 1 for all values that are numeric and 0 for all others. Helpful for creating selector variables that identify missing values. Argument can also be a scalar, for example: Numeric(3) returns 1 but Numeric (0/0) returns 0, because zero divided by zero returns a missing value.
CaseNum
Returns the case number for each case. If the argument is a scalar, it returns a variable that counts from 1 to the value. For example CaseNum(500), returns a variable holding 1,2,3,…500.
NameOf
Returns the name of the selected variable. If the argument is a variable socket, it returns the name of the variable currently plugged into the socket.. This command is useful for Action Language Programs.
CoerceToRelation(y, x)
If y and x are in two different relations but have the same number of cases, creates a derived variable that returns the values of y. The derived variable resides in x’s relation. If y is a scalar and x is a vector, returns a vector in x’s relation with each case equal to the value of y.
NumCats
Returns the number of categories in the selected variable. The argument must be a variable and not an expression.