## Explain Files

Explain Files are a quick-start resource meant to help you get the most out of Data Desk as quickly as possilbe.

This is a Boxplot side by side window. If it is empty, drag one or more quantitative variables into the window. Data desk will make a vertical Boxplot for each variable, all on the same vertical scale.

### Working with Boxplots side by side

Boxplots and Dotplots work similarly. Switch between these two views with the Add Boxes/Remove Boxes command in the global HyperView menu. Generally, boxplots give a better overview of the relationship of the variables and Dotplots show more detail and facilitate identifying and working with (e.g. assigning colors or symbols too) individual points. Boxplots show the InterQuartile Range (IQR) of each variable as the size of the box and the median of each variable as a horizontal line across the box. They nominate possible outliers in terms of the IQR and show them as individual points. Because this outlier nomination is local to each variable, it is an effective way to identify points that might be unusual for that variable even if they are not unusual when viewed overall. Boxplots exhibit potential outliers as individual points. You can identify any point with the query (?) tool.

### Boxplot HyperViews

In addition to adding or removing boxes, the global HyperView offers a ZScore Transformation. This is useful when plotting variables that have very different scales. For each variable, Datadesk subtracts its mean and divides by its standard deviation.

### Learning from Boxplots

Boxplots offer a convenient comparison of the distributions of quantitative variables. Look for trends in their variability (comparing the IQRs) and identify and examine the nominated possible outliers. The Modify > Lines > Show Lines/Hide Lines command alternately draws lines that connect the points for each case or hides them (even though the individual points are not displayed in the boxplots.)

This is a Boxplot y by x window. If it is empty, drag a

quantitative variable onto the y-axis and a categorical

variable that names groups onto the x-axis. Data desk will

make a Boxplot for each group.

### Working with

Boxplots

Boxplots and Dot plots work similarly. Switch

between these two views with the Add Boxes/Remove Boxes

command in the global HyperView menu. Generally, boxplots

give a better overview of the relationship of the groups

and dot plots show more detail and facilitate identifying

and working with (e.g. assigning colors or symbols to)

individual points. Boxplots show the InterQuartile Range

(IQR) of each group as the size of the box and the median

of each group as a horizontal line across the box. They

nominate possible outliers in terms of the IQR and show

them as individual points. Because this outlier nomination

is local to each group, it is an effective way to identify

points that might be unusual for that group even if they

are not unusual when viewed overall.

### Boxplot HyperViews

In

addition to adding or removing boxes, the global HyperView

offers to Drop scales. This generates hotresult variables

with the max and min for each group. These can be useful in

other calculations. The HyperView attached to the name of

the categorical (x-axis) variable offers bar charts and pie

charts. The HyperView attached to the quantitative (y-axis)

variable offers a histogram and normal probability plot.

You can identify any outlying point with the query (?) tool

### Learning from

Boxplots

Boxplots offer a convenient comparison of the

distribution of a quantitative variable across categories

of a categorical variable. Look for clusters and outliers.

The vertical style of Boxplots overprints multiple points

if they appear at the same place. The ? tool will indicate

the number of overprints and will identify them

successively on multiple clicks. Consider making a

histogram of the y-axis variable (from its HyperView) and

then selecting all the points in one of the categories with

the rectangle selector, lassor selector or knife tool to

see the sub-histogram for that group against the overall

distribution.

This is a Cluster Analysis window. If it is empty, just

drag quantitative variables into it singly or in groups. A

cluster tree graph will appear. Calc > Calculation

Options > Cluster Analysis Options lets you choose the

clustering method.

### Working with Cluster

Analysis

Click on any tree node to select all cases

"below" that node. Cases are selected in all graphics and

editing windows. Plot colors and symbols operate on the

points at the bottom (left) of the cluster tree, so it is

easy to select a cluster and assign a color or symbol to

it.

### Cluster Analysis

HyperViews

The Save Distances command records the

distances of the successive nodes in the cluster tree.

### Learning from Cluster

Analysis

Clustering is an exploratory method best used

along with other methods. Select clusters and examine them

in other analyses. Consider making an indicator variable

for the selected cases with the Modify > Selection >

Record Hot Set command.

This is a Contingency Table window. If there is no content,

drag variables into the window, dropping them in the title

area to specify which will define the rows of the table and

which will define the columns. You can drag other variables

in and drop them in these locations to replace these at any

time. Contingency tables treat the variables as categorical

even if they contain numerals. The table will have a row

for each individual category of the row variable, a column

for each category of the column variable, and counts of the

number of cases falling into the combination of categories

in the body of the table. HyperView commands include

opening Contingency Table Options (also available from the

Calc menu under Calculation Options.) These include

specifying whether marginal values should be displayed and

whether the displayed values should include counts,

proportions, or both. Options also include a chi square

statistic to summarize the degree of association between

the variables and the expected values and standardized

residuals that go into the chi square calculation. (Chi

square = the sum of the squared standardized residuals, so

these can reveal just where the table shows an association

between the variables.)

### Working with Contingency

Tables

Click on any cell or margin value to drop down a

HyperView menu that offers to select the cases represented

by that value, to create an indicator (or dummy) variable

that is one for the cases in that cell and zero for the

others, or to record a 0/1 variable and make it the

Selector for subsequent commands. Choose Compute Counts

from the global HyperView menu to record hotresult

variables that name the row categories and column

categories, and the counts.

### Learning from Contingency

Tables

A chi square statistic with a low p-value can be

interpreted as evidence that the two variables are not

statistically independent. Depending on your interpretation

of the variables, the test may indicate whether the

distributions of counts among several categories are alike

or different. When a test is statistically significant, it

is almost always worthwhile to examine the standardized

residuals to understand which cells contributed to the

larger chi square value. If one or two cells stand out with

larger residuals, you can click on them to select those

cases for further examination in other displays. (All case

selections always appear in all displays and editing

windows.)

### Special Features of Contingency

Tables

If the number of categories in a variable

exceeds the categories limit (default 50), Datadesk will

warn you before proceeding to make the table. This could

happen if you select a quantitative variable rather than a

categorical variable. You can adjust the categories limit

at Data Desk > Preferences > Categories Limit…

This is a Correlation Table window. If it shows no values,

drag variables into the window and drop them anywhere. You

can drag additional variable into the window at any time to

add them to the correlation table. The drop-down menu at

the top lets you choose among Pearson product-moment

corrosion, Kendall's tau, Spearman's rho, and covariances.

The table shows pairwise correlations computed for all

cases with numeric values on both variables. Correlation

tables can take you directly to related analyses. The

Hyperview found by clicking the triangle in the upper left

of the window offers features and related commands.

Correlation tables are versatile with many special

features.

### Working with Correlation

Tables

Click on the row or column name of any variable

to Select or Locate its icon, to make a Histogram or Normal

Probability Plot of the variable or to remove it from the

table. Click on any correlation to make a Scatterplot of

the two variables. Because correlations are symmetric,

Datadesk offers to plot either variable on the y-axis. If

any value in one of the variables in the table is modified,

the table will offer to recalculate by displaying a red

exclamation mark where the global Hyperview menu usually

appears in the upper left corner of the window.

### Correlation HyperViews

The

global Hyperview offers to make a Plot Matrix of all the

variables. This is the natural visualization of a

correlation table. Turn on Automatic Update to have the

table immediately recomputed if any data value is changed

### Learning from

Correlations

Correlations summarize the association

between pairs of variables. Pearson correlation measures

linear association. Kendall's tau measure monotonicity.

Spearman's rho is the Pearson correlation of the ranks, and

is less sensitive to possible outliers. Covariances, unlike

the other association measures available in this table,

measure association using the original measurement units of

the variables rather than re-scaling them. It is a good

idea to examine the scatterplot corresponding to any

correlation that is of importance or interest. Pearson

correlation is sensitive to outliers and is only

appropriate when the associate is linear. You can check

both with the scatterplot. Kendall's tau and Spearman's rho

do not require a linear relationship and are less sensitive

to outliers.

### Special Features of Correlation

Tables

Correlations are computed for all cases that

have numeric data in both variables. Thus the correlations

in the table may be for different numbers of cases and for

different cases if data are missing for different cases in

different variables. Pearson correlation is closely

associated with linear regression. A convenient way to get

to a regression is to make the offered scatterplot and then

choose Regression from the scatterplot's Hyperview menu.

This has the advantage of offering a check for nonlinearity

and outliers along the way. When developing a multiple

regression model, it can be helpful to save the residuals

and drag them into a correlation table of the available

predictors. This will show which remaining predictors are

correlated with the residuals and are thus good candidates

to include in your model. When a variable is added to the

model, the residuals will update and the correlation table

will offer to update to reflect the change.

This is a Dotplot side by side window. If it is empty, drag

one or more quantitative variables into the window. Data

desk will make a vertical dotplot for each variable, all on

the same vertical scale.

### Working with Dotplots side by

side

Dotplots and Boxplots work similarly. Switch

between these two views with the Add Boxes/Remove Boxes

command in the global HyperView menu. Generally, boxplots

give a better overview of the relationship of the variables

and dotplots show more detail and facilitate identifying

and working with (e.g. assigning colors or symbols to)

individual points.

### Dotplot HyperViews

In

addition to adding or removing boxes, the global HyperView

offers a ZScore Transformation. This is usefule when

plotting variables that have very different scales. For

each variable, Datadesk subtracts its mean and divides by

its standard deviation. This makes it easier to compare

distributions within the variables. Because these are

different variables, each case is represented in each of

the dotplots, so when a point is selected, it highlights in

each dotplot. You can identify any point with the query (?)

tool

### Learning from

Dotplots

Dotplots offer a convenient comparison of the

distribution of a quantitative variables. Look for trends

in the variability of the variables and for possible

outliers. The vertical style of dotplots overprints

multiple points if they appear at the same place. The ?

tool will indicate the number of overprints and will

identify them successively on multiple clicks. The Modify

> Lines > Show Lines/Hide Lines command alternately

draws lines that connect the points for each case or hides

them. With lines shown, this is a kind of parallel

coordinate plot.

This is a Frequency Breakdown window. If it is empty, drag

a categorical variable into it. Frequency breakdowns show,

for each category, the count of cases falling in that

category, the percentage of cases in the category, and a

cumulative percentage of cases. Drag a new variable over

the variable name to replace it.

### Working with Frequency

Breakdowns

Click on the column heads to save and locate

a hotresult variable holding the values of the column. As

hotresults, these will update if the variable is changed or

replaced. The HyperView attached to the variable name

offers to make a bar chart of pie chart of the variable.

Hyperviews attached to any of the values in the table offer

to select that category in all other displays and editing

windows.

### Frequency Breakdown

HyperViews

The Frequency Options command in the global

hyperview offers a variety of alternative calculations for

a frequency window. These include the expected values and

standardized residuals for a chi square test of homogeneity

of the hypothesis of equal cell counts

This is a histogram window.

If it is blank, then drag a variable's icon into the window and drop it anywhere. The variable must contain at least one numeric case.

To change the displayed variable, drag a new variable icon onto the axis label at the bottom of the plot, and drop it there.

### How does it work?

The lasso, box, brush, and pointer tools select bars of the histogram. Hold Shift to select multiple bars. The knife tool selects ranges of bars. Selected cases link to other displays and to variable editing windows, and are selected there as well.

Cases selected in other displays highlight in the histogram as a smaller histogram within the bars of the histogram.

Drag the size box in the lower left to resize the window.

Option-drag the size box to re-scale the plot.

### HyperViews

Automatic Update to redraw the histogram immediately whenever the data or selection change

Plot scale lets you specify the plot scale numerically.

Compute Bar counts saves hotresult variables holding the left edge value of each bar, the bar counts, and the cumulative count (low to high).

### What does it tell me?

Histograms show the distribution of a variable. Look for:

Shape: Is the histogram symmetric; does the right side approximately mirror the left? Does the histogram have one main mode (hump) or more than one? Are there any possible outliers standing away from the body of the data?

Center: Roughly what is the middle value?

Spread: How spread out are the values? (Does that central value summarize the overall values well?)

Alert: Rescaling can change the impression that you get from a histogram. Don't be overly impressed by minor variation in bar heights; those are probably not separate modes and may disappear with a different plot scale.

If your histogram is skewed (lopsided), you may want to re-express your data. (See Histogram tricks.)

### What next?

If the histogram shows two or more modes, consider whether they represent subgroups in your data that should be analyzed separately. It is usually worthwhile to identify such subgroups and figure out why they are different.

### Histogram Tricks and Details

To find a re-expression that makes a skewed histogram more nearly symmetric, try the following:

Select the icon of your variable

Choose Manip > Transform > Dynamic > Box-Cox

This will create a new derived variable and a slider.

Drag the derived variable into the histogram to replace the original variable. The histogram will not change at first.

Choose Turn on Automatic Update in the gobal Hyperview menu.

Now, when you slide the slider, the histogram will show how the distribution changes with each re-expression power. Look especially at the "simple" powers of ½, 0,-½ , and -1. (Note; the "0" power corresponds to taking logarithms.

This is a Linear Model window. If there is no analysis, drag your response variable into the indicated place at the top of the table. Drag your factor or predictor variables into the indicated place.

At any time, you can drag additional variables into the window at these locations for either the factors or response. Click on the name of a variable to remove it from the model.

There can be more than one response variable, in which case, the analysis is a multivariate linear model, and the Type of analysis will indicate that.

The response variable can be a binary categorical variable, in which case you should click on the type of analysis and choose Logistic.

If all factors are categorical (discrete) and the response variable(s) are quantitative, the analysis is a multivariate ANOVA (MANOVA).

If all factors are quantitative (continuous) and the response variable(s) are quantitative, the analysis is a multivariate regression.

If the factors are a mix of quantitative and categorical and the response variable(s) are quantitative, the analysis is a multivariate analysis of covariance (ANOCOV).

The multivariate linear model is an extraordinarily general analysis with many special versions. You may want to consult the Data desk documentation for further information.

### Working with Linear Models

In the Factors panel, specify for each factor whether it is a fixed or random effect and whether it is continuous (quantitative) or discrete (categorical).

Indicate nesting by dragging a line between a factor and the parentheses next to the factor in which it is nested.

Select either Type I (sequential) or Type III (partial) sum of squares.

The Design Help button shows how to specify common designs.

The Interactions Sub-panel allows you to call for all interactions up to a specified level, to select two terms and add their interaction, or to select and remove an interaction. It also computes and displays the maximum df available for each interaction (fewer df may ultimately be available as the model is computed due to collinearities and empty cells.), the basis for the expected mean squares, and the denominator to be used for the appropriate F-test.

The Up and Down buttons allow re-ordering of factors and interactions for sequential sums of squares.

The modifications panel accommodates a selector variable to analyze a subset of the data and a variance variable to make the analysis a weighted analysis.

The Results panel opens to reveal output tables appropriate to the type of analysis.

A sub-panel opens to offer coefficients, expected cell means, and post-hoc tests.

The results for a multivariate analysis include a selection among the most common multivariate tests. For a multivariate analysis, the analysis for a specific response is itself an ANOVA or ANOCOV.

### Linear Model HyperViews

The global HyperView menu offers a variety of diagnostic displays and calculations similar to those offered for regression.

HyperViews on each variable offer histograms and normal probability plots.

Each panel has a window icon. The HyperView attached to that offers to pull the panel out into a separate window (so the analysis can fit for easily on a small screen) and to make a static copy of it (for comparison with an alternative model.)

ANOVA/ANOCOV panels behave in the same was as ANOVA windows.

The tables of results for multivariate tests offer HyperViews with appropriate supplementary information such as eigenvalues associated with the selected factor.

This is a Pie Chart window. If it is empty, drag the icon of a variable into the window and drop it there. Drag a new variable into the window to plot it instead.

A pie chart treats the displayed variable as naming categories even if its contents are numbers. It divides a circle into segments that correspond in size to the relative frequency of each category named in the variable. The segments are colored and a color key is provided to the right of the circle.

### Working with Pie Charts

Select the cases in any sector of the pie chart by clicking on it or on its color square in the key. There is no need to choose a plot tool. Cases selected in a pie chart highlight in all other displays and editing windows.

### Pie Chart HyperViews

The natural alternative display for the same variable is a Bar chart, which is offered in the HyperView.

The natural associated table is a Frequency Table, also offered in the HyperView.

The option of using patterns instead of colors is available, and useful for publication when colors will not be available.

### Learning from Pie Charts

Pie charts have been maligned because the human eye finds it harder to perceive the relative size of angles than the relative heights of bars in a bar chart. However, they are compact and easily read. They are most useful when displaying a variable that divides some "whole" into segments.

### Special Features of Pie Charts

The colors assigned to the pie chart slices are selected to be as different from each other as possible. They are the same colors that Datadesk will choose if the displayed variable is used to color points in another display with the Modify>Colors>Add Group command. Thus, a pie chart of the variable can serve as a quick key to the assigned colors.

This is a Principal Components window. If it is empty then drag some variables into the window individually or in groups.

### Working with Principal Components

The principal components results report the eigenvalues, the eigenvectors, and an unrotated factor matrix.

### Principal Components HyperViews

The global HyperView offers Principal Components Options. These include a choice of basing the analysis on correlations or on covariances. Generally, it is best to use correlations unless the variables are measured on comparable scales. The Options also offer a choice of results to be saved. Results are saved in a new folder that is placed in the Data folder found in the "file" at the upper right of the Data desk window.

### Learning from Principal Components

Locate the PC's folder in the Data folder of the File icon. Two folders hold the columns of the U and V' matrices of the Singular Value Decomposition (SVD) of the matrix made up of the columns of data; X = UDV' where D is a diagonal matrix of the eigenvalues.

A rotating plot of the columns of U is the same as a rotating plot of the columns of X (the original data) except for the orientation of the axes. For more than 3 variables, the rotating plot of the columns of U may show a more "interesting" orientation of the data.

### Special Features of Principal Components

The U and V columns are themselves derived variables so you can open the to see the linear combinations of the argument variables.

This is a Linear Regression window. If it is empty, drag your response variable into the top line of the table where indicated. Drag predictor variables–either individually or as a group–into the bottom row of the table where indicated. When both parts of the analysis are specified, the regression is computed and the rest of the table will fill in.

You can drag a new variable into the response variable row to replace the response variable. You can drag additional variables into the predictor area to add them to the model. Alternatively, you can drag a new variable directly over the name of an existing predictor to replace it in the regression model.

Linear Regression fits a linear model to predict or describe the response variable in terms of the predictor variables.

The values in the regression table are:

The number of cases in the model. A case that is missing a numeric value in any of the variables in the model will be excluded from the calculation.

R-squared gives the percentage of the variance of the response variable accounted for by the regression model.

R squared (adjusted) is a value adjusted for the number of cases and variables, suitable for comparing regression models with different numbers of predictors.

F-ratio is a global indicator of whether the response can be modeled by the predictors.

The final sub-table names each predictor and gives:

• its coeffient in the model,

• The estimated standard error of that coefficient

• The t-ratio for testing the standard null hypothess that the true value of the coefficient in this models ins zero.

• The p-value of that t-test.

### Working with Regressions

Regression is one of the most versatile statistical models and is widely used. Datadesk offers great flexibility for building, diagnosing, and understanding regressions.

Drag potential predictor variables into your model to add them to the model or replace existing predictors. Remove predictors by clicking on them and choosing the Remove Predictor command.

HyperView menus are attached to various parts of the regression table.

The menu attached to variable names offers to locate or select the variables, display them with Histograms or Normal probability plots, or make a Scatterplot of the response variable against that predictor

The HyperView attached to each coefficient offers to make the partial regression plot that corresponds to that coefficient or to "drop coefficients" into hotresult variables that can be used in other calculations.

The HyperView attached to the standard errors offers to drop the coefficients.

The Hyperiew attached to each t-ratio offers a plot of studentized residuals or of residuals vs that variable, and also offers to drop the t-ratio in a hotresult.

The Hyperview attached to each p-values offers to drop the p-value as a hotresult.

The HyperViews attached to the Sum of Squares values and the df values drop hotresults containing those values. They are provided primarily to be available for other calculations.

The HyperView attached to the adjusted R-squared drops that value as a hotresult. This is provided primarily for use in automatically optimizing the regression model by maximizing the adjusted R-squared value.

### Regression HyperViews

The global HyperView offers a variety of displays and diagnostic statistics:

Scatterplot residuals vs predicted values.

Scatterplot studentized residuals vs predicted values. This command computes the externally studentized residuals and plots them against the predicted values. Studentized residuals are adjusted to all have the same standard errors, so this plot may be more appropriate for assessing whether the regression assumption of constant variance around the model is satisfied.

Potential-Residual plot. This is a diagnostic plot that can help identify influential cases. It is a good idea to identify and understand any cases that stand apart from the rest of the data in this display.

Assign Variance Variable. To compute a weighted regression, select a variable that holds the estimated variances of the cases and then choose this command. The reciprocal of the variances will be the weights in the model.

Turn On Automatic Update. As in other DD windows, this causes the regression to update immediately upon any change to a variable in the model. See the discussion of special features for some ideas on using this capability.

Compute> This is a submenu offering to compute a variety of diagnostic and related statistics. The computed statistics are saved as hotresult variables, so all will update if the regression model is updated. Computed statistics include:

Predicted values

Residuals

Leverages

Externally studentized residuals

DFFITS

Cook's D

Hadi's Influence

Likelihood

Mahalanobis distances (based on the predictors)

Prob Plot> This is a submenu offering to make a Normal Probability Plot of any of a variety of related diagnostic statistics. The availalble plots include:

Residuals

Leverages

Externally studentized residuals

Cook's D

Hadi's influence

Likelihood

For many of these measures the best indication of an extraordinary case is that it stands away from the other values. A Normal probability plot offers a good way to look for that and one in which individual cases can be easily identified with the ? tool.

### Learning from Regression

Regressions are found for a variety of reasons and can be part of other analyses.

If you are particularly interested in a coefficient in a multiple regression, you should make and examine the partial regression plot available by clicking on the coefficient. That plot displays the relationship between the respons variable and the predictor in question after removing the linear effects of the other predictors in the model. You can interpret it as you would a simple scatterplot. You should identify and understand any cases that stand away from the body of the data and you should be concerned if the relationship looks nonlinear.

No regression is complete without an examination of the residuals, so either a plot of residuals vs predicted or of studentized residuals vs predicted is highly recommended.

### Special Features of Regression

Many of Data desk's abilities are particularly valuable when you build and interpret a linear regression.

Consider some of the following options:

If you have found (or suspect) a curved relationship between the response variable and the predictors–for example, if the scatterplot of residuals is curved–select the response variable. (You can do that from the HyperView attached to its name in the regression table.) Then choose

Manip > Transform > Dynamic > Box-Cox

Datadesk will make a new derived variable and a slider.

Drag the derived variable into the regression table to replace the original response variable.

If you haven't already, make a scatterplot of the residuals and/or a Normal probability plot of the residuals.

Set those plots to Automatic Update using the commands in their HyperView menus.

Now, sliding the control on the slider will re-express the response variable. A slider value of 1 is the raw data. A value of ½ takes a square root. A value of 0 specifies a (natural) logarithm. – ½ specifies a negative reciprocal root, and -1 specifies a negative reciprocal. As you slide, the regression model is continuously re-computed along with the residuals, predicted values, and any other statistics you have computed or plotted. Plots of those values set to automatic update will change smoothly to reflect the change. You may set the regression table itself to Automatic Update, but that isn't necessary.

With this trick, it is easy to see and understand the effects on your regression model of re-expressing the response variable and to find an optimal re-expression function.

A similar trick can help you to choose between two potential predictors. Choose both variables and select Manip > Transform> Dynamic > Mix X and Y.

Datadesk makes a derived variable and slider. The slider is bounded at 0 and 1. At 0 the derived variable is equal to the X variable. At 1 it is equal to the Y variable. Between those values it is a weighted combination of scaled versions of these variables. With this variable you can "slide" from one variable to the other and watch the consequences in the plots you have set to automatically update. This can be remarkably informative. You may, for example, see a cluster of points that move together in one of the plots, helping to identify those cases as related to each other.

You may discover a case that you conclude should be omitted from the analysis. (Perhaps, for example, you identified the case with the ? tool and then used the Special > Web Search Query function to learn more about the case, concluding that it was in some important way different from the other cases.) One convenient way to set the case aside without losing it is to create a special indicator variable that is 1 for that case and 0 for all the others.

Select the case in any display.

Open a variable that names the cases (if you have one)

Choose Modify > Selection > Record as Indicators

Data desk will make an indicator variable for each selected case, naming the variable with the name of the case.

Now drag those indicators into the regression model to add them as predictors.

The effects of these cases will be removed from the analysis.

The p-value associated with the t-test on each coefficient is a statistical test of whether the case is in fact an outlier from the regression model.

jQuery( document ).ready(function() { if(window.location.hash){ jQuery(window.location.hash).collapse(); } });