Explain Files

Explain Files are a quick-start resource meant to help you get the most out of Data Desk as quickly as possilbe.

This is a Boxplot side by side window. If it is empty, drag one or more quantitative variables into the window. Data desk will make a vertical Boxplot for each variable, all on the same vertical scale.

Working with Boxplots side by side

Boxplots and Dotplots work similarly. Switch between these two views with the Add Boxes/Remove Boxes command in the global HyperView menu. Generally, boxplots give a better overview of the relationship of the variables and Dotplots show more detail and facilitate identifying and working with (e.g. assigning colors or symbols too) individual points. Boxplots show the InterQuartile Range (IQR) of each variable as the size of the box and the median of each variable as a horizontal line across the box. They nominate possible outliers in terms of the IQR and show them as individual points. Because this outlier nomination is local to each variable, it is an effective way to identify points that might be unusual for that variable even if they are not unusual when viewed overall. Boxplots exhibit potential outliers as individual points. You can identify any point with the query (?) tool.

Boxplot HyperViews

In addition to adding or removing boxes, the global HyperView offers a ZScore Transformation. This is useful when plotting variables that have very different scales. For each variable, Datadesk subtracts its mean and divides by its standard deviation.

Learning from Boxplots

Boxplots offer a convenient comparison of the distributions of quantitative variables. Look for trends in their variability (comparing the IQRs) and identify and examine the nominated possible outliers. The Modify > Lines > Show Lines/Hide Lines command alternately draws lines that connect the points for each case or hides them (even though the individual points are not displayed in the boxplots.)

This is a Boxplot y by x window. If it is empty, drag a quantitative variable onto the y-axis and a categorical variable that names groups onto the x-axis. Data desk will make a Boxplot for each group.

Working with Boxplots

Boxplots and Dot plots work similarly. Switch between these two views with the Add Boxes/Remove Boxes command in the global HyperView menu. Generally, boxplots give a better overview of the relationship of the groups and dot plots show more detail and facilitate identifying and working with (e.g. assigning colors or symbols to) individual points. Boxplots show the InterQuartile Range (IQR) of each group as the size of the box and the median of each group as a horizontal line across the box. They nominate possible outliers in terms of the IQR and show them as individual points. Because this outlier nomination is local to each group, it is an effective way to identify points that might be unusual for that group even if they are not unusual when viewed overall.

Boxplot HyperViews

In addition to adding or removing boxes, the global HyperView offers to Drop scales. This generates hotresult variables with the max and min for each group. These can be useful in other calculations. The HyperView attached to the name of the categorical (x-axis) variable offers bar charts and pie charts. The HyperView attached to the quantitative (y-axis) variable offers a histogram and normal probability plot. You can identify any outlying point with the query (?) tool

Learning from Boxplots

Boxplots offer a convenient comparison of the distribution of a quantitative variable across categories of a categorical variable. Look for clusters and outliers. The vertical style of Boxplots overprints multiple points if they appear at the same place. The ? tool will indicate the number of overprints and will identify them successively on multiple clicks. Consider making a histogram of the y-axis variable (from its HyperView) and then selecting all the points in one of the categories with the rectangle selector, lassor selector or knife tool to see the sub-histogram for that group against the overall distribution.

This is a Cluster Analysis window. If it is empty, just drag quantitative variables into it singly or in groups. A cluster tree graph will appear. Calc > Calculation Options > Cluster Analysis Options lets you choose the clustering method.

Working with Cluster Analysis

Click on any tree node to select all cases "below" that node. Cases are selected in all graphics and editing windows. Plot colors and symbols operate on the points at the bottom (left) of the cluster tree, so it is easy to select a cluster and assign a color or symbol to it.

Cluster Analysis HyperViews

The Save Distances command records the distances of the successive nodes in the cluster tree.

Learning from Cluster Analysis

Clustering is an exploratory method best used along with other methods. Select clusters and examine them in other analyses. Consider making an indicator variable for the selected cases with the Modify > Selection > Record Hot Set command.

This is a Contingency Table window. If there is no content, drag variables into the window, dropping them in the title area to specify which will define the rows of the table and which will define the columns. You can drag other variables in and drop them in these locations to replace these at any time. Contingency tables treat the variables as categorical even if they contain numerals. The table will have a row for each individual category of the row variable, a column for each category of the column variable, and counts of the number of cases falling into the combination of categories in the body of the table. HyperView commands include opening Contingency Table Options (also available from the Calc menu under Calculation Options.) These include specifying whether marginal values should be displayed and whether the displayed values should include counts, proportions, or both. Options also include a chi square statistic to summarize the degree of association between the variables and the expected values and standardized residuals that go into the chi square calculation. (Chi square = the sum of the squared standardized residuals, so these can reveal just where the table shows an association between the variables.)

Working with Contingency Tables

Click on any cell or margin value to drop down a HyperView menu that offers to select the cases represented by that value, to create an indicator (or dummy) variable that is one for the cases in that cell and zero for the others, or to record a 0/1 variable and make it the Selector for subsequent commands. Choose Compute Counts from the global HyperView menu to record hotresult variables that name the row categories and column categories, and the counts.

Learning from Contingency Tables

A chi square statistic with a low p-value can be interpreted as evidence that the two variables are not statistically independent. Depending on your interpretation of the variables, the test may indicate whether the distributions of counts among several categories are alike or different. When a test is statistically significant, it is almost always worthwhile to examine the standardized residuals to understand which cells contributed to the larger chi square value. If one or two cells stand out with larger residuals, you can click on them to select those cases for further examination in other displays. (All case selections always appear in all displays and editing windows.)

Special Features of Contingency Tables

If the number of categories in a variable exceeds the categories limit (default 50), Datadesk will warn you before proceeding to make the table. This could happen if you select a quantitative variable rather than a categorical variable. You can adjust the categories limit at Data Desk > Preferences > Categories Limit…

This is a Correlation Table window. If it shows no values, drag variables into the window and drop them anywhere. You can drag additional variable into the window at any time to add them to the correlation table. The drop-down menu at the top lets you choose among Pearson product-moment corrosion, Kendall's tau, Spearman's rho, and covariances. The table shows pairwise correlations computed for all cases with numeric values on both variables. Correlation tables can take you directly to related analyses. The Hyperview found by clicking the triangle in the upper left of the window offers features and related commands. Correlation tables are versatile with many special features.

Working with Correlation Tables

Click on the row or column name of any variable to Select or Locate its icon, to make a Histogram or Normal Probability Plot of the variable or to remove it from the table. Click on any correlation to make a Scatterplot of the two variables. Because correlations are symmetric, Datadesk offers to plot either variable on the y-axis. If any value in one of the variables in the table is modified, the table will offer to recalculate by displaying a red exclamation mark where the global Hyperview menu usually appears in the upper left corner of the window.

Correlation HyperViews

The global Hyperview offers to make a Plot Matrix of all the variables. This is the natural visualization of a correlation table. Turn on Automatic Update to have the table immediately recomputed if any data value is changed

Learning from Correlations

Correlations summarize the association between pairs of variables. Pearson correlation measures linear association. Kendall's tau measure monotonicity. Spearman's rho is the Pearson correlation of the ranks, and is less sensitive to possible outliers. Covariances, unlike the other association measures available in this table, measure association using the original measurement units of the variables rather than re-scaling them. It is a good idea to examine the scatterplot corresponding to any correlation that is of importance or interest. Pearson correlation is sensitive to outliers and is only appropriate when the associate is linear. You can check both with the scatterplot. Kendall's tau and Spearman's rho do not require a linear relationship and are less sensitive to outliers.

Special Features of Correlation Tables

Correlations are computed for all cases that have numeric data in both variables. Thus the correlations in the table may be for different numbers of cases and for different cases if data are missing for different cases in different variables. Pearson correlation is closely associated with linear regression. A convenient way to get to a regression is to make the offered scatterplot and then choose Regression from the scatterplot's Hyperview menu. This has the advantage of offering a check for nonlinearity and outliers along the way. When developing a multiple regression model, it can be helpful to save the residuals and drag them into a correlation table of the available predictors. This will show which remaining predictors are correlated with the residuals and are thus good candidates to include in your model. When a variable is added to the model, the residuals will update and the correlation table will offer to update to reflect the change.

This is a Dotplot side by side window. If it is empty, drag one or more quantitative variables into the window. Data desk will make a vertical dotplot for each variable, all on the same vertical scale.

Working with Dotplots side by side

Dotplots and Boxplots work similarly. Switch between these two views with the Add Boxes/Remove Boxes command in the global HyperView menu. Generally, boxplots give a better overview of the relationship of the variables and dotplots show more detail and facilitate identifying and working with (e.g. assigning colors or symbols to) individual points.

Dotplot HyperViews

In addition to adding or removing boxes, the global HyperView offers a ZScore Transformation. This is usefule when plotting variables that have very different scales. For each variable, Datadesk subtracts its mean and divides by its standard deviation. This makes it easier to compare distributions within the variables. Because these are different variables, each case is represented in each of the dotplots, so when a point is selected, it highlights in each dotplot. You can identify any point with the query (?) tool

Learning from Dotplots

Dotplots offer a convenient comparison of the distribution of a quantitative variables. Look for trends in the variability of the variables and for possible outliers. The vertical style of dotplots overprints multiple points if they appear at the same place. The ? tool will indicate the number of overprints and will identify them successively on multiple clicks. The Modify > Lines > Show Lines/Hide Lines command alternately draws lines that connect the points for each case or hides them. With lines shown, this is a kind of parallel coordinate plot.

This is a Frequency Breakdown window. If it is empty, drag a categorical variable into it. Frequency breakdowns show, for each category, the count of cases falling in that category, the percentage of cases in the category, and a cumulative percentage of cases. Drag a new variable over the variable name to replace it.

Working with Frequency Breakdowns

Click on the column heads to save and locate a hotresult variable holding the values of the column. As hotresults, these will update if the variable is changed or replaced. The HyperView attached to the variable name offers to make a bar chart of pie chart of the variable. Hyperviews attached to any of the values in the table offer to select that category in all other displays and editing windows.

Frequency Breakdown HyperViews

The Frequency Options command in the global hyperview offers a variety of alternative calculations for a frequency window. These include the expected values and standardized residuals for a chi square test of homogeneity of the hypothesis of equal cell counts

This is a histogram window. If it is blank, then drag a variable's icon into the window and drop it anywhere. The variable must contain at least one numeric case. To change the displayed variable, drag a new variable icon onto the axis label at the bottom of the plot, and drop it there.

How does it work?

The lasso, box, brush, and pointer tools select bars of the histogram. Hold Shift to select multiple bars. The knife tool selects ranges of bars. Selected cases link to other displays and to variable editing windows, and are selected there as well. Cases selected in other displays highlight in the histogram as a smaller histogram within the bars of the histogram. Drag the size box in the lower left to resize the window. Option-drag the size box to re-scale the plot.

HyperViews

Automatic Update to redraw the histogram immediately whenever the data or selection change Plot scale lets you specify the plot scale numerically. Compute Bar counts saves hotresult variables holding the left edge value of each bar, the bar counts, and the cumulative count (low to high).

What does it tell me?

Histograms show the distribution of a variable. Look for: Shape: Is the histogram symmetric; does the right side approximately mirror the left? Does the histogram have one main mode (hump) or more than one? Are there any possible outliers standing away from the body of the data? Center: Roughly what is the middle value? Spread: How spread out are the values? (Does that central value summarize the overall values well?) Alert: Rescaling can change the impression that you get from a histogram. Don't be overly impressed by minor variation in bar heights; those are probably not separate modes and may disappear with a different plot scale. If your histogram is skewed (lopsided), you may want to re-express your data. (See Histogram tricks.)

What next?

If the histogram shows two or more modes, consider whether they represent subgroups in your data that should be analyzed separately. It is usually worthwhile to identify such subgroups and figure out why they are different.

Histogram Tricks and Details

To find a re-expression that makes a skewed histogram more nearly symmetric, try the following: Select the icon of your variable Choose Manip > Transform > Dynamic > Box-Cox This will create a new derived variable and a slider. Drag the derived variable into the histogram to replace the original variable. The histogram will not change at first. Choose Turn on Automatic Update in the gobal Hyperview menu. Now, when you slide the slider, the histogram will show how the distribution changes with each re-expression power. Look especially at the "simple" powers of ½, 0,-½ , and -1. (Note; the "0" power corresponds to taking logarithms.

This is a Linear Model window. If there is no analysis, drag your response variable into the indicated place at the top of the table. Drag your factor or predictor variables into the indicated place. At any time, you can drag additional variables into the window at these locations for either the factors or response. Click on the name of a variable to remove it from the model. There can be more than one response variable, in which case, the analysis is a multivariate linear model, and the Type of analysis will indicate that. The response variable can be a binary categorical variable, in which case you should click on the type of analysis and choose Logistic. If all factors are categorical (discrete) and the response variable(s) are quantitative, the analysis is a multivariate ANOVA (MANOVA). If all factors are quantitative (continuous) and the response variable(s) are quantitative, the analysis is a multivariate regression. If the factors are a mix of quantitative and categorical and the response variable(s) are quantitative, the analysis is a multivariate analysis of covariance (ANOCOV). The multivariate linear model is an extraordinarily general analysis with many special versions. You may want to consult the Data desk documentation for further information.

Working with Linear Models

In the Factors panel, specify for each factor whether it is a fixed or random effect and whether it is continuous (quantitative) or discrete (categorical). Indicate nesting by dragging a line between a factor and the parentheses next to the factor in which it is nested. Select either Type I (sequential) or Type III (partial) sum of squares. The Design Help button shows how to specify common designs. The Interactions Sub-panel allows you to call for all interactions up to a specified level, to select two terms and add their interaction, or to select and remove an interaction. It also computes and displays the maximum df available for each interaction (fewer df may ultimately be available as the model is computed due to collinearities and empty cells.), the basis for the expected mean squares, and the denominator to be used for the appropriate F-test. The Up and Down buttons allow re-ordering of factors and interactions for sequential sums of squares. The modifications panel accommodates a selector variable to analyze a subset of the data and a variance variable to make the analysis a weighted analysis. The Results panel opens to reveal output tables appropriate to the type of analysis. A sub-panel opens to offer coefficients, expected cell means, and post-hoc tests. The results for a multivariate analysis include a selection among the most common multivariate tests. For a multivariate analysis, the analysis for a specific response is itself an ANOVA or ANOCOV.

Linear Model HyperViews

The global HyperView menu offers a variety of diagnostic displays and calculations similar to those offered for regression. HyperViews on each variable offer histograms and normal probability plots. Each panel has a window icon. The HyperView attached to that offers to pull the panel out into a separate window (so the analysis can fit for easily on a small screen) and to make a static copy of it (for comparison with an alternative model.) ANOVA/ANOCOV panels behave in the same was as ANOVA windows. The tables of results for multivariate tests offer HyperViews with appropriate supplementary information such as eigenvalues associated with the selected factor.

This is a Pie Chart window. If it is empty, drag the icon of a variable into the window and drop it there. Drag a new variable into the window to plot it instead. A pie chart treats the displayed variable as naming categories even if its contents are numbers. It divides a circle into segments that correspond in size to the relative frequency of each category named in the variable. The segments are colored and a color key is provided to the right of the circle.

Working with Pie Charts

Select the cases in any sector of the pie chart by clicking on it or on its color square in the key. There is no need to choose a plot tool. Cases selected in a pie chart highlight in all other displays and editing windows.

Pie Chart HyperViews

The natural alternative display for the same variable is a Bar chart, which is offered in the HyperView. The natural associated table is a Frequency Table, also offered in the HyperView. The option of using patterns instead of colors is available, and useful for publication when colors will not be available.

Learning from Pie Charts

Pie charts have been maligned because the human eye finds it harder to perceive the relative size of angles than the relative heights of bars in a bar chart. However, they are compact and easily read. They are most useful when displaying a variable that divides some "whole" into segments.

Special Features of Pie Charts

The colors assigned to the pie chart slices are selected to be as different from each other as possible. They are the same colors that Datadesk will choose if the displayed variable is used to color points in another display with the Modify>Colors>Add Group command. Thus, a pie chart of the variable can serve as a quick key to the assigned colors.

This is a Principal Components window. If it is empty then drag some variables into the window individually or in groups.

Working with Principal Components

The principal components results report the eigenvalues, the eigenvectors, and an unrotated factor matrix.

Principal Components HyperViews

The global HyperView offers Principal Components Options. These include a choice of basing the analysis on correlations or on covariances. Generally, it is best to use correlations unless the variables are measured on comparable scales. The Options also offer a choice of results to be saved. Results are saved in a new folder that is placed in the Data folder found in the "file" at the upper right of the Data desk window.

Learning from Principal Components

Locate the PC's folder in the Data folder of the File icon. Two folders hold the columns of the U and V' matrices of the Singular Value Decomposition (SVD) of the matrix made up of the columns of data; X = UDV' where D is a diagonal matrix of the eigenvalues. A rotating plot of the columns of U is the same as a rotating plot of the columns of X (the original data) except for the orientation of the axes. For more than 3 variables, the rotating plot of the columns of U may show a more "interesting" orientation of the data.

Special Features of Principal Components

The U and V columns are themselves derived variables so you can open the to see the linear combinations of the argument variables.

This is a Linear Regression window. If it is empty, drag your response variable into the top line of the table where indicated. Drag predictor variables–either individually or as a group–into the bottom row of the table where indicated. When both parts of the analysis are specified, the regression is computed and the rest of the table will fill in. You can drag a new variable into the response variable row to replace the response variable. You can drag additional variables into the predictor area to add them to the model. Alternatively, you can drag a new variable directly over the name of an existing predictor to replace it in the regression model. Linear Regression fits a linear model to predict or describe the response variable in terms of the predictor variables. The values in the regression table are: The number of cases in the model. A case that is missing a numeric value in any of the variables in the model will be excluded from the calculation. R-squared gives the percentage of the variance of the response variable accounted for by the regression model. R squared (adjusted) is a value adjusted for the number of cases and variables, suitable for comparing regression models with different numbers of predictors. F-ratio is a global indicator of whether the response can be modeled by the predictors. The final sub-table names each predictor and gives:


• its coeffient in the model,
• The estimated standard error of that coefficient
• The t-ratio for testing the standard null hypothess that the true value of the coefficient in this models ins zero.
• The p-value of that t-test.

Working with Regressions

Regression is one of the most versatile statistical models and is widely used. Datadesk offers great flexibility for building, diagnosing, and understanding regressions. Drag potential predictor variables into your model to add them to the model or replace existing predictors. Remove predictors by clicking on them and choosing the Remove Predictor command. HyperView menus are attached to various parts of the regression table. The menu attached to variable names offers to locate or select the variables, display them with Histograms or Normal probability plots, or make a Scatterplot of the response variable against that predictor The HyperView attached to each coefficient offers to make the partial regression plot that corresponds to that coefficient or to "drop coefficients" into hotresult variables that can be used in other calculations. The HyperView attached to the standard errors offers to drop the coefficients. The Hyperiew attached to each t-ratio offers a plot of studentized residuals or of residuals vs that variable, and also offers to drop the t-ratio in a hotresult. The Hyperview attached to each p-values offers to drop the p-value as a hotresult. The HyperViews attached to the Sum of Squares values and the df values drop hotresults containing those values. They are provided primarily to be available for other calculations. The HyperView attached to the adjusted R-squared drops that value as a hotresult. This is provided primarily for use in automatically optimizing the regression model by maximizing the adjusted R-squared value.

Regression HyperViews

The global HyperView offers a variety of displays and diagnostic statistics: Scatterplot residuals vs predicted values. Scatterplot studentized residuals vs predicted values. This command computes the externally studentized residuals and plots them against the predicted values. Studentized residuals are adjusted to all have the same standard errors, so this plot may be more appropriate for assessing whether the regression assumption of constant variance around the model is satisfied. Potential-Residual plot. This is a diagnostic plot that can help identify influential cases. It is a good idea to identify and understand any cases that stand apart from the rest of the data in this display. Assign Variance Variable. To compute a weighted regression, select a variable that holds the estimated variances of the cases and then choose this command. The reciprocal of the variances will be the weights in the model. Turn On Automatic Update. As in other DD windows, this causes the regression to update immediately upon any change to a variable in the model. See the discussion of special features for some ideas on using this capability. Compute> This is a submenu offering to compute a variety of diagnostic and related statistics. The computed statistics are saved as hotresult variables, so all will update if the regression model is updated. Computed statistics include: Predicted values Residuals Leverages Externally studentized residuals DFFITS Cook's D Hadi's Influence
Likelihood Mahalanobis distances (based on the predictors) Prob Plot> This is a submenu offering to make a Normal Probability Plot of any of a variety of related diagnostic statistics. The availalble plots include: Residuals Leverages Externally studentized residuals Cook's D Hadi's influence Likelihood For many of these measures the best indication of an extraordinary case is that it stands away from the other values. A Normal probability plot offers a good way to look for that and one in which individual cases can be easily identified with the ? tool.

Learning from Regression

Regressions are found for a variety of reasons and can be part of other analyses. If you are particularly interested in a coefficient in a multiple regression, you should make and examine the partial regression plot available by clicking on the coefficient. That plot displays the relationship between the respons variable and the predictor in question after removing the linear effects of the other predictors in the model. You can interpret it as you would a simple scatterplot. You should identify and understand any cases that stand away from the body of the data and you should be concerned if the relationship looks nonlinear. No regression is complete without an examination of the residuals, so either a plot of residuals vs predicted or of studentized residuals vs predicted is highly recommended.

Special Features of Regression

Many of Data desk's abilities are particularly valuable when you build and interpret a linear regression. Consider some of the following options: If you have found (or suspect) a curved relationship between the response variable and the predictors–for example, if the scatterplot of residuals is curved–select the response variable. (You can do that from the HyperView attached to its name in the regression table.) Then choose Manip > Transform > Dynamic > Box-Cox Datadesk will make a new derived variable and a slider. Drag the derived variable into the regression table to replace the original response variable. If you haven't already, make a scatterplot of the residuals and/or a Normal probability plot of the residuals. Set those plots to Automatic Update using the commands in their HyperView menus. Now, sliding the control on the slider will re-express the response variable. A slider value of 1 is the raw data. A value of ½ takes a square root. A value of 0 specifies a (natural) logarithm. – ½ specifies a negative reciprocal root, and -1 specifies a negative reciprocal. As you slide, the regression model is continuously re-computed along with the residuals, predicted values, and any other statistics you have computed or plotted. Plots of those values set to automatic update will change smoothly to reflect the change. You may set the regression table itself to Automatic Update, but that isn't necessary.
With this trick, it is easy to see and understand the effects on your regression model of re-expressing the response variable and to find an optimal re-expression function.
A similar trick can help you to choose between two potential predictors. Choose both variables and select Manip > Transform> Dynamic > Mix X and Y.
Datadesk makes a derived variable and slider. The slider is bounded at 0 and 1. At 0 the derived variable is equal to the X variable. At 1 it is equal to the Y variable. Between those values it is a weighted combination of scaled versions of these variables. With this variable you can "slide" from one variable to the other and watch the consequences in the plots you have set to automatically update. This can be remarkably informative. You may, for example, see a cluster of points that move together in one of the plots, helping to identify those cases as related to each other.
You may discover a case that you conclude should be omitted from the analysis. (Perhaps, for example, you identified the case with the ? tool and then used the Special > Web Search Query function to learn more about the case, concluding that it was in some important way different from the other cases.) One convenient way to set the case aside without losing it is to create a special indicator variable that is 1 for that case and 0 for all the others.
Select the case in any display.
Open a variable that names the cases (if you have one)
Choose Modify > Selection > Record as Indicators
Data desk will make an indicator variable for each selected case, naming the variable with the name of the case.
Now drag those indicators into the regression model to add them as predictors.
The effects of these cases will be removed from the analysis.
The p-value associated with the t-test on each coefficient is a statistical test of whether the case is in fact an outlier from the regression model.