Jeromy Anglim's Blog: Psychology and Statistics

Monday, November 2, 2009

Issues in Model Building and Parameter Estimation | Case Studies in Psychology

In this post I discuss issues related to Model Testing and Parameter Estimation. I focus on the role of this process in the scientific development of knowledge. I was motivated to write this post in an attempt to integrate the set of modelling issues that I was encountering across a range of psychological research topics including psychological tests, learning curves, social networks, and well-being. The post provides links to additional resources and presents some of my own observations on model testing and parameter estimation. I make the disclaimer that my ideas are still evolving on this topic.

  • Benjamin Bolker (Ecological Models of Data in R) provides a good introuction to modelling (an early copy of his book is online). Relative to some book on modelling, the book is highly accessible. I recommend the book particularly to researchers in psychology who want to branch into less standard modelling scenarios. The book only assumes a rusty understanding of calculus, probability, and statistics. It is written with a researcher who is more interested in theory testing than modelling in mind. It also has the benefit of showing how the ideas can be implemented in R. Thus, an interested researcher can gain hands on practice. The whole book is relevant to present discussions but see particularly: introduction to modelling and optimisation.
  • previously commented on Jones et al's Introduction to Scientific Programming and Simulation using R. It provides a good introduction to modelling and simulation. Jones et al's book is more academic in its presentation style, whereas Bolker is more pragmatic and conversational.
  • I previously gave a very introductory talk on modelling for a psychological research audience.
  • Zhang provides a tutorial on parameter estimation with an emphasis of processes in vision. The tutorial characterises parameter estimation as made up of four optimisation problems: "criterion: the choice of the best function to optimize (minimize or maximize); estimation: the optimization of the chosen function; design: optimal design to obtain the best parameter estimates; modeling: the determination of the mathematical model which best describes the system from which data are measured." [italics added] - Zhang - A Glance over Parameter Estimation
I now present an assortment of points, observations, and case studies related to model testing and parameter estimation organised under Zhang's four headings of criterion, estimation, design, and modelling.


I don't have much to say regarding the pros and cons of alternate loss functions. In many domains that interest me, default options are fairly well accepted. For example, nonlinear regression tends to use least squares. Confirmatory factor analysis tends to use maximum likelihood. I'd also  be wary of models that yield vastly different parameter estimates based on different loss functions.

Outliers: Related to choosing a loss function to minimise is the issue of what to do with outliers. How should they be identified? What should be done with them once they are identified? Some methods involve altering or removing the outliers from the raw data that is to be modelled. Other approaches retain the data, but are less affected by outliers (e.g., minimising absolute error as opposed to squared error). I found Zhang's discussion of robust estimators useful. Dealing with outliers require pragmatic (e.g., does it make a difference what is done?) and theoretical decisions (do outliers represent a process to be modelled?). 

Case Study - Skill Acquisition: I am in the process of fitting functions to learning curves of individuals. Outliers occur frequently.They can arise, for example, when a participant gets distracted or when he or she makes a major error. On these trials task completion time is substantially slower than is typically the case. I assume that observed data is a function of a mixture distribution. Most trials are drawn from one distribution whereby observed performance largely reflects the individuals skill and general level of effort. However, there is a second distribution, which occurs when the individual makes a major error, which leads to substantially slower  task completion times. From examining individual differences it is evident that individuals differ in all of these above stated factors: some individuals are more skilled than others; some individuals make major errors more frequently than others; and some fundamental errors are more disruptive than others. This then raises the issue of how to identify outliers and what to do with outliers once they are identified. One argument goes that outliers reflect a distinct process and therefore outliers should be removed. Another argument is that the outliers are an intrinsic part of performance that needs to be modelled. 

For many applications in psychology, software sorts out the estimation problem. For example, multiple regression, logistic regression, SEM in Amos, maximum likelihood in factor analysis, and multidimensional scaling in SPSS all tend to converge on a solution to parameter estimation using software defaults. Many of these estimation algorithms do use an iterative search over the parameter space, yet the default settings are often sufficient.

Yet, sometimes things go wrong with these standard analyses. The analyst is then confronted with messages relating to matrices being not positive definite or singular gradients or under identification. With many of the above mentioned techniques there are often some fairly standard ways of resolving estimation problems, such as getting more cases, reducing the complexity of the model, or reducing the number of free parameters.

Case Study - Nonlinear Regression: However, there are some models where parameter estimation requires more thought. More recently I've been doing a lot of nonlinear regression. This has required me to think more deeply about: (1) the behaviour of a loss function over a parameter space; (2) generating starting values; and (3) what to do when a model fails to converge. I'll save discussion of this adventure for a separate post.

Design is key: In psychological research, design is a major challenge. If adequate data is not collected, parameter estimation and model testing is difficult or even impossible.

Case study - personality and well-being: I worked on an article looking at the relationship between personality and well-being. The data included cross-sectional information measuring various dimensions of personality and well-being. The paper was motivated by questions concerning the nature of the causal system that underlies the relationship between personality, affective experience, life events, the life course, and well-being of various kinds. To what extent is the good-life due to stable individual differences? To what extent is it a function of life experiences? and so on. We used a covariance modelling approach to compare the fit and parameter estimates that result from placing various constraints on the correlation matrix between personality variables and well-being variables. While the results highlighted several interesting unique pairings between Big 5 personality and types of well-being, the process also suggested several challenges of modelling psychological phenomena. First, the cross-sectional self-report measures represent a snapshot of a dynamic set of variables only some of which are even conceptually measured by a any given study. Second, self-report provides only one lens for viewing the system of variables of theoretical interest. Finally, the gulf between the theoretical system of variables and the measured variables is vast and one way of viewing study design is as an attempt to minimise the gulf.

Case study - skill acquisition: I am interested in fitting various mathematical functions to the learning curve at the individual-level. Several challenges arise relating to estimating the learning function at the individual-level. In particular, substantial improvement often occurs in the first few trials of practice. The consequence of this is that this important early period of learning is measured less reliably than the relatively long period of slow or negligible improvement that happens later in practice. This helps to explain why many researchers shift to the group-level of analysis, where the pooling of observations from multiple cases helps to improve the reliability of measurement of this early period. However, moving to the group-level changes the phenomena of interest. From a design perspective having more difficult tasks with longer periods of training may increase the duration of the initial period of learning and thereby increase the reliability of estimating this initial learning period. Designs can also be made

The correct model is not known: In Zhang's discussion the model is assumed to be known. This is rarely the case in psychology. From my own casual observations of fields like ecology, pharmacokinetics, and others, some models appear to be well established. While the degree to which a model testing framework has established itself varies across psychology, in my own experience, the amount of choice can become a hindrance. Thus, this section discusses issues related to choosing models, comparing models, evaluating models, and linking results back to theoretical testing.

Fit, parsimony, and theoretically meaningful parameters: As a starting point, I often say that a good model has: (a) good fit to observed data (i.e., small sum of squared residuals; large r-squared); (b) parsimony (i.e., few parameters that are fitted to data); (c) theoretically meaningful parameters. Some fit statistics combine parsimony and fit (e.g., AIC, BIC). I discuss theoretically meaningfulness below. And of course this is just this is just a starting point.

The role of theory: Theory says whether predictions made by a model outside the range of data are plausible. Theory influences which processes are of interest and which processes are deemed to be external to the domain of interest. Theory guides measurement.

Case study - skill acquisition: When modelling task completion time as a function of practice, any model that predicts task completion time is less than zero is wrong. There will also be a range of reaction time predictions which are highly unlikely. For example, I recently studied time to complete various text editing changes over the course of practice. The quickest task completion time in the sample was around two seconds. Thus, predictions of times less than one second even after an infinite amount of practice are highly unlikely. However, it is also clear that defining the point where times move from plausibly quick to implausibly quick is subjective and difficult to define given only the data. I can think of a hypothetical experiment looking at 100 metre running times

Good models should generate theoretically plausible parameters. Thus, if there are parameters within the model that would yield a theoretically plausible model, but the model fitted to the data suggests implausible parameters, it is likely that the model is wrong in some way.

The data is rarely sufficient. In any psychological study a huge number of theoretically relevant variables are not measured. This may be because the variables could not be measured or just that they were not measured.

Purpose - Approximation or specification: Modelling is done for different purposes (see Bolker's discussion for some dimensions of modelling approach). From my own experience I have found the distinction between precise definition versus reasonable approximation an important one. In psychological data, many processes often produce the observed data. Modelling only the dominant process will yield imperfect fit. Assumed peripheral effects cause systematic deviations from the approximate model. Which model we should prefer depends on a number of factors.

Case study - CFA for Psychological Test Construction: Multiple factor psychological scales are often modelled using confirmatory factor analysis. A standard model might posit that each item is a function of one latent factor and a unique and uncorrelated error component. Latent factors are typically allowed to correlate and the unique error for each item are assumed to be uncorrelated with other unique error terms (see the model on page 4 of the following as an example). However, in all cases that I have observed, even if this theoretical model was a good approximation, there are almost always smaller sources of systematic variance. And this issue becomes more apparent when there are more items per scale (e.g., 10 items per scale instead of 3; note also that more items per scale is a desirable thing for reliability and breadth of construct measurement). Such systematic sources of variance might be explained by cross loadings (i.e., items that load on more than one latent factor), correlated unique errors (e.g., when items share some common words or are more alike than other items within a set), or by additional latent factors. Some may argue that these problems should be systematically removed from a scale. While I think this is a noble aim, if these sources are small, it may reduce the focus on the way that the model is approximated well by the simpler model. The quest for the perfect model of the data  involving correlated errors, cross-loadings, and new latent factors, opens a Pandora's box of options regarding model improvement which are less grounded in theory than the one-item per factor CFA. This then leads to issues of over-fitting. There also is rarely enough data to choose between models. My main point here is that sometimes it is more important to model the dominant process in the data, and that therefore a simpler model with poorer fit can still have value.

Robert and Pashler's Critique: Roberts and Pashler (2000, How Persuasive is a Good Fit?p.358) write: "A good fit reveals nothing about the flexibility of the theory (how much it cannot fit), the variability of the data (how firmly the data rule out what the theory cannot fit), or the likelihood of other outcomes (perhaps the theory could have fit any plausible result), and a reader needs all 3 pieces of information to decide how much the fit should increase belief in the theory." Roberts and Pashler (2000) suggest some better ways to assess theories involving parameter estimation: (1) "determine the predictions"; (2) "show the variability of the data"; (3) "show that there are plausible results the theory cannot fit." They summarise by encouraging researchers to more frequently ask the following questions: "what would disprove my theory?" and "what theories do these data rule out?"

No comments:

Post a Comment