In a first post about when is something true, I said something is true when it explains and predicts measurable/observable reality consistently. My goal here is to broadly discuss how complicated measurement/observation truly is. Throughout, you’re definitely getting my personal opinion on a lot of issues in psychological science: I often take a critical position. There are scientists who do not agree with me and some that do. But either way, I wanted to be clear that those are my opinions.
There are enormous obstacles to discovering “core truths” about human psychology, including philosophical questions about whether such truths exist. Here, I take the position there are some kinds of truth we can learn, and I also think the obstacles are not impossible to overcome. I do think that we would all benefit from carefully considering some basic issues more thoroughly. We must constantly challenge the validity of our constructs, a fancy word that basically means “concept” (e.g., happiness is a construct, and so is extraversion). In particular, we must be critical of how we measure our constructs and how much information those measures really provide. If we don’t, the best case is that we’re making an imprecise triangulation of the truth; at worst, we’re describing things that may be mathematically consistent but don’t relate to actual human psychology, the thing we claim to be studying.
For observations of simple gravity on Earth, observation may seem straightforward: junk falls. But the concept of observation can be complicated. Consider how exactly you should measure the time it takes for objects to fall:
- Unit: are we measuring in seconds? minutes? jiffies?
- Observers: is the measurement collected by a person on the ground with a stopwatch? the person dropping the object? a sophisticated laser? How do we calibrate the equipment? If I want to perform the same study, how do I ensure my equipment is calibrated the same way as the original? Do we have a universal clock that everyone agrees measures the “correct” unit of seconds? That’s not at trivial problem! In modern times, we have atomic clocks that vote on the time and determine it through consensus.
- Outcomes: are we measuring time until first contact with the ground? What is “first contact?” Electrons don’t normally touch in day-to-day life. Touching becomes sort of a non-concept at small scales anyway. The range of electromagnetic influence is also infinite, so they’re theoretically always in some kind of “contact.” We’ll have to pick some boundary that counts as a finish line, probably. People may argue about how we select the boundary or whether the boundary is meaningful. This is more of an issue with psychological constructs. The boundaries of what we mean are very important.
- Difficulty of observation: some events are rare. some are incredibly difficult to measure. Imagine needing to build miles of a particle accelerator just to hope to measure tiny, tiny particles we haven’t seen before. Maybe. In the psychological domain, observation is tricky because people often know they’re being observed!
For observations of gravity on the scale of human-sized reality, physicists have most of these issues solved. I think what often separates psychological science from hard sciences is that we have a lot of trouble with all these. And we never have perfect experimental control. I’m not allowed to assign my participants to random parent conditions at birth, and I can’t randomize their political beliefs or their religion either. Here are some other examples:
- We’re assuming that a lot of abstract concepts are real and CAN be measured (e.g., that “love” is a thing, I can measure how much you feel it and it’s unique from the idea of liking).
- Just because we can construct sentences we understand doesn’t mean they have a clear definition. E.g., I want to study “good conversations.” Easy to say, but what does that mean? Is the quality of a conversation determined by the people talking or by “objective” outside observers? Is it according to how much they like each other at the end? Whether they talk again later? The length of time they spoke? How do we factor in “depth” of the conversation? Does what counts as a “good conversation” depend on demographics (cultural changes related to generations, locations, shared history?). Maybe there’s no such thing as a good conversation. Imagine I have two people who went through a conversation and hated it. Maybe I can take two new people, have them engage in exactly the same conversation, and yet these new people will absolutely love it. So it’s really about the people, not the conversation.
- There are standard units of time, but we lack truly standardized units for many of our phenomena (what is the standard unit of feeling happy? I vote for 7.5 smiley faces out of 7.5). This complicates comparing studies that use different kinds of measures. It also doesn’t help that we know people respond to scales differently (asking on a 1 to 7 scale can produce quite different answers than 1 to 100).
- On that point, math doesn’t always make sense for our measurements: is a 10 on a happiness scale 2 times as much as a 5? Rarely, maybe never. Does one person’s 10 equate to another person’s 10? Probably not. So person A’s 5 may actually be “better” than person B’s 6. Even within a person, how consistently do they use the scale? Is my 5 on Monday truly the same as my 5 on Saturday?
That’s just a sampling of what makes psychological science hard, and some people think it’s impossible. As someone who does psychological research, I currently don’t think it’s impossible, but I’m sympathetic: we haven’t always done a good job tackling these first issues. In fairness, they are really difficult to solve. But still, as the people claiming to do psychological science, it’s literally our job to solve them.
If you’re into this kinda thing, these problems actually all relate to core statistical and research concepts.
- Content and Construct validity: do our observations (measurements) actually capture the construct/concept we are trying to capture? Is it missing pieces of it? It would be weird, for example, to claim I’m measuring weight by only asking for your height. There actually is a relationship between weight and height, so at least I’m not asking for something totally irrelevant like your favorite color. But clearly I’m missing the most important aspect of the thing I want to study, and there are ways in which height relates to stuff that weight doesn’t. By using imperfect measures of the thing we really mean, we may be misled. Personally, I believe this is a big problem in social psychology. E.g., we study “culture” mostly by asking for people’s ethnic ancestry or country of origin. This doesn’t necessarily mean the results are invalid: even mediocre measurements still provide “signal” to the truth, provided you gather a lot of data. But I’ll have to talk about that later.
- Test-retest Reliability: if you say your personality on some scale is a 7 today, by the theory of personality, you should give a very similar answer tomorrow. And five weeks from now. And two months from now. Similarly, a scale that measures “weight” but randomly adds or subtracts 0 to 100 pounds is not a very good scale. It would be hard to interpret a single measurement (though you could still interpret the average of many measurements). The same applies to our measures. They need to be consistent to be interpretable.
- Internal reliability: If we’re averaging a bunch of stuff together in some way, do these things actually go together? E.g., if I’m measuring your height with two different measuring tapes, they may disagree a little (measurement error), but I’m pretty confident they’re measuring the same basic concept (distance), and therefore it is fair to average the two answers to together into a single (ideally more accurate) score. Internal reliability, however, is a mathematical kind of reliability. We calculate how reliable our scales are. Mathematical reliability doesn’t mean the scale is a good scale. Sometimes people forget this.
I can only speak about social psychology, but we all tend to study our own things. So we have similar constructs but we call them different names and use different measures. Sometimes our measures have passed some test of validity and reliability, but sometimes they’re still just convenient instead of accurate (my pet peeve is treating race as an indicator of culture in the US; another pet peeve is that I dislike the phrase “pet peeve”). With the proliferation of technology and computing, I think we can do a lot better than most people realize. It’s just a matter of introducing these advances to a field that isn’t primarily made of statisticians or computer scientists.
Maybe someday I’ll actually take on the job of sifting through our myriad measures and seeing what’s what myself. I actually think that’d be pretty neat. And as a final thought, even if I came across as very critical, I actually am fairly optimistic that these problems can be solved. Whether they will be is up to us (the scientists). And I think we are improving over time. Just like there’s a lag between scientific advances and communication to the broader public, there’s a lag between advances within psychological science and communication to other psychologists.