Thursday, June 15, 2017

N-best evaluation for hiring and promotion

How can we create incentive-compatible evaluation of scholarship? Here's a simple proposal, discussed around a year ago by Sanjay Srivastava and floated in a number of forms before that (e.g., here):
The N-Best Rule: Hiring and promotion committees should solicit a small number (N) of research products and read them carefully as their primary metric of evaluation for research outputs. 
I'm far from the first person to propose this rule, but I want to consider some implementational details and benefits that I haven't heard discussed previously. (And just to be clear, this is me describing an idea I think has promise – I'm not talking on behalf of anyone or any institution).

Why do we need a new policy for hiring and promotion? How do two conference papers on neural networks for language understanding compare with five experimental papers exploring bias in school settings or three infant studies on object categorization? Hiring and promotion in academic settings is an incredibly tricky business. (I'm focusing here on evaluation of research, rather than teaching, service, or other aspects of candidates' profiles.) How do we identify successful or potentially successful academics, given the vast differences in research focus and research production between individuals and areas? Two different records of scholarship simply aren't comparable in any sort of direct, objective manner. The value of any individual piece of work is inherently subjective, and the problem of subjective evaluation is only compounded when an entire record is being compared.

To address this issue, hiring and promotion committees typically turn to heuristics like publication or citation numbers, or journal prestige. These heuristics are widely recognized to promote perverse incentives. The most common, counting publications, leads to an incentive to do low-risk research and "salami slice" data (publish as many small papers on a dataset as you can, rather than combining work to make a more definitive contribution). Counting citations or H indices is not much better – these numbers are incomparable across fields, and they lead to incentives for self-citation and predatory citation practices (e.g., requesting citation in reviews). Assessing impact via journal ranks is at best a noisy heuristic and rewards repeated submissions to "glam" outlets. Because they do not encourage quality science, these perverse incentives have been implicated as a major factor in the ongoing replicability/reproducibility issues that are facing psychology and other fields.

Details of the proposed N-best policy. Evaluation processes for individual candidates should foreground – weight most heavily – the evaluation of N discrete products, with N a number varying by career stage. I'm referring to these as "products" rather than "papers," because it's absolutely critical to the proposal that they can be unpublished (e.g., preprints), thesis chapters, datasets, or other artifacts, when these are appropriate contributions for the position. All other aspects of an applicant's file, including institutional background, lab, other publication record, and letters of recommendation, should serve as context for those pieces of scholarship. And these products must be read by all members of the evaluating committee.

How does this proposal differ from current evaluation standards? Evaluation for other qualities – teaching, service – will still be critical, especially if these are part of the position description. I'm only talking about research. But the N-best policy would change current procedures for research evaluation in a number of respects:
  • Currently papers are solicited, but there is no expectation that they will be read. Under this policy, it would be required that they be read by all committee members.
  • The CV is currently the primary assessment tool, rewarding publishing a lot and in high-profile outlets. Under this policy, the CV would be explicitly de-emphasized, and references to "productivity" would not be allowed in statements about hiring.* 
  • Under current standards, letters of recommendation are solicited for hiring and promotion, but there is little guidance on what these letters should do (other than sing the applicant's praises or compare the applicant in general terms to other scholars). Under N-Best, the explicit goal of letters would be to contextualize the submitted scholarship and its contributions to the broader research enterprise, so as to mitigate the problem of scholarship outside of the committee's expertise. Other general statements, e.g. about productivity, brilliance, etc. would be discounted. 
  • We currently weigh job talks very heavily in hiring decisions. This practice leads to strong biases for good presenters. Under N-best, the goal would be to use interviews/job talks to assess the quality of the submitted research products. If the evaluation of a small number of distinct research findings is the nexus of the assessment, then what someone wore or whether they were "charismatic" (e.g., good looking and friendly) becomes a bit easier to confuse wth the task at hand. 
What should N be? In a single-parameter model, setting that parameter correctly is critical. Here are some initial thoughts for reasonable parameter values at an R1 university.
  • Postdocs hiring: 1 - 2 products. Having done two good projects in a PhD is enough to show that you are able to initiate and complete projects. Some PhDs only yield a single product, but typically this product will be comprehensive or impressive enough that its contribution should be clear. 
  • Applicants for tenure-track, research-intensive positions: 3 products. Three good products seems to me enough to give some intimation of a coherent set of methods and interests. A research statement will typically be necessary for contextualizing this work and describing future plans. 
  • Tenure files: 4 - 5 products. If you have done five really good things, I think you deserve tenure at a major research university. Some of these products could even review other work, giving the opportunity to foreground a synthetic contribution or a broader program of research; this synthesis could also be the function of a research statement.
How can we apply this policy fairly to large applicant groups? Many academic jobs receive hundreds of applicants, but some heuristics could make the reading load bearable (if still heavy): 
  • The whole committee need only read products from a short list, reading work can be divided amongst members for the full set. 
  • The committee can ask for ranked products, so only the first is be assessed in first pass.
  • The committee can rank applicants on explicit non-research criteria (e.g., area of interest, teaching, service, etc.) prior to evaluating research to narrow the set of candidates for whom papers must be read.
There's still no question that this policy will result in more work. But I'd argue that this work is balanced by the major benefits that accrue. 

Benefits of the N-best policy. The primary benefit is that evaluation will be tied directly to the quantity that we want to prioritize: good science. If we want more quality science getting done, we need to hire and promote people who do that science, as opposed to hiring and promoting people who do things that are sometimes noisily correlated with doing good science (e.g., publishing in top tier journals). Unfortunately, that means that we need to read their papers and use our best judgment. Our judgment is unlikely to be perfect, but I don't think there's any reason to think it's worse than the judgment of a journal editor!

A second major benefit of N-best is that – if we're actually reading the research – it need not be published in any particular journal. It can easily be a preprint. Hence, N-best incentivizes preprint posting, with the concomitant acceleration of scientific communication. Of course, publication outlet will still likely play a role in biasing evaluation. But good research will shine when read carefully, even if it's not nicely typeset, and candidates can weigh the prospect of being evaluated on strong new work against the risks.

A final benefit of N-best and the use of preprints in the N is for early career researchers. Under the current laggy publication system, a typical US PhD student will have to have finished a project by the end of their third year or before in order for the paper to be accepted by the time they are looking for postdocs in the middle of their fifth year. This process can of course be sped up by luck, but it can also be slowed down such that even early work from a PhD doesn't "come out" until years later (both happened in my case!). Such small sample fluctuations caused by slow journals or chance rejections can shape students' careers dramatically (and often not for the better). Explicitly allowing preprints allows candidates to be evaluated in terms of the work they've done, not the work they've pushed through the publication process.**

Conclusions. If we want to hire and promote good scientists, we need to read their science and decide that it's good.*** 

* I've certainly said that candidates are "productive" before! It's an easy thing to say. Productivity is probably correlated at some level with quality. It's just not the same thing. If you can actually assess quality, that's what you should be assessing.

** E.g., work they've gotten past the slow judgment of an editor who should be writing decision letters instead of blog posts!

*** Any other evaluation will fall prey to Goodhart's law.


  1. Thanks for a very interesting post. I agree with the benefits you've described. I thought it would be worthwhile to point out a potential set of costs. Obviously a key question is who gets to decide what constitutes "good science". At one extreme, it might be the president of the university. A risk of adopting this procedure is that it will bias hiring towards fields (or individual research programs) that are easy to understand and appreciate, and away from those that that are more inaccessible to a broad audience (but no less valuable). At the opposite extreme, evaluations of "good science" might be left to those individuals who are closest to the relevant field / program of research. But this approach introduces the risk that personal relationships, nepotism, rivalry etc. will infect the hiring process. Hiring decisions will frequently be made by somebody who was an advisor to, collaborator with, competitor of, etc., the candidates in question.

    This tension need not doom the N-best approach, and some potential solutions come to mind. There is a goldilocks approach where "good science" is evaluated by those who are close-but-not-too-close. There is a democratic approach where it is evaluated by a mixture of the close and the far. There is a recusal approach where people with prior cooperative or competitive relationships are not allowed to influence the process. Combine all of these, and what you get looks an awful lot like current tenure and promotion practices.

    A second set of concerns with the "best science" approach is that the answer to the question "is X better science than Y" is relatively more likely to be infected by predictable biases (gender, race, nationality, halo, etc.) than the answer to the question "is X citations more citations than Y"? (Of course, there are plenty of documented ways in which the number of citations is, itself, subject to the same pernicious biases).

    In any event, in light of the inherent tensions in deciding who gets to make the determination about what is "good science", and the potential role of bias, it becomes more clear why relatively objective, numeric measures based on citation would have been chosen not merely to save time and mental energy on the part of evaluators, but in fact because it could be viewed as a more objective and reliable process. Or at least a check on the system that asks, "Does a relatively objective metric concur with our "best science" evaluation?"

    -Fiery Cushman

  2. Hi Fiery, thanks very much for the comment and for engaging! I agree with you that context is important in assessing scientific quality. That's why I'm suggesting that letters and job talk be focused on providing that kind of context. Citation numbers can also be useful for this purpose when used appropriately.

    But I disagree that current hiring/promotion looks like this. First, we tend to assess people, not products. This leads to bias at the level of gender, race, looks, etc. Second, letters and talks are not used to assess or contextualize individual pieces of work as consistently as they should be. Third, citation numbers are used inappropriately, with little awareness of the need for controls for subfield, publication date, etc.

    I am responding to some generalizations from the replication crisis: people can look like stars who have published highly impactful research, but if you examine individual papers in a journal club, the research will come apart at the seams. Putting a group of smart people in a room for a guided discussion of a couple of papers and - my contention is - you will often come out with a very different assessment of a body of work than if you read a CV, read some general praise in letters, and watch a nice, well-practiced talk.

  3. Hi Michael, this is an immensely sensible proposal, which would address many of the perverse incentives in the current system for evaluating academics for hiring and promotion.

    It's interesting to note that the Research Excellence Framework (REF) in the UK uses almost exactly the system you propose. The REF is a 5-yearly exercise in which all research active staff (faculty and senior researchers) at all UK universities are evaluated. Everyone submits their top 4 outputs in a specified 5-year window; these outputs are the evaluated (effectively peer-reviewed) by a panel of experts in the discipline, and a rating between 0 and 4 stars is assigned to each researcher based on this evaluation. The overall evaluation of a department is then computed as an aggregate as of these individual star-rating. Crucially, government funding is then allocated based on this aggregate.

    The next REF will happen in 2020, and most of the process will be the same as the one I just described, but the number of output evaluated per research will vary (probably between 2 and 6, with an average of 4). This is also something you suggest.

    The important point is that the REF evaluation panels are explicitly discouraged from using proxy metrics such as impact factors and citation counts; they are instructed to evaluate each output on its quality. Also, outputs do not have to be publications; they can be preprints, but also datasets, software, patents, etc.

    Needless to say, evaluating every researcher at every university is an immense effort, and it's a very costly exercise. However, it's one that has driven up overall research quality, in my view. Not only because it incentivises researchers to aim for quality rather than quantity, but also because hiring and promotion committees apply REF criteria in their decisions.

    1. Frank, this is very interesting, thank you!

      I wasn't familiar with all the details of the REF, and I didn't know that the committees actually read the materials. (I knew that the assessments were controversial, though).

      I guess it's a separate question how often such assessments should be done and how funding should be allocated - but it would be interesting to hear people's experiences with the actual judgements that are made by the committees.

    2. Talking to REF panel members, what happens is that each member is assigned a stack of submitted outputs to read and assign star rankings to. Of course this isn't full peer-review, and it's not blind, and we can't be sure that panel members won't be influenced by the prestige of publication outlets, citation counts, etc. However, in principle their remit is to judge the quality of the science underlying the submitted output, independently of where it been published.

      You are right, the REF is controversial for various reasons (it puts a huge burden on universities, it effectively creates a "transfer market" for top academics, departments try to game the system in various ways, in addition to excellence, "impact" is now also a criterion as well, etc.). The way panels operate, however, is not generally a controversial aspect of the REF.

      Anyway, given that hiring committees in the UK are guided by REF results, this is in some ways a real-life experiment with n-best evaluation along the line you suggest.

    3. Thanks, that's fascinating context! I appreciate you sharing. I would love to hear others' thoughts on how REF committees operate - perhaps there are even some thoughts about how evaluation results compare to traditional metrics (e.g., correlation or lack with H and citations), though I'm sure data confidentiality is a huge issue.

  4. Hi Mike. Fascinating as usual, and I agree with your description of some of the ills of our current system. But I'm not yet convinced by your proposed solution. Because for postdocs, and especially for tenure track hires, we are not trying to evaluate the quality of the work the candidate has done, but the quality of the work that the candidate will do. I think that's the reason for the premium on job talks, chalk talks, and letters of recommendation. Because the quality of past work is only noisily correlated with the quality of future work, and subject to major confounds, including the unknown contribution of the PI. So we look to letters to describe the independence of the candidate; and to the talks for evidence of the ability to craft and develop a future-oriented coherent scientific path. Skills as a communicator are not irrelevant, either; a PI needs to be able to communicate to their own lab, their broader audience, and the public, in order to succeed at their job. I am sure we do this evaluation noisily at best; but I'm not sure reading papers would help us do it any better.
    --Rebecca Saxe

  5. Hey Rebecca, thanks very much for engaging (and as always great to hear from you)! There are a couple of points here:

    1. Talks as evidence for future plans. I agree that we're trying to judge future success, but what makes us think that there is any correlation between talk quality and future productivity? I guess the thought is that we judge the ideas in the talk and decide if they are good. My worry here is that we can do that just as well from a research statement, but without the bias. And the research statement is much closer to the eventual evaluation metric for an academic - which is almost all written. If we want people who write good papers/grants etc. then we should hire people who are good at precisely those things. Adding in the signal from the talks just adds bias - and other skills, see below. (I do agree that letters can contextualize the candidate's contributions to prior work/independence in doing that work, I think that's important).

    2. Communication skills. I agree that the best way to judge someone's communication skills is to see them communicate! But in principle we could actually separate that from research productivity, e.g. having people do a guest teaching slot on material that they didn't produce. That would perhaps be more probative about their general science communication abilities (and many teaching schools do something similar). The worry is that by conflating communication and research, we can often get worse research that's communicated better (perhaps by someone more charismatic and fitting with our own biases).

    Overall, this argument comes from my own feeling that I'm a poor "judge of character" when it comes to research, and that others may not be much better. So I'm looking for something a little more structured than the "I'll know it when I see it" that feels like what we do now in our holistic evaluations.