Action selection requires a policy that maps states of the world to a distribution over actions. The amount of memory needed to specify the policy (the policy complexity) increases with the state-dependence of the policy. If there is a capacity limit for policy complexity, then there will also be a trade-off between reward and complexity, since some reward will need to be sacrificed in order to satisfy the capacity constraint. This paper empirically characterizes the trade-off between reward and complexity for both schizophrenia patients and healthy controls. Schizophrenia patients adopt lower complexity policies on average, and these policies are more strongly biased away from the optimal reward-complexity trade-off curve compared to healthy controls. However, healthy controls are also biased away from the optimal trade-off curve, and both groups appear to lie on the same empirical trade-off curve. We explain these findings using a cost-sensitive actor-critic model. Our empirical and theoretical results shed new light on cognitive effort abnormalities in schizophrenia.

People diagnosed with schizophrenia are typically less willing to exert cognitive and physical effort to obtain rewards (

One obstacle to a unified understanding of cognitive effort abnormalities in schizophrenia is the heterogeneity of the constructs.

The N-back task, in contrast, is effortful in the sense that it taxes representational resources needed for storing information in memory. In other words, it incurs an

We pursue this hypothesis further using a different task and a theoretical framework that makes the informational costs explicit. Collins and Frank (

Gershman (

A key goal of this paper is to understand to what extent differences in cognitive effort between patients and controls, as well as differences between individuals within these groups, can be understood as a rational trade-off. Specifically, an individual may choose to avoid cognitive effort based on their subjective preference for reward relative to the effort cost. Observing that schizophrenia patients exert less effort does not allow us to say whether they perceive cognitive effort as more costly relative to reward, or whether they are failing to optimize the trade-off between reward and effort. In the latter case, schizophrenia patients may in fact be willing to exert more effort, but they fail to identify their subjectively optimal level of effort. Rate distortion theory provides us with the theoretical tools to address how close schizophrenia patients and healthy controls are to the optimal reward-complexity trade-off. If they adhere closely to the optimal trade-off curve, then we have a basis for claiming that any differences in policy complexity between the two groups reflects a rational trade-off.

All code and data to reproduce the analyses in this paper can be obtained at:

We model an agent that visits states (denoted by

where

The agent’s goal is to earn as much reward as possible, subject to the constraint that the policy complexity cannot exceed a capacity limit. Formally, the resource-constrained optimization problem is defined as follows:

is the average reward under policy

with Lagrange multipliers

The optimal policy thus takes the form of a softmax function, with a frequency-dependent bias (perseveration) term. The Lagrange multiplier

Geometrically, this is the slope of the optimal reward-complexity curve for a particular resource constraint (see below).

The perseveration term implicitly depends on the optimal policy:

To find the optimal policy, we can use a variation of the

The previous section presented a computational-level account of policy optimization under an information-theoretic capacity limit. For convenience, we assumed direct access to the reward function, and computed the optimal policy using the Blahut-Arimoto algorithm. However, these idealizations are not plausible as process models. Real agents need to learn the reward function from experience, and the Blahut-Arimoto algorithm may be computationally intractable when the state space is large (because it requires marginalization over all states according to Eq. 8).

To derive a more cognitively plausible process model, we start from the observation that the Lagrangian optimization problem in Eq. 5 can be expressed in terms of an expectation over states:

This formulation allows us to construct an “actor-critic” learning rule using the stochastic policy gradient algorithm (

where

Given an observed reward

where

is the prediction error of the “critic”

with critic learning rate _{V}

with learning rate _{P}

We fit four free parameters (_{V}, α_{P}

If individuals can be characterized by a fixed channel capacity

where _{0}.

We also examined a reduced-form variant of the adaptive model in which we fix _{P}

We will refer to the three model variants as the _{V}, α_{P}_{0}, _{V}, α_{P}_{0}, _{V}

We applied the theory to a data set originally reported in Collins et al. (

Two groups of subjects (schizophrenia patients and healthy controls) completed the experiment. The schizophrenia group (henceforth denoted SZ) consisted of 49 people (35 males and 14 females) with a DSM-IV diagnosis of schizophrenia (

To construct the empirical reward-complexity curve, we computed for each subject the average reward and the mutual information between states and actions. From the collection of points in this two-dimensional space, we could estimate an empirical reward-complexity curve. While there are many ways to do this, we found 2nd-order polynomial regression to yield a good fit. To estimate mutual information, we used the technique introduced by Hutter (

How close are subjects to the optimal reward-complexity trade-off curve?

Another aspect of bias, captured in

We now turn to the critical question raised in the Introduction: do subjects in the two groups occupy different points along the same trade-off curve, or do they occupy different trade-off curves? To answer this question, we fit a parametric model (2nd-order polynomial regression) to the reward-complexity values, separately for the two groups and for each set size. This modeling demonstrated that the two groups have essentially the same trade-off curves. We show this in two ways. First, none of the parameter estimates differ significantly between groups for any of the set sizes

As a first step towards understanding why the empirical and optimal trade-off curves diverge, we simulated a process model of policy optimization (see Materials and Methods). This model is a cost-sensitive version of the actor-critic model that has been studied extensively in neuroscience and computer science. The key idea is that the agent is penalized for policies that deviate from the marginal distribution over actions (i.e., the probability of taking a particular action, averaging over states). This favors less complex policies, because the penalty will be higher to the extent that the agent’s policy varies across states. Mechanistically, the model works like a typical actor-critic model, with the difference that the policy complexity penalty is subtracted from the reward signal.

We fit the fixed actor-critic model to the choice data using maximum likelihood estimation, and then simulated the fitted model on the task. Applying the same analyses to these simulations _{P}, t

We found a significant difference between groups for two parameters. First, the inverse temperature (_{V}_{P}

Putting these various observations together, we conclude that the deviation from the optimal trade-off curve exhibited by subjects (particularly those with low policy complexity) can be explained as a consequence of suboptimal learning. This suboptimality is more pronounced in the schizophrenic group, which had higher actor learning rates that in turn produced greater bias. This fits with the theoretical observation that convergence of actor-critic algorithms depends on the actor learning much more slowly than the critic (

Since we are making claims about learning, it is important to validate that our model captures the key patterns in the empirical learning curves.

Next, we address the assumption that _{P}

Finally, we address whether the cost term in the prediction error (Eq. 12) is necessary to quantitatively model the data: is value updating sensitive to policy complexity? To answer this question, we fit a variant of the model without the cost term. Bayesian model comparison strongly disfavored this model (PXP close to 0). Thus, value updating does indeed seem to be sensitive to policy complexity, such that high complexity policies diminish the learned value.

In this paper, we analyzed data from a deterministic reinforcement learning task in which the number of stimuli (the set size) varied across blocks. Both schizophrenia patients and healthy controls achieved reward-complexity trade-offs that were strongly correlated with the optimal trade-off curve, but nonetheless deviated from the optimal curve for subjects with low complexity policies. In general, schizophrenia patients had lower complexity policies and hence were more biased away from the optimal curve. However, both groups of subjects appeared to lie on the same empirical reward-complexity curve. In other words, even though the schizophrenia patients were more biased than healthy controls, they did not exhibit

One implication of this conclusion is that insensitivity to reward in schizophrenia might reflect a quasi-rational trade-off rather than a cognitive impairment

How exactly does the brain solve the optimization problem? The problem is intractable for large state spaces, necessitating approximate algorithms. In particular, we formalized an actor-critic model that optimizes the cost-sensitive objective function based on trial-by-trial feedback. This model builds on earlier actor-critic models of reinforcement learning in the basal ganglia (

The finding that learning rates are elevated in schizophrenia is unusual, given that past reinforcement learning studies have not reported such a finding (

While our model was able to capture several key aspects of the experimental data, it failed to completely capture other aspects. In particular, we found that (1) the model did not learn as quickly as human subjects, and (2) the model did not capture the growth of bias with set size. These mismatches suggest that other modeling assumptions may be necessary to fully characterize performance in this task.

Unlike earlier models of memory-based reinforcement learning applied to the same data (

As discussed in the Introduction, policy complexity is one of several forms of cognitive effort that have been studied in schizophrenia patients. Some earlier work operationalized cognitive effort in terms of task difficulty (

As pointed out by Culbreth et al. (

For this analysis, we made weak assumptions about the form of the empirical trade-off curve by using linear interpolation. Later, we adopt stronger parametric assumptions.

Holding set size constant, average reward for an individual subject is also higher on blocks in which a subject has higher policy complexity (average Spearman’s

We chose to use the BIC rather than the Akaike Information Criterion (AIC) to score models because we found that AIC performed worse at model recovery, exhibiting a bias towards the independent model even when simulated data were generated by the joint model.

The apparent drop in bias at set size 6 is likely a statistical fluke, because rerunning the simulation with different random seeds frequently eliminates the drop.

Although Collins and colleagues implement a slot model, they are not explicitly committed to a slot assumption.

We are indebted to Anne Collins for making her data available. This research was supported by the Center for Brains, Minds and Machines (funded by NSF STC award CCF-1231216) and a Graduate Research Fellowship from the NSF.

The authors have no competing interests to declare.