There's been much grumbling about how the RLHF done with ChatGPT seems to have collapsed it into a bland writing style ('corporate speak'). Do you think a different persona could have been much less bland as a writer, or thanks to the inherent limitations of RLHF, just boring/limited in a different way? (e.g. I know CharacterAI was struggling with effusively friendly 'villain' chatbots)

nostalgebraist:

It’s hard to speak with any confidence on questions like this. There’s a lot about RLHF that we just don’t know yet.

However, I don’t think this is merely a result of the “persona” assigned to ChatGPT.

Why not? Because the same problem afflicts other preference-tuned models that don’t “have a persona” in the way ChatGPT (and Claude) do.

If you ask text-davinci-003 to write fiction, it tends to use a disappointingly bland style, much like ChatGPT when asked the same thing.

text-davinci-003 was tuned with RLHF to “follow instructions” in a “helpful, truthful, and harmless” manner. (Cf. the annotator instructions in Figure 10 here.)

However, it wasn’t tuned to roleplay a character with a specific persona. text-davinci-003 doesn’t say things to you; it doesn’t talk about itself; it just writes the text you asked for in your instruction.

Which OpenAI models have this problem? An incomplete list, from my own brief tests:

  • Pure language models like davinci and code-davinci-002 do not have the problem.
  • (Despite the name, code-davinci-002 in particular is great at creative writing. code-davinci-002 is probably the best OpenAI API model overall, if you know what you’re doing.)
  • text-davinci-002 has the problem. It was tuned on a similar dataset to text-davinci-003, but using a non-RLHF method, (“FeedME,” basically finetuning on highly rated samples).

So I think the problem results from the human preference data used to tune the instruction-tuned models.

This is not entirely distinct from the “persona” we see in ChatGPT:

  • The preference data encourages responses that are “helpful, truthful and harmless”
  • The persona is something like “a friendly chatbot programmed to be helpful, truthful and harmless”

But, the evidence above shows that the friendly chatbot character isn’t necessary for the problem. Tuning to encourage “helpful, truthful and harmless” instruction-following is apparently sufficient.

Presumably, there is some way to collect preference data that doesn’t make the model less creative / less capable of stylistic variety when it’s tuned on it? There are finetuned models that don’t have this problem, so it’s not the mere act of finetuning that causes the problem, it’s something about the data used.

So the obvious explanation is that blandness is low-variance. How exactly that would cause blandness to reach fixation I’m unsure. If you’re ruling out anything which is rated as bad by 10% of raters, that will produce things which are palatable to >90%, which are probably rated worse in quality than things which are unfiltered and just sorted by average quality.

I guess this suggests that you aggregate preference data as non-boolean, and probably permit things which have a bimodal rating pattern as long as the ratio between strength of positive reaction and strength of negative reaction is strong enough. Sounds tricky and underdefined though.

ebbythemes