Step-DPO

Step-wise Direct Preference Optimization

Natural Language ProcessingIntroduced 20002 papers