When fine-tuning large language models like Google’s T5, choosing the right optimizer and learning rate is crucial for achieving optimal performance. AdaFactor is an optimizer known for its memory efficiency, making it attractive for large models. However, users sometimes encounter unexpected issues when implementing AdaFactor, particularly with learning rates. This article explores a common problem experienced when trying to finetune the original T5 model using AdaFactor within the Hugging Face Transformers library and AWS SageMaker environment.
The issue arises when users attempt to switch from the default AdamW optimizer to AdaFactor. The Hugging Face Trainer
class simplifies this process with the adafactor = True
argument in Seq2SeqTrainingArguments
. This flag automatically swaps out AdamW for AdaFactor, as seen in the Trainer’s code:
optimizer_cls = Adafactor if self.args.adafactor else AdamW
if self.args.adafactor:
optimizer_cls = Adafactor
optimizer_kwargs = {"scale_parameter": False, "relative_step": False}
else:
optimizer_cls = AdamW
optimizer_kwargs = {
"betas": (self.args.adam_beta1, self.args.adam_beta2),
"eps": self.args.adam_epsilon,
}
optimizer_kwargs["lr"] = self.args.learning_rate
if self.sharded_ddp == ShardedDDPOption.SIMPLE:
self.optimizer = OSS(
params=optimizer_grouped_parameters,
optim=optimizer_cls,
**optimizer_kwargs,
)
else:
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
By also setting learning_rate = 1e-3
in Seq2SeqTrainingArguments
, users expect to configure AdaFactor with this learning rate, similar to how it’s done with AdamW. The expectation is that this setup should work seamlessly without needing to manually configure a separate learning rate scheduler like AdafactorSchedule
.
However, in practice, especially within environments like AWS SageMaker and Hugging Face Training DLC, some users have observed a perplexing issue. The training logs in CloudWatch report a learning rate stuck at zero. Concurrently, the evaluation loss (eval_loss
) remains consistently high and unchanging across evaluation steps. This indicates that the model is not learning effectively, despite the intended optimizer and learning rate settings.
Interestingly, switching back to AdamW with a learning rate of 1e-4
or 5e-5
often resolves the issue, leading to successful fine-tuning. This raises questions about the interaction between AdaFactor, the original T5 model architecture, and the Hugging Face implementation, particularly in distributed training environments.
One hypothesis is that the observed behavior might be specific to the original T5 base model and not applicable to later T5 versions like T5 v1.1, mT5, or ByT5. These later versions might have architectural differences or pre-training schemes that make them more compatible with AdaFactor out-of-the-box within the current Hugging Face Transformers framework.
Further investigation and community input are needed to fully understand this issue. Is AdaFactor’s default configuration in Hugging Face Transformers less suitable for the original T5 architecture? Are there specific learning rate ranges or other hyperparameters that need adjustment when using AdaFactor with the original T5? Sharing experiences and insights on fine-tuning Google T5 with AdaFactor, especially concerning learning rate configurations, can help the community develop best practices and avoid potential pitfalls.