This page provides resources and code for the research paper “LEVER: Learning to Verify Language-to-Code Generation with Execution“. LEVER introduces a novel and effective approach to enhance code generation in Large Language Models (CodeLMs). By integrating execution results into the verification and re-ranking process of CodeLM-generated programs, LEVER significantly boosts the accuracy and reliability of code generation. Notably, LEVER combined with Codex (code-davinci-002) has achieved state-of-the-art (SOTA) results across prominent benchmarks including Spider, WikiTableQuestions, GSM8k, and MBPP datasets, demonstrating its powerful capability in learning to verify language-to-code generation with execution.
Stay Updated with LEVER
Latest Updates
- 2023-07-05: Explore our interactive online demo now available on Huggingface Spaces! Experience LEVER in action and test its capabilities firsthand.
- 2023-07-04: Model weights for Codex, trained on all four benchmark datasets, are now accessible on Huggingface. Find these resources in the Model Checkpoints section below.
- 2023-07-03: Initial code release for LEVER is now public. Access the codebase to implement and experiment with LEVER.
- 2023-04-24: LEVER has been accepted to ICML’23! The paper will be presented at the International Conference on Machine Learning in 2023.
These updates showcase the ongoing development and recognition of LEVER in the machine learning community.
Getting Started with LEVER
To utilize LEVER, follow these setup instructions to install the necessary environment and download the required data.
Installation Guide
-
Environment Setup: It’s highly recommended to create a dedicated conda environment to manage dependencies and avoid conflicts with other Python projects. Use the following commands to create and activate a new conda environment named
lever
:conda create -n lever python=3.8 conda activate lever
This ensures a clean and consistent environment for running LEVER.
-
Dependency Installation: Navigate to the LEVER directory in your terminal and install all required Python packages using pip. The dependencies are listed in the
requirements.txt
file.pip install -r requirements.txt
This command will install all necessary libraries to run LEVER smoothly.
Platform Note: The pipelines have been rigorously tested on Linux machines. If you are using a different operating system, you might need to build
tree-sitter
parsers specifically for your platform.
Data Acquisition
To reproduce the results presented in the paper, you need to download the verification data.
License Information: The verification data is shared under the CC-BY-NC 4.0 license. Please adhere to the terms of this license. For the original datasets (Spider, WikiTableQuestions, GSM8k, MBPP), please refer to their respective dataset pages for licensing and download instructions.
-
Download Data: Download the LEVER verification data from this link.
-
Create Data Directory: In the main
lever
directory, create a folder nameddata
and navigate into it. Then, unzip the downloaded file into this directory.cd lever mkdir data cd data unzip lever_verification_data.zip
After these steps, your
data
directory structure should resemble the following:data ├── gsm8k │ └── ... ├── mbpp │ └── ... ├── spider │ └── ... └── wikitq ├── wikitq_codex_verification_dev.jsonl ├── wikitq_codex_verification_test.jsonl └── wikitq_codex_verification_train.jsonl
This structure ensures that LEVER can correctly locate and access the necessary data files.
Optional Configuration
Enhance your LEVER experience with these optional setups for API keys and experiment tracking.
-
OpenAI API Key (For Codex): If you plan to use Codex models with LEVER, you need to set up your OpenAI API key. You can either add the following line to your
~/.bashrc
file (or equivalent shell configuration file) or include it directly in your inference commands.export OPENAI_API_KEY=<your key, should start with "sk-">
Replace
<your key, should start with "sk-">
with your actual OpenAI API key. -
Experiment Logging with Weights & Biases (W&B): For experiment tracking and logging, LEVER supports integration with Weights & Biases.
- W&B Account: First, you need to set up a W&B account. Follow the instructions here to create an account and log in via the command line.
- Configuration: To enable W&B logging, prepend
export EXP_NAME=
followed by your desired experiment name to your Python commands. For example:export EXP_NAME=lever-reproduce-mbpp;
- YAML Configuration: Modify the
trainer.logger+
fields in the YAML configuration file you intend to use. Update theentity
andproject
fields with your W&B account details.trainer: logger+: entity: <your_wandb_entity> # your wandb username or team name project: <your_wandb_project_name> # project name for wandb
This setup enables detailed tracking of your LEVER experiments on the W&B platform.
-
Python Import Issues: If you encounter Python import errors (e.g.,
ModuleNotFoundError
), ensure that the Python path is correctly set. From the mainlever
directory, run:export PYTHONPATH=`pwd`
This command adds the current directory to the Python path, resolving potential import issues.
By completing these setup steps, you will have a robust environment ready to run and explore LEVER for execution-based code generation verification.
Utilizing LEVER for Code Verification
Explore various use cases of LEVER, from reproducing published results to applying LEVER to new models and datasets.
Accessing Pre-trained Model Checkpoints
For ease of reproducibility and direct application, we provide access to pre-trained model weights for all four datasets on the Hugging Face Model Hub. These checkpoints allow you to quickly replicate our results and utilize LEVER’s capabilities without training from scratch.
Request for InCoder and CodeGen Models: If you require trained model weights for InCoder and CodeGen models, please open a feature request here. We prioritize requests based on community demand.
Inference with LEVER
Apply pre-trained LEVER models to existing datasets or integrate them with outputs from different code Large Language Models (LLMs). LEVER’s transfer learning capabilities, as demonstrated in the paper, make it surprisingly effective across different models and datasets.
Reproducing Published Results
To reproduce the results from the LEVER paper, use the provided trained models on the prepared datasets. After setting up the data as described in the Data section, execute the following command, replacing <dataset>
with spider
, wikitq
, gsm8k
, or mbpp
.
python finetuning/trainer.py validate --config finetuning/training_configs/verification/<dataset>_verification.yaml --trainer.accelerator gpu --trainer.gpus 1
For example, to run LEVER on the Spider development set using CPUs (approximately 7 minutes on an M1 Max MacBook Pro), use:
python finetuning/trainer.py validate --config finetuning/training_configs/verification/spider_verification.yaml --trainer.accelerator cpu --trainer.gpus 4 --data.val_batch_size 4
Feel free to modify the YAML configuration files to adjust parameters according to your needs. The fields are designed to be self-explanatory for ease of customization.
Note on MPS: We encountered issues running T5 models with MPS. If you have a solution or workaround, contributions are welcome via issues or pull requests.
Integrating with New LLMs
Leverage LEVER with new LLMs on existing datasets by first generating candidate programs using your chosen LLMs. Example YAML configurations for GSM8K with Codex, InCoder, and CodeGen models are available in finetuning/training_configs/few_shot/
.
To incorporate a new LLM into the few-shot generation pipeline, you need to modify finetuning/lightning_modules/models/seq2seq_model_util.py
. Add your model within the elif model_name == "<your_model_name>":
block in the initialize_model_and_tokenizer
function:
... elif model_name.startswith("codex-"): ...
###### Add your model here ##########
elif model_name == "<your_model_name>":
"""Initialize and return the tokenizer and the model for your new LLM"""
# Load tokenizer and model for your LLM here
pass # Replace with your model loading code
######## End adding model ##############
else:
print(f"unknown model: {model_name}")
raise NotImplementedError
Remember to update the helper functions is_model_gpt_style
and is_encoder_only_model
in the same file to correctly identify your new model’s architecture.
After these modifications, run few-shot generation with your new model using:
python finetuning/trainer.py validate --config finetuning/training_configs/few_shot/<your_new_llm_config>.yaml
Training LEVER for New Datasets
While pre-trained LEVER models offer excellent transferability, training LEVER on a new dataset may be necessary for optimal performance. This involves implementing new dataset classes, generating training data, and training your own LEVER model.
Implementing Dataset Classes for Few-Shot Generation
To adapt LEVER to a new dataset, start by creating new dataset classes for few-shot generation. Refer to FewShotMathQADataset
and FewShotMathQADataModule
in finetuning/lightning_modules/datasets/mathqa_reader.py
for implementation examples. Pay close attention to the functions that need to be overridden to handle your specific dataset format and requirements.
Running Few-Shot Generation for Training Data
Generate candidate programs for your new dataset using few-shot generation. This process needs to be applied to both training and development/test datasets. Use the following command template:
python finetuning/trainer.py validate --config finetuning/training_configs/few_shot/<your_new_dataset_config>.yaml
Implementing Dataset Classes for Verification
Create dataset classes specifically for the verification step. For guidance, see MathQAEndVerificationDataset
and MathQAEndVerificationDataModule
in finetuning/lightning_modules/datasets/mathqa_reader.py
. These classes should be designed to handle the candidate programs and execution results for your dataset.
Training Your LEVER Model
Train LEVER using the generated training data. Execute the training process with the following command, using your dataset-specific configuration file:
python finetuning/trainer.py fit --config finetuning/training_configs/verification/<your_new_dataset_config>.yaml
Evaluating LEVER on Dev/Test Data
Evaluate your trained LEVER model on the development and test datasets. Validation results are typically displayed during training after each epoch if the dev data path is specified in your YAML configuration. Alternatively, you can run a separate validation using:
python finetuning/trainer.py validate --config finetuning/training_configs/verification/<your_new_dataset_config>.yaml --model.load_ckpt_file <path_to_ckpt>
Replace <path_to_ckpt>
with the path to your trained model checkpoint file to evaluate performance on your dataset.
By following these steps, you can effectively train and deploy LEVER for execution-based verification on new code generation datasets, enhancing the reliability and accuracy of language-to-code models.
References and Citations
This research builds upon and adapts code from the following repositories:
- https://github.com/Yale-LILY/NLP4Code (Apache-2.0 License)
- https://github.com/microsoft/TraceCodegen (MIT License)
If you utilize the code or data from this repository in your research or applications, please cite the LEVER paper:
@inproceedings{ni2023lever,
title={{Lever: Learning To Verify Language-to-code Generation With Execution}},
author={Ni, Ansong and Iyer, Srini and Radev, Dragomir and Stoyanov, Ves and Yih, Wen-tau and Wang, Sida I and Lin, Xi Victoria},
booktitle={{Proceedings of the 40th International Conference on Machine Learning (ICML'23)}},
year={2023}
}
This citation acknowledges the work and contributions of the LEVER research and development team.