LEVER: Revolutionizing Code Generation with Execution-Based Verification

This page provides resources and code for the research paper “LEVER: Learning to Verify Language-to-Code Generation with Execution“. LEVER introduces a novel and effective approach to enhance code generation in Large Language Models (CodeLMs). By integrating execution results into the verification and re-ranking process of CodeLM-generated programs, LEVER significantly boosts the accuracy and reliability of code generation. Notably, LEVER combined with Codex (code-davinci-002) has achieved state-of-the-art (SOTA) results across prominent benchmarks including Spider, WikiTableQuestions, GSM8k, and MBPP datasets, demonstrating its powerful capability in learning to verify language-to-code generation with execution.

Stay Updated with LEVER

Latest Updates

2023-07-05: Explore our interactive online demo now available on Huggingface Spaces! Experience LEVER in action and test its capabilities firsthand.
2023-07-04: Model weights for Codex, trained on all four benchmark datasets, are now accessible on Huggingface. Find these resources in the Model Checkpoints section below.
2023-07-03: Initial code release for LEVER is now public. Access the codebase to implement and experiment with LEVER.
2023-04-24: LEVER has been accepted to ICML’23! The paper will be presented at the International Conference on Machine Learning in 2023.

These updates showcase the ongoing development and recognition of LEVER in the machine learning community.

Getting Started with LEVER

To utilize LEVER, follow these setup instructions to install the necessary environment and download the required data.

Installation Guide

Environment Setup: It’s highly recommended to create a dedicated conda environment to manage dependencies and avoid conflicts with other Python projects. Use the following commands to create and activate a new conda environment named lever:
```
conda create -n lever python=3.8 conda activate lever
```
This ensures a clean and consistent environment for running LEVER.
Dependency Installation: Navigate to the LEVER directory in your terminal and install all required Python packages using pip. The dependencies are listed in the requirements.txt file.
```
pip install -r requirements.txt
```
This command will install all necessary libraries to run LEVER smoothly.

Platform Note: The pipelines have been rigorously tested on Linux machines. If you are using a different operating system, you might need to build tree-sitter parsers specifically for your platform.

Data Acquisition

To reproduce the results presented in the paper, you need to download the verification data.

License Information: The verification data is shared under the CC-BY-NC 4.0 license. Please adhere to the terms of this license. For the original datasets (Spider, WikiTableQuestions, GSM8k, MBPP), please refer to their respective dataset pages for licensing and download instructions.

Download Data: Download the LEVER verification data from this link.

Create Data Directory: In the main lever directory, create a folder named data and navigate into it. Then, unzip the downloaded file into this directory.

cd lever
mkdir data
cd data
unzip lever_verification_data.zip

After these steps, your data directory structure should resemble the following:

data
├── gsm8k
│   └── ...
├── mbpp
│   └── ...
├── spider
│   └── ...
└── wikitq
    ├── wikitq_codex_verification_dev.jsonl
    ├── wikitq_codex_verification_test.jsonl
    └── wikitq_codex_verification_train.jsonl

This structure ensures that LEVER can correctly locate and access the necessary data files.

Optional Configuration

Enhance your LEVER experience with these optional setups for API keys and experiment tracking.

OpenAI API Key (For Codex): If you plan to use Codex models with LEVER, you need to set up your OpenAI API key. You can either add the following line to your ~/.bashrc file (or equivalent shell configuration file) or include it directly in your inference commands.
```
export OPENAI_API_KEY=<your key, should start with "sk-">
```
Replace <your key, should start with "sk-"> with your actual OpenAI API key.
Experiment Logging with Weights & Biases (W&B): For experiment tracking and logging, LEVER supports integration with Weights & Biases.
- W&B Account: First, you need to set up a W&B account. Follow the instructions here to create an account and log in via the command line.
- Configuration: To enable W&B logging, prepend export EXP_NAME= followed by your desired experiment name to your Python commands. For example:
```
export EXP_NAME=lever-reproduce-mbpp;
```
- YAML Configuration: Modify the trainer.logger+ fields in the YAML configuration file you intend to use. Update the entity and project fields with your W&B account details.
```
trainer:
  logger+:
    entity: <your_wandb_entity> # your wandb username or team name
    project: <your_wandb_project_name> # project name for wandb
```
  This setup enables detailed tracking of your LEVER experiments on the W&B platform.
Python Import Issues: If you encounter Python import errors (e.g., ModuleNotFoundError), ensure that the Python path is correctly set. From the main lever directory, run:
```
export PYTHONPATH=`pwd`
```
This command adds the current directory to the Python path, resolving potential import issues.

By completing these setup steps, you will have a robust environment ready to run and explore LEVER for execution-based code generation verification.

Utilizing LEVER for Code Verification

Explore various use cases of LEVER, from reproducing published results to applying LEVER to new models and datasets.

Accessing Pre-trained Model Checkpoints

For ease of reproducibility and direct application, we provide access to pre-trained model weights for all four datasets on the Hugging Face Model Hub. These checkpoints allow you to quickly replicate our results and utilize LEVER’s capabilities without training from scratch.

Request for InCoder and CodeGen Models: If you require trained model weights for InCoder and CodeGen models, please open a feature request here. We prioritize requests based on community demand.

Inference with LEVER

Apply pre-trained LEVER models to existing datasets or integrate them with outputs from different code Large Language Models (LLMs). LEVER’s transfer learning capabilities, as demonstrated in the paper, make it surprisingly effective across different models and datasets.

Reproducing Published Results

To reproduce the results from the LEVER paper, use the provided trained models on the prepared datasets. After setting up the data as described in the Data section, execute the following command, replacing <dataset> with spider, wikitq, gsm8k, or mbpp.

python finetuning/trainer.py validate --config finetuning/training_configs/verification/<dataset>_verification.yaml --trainer.accelerator gpu --trainer.gpus 1

For example, to run LEVER on the Spider development set using CPUs (approximately 7 minutes on an M1 Max MacBook Pro), use:

python finetuning/trainer.py validate --config finetuning/training_configs/verification/spider_verification.yaml --trainer.accelerator cpu --trainer.gpus 4 --data.val_batch_size 4

Feel free to modify the YAML configuration files to adjust parameters according to your needs. The fields are designed to be self-explanatory for ease of customization.

Note on MPS: We encountered issues running T5 models with MPS. If you have a solution or workaround, contributions are welcome via issues or pull requests.

Integrating with New LLMs

Leverage LEVER with new LLMs on existing datasets by first generating candidate programs using your chosen LLMs. Example YAML configurations for GSM8K with Codex, InCoder, and CodeGen models are available in finetuning/training_configs/few_shot/.

To incorporate a new LLM into the few-shot generation pipeline, you need to modify finetuning/lightning_modules/models/seq2seq_model_util.py. Add your model within the elif model_name == "<your_model_name>": block in the initialize_model_and_tokenizer function:

 ... elif model_name.startswith("codex-"): ...
 ###### Add your model here ##########
 elif model_name == "<your_model_name>":
  """Initialize and return the tokenizer and the model for your new LLM"""
  # Load tokenizer and model for your LLM here
  pass # Replace with your model loading code
 ######## End adding model ##############
 else:
  print(f"unknown model: {model_name}")
  raise NotImplementedError

Remember to update the helper functions is_model_gpt_style and is_encoder_only_model in the same file to correctly identify your new model’s architecture.

After these modifications, run few-shot generation with your new model using:

python finetuning/trainer.py validate --config finetuning/training_configs/few_shot/<your_new_llm_config>.yaml

Training LEVER for New Datasets

While pre-trained LEVER models offer excellent transferability, training LEVER on a new dataset may be necessary for optimal performance. This involves implementing new dataset classes, generating training data, and training your own LEVER model.

Implementing Dataset Classes for Few-Shot Generation

To adapt LEVER to a new dataset, start by creating new dataset classes for few-shot generation. Refer to FewShotMathQADataset and FewShotMathQADataModule in finetuning/lightning_modules/datasets/mathqa_reader.py for implementation examples. Pay close attention to the functions that need to be overridden to handle your specific dataset format and requirements.

Running Few-Shot Generation for Training Data

Generate candidate programs for your new dataset using few-shot generation. This process needs to be applied to both training and development/test datasets. Use the following command template:

python finetuning/trainer.py validate --config finetuning/training_configs/few_shot/<your_new_dataset_config>.yaml

Implementing Dataset Classes for Verification

Create dataset classes specifically for the verification step. For guidance, see MathQAEndVerificationDataset and MathQAEndVerificationDataModule in finetuning/lightning_modules/datasets/mathqa_reader.py. These classes should be designed to handle the candidate programs and execution results for your dataset.

Training Your LEVER Model

Train LEVER using the generated training data. Execute the training process with the following command, using your dataset-specific configuration file:

python finetuning/trainer.py fit --config finetuning/training_configs/verification/<your_new_dataset_config>.yaml

Evaluating LEVER on Dev/Test Data

Evaluate your trained LEVER model on the development and test datasets. Validation results are typically displayed during training after each epoch if the dev data path is specified in your YAML configuration. Alternatively, you can run a separate validation using:

python finetuning/trainer.py validate --config finetuning/training_configs/verification/<your_new_dataset_config>.yaml --model.load_ckpt_file <path_to_ckpt>

Replace <path_to_ckpt> with the path to your trained model checkpoint file to evaluate performance on your dataset.

By following these steps, you can effectively train and deploy LEVER for execution-based verification on new code generation datasets, enhancing the reliability and accuracy of language-to-code models.

References and Citations

This research builds upon and adapts code from the following repositories:

https://github.com/Yale-LILY/NLP4Code (Apache-2.0 License)
https://github.com/microsoft/TraceCodegen (MIT License)

If you utilize the code or data from this repository in your research or applications, please cite the LEVER paper:

@inproceedings{ni2023lever,
 title={{Lever: Learning To Verify Language-to-code Generation With Execution}},
 author={Ni, Ansong and Iyer, Srini and Radev, Dragomir and Stoyanov, Ves and Yih, Wen-tau and Wang, Sida I and Lin, Xi Victoria},
 booktitle={{Proceedings of the 40th International Conference on Machine Learning (ICML'23)}},
 year={2023}
}

This citation acknowledges the work and contributions of the LEVER research and development team.