A UI to Compare Prompts with Multiple LLMs

MLflow’s Tracking UI, starting from version 2.7, offers a powerful interface for prompt engineering. This user-friendly UI allows you to experiment with various Large Language Models (LLMs) via the MLflow AI Gateway, tweak parameter configurations, and refine prompts without writing any code. It empowers you to develop models for diverse tasks, such as question answering and document summarization, and compare their performance. The integrated Evaluation UI facilitates side-by-side comparisons of multiple models based on a given input set, enabling you to select the most effective model. Every model created within the prompt engineering UI is saved in the MLflow Model format, ready for batch or real-time inference deployment. Furthermore, all configurations, including prompt templates, LLM selections, and parameters, are meticulously tracked as MLflow Runs.

An example of the MLflow Experiment page.

Connecting Your LLM and Experimenting with Prompts

To leverage the prompt engineering UI, you need an MLflow AI Gateway endpoint and a connected MLflow Tracking Server. Set up a completions or chat endpoint using the MLflow AI Gateway and link it to your tracking server by setting the MLFLOW_DEPLOYMENTS_TARGET environment variable.

Within a chosen MLflow Experiment, initiate a new run using the “using Prompt Engineering” option. This launches a playground where you can select your endpoint, input a prompt, and evaluate its performance with different LLMs and parameters.

The prompt modal allows users to select an endpoint and evaluate prompts.

Iterative Prompt Refinement and Model Comparison

The UI allows iterative refinement of your prompts. You can modify the prompt template, adjust parameters like temperature, and re-evaluate with different inputs. Each iteration can be saved as a new MLflow Run, preserving the configuration details for future reference and comparison. The Evaluation UI enables you to compare the outputs of different runs side-by-side, facilitating data-driven decision making in selecting the optimal prompt and LLM configuration.

The Evaluation UI displays the output for a new input, allowing comparison with previous runs.

Programmatic Access and Deployment

Beyond the UI, you can programmatically access the evaluation data using mlflow.load_table(), enabling deeper analysis and integration with your existing workflows. The generated MLflow Model can be deployed for real-time serving using mlflow models serve, allowing you to integrate your refined prompts into production applications. Furthermore, mlflow.evaluate() allows for metric-based evaluation of your models, providing quantitative insights into their performance.

MLflow allows metric-based evaluation of models, providing quantitative insights into performance.

In conclusion, MLflow’s prompt engineering UI provides a comprehensive platform for developing, evaluating, and deploying LLM-powered applications. Its intuitive interface, coupled with programmatic access and deployment capabilities, makes it a valuable tool for anyone working with LLMs.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *