Creating Datasets

Step-by-step guide for creating datasets in PropulsionAI

In PropulsionAI, datasets are the foundation for training your models. Whether you're starting with prebuilt data or generating new data from live interactions, PropulsionAI offers flexible options to get your datasets ready for fine-tuning your models. Here’s how you can do it:

1. Upload a Prebuilt Dataset

The most straightforward way to get started is by uploading a prebuilt dataset. PropulsionAI supports JSONL, JSON, and CSV formats. Here’s how:

  • Step 1: Navigate to the "Datasets" section of your project.

  • Step 2: Click "New" and enter a name and description for your dataset.

  • Step 3: Click "Upload Dataset" and select your dataset file (JSONL, JSON, or CSV).

  • Step 4: Map the columns in your dataset file to the columns supported by PropulsionAI. This step ensures that the data is correctly interpreted for training.

  • Step 5: Once uploaded, you'll be able to manage, search, and tag items within the dataset, setting the stage for high-quality model training.

2. Record Using Deployments

Recording data directly from your model’s deployment is an efficient way to gather real-world usage data for further fine-tuning:

  • Step 1: When creating a deployment in the "Deployments" section, you’ll have the option to record data directly to a dataset.

  • Step 2: Choose to create a new dataset or select an existing one to capture interactions as they happen.

  • Step 3: This recorded data can then be reviewed, tagged, and used for improving your model, making it more accurate and aligned with actual usage scenarios.

3. Add Items Manually

If you have specific data points that you want to include, or if you need to augment an existing dataset, you can add items manually:

  • Step 1: In the "Datasets" section, instead of uploading, click on "Add Item."

  • Step 2: Manually enter the data you want to include, one item at a time.

  • Step 3: This method is especially useful for refining datasets with targeted examples or for testing purposes.

4. Record Using SDK

For users who are already using other platforms like OpenAI, PropulsionAI’s SDK provides a seamless way to record ongoing interactions into a dataset:

  • Step 1: Integrate the PropulsionAI SDK into your existing application.

  • Step 2: Use the SDK to capture conversations or interactions happening on another platform and record them directly into a PropulsionAI dataset.

  • Step 3: This data can be further improved and utilized to fine-tune an open-source model, making it more effective and tailored to your needs.

Leveraging existing data through uploading a prebuilt dataset (Option 1) or capturing real-world interactions by recording using deployments (Option 2) are both effective ways to gather valuable data for your models. These approaches help establish a solid foundation for accurate and effective model training, ensuring that your models are well-aligned with your specific needs.

With these options, you’re well on your way to building high-quality datasets that power your custom LLMs in PropulsionAI. Whether you start with prebuilt data or collect new data on the go, PropulsionAI provides the flexibility and tools you need to succeed.

Last updated