Fixing Image Token Errors In AgentCPM-GUI SFT Training

Nov 14, 2025 by Admin 55 views

Fixing Image Token Errors in AgentCPM-GUI SFT Training

Hey there, coding enthusiasts! Are you wrestling with the Exception: image start token != image end tokens error during your SFT (Supervised Fine-tuning) phase with AgentCPM-GUI? Don't worry, you're not alone! This is a common hiccup, and we'll break down the issue and how to resolve it. This guide is tailored for those using the OpenBMB framework and specifically targets the problem where your image tokens are out of sync during the conversation processing stage. We'll delve into the root cause, provide a detailed explanation of the error message, and offer practical solutions to get your SFT training back on track. We'll also explore best practices to avoid these issues in the future, ensuring a smoother and more efficient training process.

Understanding the Error: Image Token Mismatch

The core of the problem lies within the conversation_to_ids function, likely in the part of your code responsible for handling image tokens. The error Exception: image start token != image end tokens, (4,3) is quite telling. It signifies that the number of image start tokens doesn't match the number of image end tokens in your data. In simple terms, your system is finding a different amount of start tags, which are like the beginning markers for an image, than it's finding end tags, which signal the close of an image. This mismatch throws off the system because it expects a clear beginning and end for each image it processes. This problem can happen due to many reasons, such as inconsistent formatting or errors in your data preparation.

Why this matters: In your specific example, the error cropped up when you were feeding your data to the model. The data, as you provided, has the image tag placed within the user's question. This tag is crucial for the model to understand that it needs to process image-based content. The OpenBMB framework, and similar models, often use special tokens to denote where an image starts and ends. When these tokens are mismatched, the model becomes confused, and your training fails.

Decoding Your Data and the Error Message

Let's analyze your data snippet to see where the problem originates:

{
  'id': '0', 'image': 'GUI_Agent/data/data_source/ac/screenshot_10115_0.png', 'conversations': [
    {'role': 'system', 'content': '# Role\n...'}, 
    {'role': 'user', 'content': '<Question>Open the California Pizza app, then add a Miami Beast pizza in large size with a thin crust and make sure it is gluten-free, and add to the cart.</Question>\n当前屏幕截图：<image>'}, 
    {'role': 'assistant', 'content': '{...}'}
  ], 'episode_id': 87, 'step_id': 0
}

Looking closely, you can see that the image tag <image> is placed within the user content. This placement might be correct, depending on how your system is designed to handle image inputs. However, the error suggests there's a problem with how the system identifies and processes these tags.

The error message Exception: image start token != image end tokens, (4,3) tells us the system detected 4 start tokens and 3 end tokens. We need to identify where these tokens are and why there's a difference. Common causes are incorrect formatting, missing closing tags, or errors in your data preprocessing steps. The specifics depend on how your OpenBMB setup tokenizes the data. It is important to know the image tag that the system is using to identify the start and end of the image.

Troubleshooting Steps and Solutions

Here’s a practical, step-by-step approach to fix this image token mismatch:

Inspect Your Data:
- Carefully Review Your JSON: Go through your dataset, especially the conversations part. Ensure that every <image> tag has a corresponding closing tag. The closing tag is a tag that signals the end of an image.
- Check for Typos: A simple typo in the start or end token can cause the mismatch. For example, if you're using <image> as the start tag, make sure the end tag is consistent (e.g., </image>).
Verify Tokenization:
- Understand Tokenization: OpenBMB and other frameworks use tokenizers to convert text and images into numerical tokens that the model can understand. The tokenizer determines how image tags are handled.
- Examine Tokenizer Output: Use the tokenizer on a sample of your data to see how it processes the image tags. This will reveal the exact tokens used for image start and end. This is a crucial step to check if the framework is correctly interpreting the tags.
Adjust Data Preprocessing:
- Data Cleaning: If you find inconsistencies in your data, clean it up. Standardize the image tag format and ensure all tags are properly closed.
- Preprocessing Scripts: Review any preprocessing scripts you use to prepare your data. Make sure they correctly handle image tags and don't introduce errors.
Check Your Model Configuration:
- Model-Specific Handling: Some models have specific ways of handling images. Review your model configuration to see if it requires any special settings for image inputs.
Debugging:
- Print Intermediate Results: Add print statements to your code to check the state of the tokens at various stages of the conversation_to_ids function. This will help you pinpoint the exact location of the error.
Review the AgentCPM-GUI Readme: Ensure that you have followed the AgentCPM-GUI's readme instructions to the letter.

Example of Data Structure

Here’s how a corrected example might look. This is a sample, and the specific format will depend on your setup, but it highlights the correct use of image tags:

{
  'id': '0', 'image': 'GUI_Agent/data/data_source/ac/screenshot_10115_0.png', 'conversations': [
    {'role': 'system', 'content': '# Role\n...'}, 
    {'role': 'user', 'content': '<Question>Open the California Pizza app, then add a Miami Beast pizza in large size with a thin crust and make sure it is gluten-free, and add to the cart. Current screen shot: <image> '}, 
    {'role': 'assistant', 'content': '{...}'}
  ], 'episode_id': 87, 'step_id': 0
}

In this revised structure, the start and end tokens should align correctly, provided your tokenizer and preprocessing steps are configured to handle them properly. The image tag is properly placed within the user's content, and the format is consistent.

Avoiding Future Issues

To prevent future image token mismatches, follow these best practices:

Data Validation: Implement data validation checks in your data loading and preprocessing pipelines. Validate that all image start tokens have corresponding end tokens.
Consistent Formatting: Maintain a consistent format for your data. Ensure that image tags are always used correctly and consistently.
Automated Testing: Create automated tests that check for common errors, such as token mismatches, before you start training.
Regular Review: Regularly review your data and preprocessing scripts to catch any potential problems early.

Conclusion

Fixing the image start token != image end tokens error requires a methodical approach. By carefully examining your data, verifying your tokenization process, and adjusting your preprocessing steps, you can resolve the issue and ensure your SFT training runs smoothly. Remember to consistently format your data and implement validation checks to prevent future token mismatches. Keep experimenting and learning, and you'll become a pro at handling these challenges! Happy coding, and may your training runs be error-free!