Contents

Is Vibe Coding ML a Viable Approach?

Is Vibe Coding ML a Viable Approach? webp image

Vibe coding has become such a popular term recently that it even already has its own Wikipedia page. However, if you haven’t come across it before, vibe coding is an emerging development technique that uses natural language to write code and build applications. This approach has already demonstrated strong results in several areas of software development. Machine Learning (ML) involves a wide range of tasks, including:

  • data visualization,
  • model training,
  • evaluation,
  • deployment.

While it’s clear that coding agents can accelerate routine programming tasks, the more relevant question is: can they truly support core ML workflows?

For the purposes of this article, I’ll be using Cursor, as an agentic code editor. I’ve also experimented with coding agents in VSCode, but in my experience, Cursor provides a smoother and more efficient experience, particularly for ML-related development.

Datasets

As an introduction to my tests, I used two distinct datasets: the plant-disease dataset from DiaMOS and the Unemployment Duration dataset. These datasets differ significantly in structure and complexity, which allowed me to evaluate how well coding agents handle both straightforward and more challenging ML problems.

The plant-disease dataset presents a relatively simple supervised learning task: a 4-class classification problem. Each image corresponds to a type of plant leaf, labeled with one of four disease categories. This makes it ideal for testing basic training pipelines, image preprocessing, and explainability methods such as Grad-CAM or saliency maps. Class distributions are moderately imbalanced, but overall, the dataset is clean and visually consistent, allowing for fast prototyping and model iteration.

In contrast, the Unemployment Duration dataset introduces a much more complex challenge. It involves a survival analysis problem, where the target is not a fixed label but the duration of unemployment, along with censoring indicators that describe whether the event (finding a job) occurred or not. The dataset includes multiple one-hot encoded variables, numeric features (e.g., age, wage), and temporal elements. This structure requires a deeper understanding of the semantics behind each variable, particularly the censoring logic, to avoid misleading results during analysis and modeling.

These two datasets offer complementary perspectives: one focused on efficient classification of image data, and the other on the intricacies of censored time-to-event prediction. They serve as a strong foundation for evaluating the effectiveness and limitations of agent-driven machine learning workflows.

Exploratory Data Analysis

The exploratory data analysis (EDA) prompt was quite straightforward:

Create EDA for plant-disease from scratch as a jupyter notebook.

To achieve the output to be returned directly in a notebook, I had to use the Sonnet-4 model.

The generated notebook was impressive. Although there were some minor issues with plot alignment, it produced a useful set of basic EDA outputs. These included visualizations for each class, class distribution statistics, as well as image size ratios and frequency distributions. From this analysis, it was immediately clear that the dataset was imbalanced.

class-distribution

The model also provided actionable suggestions based on its findings. These were intended to improve model performance by addressing potential issues in the dataset.

This approach significantly speeds up the initial EDA phase by automatically generating tailored code and results based on the dataset. However, it is still important for ML practitioners to verify the outputs. While image datasets lend themselves well to this kind of automated analysis, more complex data structures may lead to incorrect insights or flawed visualizations.

Potential Issues:

  • Class imbalance detected. The model suggested addressing this through data augmentation or a weighted loss function.
  • Image sizes were relatively consistent, which simplifies preprocessing.

Recommendations:

  1. Apply data augmentation to increase dataset size and improve generalization.
  2. Use a stratified train/validation split to preserve class distribution.
  3. Resize all images to consistent dimensions (e.g., 224×224) to standardize input.
  4. Normalize images using ImageNet statistics, especially when applying transfer learning.

Potential problems

A good example of an EDA result that is not fully correct comes from the unemployment dataset. Below is a summary of the model’s outputs across two iterations using the same dataset.

Cursor's output

First iteration:

🔍 KEY FINDINGS:

  • UI recipients: 8.0 weeks average
  • Non-UI recipients: 4.1 weeks average
  • Difference: 3.9 weeks (longer for UI recipients)

Second iteration:

⏱️ DURATION INSIGHTS:

  • Average duration: 6.2 weeks
  • Median duration: 5.0 weeks
  • Duration variability: 5.6 weeks (std)
  • Range: 1-28 weeks
  • Distribution: Right-skewed (skewness = 1.52)

🔍 KEY FINDINGS:

  • UI effect: +3.9 weeks (longer for UI recipients)
  • Age correlation: 0.153 (positive relationship)
  • Wage correlation: 0.040 (higher wages = longer search)
  • Replacement rate correlation: -0.030

As you can see, the results vary between runs, even though the dataset remains unchanged and the results are returned from the code that the model generated. More importantly, both outputs are incorrect.

The core issue stems from the dataset's structure, which includes four one-hot encoded censoring variables that define the nature of the unemployment spell's outcome.

distribution of censoring variables

These variables indicate which type of event concluded (or failed to conclude) the unemployment period. Unfortunately, the model treats them as independent input features, rather than as outcome-related indicators, even though their intended role is clearly explained in the accompanying documentation.

For instance, if the recorded duration is 14 weeks and censor4 is set to 1, it means the subject remained unemployed at the end of the observation window, and the event (finding a job) never occurred. In such cases, the true duration is right-censored – the unemployment spell continues beyond the 14 weeks recorded in the dataset.

The explanatory variables in the dataset are as follows:

unemployment duration dataset

AutoML vs Vibe Coding

Numerous AutoML frameworks available today enable the creation of entire machine learning pipelines with minimal manual input and limited ML knowledge. While this approach significantly lowers the entry barrier, it often results in black-box models with limited flexibility, making it harder to debug, extend, or interpret the results, especially in complex or regulated environments.

In contrast, vibe coding leverages the power of natural language prompts and large language models to assist developers who already possess machine learning expertise, or at least can verify and validate results. This approach has the potential to speed up ML pipeline creation while still maintaining full control over the code, architecture, and interpretability. Developers can extend or adapt the pipeline as needed, striking a balance between automation and transparency. A high-level comparison of both approaches is outlined below:

AspectAutoMLVibe Coding
GoalAutomate end-to-end ML pipelineGenerate ML code via natural language prompts
FlexibilityLow–medium; predefined workflowsHigh; adaptable to any coding scenario
Ease of UseVery easy; minimal coding requiredRequires coding skills and ML understanding
Best ForFast prototyping, structured data, business useCustom ML workflows, experimentation, PoCs
Learning ValueLow; abstracts most complexitiesHigh; encourages hands-on ML learning and iteration

Vibe coding PoC image classificator

For my tests, I used automatic model selection and the following initial prompt:

Build a lightweight image classification model in PyTorch to classify leaf images into 4 categories. Use one of small image classification models. Load images from a directory, and include preprocessing, training, and evaluation steps. Keep the code modular, efficient, and well-commented.

Using the plant-disease dataset, Cursor generated code that handled data loading, basic preprocessing, a train-validation split (though without a fixed random seed), model training, and evaluation. The reported training accuracy reached 87%, which at first glance seems impressive, doesn’t it?

loss metrics

Plots show the loss metrics on the transformed dataset

However, there are a few issues worth highlighting:

  1. Overfitting risk: As seen in the plotted losses, training and validation losses begin to diverge at a certain point, indicating that the model is starting to overfit. Training should likely have been stopped earlier.
  2. Wasted computation: A significant portion of the training epochs likely contributed no further performance gain and could have been skipped to save time and resources.
  3. Unseeded split: The dataset split wasn’t seeded, meaning we cannot reliably reproduce results or test the trained model, as we don’t know which data was used in each set.
  4. Metric limitations: While an accuracy of 87% may seem impressive, it can be misleading in a 4-class classification problem, especially if the class distribution is imbalanced. Without visibility into the class proportions, there's a risk that the model has simply learned to predict the dominant class, which might account for 87% of the samples. In such cases, accuracy alone is insufficient. More informative metrics include the F1 score, ROC curves (per class), or a full classification report that provides precision, recall, and support for each class.

After a few iterations, I successfully integrated early stopping with a patience of 3 epochs and configured the training process to save the best-performing model. The agent was unable to make these adjustments on its own and required manual guidance to produce correct results.

These refinements led to strong performance on the test set, achieving an accuracy of 0.867 and an F1 score of 0.869.

The complete classification report is as follows:

            precision    recall  f1-score   support

     healthy     0.7500    0.8571    0.8000         7
        spot     0.7634    0.8680    0.8124       197
        slug     0.9307    0.8660    0.8972       388
        curl     1.0000    0.9000    0.9474        10

    accuracy                         0.8671       602
   macro avg     0.8610    0.8728    0.8642       602
weighted avg     0.8750    0.8671    0.8691       602

On the positive side, Cursor did an excellent job constructing the DataLoader, including the appropriate normalization using MobileNet’s ImageNet statistics in the preprocessing step. Overall, the experiment confirmed that this kind of classification is possible, so Cursor managed to rapidly create a somewhat working PoC.

AutoML PoC

I also evaluated how the results of a vibe-coded solution compare to those of an AutoML approach. For this comparison, I used AutoKeras, which automatically constructs image classification models using a selection of pretrained architectures such as EfficientNet, ResNet, Xception, and MobileNet.

I configured AutoKeras to run 5 trials, and the best model it produced achieved 67% accuracy. However, this result is misleading. The AutoML process appeared to be overly focused on minimizing validation loss, which caused it to converge on a model that simply predicted the majority class. In this case, accuracy is not a meaningful metric, as the model learned to classify nearly all samples as a single dominant class.

Interestingly, AutoKeras decided that none of the available pretrained architectures were suitable and instead opted to train a vanilla CNN from scratch. The performance of this model is summarized in the classification report below:

             precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.00      0.00      0.00         7
           2       0.67      1.00      0.81       304
           3       0.00      0.00      0.00       132

    accuracy                           0.67       451
   macro avg       0.17      0.25      0.20       451
weighted avg       0.45      0.67      0.54       451

This output clearly highlights the class imbalance issue and the model’s failure to generalize across all classes. Despite a superficially decent accuracy score, the macro F1 score and per-class metrics reveal significant performance gaps.

Despite the limitations and errors encountered, the process was not entirely straightforward. The dataset was large enough that using cloud-based AutoML tools became impractical due to upload and processing constraints. Additionally, AutoKeras does not handle data loading, so I had to create a custom data loader, apply necessary preprocessing steps, and manually split the dataset into training, validation, and test sets.

Experiments

Then, I returned to the plant-disease dataset and used the prompt:

Create experiments for leaves model.

The agent’s code proposal was suitable for basic validation and included three tiers of experiments designed to test various configurations of the training pipeline. The experiments looked as follows:

  • Simple Experiments tested combinations of three learning rates, three batch sizes, and three optimizers.
  • Quick Experiments expanded on this by exploring a broader range of hyperparameter combinations and introducing data augmentation techniques.
  • Full Grid Search conducted an exhaustive sweep of all possible combinations of the defined learning rates, batch sizes, and optimizers.

This setup serves as a solid starting point for initial experimentation. All experiment types are executed as single-epoch runs, and the code functions correctly, returning accuracy metrics for various parameter configurations. It also includes basic logging of the parameters used during testing.

The results for the simple experiments are as follows:

🎯 Best Learning Rate: 0.001 (81.73%)
📦 Best Batch Size: 32 (83.89%)
⚡ Best Optimizer: SGD (84.72%)

However, it lacks dataset version tracking and relies solely on grid search. Although there is an option to limit the number of tests, this still falls short in terms of efficiency and flexibility.
A more efficient alternative to grid search is using random search or deterministic samplers like TPE, which optimize results without exhaustively testing all combinations. Tools like Hydra, combined with the Optuna Sweeper, further streamline experiment management and enable advanced, parallelized hyperparameter tuning.

Explainability

I tested explainability for an NLP-based classification model by introducing SHAP to analyze feature impact. However, the agent failed to grasp the pipeline fully. It used sample text inputs that didn't correspond to the actual dataset and even generated examples in the wrong language. Despite multiple attempts, the agent couldn't produce a meaningful result. It’s possible that with more precise and detailed prompting, it could work, but in my view, crafting such a specialized prompt is more time-consuming than implementing the explainability manually.

Switching focus back to the plant-disease dataset, I wanted to evaluate the explainability of the previously generated image classification model. I asked the agent to produce Feature Activation Maps, Grad-CAM, and Saliency Maps.

The results were functional, but the process was inefficient. The agent generated over 1,000 lines of code, spread across multiple versions of similar scripts (e.g., grad_cam.py, simple_grad_cam.py, working_gradcam.py, etc.). The process took around 30 minutes. In reality, these visualizations could have been implemented in under 50 lines of concise, readable code. While the final output was technically correct, the overhead introduced by the agent reduced its practical usefulness.

maps comparison

Deployment

Another test I conducted involved asking Cursor to assist with deploying a more complex machine learning pipeline.

The pipeline under evaluation combined several components. First, PaddleOCR scanned receipts and parsed them into a line-by-line text format. Then, product descriptions were tokenized using the all-MiniLM-L6-v2 sentence transformer. Finally, the embeddings were passed into an XGBoost classifier that predicted one of six predefined product categories.

Cursor was unable to generate a fully working solution without strong manual interception. It selected incorrect models (using the original version instead of a locally fine-tuned one), applied different OCR configurations, and presented results inconsistently across iterations. Achieving a functional implementation required a step-by-step interaction, involving prompt refinement and manual correction of various components.

Eventually, after a few iterations, the pipeline was successfully deployed. The solution used FastAPI for serving the model and Gunicorn as the application server. Cursor also generated a simple user interface to allow image uploads and visualize the output. While the setup was far from production-ready, it provided a solid proof-of-concept and demonstrated the viability of the approach.

product-clasification

This is debatable, but Cursor did seem to get lost several times during the debugging process. In some cases, writing the code manually or rather with chat might have been a faster way to reach a working solution, mainly because the solution could be shorter than a detailed prompt. This was likely due to the large context and complexity of the pipeline. However, one clear advantage was that the visual presentation of the results was generated and looked clean and intuitive.

Conclusions

Coding agents are undoubtedly a key part of the future of programming. However, skilled machine learning practitioners remain essential to supervise and guide the development process. These tools offer clear productivity benefits when used by someone with domain expertise. Still, they can easily produce misleading or incorrect results if used by individuals who lack a deep understanding of the underlying methods.

In practice, coding agents behave more like next-generation IDEs – most effective when autocompleting supervised code or generating well-defined functions based on precise, expert-level instructions.

There are several reasons why coding agents are not yet capable of producing fully production-ready applications. A key limitation is their incomplete contextual understanding, which often surfaces in subtle but important ways. For instance, when I asked Cursor to perform multiple related tasks, it failed to reuse previously generated code – even when the relevant context was provided – and instead recreated new code from scratch each time. This behavior raises valid concerns about code maintainability, consistency, and overall workflow efficiency.

In my experience, Cursor struggled with environment management. It frequently ignored the existing virtual environment and attempted to install packages globally. As a result, it would reinstall dependencies unnecessarily, rather than leveraging tools like uv or correctly handling virtual environments, which introduced inefficiencies during execution.

On the positive side, Cursor performed especially well in data analysis and visualization. Even when more detailed explanations were required, the visualizations were accurate, clear, and well-presented. Additional strengths include project structuring, data loading, and proof-of-concept generation for relatively simple datasets. These capabilities can be further improved by providing detailed, step-by-step prompts.

To sum up, I plan to continue incorporating coding agents into my daily workflow and encourage you to explore how they can enhance yours as well.

Technical review: Adam Wawrzyński, Łukasz Lenart

Enjoyed this article? You may also be interested in:

Blog Comments powered by Disqus.