Using LLMs to enhance our testing practices

186 points by johnjwang 8 months ago

In every single system I have worked on, tests were not just tests - they were their own parallel application, and it required careful architecture and constant refactoring in order for it to not get out of hand.

"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).

Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.

jeswin 8 months ago

> Using an autocomplete to "bang it out" seems foolish.
Based on my own experience, I find the widespread scepticism on HN about AI-assisted coding misplaced. There will be corner cases, there will be errors, and there will be bugs. There will also be apps for which AI is not helpful at all. But that's fine - nobody is saying otherwise. The question is only about whether it is a _significant_ nett saving on the time spent across various project types. The answer to that is a resounding Yes.
The entire set of tests for a web framework I wrote recently were generated with Claude and GPT. You can see them here: https://github.com/webjsx/webjsx/tree/main/src/test
On an average, these tests are better than tests I would have written myself. The project was written mostly by AI as well, like most other stuff I've written since GPT4 came out.
"Using an autocomplete to bang it out" is exactly what one should do - in most cases.
- beepbooptheory 8 months ago
  
  Ok but looking at those tests for just a second (for createElement), you might want to go through it again, or ask the computer or whatever. For example, edgeCases.test.ts is totally redundant, you are running the same exact tests in children.test.ts.
  Edit: such a LLM repo... why did it feel the need to recreate these DOM types? Is your AI just trying to maximize LoC? It just seems like such a pain and potential source of real trouble when these are already available. https://github.com/webjsx/webjsx/blob/main/src%2FjsxTypes.ts
  
  jeswin 8 months ago
  
  Actually, the file you identified is the (only) one that's mostly human written. It came from a previous project, I may be able to get rid of it.
  But generally, the tests are very useful. My point is that there will be redundancies, and maybe even bugs - and that's fine, because the time needed to fix these would be much less than what it would have taken to write them from scratch.
- thanksgiving 8 months ago
  
  I want to bring my own experience from a code base I briefly worked on, I worked on a module of code where basically all the unit tests assertions were commented out. This was about ten years ago. The meta is there should still be someone responsible for the code an LLM generated and there should still be at least one more person who does a decent code review at some point. Otherwise, the unit tests being there is useless just like the example I gave on top with the assertions removed.
swatcoder 8 months ago

Fully agreed.
It's bad enough when human team members are submitting useless, brittle tests with their PR's just to satisfy some org pressure to write them. The lazy ones provide a false sense of security even though they neglect critical scenarios, the unstable ones undermine trust in the test output because they intermittently raise false negatives that nobody has time to debug, and the pointless ones do nothing but reify architecture so it becomes too laborious to refactor anything.
As contextually aware generators, there are doubtless good uses for LLM's in test developement, but (as with many other domains) they threaten to amplify an already troubling problem with low-quality, high-volume content spam.
BeetleB 8 months ago

Mostly agree.
My first thought when I read this post was: Is his goal to test the code, or validate the features?
The first problem is he's providing the code, and asking for tests. If his code has a bug, the tests will enshrine those bugs. It's like me writing some code, and then giving it to a junior colleague, not providing any context, and saying "Hey, write some tests for this."
This is backwards. I'm not a TDD guy, but you should think of your test cases independent of your code.
- _puk 8 months ago
  
  But in a system that exists without tests (this is the real world after all), the current functionality is already enshrined in the app.
  Adding tests that capture the current state of things, so that when that bug is uncovered tests can easily be updated to the correct functionality to prove the bug prior to fixing it is a much better place to be than the status quo.
  The horse may have bolted from the barn, but we can at least close the farm gate in the hopes of recapturing it eventually.
- renegade-otter 8 months ago
  
  Right! AI is going to help you write passing tests - not BREAK your code, which is the whole point of writing tests.
  
  GuB-42 8 months ago
  
  Tests are not just for breaking your code. Writing passing tests is great for regression testing, which I think is the most important kind of unit testing.
  If your goal is to break your code, try fuzzing. For some reason, it seems that the only people who do it are in the field of cybersecurity. Fuzzing can do more than find vulnerabilities.
- sumedh 8 months ago
  
  > not providing any context
  You can provide the context to an AI model though, you can share the source with it.
danmaz74 8 months ago

I subscribe to the concept of the "pyramid of tests" - lots of simpler unit tests, fewer integration tests, and very few end-to-end tests. I find that using LLMs to write unit tests is very useful. If I just wrote code which has good naming both for the classes, methods and variables, useful comments where necessary and if I already have other tests which the LLMs can use as examples for how I test things, I usually just need to read the created tests and sometimes add some test cases, just writing the "it should 'this and that'" part for cases which weren't covered.
An added bonus is that if the tests aren't what you expect, often it helps you understand that the code isn't as clear as it should be.
- holbrad 8 months ago
  
  I also subscribe to a testing pyramid but I think it's commonly upside down IMO.
  You should have a few very granular unit tests for where they make the most sense (Known dangerous areas, or where they are very easy to write eg. analysis)
  More library/service tests. I read in an old config file and it has the values I expect.
  Integration/system tests should be the most common, I spin up the entire app in a container and use the public API to test the application as a whole.
  Then most importantly automated UI tests, I do the standard normal customer workflows and either it works or it doesn't.
  The nice thing is that when you strongly rely on UI and public API tests you can have very strong confidence that your core features actually work. And when there are bugs they are far more niche. And this doesn't require many tests at all.
  (We've all been in the situation where the 50,000 unit tests pass and the application is critically broken)
  
  renegade-otter 8 months ago
  
  This is exactly my experience.
viraptor 8 months ago

Pretty much this and I prefer the opposite. "Here's the new test case from me, make the code pass it" is a decent workflow with Aider.
I get that occasionally there are some really trivial but important tests that take time and would be nice to automate. But that's a minority in my experience.
skissane 8 months ago

> "More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code.
Are there ways we can measure this?
One idea that I’ve had, is collect code coverage separately for each test. If a test isn’t covering any unique code or branches, maybe it is superfluous - although not necessarily, it can make sense to separately test all the boundary conditions of a function, even if doing so doesn’t hit any unique branches.
Maybe prefer a smaller test which covers the same code to a bigger one. However, sometimes if a test is very DRY, it can be more brittle, since it can be non-obvious how to update it to handle a code change. A repetitive test, updating it can be laborious, but at least reasonably obvious how to do so.
Could an LLM evaluate test quality, if you give it a prompt containing some expert advice on good and bad testing practices?
- fijiaarone 8 months ago
  
  Sometimes you actually have to think, or hire someone who can. Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.
  
  skissane 8 months ago
  
  > Sometimes you actually have to think, or hire someone who can.
  I'm perfectly capable of thinking. Thinking about "how can I create a system which reduces some of my cognitive load on testing so I can spend more of my cognitive resources on other things" is a particularly valuable form of thinking.
  > Go join the comments section on the Goodharts Law post to go on about measuring magical metrics.
  That problem is when managers take a metric and turn it into a KPI. That doesn't happen to all metrics. I can think of many metrics I've personally collected that no manager ever once gazed upon.
  The real measure of a metric's value, is how meaningful a domain expert finds it to be. And if the answer to that is "not very" – is that an inherent property of metrics, or a sign that the metric needs to be refined?
  
  jaredsohn 8 months ago
  
  Good tests reduce your cognitive load; you can have more confidence that code will work and spend less time worrying that someone will break it.
  BTW, I think above are the best metrics to use for tests. Actually measuring it can be hard, but I think keeping track of when functionality doesn't work and people break your code is a good start.
  And I think all of this should be measured in terms of doing the right thing business logic-wise and weighing importance of what needs testing based on the business value of when things don't work.
nrnrjrjrj 8 months ago

There is an art to writing tests especially getting absraction levels right. For example do you integration test hitting the password field with 1000 cases or do that as a unit test, and does doing it as a unit test sufficiently cover this.
AI could do all this thinking in the future but not yet I believe!
Let alone the codebase is likely a mess of bad practice already (never seen one that isn't! That is life) so often part of the job is leaving the campground a bit better than how you found it.
LLMs can help now on last mile stuff. Fill in this one test. Generate data for 100 test cases. Etc.
dngit 8 months ago

Great point on focusing on high-impact tests. I agree that LLMs risk giving a false sense of coverage. Maybe a smart strategy is generating boilerplate tests while we focus on custom edge cases.
- idoco 8 months ago
  
  Absolutely with you on the need for high-impact tests. I find that humans are still way better at coming up with the tests that actually matter, while AI can handle the implementation faster—especially when there’s a human guiding it.
  Keeping a human in the loop is essential, in my experience. The AI does the heavy lifting, but we make sure the tests are genuinely useful. That balance helps avoid the trap of churning out “dumb” tests that might look impressive but don’t add real value.
bryanrasmussen 8 months ago

>Sometimes I spend more time on the test code than the actual code (probably normal).
This seems like the kind of thing that should be highly dependent on the kind of project one is doing, if you have an MVP and your test code is taking longer than the actual code then it is clear the test code is antagonistic to the whole concept of an MVP.
aoeusnth1 8 months ago

Detecting regressions is the goal. If LLMs can do that for free to cheap, that’s good. It doesn’t have to be complicated.
idoco 8 months ago

Totally agree, especially about the need for well-architected, high-impact tests that go beyond just coverage. At Loadmill, we found out pretty early that building AI to generate tests was just the starting point. The real challenge came with making the system flexible enough to handle complex customer architectures. Think of multiple test environments, unique authentication setups, and dynamic data preparation.
There’s a huge difference between using an LLM to crank out test code and having a product that can actually support complex, evolving setups long-term. A lot of tools work great in demos but don’t hold up for these real-world testing needs.
And yeah, this is even trickier for higher-level tests. Without careful design, it’s way too easy to end up with “dumb” tests that add little real value.

mastersummoner 8 months ago

I actually tested Claude Sonnet to see how it would fare at writing a test suite for a background worker. My previous experience was with some version of GPT via Copilot, and it was... not good.

I was, however, extremely impressed with Claude this time around. Not only did it do a great job off the bat, but it taught me some techniques and tricks available in the language/framework (Ruby, Rspec) which I wasn't familiar with.

I'm certain that it helped having a decent prompt, asking it to consider all the potential user paths and edge cases, and also having a very good understanding of the code myself. Still, this was the first time for me I could honestly say that an LLM actually saved me time as a developer.

shadowmanifold 8 months ago

This latest update to Sonnet is super impressive.
We are really already past the point of being able to discuss these matters though in large groups.
The herd speaks as if all LLMs on all programming languages are basically the same.
It is an absurdity. Talking to the herd is mostly for entertainment at this point. If I actually want to learn something, I will ask Sonnet.
throwa5456435 8 months ago

All this makes me think making software engineers redundant is really the "killer app" of LLM's. This is where the AI labs are spending most of the effort - its the best marketing after all for their product - fear sells better than greed (loss aversion) making engineers notice and unable to dismiss it.
Despite some of the comments on this thread, despite it not wanting to be true, I must admit LLM's are impressive. Software engineers and ML specialists have finally invented the thing which disrupts their own jobs substantially either via large reduction in hours and/or reduction in staff. As the hours a software engineer spends coding diminishes by large factors so too especially in this economy will hours spent required paying an engineer will fall up to the point where anyone can create code and learn from an LLM as you have just done. Once everybody is special, no one is and fundamentally employment, and value of things created from software, comes from scarcity just like everything else in our current system.
I think there's probably only a few years left where software engineers are around - or at least seen as a large part of an organization with large teams, etc. Yes AI software will have bugs, and yes it won't be perfect but you can get away with just one or two for a whole org to fix the odd blip of an LLM. It feels like people are picking on minor things at this point, which while true, for a business those costs are "meh" while the gains of removing engineers are substantial.
I want to be wrong; but every time I see someone "learning from LLM's", saving lots of time doing stuff, saving 100's of hours, etc I think its only 2-3 years in and already its come this far.
- fragmede 8 months ago
  
  > Yes AI software will have bugs, and yes it won't be perfect but you can get away with just one or two for a whole org to fix the odd blip of an LLM.
  Maybe. A lot of places have headcount limits on software devs because of budget constraints. As in, the reason they don't hire more is because they can't afford it, not because there is a shortage of code to write and bugs to give. The more optimistic view is that the nature of being a software engineer will adjust to increased productivity and focus on the parts of the job that LLMs can't do, with the market for experts who are skilled at removing "the odd blip from an LLM". Expertise will also move into areas where there's less or insufficient training data for a particular niche. One way to future proof yourself is to find places where it frequently makes up non existent libraries and is bad at code in a language, and specialize in that.

mkleczek 8 months ago

I am very sceptical of LLM (or any AI) code generation usefulness and it does not really have anything to do with AI itself.

In the past I've been involved in several projects deeply using MDA (Model Driven Architecture) techniques which used various code generation methods to develop software. One of the main obstacles was the problem of maintaining the generated code.

IOW: how should we treat generated code?

If we treat it in the same way as code produced by humans (ie. we maintain it) then the maintenance cost grows (super-linearly) with the amount of code we generate. To make matters worse for LLM: since the code it generates is buggy it means we have more buggy code to maintain. Code review is not the answer because code review power in finding bugs is very weak.

This is unlike compilers (that also generate code) because we don't maintain code generated by compilers - we regenerate it anytime we need.

The fundamental issue is: for a given set of requirements the goal is to produce less code, not more. _Any_ code generation (however smart it might be) goes against this goal.

EDIT: typos

mvdtnz 8 months ago

You should NEVER modify generated code. All of our generated code is pretended with a big comment that says "GENERATED CODE DO NOT MODIFY. This code could be regenerated at any time and any changes will be lost."
If you need to change behaviour of generated code you need to change your generator to provide the right hooks.
Obviously none of this applies to "AI" generated code because the "AI" generator is not deterministic and will hallucinate different bugs from run to run. You must treat "AI" generated code as if it was written by the dumbest person you've ever worked with.
- fragmede 8 months ago
  
  The reason you don't modify generated code is it gets clobbered upon regeneration. The reason it's okay to modify LLM-generated code is that it gets fed that back into the LLM for subsequent modification.
- mkleczek 8 months ago
  
  That's exactly my point :)
smokel 8 months ago

I agree. Adding unit tests without a good reason comes at a cost.
Refactoring is harder, especially if it's not clear why a test is in place. I've seen many developers disable tests simply because they could not understand how, or why, to fix them.
I'm hopeful that LLMs can provide guidance in removing useless tests or simplifying things. In an ideal future they may even help in formulating requirements or design documentation.
- mkleczek 8 months ago
  
  > I'm hopeful that LLMs can provide guidance in removing useless tests or simplifying things. In an ideal future they may even help in formulating requirements or design documentation
  I am very sceptical here as well. The biggest problem with formulating requirements or design documentation is translation from informal to formal language. In other words... writing programs.
  LLMs are good at generating content that doesn't provide useful information (ie. have low information content). Their usefulness right now is caused by the fact that people are used to reading lot of text and distill information from it (ie. all the useless e-mails formulated in corporate language, all multi-page requirement documents formulated in human readable form). The job of a software engineer is to extract information from low information content text and write it down in a formal language.
  In this context:
  What I expect in the long run is that people will start to value high information content and concise text. And obviously - it cannot be generated by any LLM, because LLM cannot provide any information by itself. There is really no point in: provide short high information content text (ie. prompt) to LLM -> receive long low information content text from LLM -> extract information from long text.

DeathArrow 8 months ago

It's hard to generate tests for typical C# code. Or for any context where you have external dependencies.

If you have injected services in your current service, the LLM doesn't know anything about those so it makes poor guesses. You have to bring those in context, so they can be mocked properly.

You end up spending a lot of time guiding the LLM, so it's not measurably faster than writing test by hand.

I want my prompt to be: "write unit tests for XYZ method" without having to accurately describe it the prompt what the method does, how it does it and why it does it. Writing too many details in the prompt takes the same time as writing the code myself.

Github Copilot should be better since it's supposed to have access to you entire code base. But somehow it doesn't look at dependencies and it just uses the knowledge of the codebase for stylistic purposes.

It's probably my fault, there are for sure better ways to use LLMs for code, but I am probably not the only one who struggles.

nazgul17 8 months ago

Should we not, instead, write tests ourselves and have LLMs write the code to make them pass?

jayd16 8 months ago

Just ask it to do both.
- sdesol 8 months ago
  
  And remember to always challenge the response with both the same and different models. No joke. Just continue the conversation for the example in the blog and ask the LLM "Do you see anything wrong with the code?" and it will spit out "Yes" and explain why.

satisfice 8 months ago

Like nearly all the articles about AI doing "testing" or any other skilled activity, the last part of it admits that it is an unreliable method. What I don't see in this article-- which I suspect is because they haven't done any-- is any description of a competent and reasonably complete testing process of this method of writing "tests." What they probably did is to try this, feel good about it (because testing is not their passion, so they are easily impressed), and then mark it off in their minds as a solved problem.

The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.

LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.

johnjwang 8 months ago

Author here: Yes, there are certain functions where writing good tests will be difficult for an LLM, but in my experience I've found that the majority of functions that I write don't need anything out of the ordinary and are relatively straightforward.
Using LLMs allows us to have much higher coverage than if we didn't use it. To me and our engineering team, this is a pretty good thing because in the time prioritization matrix, if I can get a higher quality code base with higher test coverage with minimal extra work, I will definitely take it (and in fact it's something I encourage our engineering teams to do).
Most of the base tests that we use were created originally by some of our best engineers. The patterns they developed are used throughout our code base and LLMs can take these and make our code very consistent, which I also view as a plus.
re: Complacency: We actually haven't found this to be the case. In fact, we've seen more tests being written with this method. Just think about how much easier it is to review a PR and make edits vs write a PR. You can actually spend your time enforcing higher quality tests because you don't have to do most of the boilerplate for writing a test.
- satisfice 8 months ago
  
  Have you systematically tested this approach? It sounds like you are reporting on your good vibes. Your writing is strictly anecdotal.
  I’ve been working with AI, too. I see what I’m guessing is the same unreliability that you admit in the last part of your article. For some reason, you are sanguine about it, whereas I see it as a serious problem.
  You say you aren’t complacent, but your words don’t seem to address the complacency issue. “More tests” does not mean better testing, or even good enough testing.
  Google “automation bias” and tell me what policies and procedures or training is in place to avoid it.
- youoy 8 months ago
  
  I would say that the complacency part is identifying good test with good coverage. I agree that writing test is one of the best use cases for LLMs, and it definitely saves engineers a lot of time. But if you follow them to blindly it is easy to get carried away by how easy it is to write tests that focus on coverage instead of actually testing more quality things. Which is what the previous comment was pointing at:
  > which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.
wenc 8 months ago

I do use LLMs to bootsrap my unit testing (because there is a lot boilerplate in unit tests and mocks), but I tend to finish the unit tests myself. This gives me confidence that my tests are accurate to the best of my knowledge.
Having good tests allows me to be more liberal with LLMs on implementation. I still only use LLMs to bootstrap the implementation, and I finish it myself. LLMs, being generative, are really good for ideating different implementations (it proposes implementations that I would never have thought of), but I never take any implementation as-is -- I always try to step through it and finish it off manually.
Some might argue that it'd be faster if I wrote the entire thing myself, but it depends on the problem domain. So much of what I do is involve implementing code for unsolved problems (I'm not writing CRUD apps for instance) that I really do get a speed-up from LLMs.
I imagine folks writing conventional code might spend more time fixing LLM mistakes and thus think that LLMs slow them down. But this is not true for my problem domain.
simonw 8 months ago

The answer to this is code review. If an LLM writes code for you - be it implementation or tests - you review it before you land it.
If you don't understand how the code works, don't approve it.
Sure, complacent developers will get burned. They'll find plenty of other non-AI ways to burn themselves too.
- hitradostava 8 months ago
  
  100% agree. We don't expect human developers to be perfect, why should we expect AI assistants. Code going to production should go through review.
  I do think that LLMs will increase the volume of bad code though. I use Cursor a lot, and occasionally it will produce perfect code, but often I need to direct and refine, and sometimes throw away. But I'm sure many devs will get lazy and just push once they've got the thing working...
  
  sdesol 8 months ago
  
  > 100% agree. We don't expect human developers to be perfect, why should we expect AI assistants.
  I think the issue is that we are currently being sold that it is. I'm blown away by how useful AI is, and how stupid it can be at the same time. Take a look at the following example:
  https://app.gitsense.com/?doc=f7419bfb27c896&highlight=&othe...
  If you click on the sentence, you can see how dumb Sonnet-3.5 and GPT-4 can be. Each model was asked to spell-check and grammar-check the sentence 5 times each, and you can see that GPT-4o-mini was the only one that got this right all 5 times. The other models mostly got it comically wrong.
  I believe LLM is going to change things for the better for developers, but we need to properly set expectations. I suspect this will be difficult, since a lot of VC money is being pumped into AI.
  I also think a lot of mistakes can be prevented if you include in your prompt, how and why it did what it did. For example, the prompt that was used in the blog post should include "After writing the test, summarize how each rule was applied."
  
  simonw 8 months ago
  
  "I think the issue is that we are currently being sold that it is."
  The message that these systems are flawed appears to be pretty universal to me:
  ChatGPT footer: "ChatGPT can make mistakes. Check important info."
  Claude footer: "Claude can make mistakes. Please double-check responses."
  https://www.meta.ai/ "Messages are generated by AI and may be inaccurate or inappropriate."
  etc etc etc.
  I still think the problem here is science fiction. We have decades of sci-fi telling us that AI systems never make mistakes, but instead will cause harm by following their rules too closely (paperclip factories, 2001: A Space Odyssey etc).
  Turns out the actual AI systems we have make mistakes all the time.
  
  DanHulton 8 months ago
  
  But on the other other hand, there's the commercials generated to sell new models or new model features, that FREQUENTLY lie about actual capabilities and fake demos and don't actually end with an equivalent amount of time going over how actual usage may be shit and completely unlike the advertisement.
  I'd say parent is absolutely correct - we ARE being sold (quite literally, through promotional material, i.e. ads) that these models are way more capable than they actually are.
  
  sdesol 8 months ago
  
  You do have to admit, the footer is extremely small and it's also not in the most prominent place. I think most "AI companies" probably don't go into a sales pitch saying "It's awesome, but it might be full of shit".
  I do see your science fiction angle, but I think the bigger issue is the media, VCs, etc. are not clearly spelling out that we are nowhere near science fiction AI.
  
  jazzyjackson 8 months ago
  
  I appreciate the footer on Kagi Assistant: "Assistant can make mistakes. Think for yourself when using it" - a reminder that theres a tendency to outsource your own train of thought
  
  sdesol 8 months ago
  
  I would have to imagine 90+ percent of people use LLM and AI to outsource their thought and most will not heed this warning. OpenAI might say "Check important info." but they know most people probably won't do a google search or visit their library to fact check things.
  
  mvdtnz 8 months ago
  
  > We don't expect human developers to be perfect, why should we expect AI assistants.
  What absolute nonsense. What an absurd false equivalence. It's not that we expect perfection or even human level performance from "AI". It's that the crap that comes out of LLMs is not even at the level of a first year student. I've never in my entire life reviewed the code of a junior engineer and seen them invent third party APIs from whole cloth. I've never had a junior send me code that generates a payload that doesn't validate at the first layer of the operation with zero manual testing to check it. No junior has ever asked me to review a pull request containing references to an open source framework that doesn't exist anywhere in my application. Yet these scenarios are commonplace in "AI" generated code.
  
  simonw 8 months ago
  
  That problem genuinely doesn't matter to me at all.
  If an LLM hallucinates a method that doesn't exist I find out the moment I try and run the code.
  If I'm using ChatGPT Code Interpreter (for Python) or Claude analysis mode (for JavaScript) I don't even have to intervene: the LLM can run in a loop, generating code, testing that it executes without errors and correcting any mistakes it makes.
  I still need to carefully review the code, but the mistakes which cause it not to run at all are by far the least amount of work to identify.
  
  mvdtnz 8 months ago
  
  Yes I've seen the dreck you produce with LLMs. Not a shining endorsement in my eyes.
  https://news.ycombinator.com/item?id=41929174
  
  simonw 8 months ago
  
  Which of those did you think were dreck?
  I think the source code for tools like this one is genuinely good code: https://github.com/simonw/tools/blob/main/extract-urls.html
  What do you see that's wrong with that?
  
  mvdtnz 8 months ago
  
  It's a toy. It doesn't do useful work. The code is fine for the pathetically small sample but that coding style does not scale to real software scales.
  
  sdesol 8 months ago
  
  > style does not scale to real software scales
  I think those that dismiss AI completely will fall behind, and those that turn it into a crutch will pay for it in the years to come. I truly believe AI is game changing, as I used it to create standalone functions and get answers that saved me a day or two of research and reading. I've never worked with the cheerio library before but it answered everything I needed to know, among other things. It wasn't perfect though, as it (can't remember the model) wasted some time for me regarding the SQLite library for Node.js.
  I think the issue we have right now is we are treating LLM as a final solution (mainly due to investors) instead of thinking of it as a new interface, with quirks that cannot be taken lightly. It's a bit extreme, but I think junior developers should not be allowed to use LLM. LLM is a Power Tool for developers that can easily spot BS and/or have the confidence and knowledge to fix BS that is missed.
  
  simonw 8 months ago
  
  The purpose of my "14 things I built in the last week" post was not to demonstrate large software - it was to show how the cost of building small applications has effectively fallen close to zero for me.
  I can knock out small but useful applications in genuinely less time than it would take me to Google for an existing solution to the same problem.
  You can call them dreck if you like. I call (most of) them useful solutions.
  
  NitpickLawyer 8 months ago
  
  |_____| <- dreck code
  ... ... ... |_____| <- it's good code, but toy problem
  I guess we all see where the goalposts will be tomorrow. Good code, good problem, I don't like the language. Or something :)
  
  mvdtnz 8 months ago
  
  "Dreck" means worthless rubbish. Code that solves useless toy problems is worthless rubbish.
  
  signatoremo 8 months ago
  
  If you are to share your code with us, are you sure that we wouldn’t find any worthless rubbish code? If they do exist, can you be certain that they took you less than 5 mins to build like the GP asserted?
  Take SQLite Wasm as an example:
  https://simonwillison.net/2024/Oct/21/claude-artifacts/#sqli...
  Perhaps you don’t use database everyday, but a web based sql client is very common, very far from worthless, let alone rubbish. Imagine a developer being able to stand up this module in 5 mins, using it as the starting point for further work.
  
  mvdtnz 8 months ago
  
  Are you getting offended on behalf of an "AI"? Seriously mate?

tsv_ 8 months ago

Each time a new LLM version comes out, I give it another try at generating tests. However, even with the latest models, tailored GPTs, and well-crafted prompts with code examples, the same issues keep surfacing:

- The models often create several tests within the same equivalence class, which barely expands test coverage

- They either skip parameterization, creating multiple redundant tests, or go overboard with 5+ parameters that make tests hard to read and maintain

- The model seems focused on "writing a test at any cost" often resorting to excessive mocking or monkey-patching without much thought

- The models don’t leverage existing helper functions or classes in the project, requiring me to upload the whole project context each time or customize GPTs for every individual project

Given these limitations, I primarily use LLMs for refactoring tests where IDE isn’t as efficient:

- Extracting repetitive code in tests into helpers or fixtures

- Merging multiple tests into a single parameterized test

- Breaking up overly complex parameterized tests for readability

- Renaming tests to maintain a consistent style across a module, without getting stuck on names

deeviant 8 months ago

All of the points you raise I find common in human written tests.

iambateman 8 months ago

I did this for Laravel a few months ago and it’s great. It’s basically the same as the article describes, and it has definitely increased the number of tests I write.

Happy to open source if anyone is interested.

frays 8 months ago

I'd certainly be interested to read more about your experience!

gengstrand 8 months ago

I went with a more clinical approach and used models that were available a half year ago but I also was interested in using LLMs to write unit tests. You can learn the details of that experiment at https://www.infoq.com/articles/llm-productivity-experiment/ but the net of what I found was that LLMs improve developer productivity in the form of unit test creation but only marginally. Perhaps I find myself a bit skeptical on the claims from that Assembled blog on significant improvement.

simonw 8 months ago

If you add "white-space: pre-wrap" to the elements containing those prompt examples you'll avoid the horizontal scrollbar (which I'm getting even on desktop) and make them easier to read.

johnjwang 8 months ago

Thanks for the suggestion -- I'll take a look into adding this!

apwell23 8 months ago

i would love to used to use it change code in ways that compiles and see if test fails. Coverage metric sometimes doesn't really tell you if some piece of code is covered or not.

sesm 8 months ago

Coverage metric can tell if lines of code were executed, but they can't tell if execution result was checked.
taberiand 8 months ago

I believe that's called mutation testing. Using an LLM to perform the mutation sounds like a great idea
- rgmerk 8 months ago
  
  LLMs are not suitable for mutation testing. Mutation testing needs to be fast to be useful (because you need to generate and test a lot of mutated versions); an LLM-based mutator would be extremely slow as well as error-prone.
  
  taberiand 8 months ago
  
  Set aside LLMs, why does mutation testing need to be fast? It would be fine to have mutation tests run slowly, out-of-band of the main CI pipeline. They aren't mission critical, they're smoke tests for your unit tests.
  Also you only need to generate a set of mutations for any particular unit once, and then again when the test code or the code under test changes.
  
  rgmerk 8 months ago
  
  Because mutation testing (should) generate a lot of mutants, which you then need to run your unit tests against.
  
  taberiand 8 months ago
  
  Yes but my point is it doesn't have to do that fast, as in unit test speeds.

dfilppi 8 months ago

[dead]