Learning how to use GenAI as a programming tool

Published by marco on

The article Exploring Generative AI by Birgitta Böckeler (MartinFowler.com) is chock-full of helpful tips from eight newsletters totaling 25 pages that she wrote throughout 2023. I include some of my own thoughts, but most of this article consists of citations.

A lot of my analysis and notes boils down to: you need to know what you’re doing to use these tools. They can help you build things that you don’t understand, but it’s not for medium- or long-term solutions. I’ve written a lot more about the need for expertise in How important is human expertise?

“The following are the dimensions of my current mental model of tools that use LLMs (Large Language Models) to support with coding.

“Assisted tasks”

Finding information faster, and in context

Generating code

“Reasoning” about code (Explaining code, or problems in the code)

Transforming code into something else (e.g. documentation text or diagram)
“These are the types of tasks I see most commonly tackled when it comes to coding assistance, although there is a lot more if I would expand the scope to other tasks in the software delivery lifecycle.”

“In this particular case of a very common and small function like median, I would even consider using generated code for both the tests and the function. The tests were quite readable and it was easy for me to reason about their coverage, plus they would have helped me remember that I need to look at both even and uneven lengths of input. However, for other more complex functions with more custom code I would consider writing the tests myself, as a means of quality control. Especially with larger functions, I would want to think through my test cases in a structured way from scratch, instead of getting partial scenarios from a tool, and then having to fill in the missing ones.”

“The tool itself might have the answer to what’s wrong or could be improved in the generated code − is that a path to make it better in the future, or are we doomed to have circular conversation with our AI tools?”

“[…] generating tests could give me ideas for test scenarios I missed, even if I discard the code afterwards. And depending on the complexity of the function, I might consider using generated tests as well, if it’s easy to reason about the scenarios.”

“For the purposes of this memo, I’m defining “useful” as “the generated suggestions are helping me solve problems faster and at comparable quality than without the tool”. That includes not only the writing of the code, but also the review and tweaking of the generated suggestions, and dealing with rework later, should there be quality issues.”

[…]

Boilerplate: Create boilerplate setups like an ExpressJS server, or a React component, or a database connection and query execution.

Repetitive patterns: It helps speed up typing of things that have very common and repetitive patterns, like creating a new constructor or a data structure, or a repetition of a test setup in a test suite. I traditionally use a lot of copy and paste for these things, and Copilot can speed that up.

Interesting. I’ve just always used the existing templates or made my own expansion templates. At least then it makes exactly what I want—and even leaves the cursor in the right position afterwards.

Another thought I had is that the kind of programmer that this helps doesn’t use any generalization for common patterns. Otherwise, the suggestions wouldn’t be useful because they can’t possibly take advantage of those highly specialized patterns. Or maybe they can, if they’re included in the context. It seems unlikely, if only because the sample size is too small to be able to influence the algorithm sufficiently. But maybe enough weight can be given to the immediate context to make that work somehow.

At that point, though, you’re just spending all of your time coaxing your LLM copilot into building the code that you already knew you wanted. This practice seems like it would end up discouraging generalization and abstraction—unless it can grok your API (as I’ve noted above).

This is an age-old problem that is maybe solved, once and for all. The problem is that when you generalize a solution, it becomes much easier, more efficient, and more economical to maintain, but it can end up being more difficult to understand. If the API is well-made and addresses a problem domain with a complexity that the programmer is actually capable of understanding, then the higher-level API may be easier to use, and perhaps even maintain.

However, a non-generalized solution is sometimes easier for a novice or less-experienced programmer to understand and extend. It’s questionable whether you’d want your code being extended and maintained by someone who barely—or doesn’t—understand it, but that situation is sometimes thrust on teams and managers.

“This autocomplete-on-steroids effect can be less useful though for developers who are already very good at using IDE features, shortcuts, and things like multiple cursor mode. And beware that when coding assistants reduce the pain of repetitive code, we might be less motivated to refactor.”

“You can use a coding assistant to explore some ideas when you are getting started with more complex problems, even if you discard the suggestion afterwards.”

“The larger the suggestion, the more time you will have to spend to understand it, and the more likely it is that you will have to change it to fit your context. Larger snippets also tempt us to go in larger steps, which increases the risk of missing test coverage, or introducing things that are unnecessary.”

On the other hand,

“[…] when you do not have a plan yet because you are less experienced, or the problem is more complex, then a larger snippet might help you get started with that plan.”

This is not unlike using StackOverflow or any other resource. There’s no getting around knowing what you’re doing, at least a little bit. You can’t bootstrap without even a bootstrap.

“Experience still matters. The more experienced the developer, the more likely they are to be able to judge the quality of the suggestions, and to be able to use them effectively. As GitHub themselves put it: “It’s good at stuff you forgot.” This study even found that “in some cases, tasks took junior developers 7 to 10 percent longer with the tools than without them.””

“Using coding assistance tools effectively is a skill that is not simply learned from a training course or a blog post. It’s important to use them for a period of time, experiment in and outside of the safe waters, and build up a feeling for when this tooling is useful for you, and when to just move on and do it yourself.”

This is just like any other tool. There is no shortcut to being good at something complex. The only tasks for which there are shortcuts are the non-complex ones. In that case, you should be asking yourself why your solutions involve so much repetitive programming.

“We have found that having the right files open in the editor to enhance the prompt is quite a big factor in improving the usefulness of suggestions. However, the tools cannot distinguish good code from bad code. They will inject anything into the context that seems relevant. (According to this reverse engineering effort, GitHub Copilot will look for open files with the same programming language, and use some heuristic to find similar snippets to add to the prompt.) As a result, the coding assistant can become that developer on the team who keeps copying code from the bad examples in the codebase.”

That will be so much fun, especially if you can get an echo chamber of lower-skilled programmers approving each other’s pull requests. 😉

“We also found that after refactoring an interface, or introducing new patterns into the codebase, the assistant can get stuck in the old ways. For example, the team might want to introduce a new pattern like “start using the Factory pattern for dependency injection”, but the tool keeps suggesting the current way of dependency injection because that is still prevalent all over the codebase and in the open files. We call this a poisoned context , and we don’t really have a good way to mitigate this yet.”

“Using a coding assistant means having to do small code reviews over and over again. Usually when we code, our flow is much more about actively writing code, and implementing the solution plan in our head. This is now sprinkled with reading and reviewing code, which is cognitively different, and also something most of us enjoy less than actively producing code. This can lead to review fatigue, and a feeling that the flow is more disrupted than enhanced by the assistant.”

“Automation Bias is our tendency “to favor suggestions from automated systems and to ignore contradictory information made without automation, even if it is correct.” Once we have had good experience and success with GenAI assistants, we might start trusting them too much.”

“[…] once we have that multi-line code suggestion from the tool, it can feel more rational to spend 20 minutes on making that suggestion work than to spend 5 minutes on writing the code ourselves once we see the suggestion is not quite right.”

“Once we have seen a code suggestion, it’s hard to unsee it, and we have a harder time thinking about other solutions. That is because of the Anchoring Effect, which happens when “an individual’s decisions are influenced by a particular reference point or ‘anchor’”. so while coding assistants’ suggestions can be great for brainstorming when we don’t know how to solve something yet, awareness of the Anchoring Effect is important when the brainstorm is not fruitful, and we need to reset our brain for a fresh start.”

“The framing of coding assistants as pair programmers is a disservice to the practice, and reinforces the widespread simplified understanding and misconception of what the benefits of pairing are.”

“Pair programming however is also about the type of knowledge sharing that creates collective code ownership, and a shared knowledge of the history of the codebase. It’s about sharing the tacit knowledge that is not written down anywhere, and therefore also not available to a Large Language Model. Pairing is also about improving team flow, avoiding waste, and making Continuous Integration easier. It helps us practice collaboration skills like communication, empathy, and giving and receiving feedback. And it provides precious opportunities to bond with one another in remote-first teams.”

“LLMs rarely provide the exact functionality we need after a single prompt. So iterative development is not going away yet. Also, LLMs appear to “elicit reasoning” (see linked study) when they solve problems incrementally via chain-of-thought prompting. LLM-based AI coding assistants perform best when they divide-and-conquer problems, and TDD is how we do that for software development.”

“Some examples of starting context that have worked for us:”

ASCII art mockup

Acceptance Criteria
Guiding Assumptions such as:

“No GUI needed”

“Use Object Oriented Programming” (vs. Functional Programming)

“For example, if we are working on backend code, and Copilot is code-completing our test example name to be, “given the user… clicks the buy button ” , this tells us that we should update the top-of-file context to specify, “assume no GUI” or, “this test suite interfaces with the API endpoints of a Python Flask app”.”

“Copilot often fails to take “baby steps”. For example, when adding a new method, the “baby step” means returning a hard-coded value that passes the test. To date, we haven’t been able to coax Copilot to take this approach.”

Knowing a bit about how LLMs work, there’s no way you really could train it to do TDD, because it’s an iterative process. It doesn’t know what TDD is, nor does the way it’s built have any mechanism for learning how to do it. Nor does it know what coding is, for that matter. It’s just a really, really good guesser. Everything it does is hallucination. It’s just that some of it is useful.

“As a workaround, we “backfill” the missing tests. While this diverges from the standard TDD flow, we have yet to see any serious issues with our workaround.”

Changing how you program because of the tool is something you should do deliberately. This is a slippery slope.

“For implementation code that needs updating, the most effective way to involve Copilot is to delete the implementation and have it regenerate the code from scratch. If this fails, deleting the method contents and writing out the step-by-step approach using code comments may help. Failing that, the best way forward may be to simply turn off Copilot momentarily and code out the solution manually.”

Jaysus. That’s pretty grim.

“The common saying, “garbage in, garbage out” applies to both Data Engineering as well as Generative AI and LLMs. Stated differently: higher quality inputs allow for the capability of LLMs to be better leveraged. In our case, TDD maintains a high level of code quality. This high quality input leads to better Copilot performance than is otherwise possible.”

“Model-Driven Development (MDD). We would come up with a modeling language to represent our domain or application, and then describe our requirements with that language, either graphically or textually (customized UML, or DSLs). Then we would build code generators to translate those models into code, and leave designated areas in the code that would be implemented and customized by developers.”

“That unreliability creates two main risks: It can affect the quality of my code negatively, and it can waste my time. Given these risks, quickly and effectively assessing my confidence in the coding assistant’s input is crucial.”

“Can my IDE help me with the feedback loop? Do I have syntax highlighting, compiler or transpiler integration, linting plugins? Do I have a test, or a quick way to run the suggested code manually?”

“I have noticed that in CSS, GitHub Copilot suggests flexbox layout to me a lot. Choosing a layouting approach is a big decision though, so I would want to consult with a frontend expert and other members of my team before I use this.”

That’s because you care about architecture. Review was always important, but more so when code is being written by something you never hired.

“How long-lived will this code be? If I’m working on a prototype, or a throwaway piece of code, I’m more likely to use the AI input without much questioning than if I’m working on a production system.”

“[…] it’s also good to know if the AI tool at hand has access to more information than just the training data. If I’m using a chat, I want to be aware if it has the ability to take online searches into account, or if it is limited to the training data.”

“To mitigate the risk of wasting my time, one approach I take is to give it a kind of ultimatum. If the suggestion doesn’t bring me value with little additional effort, I move on. If an input is not helping me quick enough, I always assume the worst about the assistant, rather than giving it the benefit of the doubt and spending 20 more minutes on making it work.”

“GitHub Copilot is not a traditional code generator that gives you 100% what you need. But in 40-60% of situations, it can get you 40-80% of the way there, which is still useful. When you adjust these expectations, and give yourself some time to understand the behaviours and quirks of the eager donkey, you’ll get more out of AI coding assistants.”