top of page
Black logo - no background.png
Marcos Rico

The Problem with GitHub

At Palmier we're building the next generation of code repositories designed for AI's. AIs are data hungry, yet a lot of valuable metadata is lost in current repositories. With the new wave of AI coding tools like Co-Pilot and Cursor, an important metadata we are looking to store is the conversation a human has with the AI, and store that with the code it creates. We believe this will:

  • Help developers understand why an AI wrote a piece of code long after it's been written. Developers often become less careful when using AI coding tools, leading to a poorer understanding of the codebase—a problem that worsens as the codebase grows and tech debt increases.

  • Enable teammates to build off AI-generated code and grasp the intent behind changes.

  • Improve AI's understanding of the code written, leading to more accurate and reliable results. At least, that’s the hypothesis we’re trying to test.


The problem? Storing extra metadata with code is surprisingly more complicated than we thought. In this blog, I want to share the challenges we’ve faced, especially with GitHub, and how we’re navigating them.


The Unexpected Complexity of Storing Metadata


Our first major hurdle was figuring out how to differentiate AI-generated code from human-generated code.


Our first idea was to track the rate at which code changes occurred in a file. If a significant amount of code was added in a short time, we could reasonably assume it was generated by an AI (unless you type really fast). If the changes met this criterion, we’d check if they matched suggestions from AI tools like Copilot and, if so, store the conversation. But this approach required extensive keystroke monitoring, and the potential for false positives was too high for an MVP.


Instead, we decided to wait until a developer finishes making changes and then check if those changes match the output of an AI conversation. In an ideal world, developers commit changes often and describe them in detail, but in practice, many (including us) are too lazy to commit frequently with detailed descriptions. So, we created a one-click commit tool with auto-generated summaries based on diffs.


This tool actually encouraged us to make more frequent commits and consequently, helped us more easily identify AI-generated code. With this tool in place, every commit became an opportunity to check whether the changes matched the output from recent AI interactions. If they did, we considered the code likely AI-generated and retrieved the corresponding conversation.


Initial association of code and metadata


We thought we had found a straightforward solution: link AI conversations with the code they produced and store them together. Simple, right? Not quite.


The Problem with GitHub


Our initial idea was to store the conversation history in the commit messages. Since we’re already using commits to detect AI-generated code and summarizing changes in commit messages, it seemed logical to append the AI conversation there. After all, who really pays close attention to commit messages anyways?


It turns out, having massive chunks of text in commit messages is incredibly distracting. Imagine scrolling through your commit log and one commit occupies your entire screen. Or when you use tools like git blame to track changes and a massive Wikipedia-like description pops up.


But the biggest issue we faced was with GitHub itself, and their concept of pull requests.


When merging a pull request, GitHub provides three options: merge commit, squash and merge, or rebase and merge. Two of these options—squash and rebase—rewrite the commit history for the branch being merged. All three create a new commit or modify the existing ones. This poses a problem for our metadata storage.


If we have metadata associated with individual commits and those commits get squashed, the original commit hashes—and our links to the metadata—are lost. Moreover, after a merge, a commit now represents a large set of changes, making it difficult to associate specific metadata with specific pieces of code.


We realized that git commits and commit messages aren’t designed for storing fine-grained metadata tied to small code segments. Git excels at tracking changes and using git blame to pinpoint when a line of code was modified, but it falls short when we try to attach detailed metadata to those changes.


The Solution?


Our goal from the beginning is to attach the most relevant metadata to the smallest possible blocks of code. The more fine-grained the associated metadata, the more accurate. Git naturally breaks diffs into "chunks", so we took advantage of that.


For each chunk in a git diff, we attempt to match it with any relevant AI conversations. If we find a match, we store the metadata alongside that specific chunk, not the entire commit.


But the same problem still stands: where and how do we store this metadata if not in commit messages?


We started by envisioning the ideal user experience. When a developer highlights a line or block of code, they should instantly see all associated metadata related to that code. To achieve this, we needed an efficient way to search and retrieve the metadata based on the selected code.


Our solution is to create embeddings of all the code chunks and store them in a vector database, along with the metadata and related commit data. This allows us to perform semantic searches based on the highlighted code. When a user selects a piece of code, we use git blame to find the originating commit(s), filter our embeddings to those associated with that commit, and then search for the chunk that matches the highlighted code. We can then retrieve the metadata tied to that chunk.


But what about merges and the rewriting of commit histories?


We realized that we didn’t need to adhere strictly to GitHub’s model of branches and merges. Instead, we could create a simple one-to-many mapping between the new merge commit hash and the original commits that were part of the pull request. When git blame returns a merge commit hash, we use our mapping to trace back to the original commits and retrieve the relevant metadata. We store every single commit how it was originally created, all on the same layer.


Flow of how we retrieve and store metadata


Reflections on GitHub’s Limitations


While attempting to attach just one set of metadata, we realized an important thing: GitHub, as it stands, isn’t equipped to handle the storage of detailed, line-by-line metadata like AI conversations. Its handling of commits and merges, while excellent for traditional version control, complicates efforts to attach granular metadata to code. It's not built for AI.


Our solution isn’t perfect, and we anticipate refining it as we encounter edge cases and performance issues. Storing accurate and up-to-date metadata with rapidly changing code is inherently challenging. We’re still investigating whether this metadata will ultimately be useful to AIs, but the process has underscored the limitations of existing tools like GitHub for this purpose.


Moving Forward


As we continue to develop our repository for AIs, we’re really excited about what this could look like! By capturing and linking AI conversations directly with the code they produce, we hope to enhance both human and AI understanding of complex codebases, and that's only the tip of the iceberg in metadata that can be stored.


Thanks for taking the time to read, I'd love to hear your thoughts on metadata + code, our approach and all its flaws, and what you envision the repository of the future to look like!


You can try palmier today at www.palmier.io. Feel free to reach out to us at founders@palmier.io.

Comments


bottom of page