Did you ever get a new gig and have to understand a repo? Maybe not just one repo, but maybe six or sixty repos?
Context is everything, and part of understanding a product, a team, or a codebase is getting the right understanding of it's progression over time. How did it get here? How has it changed, and what were the big inflection points? How can I contribute to this project? Is it flourishing, or it is time to put this one out to pasture?
Lots of AI tools for code, at the moment, are focused on zero-to-one generation of boilerplate code. Can You Git To That (CYGTT) will organize, classify and visualize data about your Github repo so that you can gain valuable context on the project's history, development processes, as well as the lifecycle of the code itself.
The screenshot, above, is a representative view of the output from this repo. Click to see the full screenshot.
First, copy/clone the repo to your local directory. Currently tested under Python 3.11. You can setup a venv, if you like; then run pip install -r requirements.txt
to install necessary libraries.
To run it, modify the settings in the tab-delimited config.txt
. This file, example.py
, is a bare-bones example to show you the paths and includes needed to run the app. You can run it with python3 example.py
within the root of your local copy of the repo.
What's gonna happen? Well, first, the app reads config.txt
and collects and generates a bunch of data about your repo. To collect data, it uses PyGithub to query the Github API (you'll need a Github personal access token, more info below), ultimately moving that data into a Sqlite database in the output
directory. To generate, the app uses an LLM (either OpenAI or Ollama, as configured in config.txt
) to generate plain-language summaries (of commit diffs) and to classify and tag file changes, and store them in the database. The approximate accrued cost of your LLM usage is calculated and show via logging output -- as a reference, to generate the example screenshot above, and processing this repo and it's changes solely using gpt-4o-mini
(7/2024) via API cost just under $0.04 to process.
Once this process completes succesfully, the next step is to run flask --app web.server run -p 5001
in the directory next to the web
directory, which will serve reports from http://127.0.0.1:5001.
You'll need to have a Github access token set up in your environment variables to be able to get to repos that you own for CYGTT to work.
To get an access token, in the upper-right corner of any page on GitHub, click your profile photo, then click Settings
from the dropdown menu. On the page that appears, in the left sidebar (look all the way at the bottom of the sidebar, it's easy to overlook), click Developer settings
. In the next left sidebar, under Personal access tokens
, you can use the "new fine-grained personal access token" to allow specific permissions, or "personal access tokens (classic)".
For Fine-grained personal access token
select the following permissions:
- Commit statuses: readonly
- Contents: readonly
- Pull Requests: readonly
Note that if you the Fine-grained access tokens
you may have to own the repo, as well, or ask permission to access it.
Using the Github Personal Access Token (Classic)
instead, in the same way, will seemingly let you access any repo you have permission to access via API.
Then set the token value in your system environment as CYGTT_GITHUB_ACCESS_TOKEN
.
If you'd like to run this against a local open-source LLM, instead of using (and paying for) an API such as OpenAI,
you can use Ollama quite effectively. To optimize Ollama models for reading
larger codefiles you may need to extend the default context window from 2048 tokens to 8192 tokens or more. Here's how to tweak Ollama for a larger context window, use your preferred, already installed, Ollama model name in place of <model_name>
in the instructions below.
Export the model's current configuration:
ollama show <model_name> --modelfile > model_conf.txt
Open model_conf.txt
in a text editor and:
Add the line: PARAMETER num_ctx 8192
To make sure updates keep the change, replace the line starting with FROM
with: FROM <model_name>:latest
Save and close the file.
Create a new model with the updated configuration:
ollama create <new_model_name> -f model_conf.txt
Now you can call new_model_name
in your config.txt
, if you're using Ollama for CYGTT.
To add more languages to your Tree-sitter setup, you need to manually clone the language grammar repositories into a designated directory and then modify your script to include these languages in the build process.
1. Clone the Language Grammar Repositories:
For each language you want to add, clone the corresponding Tree-sitter grammar repository into the vendor directory. For example, to add JavaScript:
git clone https://github.com/tree-sitter/tree-sitter-javascript vendor/tree-sitter-javascript
2. Update the build_language_library Function:
Modify your languages dictionary to include the new language(s) you cloned.
languages = {
'python': 'vendor/tree-sitter-python',
'javascript': 'vendor/tree-sitter-javascript',
# Add more languages here
}
- If you're going to use OpenAI's models, you will need to setup an
OPENAI_API_KEY
in your environment, as usual. I would recommend the gpt-4o-mini model as a fast and accurate and cheap choice. To be even cheaper, but probably not quite as fast nor as accurate, run Ollama locally. - What's next? I'm mostly collecting ideas/todos in issues. Feel free to take a peek and opine/ideate/complain.
- The big picture: Expand indexing of code and diffs to make code and changes searchable, by being able to provide smart context for the LLM. Being able to ask "what changed around the sixth of January such that the entire app is now in jeopardy?" and getting a solid answer, for example.
- The name of this project is based on the Funkadelic song "Can You Get To That" off the Maggot Brain album (1971). Graphics used here were created with Recraft.ai, and take their inspiration from my related project Give Up The Func.
- Details on changing Ollama context size found at Nurgo Software, for their product "Brain Soup".
- Adam Tornhill's Your Code As A Crime Scene is a great resource, and the origin of a git-as-forensics approach. If you don't want to tackle a DIY approach here, consider Adam's company Code Scene.
- The code example, above, to run the web view uses port 5001, instead of the default 5000, as 5000 seems to sometimes a conflict on MacOS. Change it to whatever you want or need.
- Solid and fun-to-read article on RAG across multiple data sources, including lots of SQL tables by Ryan Nguyen.