How do you actually start understanding a large codebase?
Posted by radjeep@reddit | ExperiencedDevs | View on Reddit | 61 comments
I’m trying to become a better engineer and feeling pretty stuck with something basic: reading large codebases.
Quick background: I’ve spent a few years as a data scientist. Built Flask endpoints, Streamlit apps, worked a bit with GCP / Vertex AI. But I haven’t really done heavy engineering work (apart from some early Java bugfixes with a lot of help).
Now I’ve got a chance to work more closely with engineering teams, but the size and complexity of the codebase is intimidating me.
A concrete example: I was asked to implement prefix KV caching. There’s already a KVCache class that I’m supposed to reuse, but I can’t even begin to reason about how it behaves across the different places it’s used. There’s a lot of abstraction (interfaces, dependency injection, etc.) and I get lost trying to follow the flow.
I’ve tried reading top-down, following function calls, even using AI tools to walk through the code, but once things get abstract, I lose track.
I’m not just looking for “ask AI to explain it”, more like -
- how do you approach a large unfamiliar codebase?
- do you start from entrypoints or specific use-cases?
- how do you trace execution without understanding everything?
Also, are there tools (AI or otherwise) that actually help you navigate and map out codebases better?
Right now it feels like everything depends on everything else and I don’t know where to get a foothold.
Would love to hear how others approach this.
Tricky_Tesla@reddit
In chunks via sequence diagrams.
radjeep@reddit (OP)
This might be a stupid question, is there a way to automatically generate sequence diagrams from (python) code or do I have to draw them out manually?
Chuu@reddit
I can't remember the name of the tool but there is a commercial tool out there that went open source about five years ago that I found to be absolutely excellent when I last took on a brand new codebase 5 years ago. With some googling you might find it.
subma-fuckin-rine@reddit
sourcetrail?
Chuu@reddit
I am 90% sure this is it.
Realistic_Yogurt1902@reddit
You could use AI to do it for you. It's pretty good with Python code
curiouscirrus@reddit
Why is this comment not higher? In 2026, no one should be doing any of this shit manually. Review and validate it, sure, but let AI take the lead here.
Veuxdo@reddit
It'll make them. But whether they are both accurate and useful is another story.
Isofruit@reddit
Just write yourself mermaid-graph's. They're widespread enough that a decent chunk of software supports them, e.g. Obsidian and Github markdown.
garbageInGarbageOot@reddit
Take a pencil and paper. Start reading code and make notes about what the modules do, their relationships to each other, the major flows.
bbaallrufjaorb@reddit
does pencil and paper work better than typing the notes out, or voice transcription?
i really hate writing, hurts my hand after a while and my writing looks like chicken scratch
garbageInGarbageOot@reddit
It’s good to create diagrams that describe the parts of software and their relationships.
Safe-Ball4818@reddit
Stop trying to read the whole thing and just attach a debugger to a specific flow you need to change. Trace the execution path line by line until the abstraction makes sense in that one context.
Delphicon@reddit
I’ve learned the most by trying to build it myself. Obviously not the whole thing but just add stuff that I don’t understand how it works. I typically write it a little bit “in my own words” too so I can’t just copy the code.
It doesn’t take long before you start to get a feel for the skeleton of the thing which makes the codebase feel more like a bunch of small chunks of code instead of one big blob.
Writing code is always easier than reading code so the easiest way to learn the code is usually to write the code yourself.
kpolli64@reddit
Start from it's entry point i.e. for an API, start from the endpoint function. You might find out some dead code
Never-Trust-Me@reddit
Vibe code it in a week and then refactor everything over the course of the next 3 months /s
Top comment already nailed it. I just came to talk crap :p
eng_lead_ftw@reddit
everyone will tell you to read the code, trace the flows, draw diagrams. that's all correct but misses the harder half: understanding WHY the code is the way it is.
i can read a codebase and understand what it does in a week. understanding why certain decisions were made - why this module is overcomplicated, why that service exists separately, why there's a weird special case in the payment flow - takes months. and that context usually isn't in the code or the docs. it's in people's heads or old slack threads.
the best teams i've joined had structured product context - not just architecture docs, but knowledge about customer edge cases, business decisions that shaped the code, and why things were built that way. when that exists, onboarding goes from months to weeks.
where are you getting stuck more - the technical architecture or the product/business context behind it?
makonde@reddit
AI is very good at this, ask it better questions, ask it to draw various types of mermaid diagrams etc. You can ask it to draw data flow diagrams of from everywhere that data that does into the cache comes from etc. This sounds like Java, AI is very good at Java because of the Java spec and the strong typing etc. Use the plan feature of any agent and it will probably one shot a working solution, then adjust it from what you like or don't like.
Ask it to implement the feature in
PressureHumble3604@reddit
Also if you can ask copilot to explain you things.
duckypotato@reddit
Nothing beats either building new features into it or debugging. But outside of that:
No developer brain is capable of holding the entire context of how the code works, the best you can do is understand either a vertical slice or at a particular layer of abstraction. So, narrow it down to a single feature or set of features that you can use to understand the patterns OR try to look at the entire system from a high level.
More concretely: learn one thing at a time and source dive / read docs. Is dependency injection confusing you? Figure out what tool is being used for it and understand how it works. Same goes for Queue systems, caching tools, etc. go one at a time and understand how they work in isolation. Nothing helped me understand queue workers more than literally reading the source code for a queue library.
Basically just try to take it a step at a time and understand one new thing a day.
JohnWangDoe@reddit
i feel like most of the time im an archeologists
AromaticNovel4028@reddit
reminds me of when i tried something similar last year
zergea@reddit
Manual approach: Run doxygen or equivalent Browse docs to get basic lay of the land Look for entry points that are frequently used. How their Main configures or loads dependencies
deadbeefisanumber@reddit
This might get some hate but I found that LLMs are great into explaining and diagraming large codebases. Dont trust it 100 percent but it sure is a very big accelerator.
hell_razer18@reddit
you need to have a purpose. either feature or prod issue oncall so you can have context of business logic
AndyKJMehta@reddit
Fix a few small trivial bugs in the system and go through the full design-dev-test-deploy cycle.
Nemosaurus@reddit
I don't get it until I start breaking things. hopefully locally
warmuuh@reddit
Beides what others said: hand-draw a call/dependency diagram, leaving out unnecessary details. You build and internalise an abstract model of the app and you have a reference for later, to understand where you are in the big picture... And hand-draw because it forces you to complete the picture manually and slow, helping you to learn...
termd@reddit
Start with boxes and arrows to understand the overall system. Start from the end user, end with your service and dependencies. You can have 1 overall, then 1 for flow. For example I created an overall, a precheckout, a checkout, a post checkout, and an alternate ingress diagram for my team.
Now look at your apis and the inputs and outputs and think about what each one needs, what it does, and what it returns.
Now make a sequence diagram of how all the apis you own work. Don't try to memorize this one, just refer to it when you need it.
You can just ask any ai to do all of those things for you nowadays. They're less good at system diagrams but VERY good at how your actual service and packages work. You might need to help it out if your team doesn't own the client facing part and explain how that works.
I'd do a code search and identify all the places KV cache is used. Feed each into claude and ask it to explain what the use case is. Save each one.
RedditMapz@reddit
Piece by piece. My practical approach would be to identify the communication system first Usually well constructed software has a module that focuses on translating inputs (Like clicking a button) and transforming that into some other signal or form of information to be dispatched to a different module to execute the functionality linked to that input. If you can find this middle schism you can sail the code's information flow and discover functionality on your own.
To start just choose a named input (like a button). Find it in code through a universal find and either walk through the code forwards or backwards until you understand the information path for that one button (ideally wot debugger). That should inform you of where you can intersect the code to peek into the functionality of all inputs. Lastly, just be curious while following the information flow and jump into other modules/methods used that may be used along the way. Keep notes and make diagrams if there are none, or try to find them in documentation if such exists .
Now, if the code is a shitty ball of mud with nonsensical paths and architecture, then you are in wild waters. In that type of company just try to not sink 🤷🏽♂️.
RedditMapz@reddit
Piece by piece. My practical approach would be:
to identify the communication system.
throwaway_0x90@reddit
The way I usually start things like this, is that I tinker with it in a test-environment. Change random things and see what breaks.
headinthesky@reddit
Hopefully there are tests that you can also stay from in these cases
PurepointDog@reddit
"See what breaks" is truly the best technique
ConflictPotential204@reddit
a NASA engineer with 35 years working on multiple space programs once told me the only way you can eat an elephant is one bite at a time. You shouldn't try to rush your understanding of a large repo. You have to practice mentally filtering out the noise so you can stay in scope for the task at hand and digest what you learned piece-by-piece. Eventually you'll start to pick up on higher level patterns and the big picture will make sense.
foxj36@reddit
I used to think I should be able to understand and remember large code bases. I would get frustrated after a year or two when I still didnt know them. Then I worked with the best engineer I've ever met. He had spent 25 years developing the codebase we were working on. In meetings he would frequently say, "I would have to look through X layer again to get the full picture" or "I think we can do this but I dont quite remember how Y works, let me look at it and get back to you." I learned a lot in my 2 years with him.
BiebRed@reddit
Ship new features. Everything you add will require you to interact with some part of the existing code, and it won't work until you understand the interfaces you have to use.
Even if you're not called upon to ship features, pretend you are. Assuming you have the time available in your day, look at a feature request or bug fix ticket assigned to a developer, and figure out how to implement it. Then check on that developer's PR and see how they did it. Check for differences between what you would have done and what they did, and try to understand them. If possible, ask the developer to clarify details.
CajunBmbr@reddit
Something to add to your analysis methods or tools is Graphify.
Antsolog@reddit
I think a lot of excellent ideas are already in the thread so I’ll go with more specific concrete things:
Chuu@reddit
Having had to get up to speed on large codebases many times in my career, I have to say that using AI to explore codebases is by far the best way I've come across by a mile.
I know you're specifically looking for non-AI answers but I would not use them exclusively.
CrushgrooveSC@reddit
It depends on the nature of the program / system.
In large service oriented architectures (good luck) I usually begin with whatever observably tools can provide some sort of fan-out diagram or traffic pattern analysis and then work that into something akin to a distributed flame-graph grouped by service. Work backward to the ingress controller from there.
In a real application my goal is to get from whatever the main hot parts are all the way back to ‘main()’. I don’t worry too much about ‘start:’ unless it’s embedded.
bbaallrufjaorb@reddit
interesting approach to work backwards, i’ve never thought of that. i usually look for the entry point and trace through a known function to the end. like if i know “this service can do an account transfer” i’ll find the entry point and then trace it through til it’s done
gotta try yours next time
CrushgrooveSC@reddit
The issue with going forwards when the program context is unknown is basically the halting problem.
How many potentially infinite loops? How many spawned threads processes will you encounter? How many external service calls? Distributed loops? Lambdas? Db trigger functions? Etc.
If you go backwards, it’s just like… make breakpoint. Read call stack backwards. Etc
AlexanderTroup@reddit
It's all about thin slices of the codebase. Figure out how one particular feature works. If it's too big, then summarise what a particular function is doing and build a hand-figured map of the components.
I'd also recommend getting help from people who have worked in the codebase before. They can help with the intuitive side of things, although there's really no avoiding getting in there and working stuff out.
If you need to track what you've learned, tests can be a genuinely good way to test your understanding and also document what's happening.
If the code is too dense to understand, that can be a sign that it's poorly designed, and in that case it's worth thinking about if you should just simplify that part of the system and clear out the old stuff.
I firmly believe that no problem is so complex that the code can't be clean. Even with horrendous but necessary algorithms, you can confine it to one part of the system and have the rest be well named and organised, so if it's too obfuscated, taking time to clarify the functions through either better names or better grouping of functionality can help.
Time in the codebase helps too. When you actually have features to build limit your learning to the parts necessary in your feature, and try to only understand the parts necessary. It clears itself up eventually.
stagedgames@reddit
trace everything using your IDE tools. find references, find definitions, find implementations. for interfaces in Sorensen injection, find the implementation, for methods that aren't obvious, find definition, and for methods that don't make sense on how they're used, find implementation. find a happy path, verify that is not orphan code and let your debugger be your guide.
bigorangemachine@reddit
Sometimes an app does "one thing" and that's pretty easy. Start with that one thing and follow it to the frontend. If it was an eCommerce store you'd start with a catalogue item.
The project I was on had a lot of things it was doing. For that I just really started with the frontend and traced it back to the backend.
cmpthepirate@reddit
"Something basic" lol would hate to see what you cover complicated 😅
SoulTrack@reddit
Nowadays I use Claude Code to make diagrams and describe what business cases the code potentially handles.
vom-IT-coffin@reddit
The endpoints / models.
Typical-Positive6581@reddit
Maintaining existing features and adding new ones will get you that knowledge
ivancea@reddit
I think it depends on the objective. For example, if you have a very clear and manually testable objective, I would start reading/touching that part. Then expand from there.
For another example, in my team we develop a DB query language. The first task for newcomers is usually something like "implement a new function X for the language (e.g.
POW(a,b))". It's simple to understand, and easy to test (You can just use it in a query). From there, you'll start learning, step by step, while enlarging your influence radius within the project, until you understand it all. Sometimes, you have to jump and dive into a new unconnected place though, and that's it.To rationalize it: start from a visible part of the system, and expand. That's my usual approach. But of course, the objective decides how you do it
orbit99za@reddit
I draw pictures and diagrams from explains of people before me, from the BAs to programers.
I find understanding what the program does, and why helps to discover what functions are used for what.
For example, the system has a barcode scanner, i find that code then try to backtrack where the information comes from and where it goes.
Visual Studio is excellent for this, because you can click on a method call and jump to it.
Realistic_Yogurt1902@reddit
For the majority of server-side applications, I am starting from understanding two opposite parts:
* input
* output
Then, everything in between is a black box for me.
Next step, depends on a feature I am working on, to understand the place of the feature code inside this black box.
Then, input and output of the feature code, rinse and repeat until you fully understand your feature input and your feature output.
Client applications are a bit different. On one side, they have a state, and you should always remember it. On the other side, the majority of such applications are pretty simple compared to the backend.
Just my best guess, based on your description:
Input: What exactly do you need to cache? Where is this data finally prepared? Most probably - inject your cache prefix there.
Output: probably nothing, if you just write to cache, you probably need to emit some metrics about success/failed writes, and that's pretty much it.
Dependency Injection - if you have it in the application, a very important part ot understand how it works.
P.S. Current AI tools are really great for such investigations and answer the question "how does X work"?
HopadilloRandR@reddit
By taking it apart.
Kind-Armadillo-2340@reddit
The ai answers are getting downvoted but at this point just pointing at the cursor at the directory and asking questions is probably the fastest way.
The trick is what questions do you ask it? The most important things are the inputs and outputs:
From there it’s good to learn about the code structure:
You can get pretty in depth with it but it’s faster than the old way of manually tracing api endpoints.
CodeGrumpyGrey@reddit
I tend to start by identifying the core data entities and how they are wired up. 99% of the time, that means digging into the database first and working out how things are stored in there and how actions in the application change that. Roughly my process is
k032@reddit
I think it really just takes time, you aren't going to be an expert and know all the ins and outs day one.
Eventually you just start owning sections or features. Just asking questions (to coworkers or AI) for features and parts as you need them.
Professional_Mix2418@reddit
$ claude /init
😎
throwaway0134hdj@reddit
Use a tool like sourcetrial to visualize the codebase as an interactive dependency graph.
And I know it’s rather difficult to find this bc most projects don’t have a single entry point, but ask members of the team where the program “starts” and for the “entry points”.
Get an understanding of the tech stack by looking at the package.json/requirements.txt or whatever dependencies tooling they use.
After that, look at the database schema figure out how entities are maps and understand those relationships.
Looking at the tests can also be a great way to understand the codebase. As well as the git blame/history.
Get someone on the team to explain their understanding first and then use the dependency graph tools. And poke around.
boring_pants@reddit
"Look at the tests" and "use the debugger" are my two main tips.
Assuming the code base has decent test coverage there are probably tests using the
KVCacheclass. So look at how they do it.Alternatively, find a place where the class is currently being used, put a breakpoint there, and step through it in the debugger. That's an excellent way to poke through abstraction and indirection. Just step into the call and you can see exactly which function actually ended up being called, and with which parameters.
WhitelabelDnB@reddit
You could ask any coding agent (eg Claude, Codex, GHCP) to help you with that specific task and it would do a great job, especially since KVCache is already in the codebase.
For the specific example you've given, language is relevant to some degree too. You've mentioned there's a class. Is this a fully OOP codebase, with directories for classes/models, interfaces, services, etc? If there are, then the KV should be isolated in a service, and you should be able to focus your efforts there.
AI tends to hallucinate when it's pressed for an answer, but doesn't have the information. Exploring a codebase is a great example of where even older, cheaper, faster models can do a great job, as long as the harness is good, because all of the information is already there and it can answer it's own questions. Just give it a go. Ask for proof and citations.