How do you actually start understanding a large codebase?

Posted by radjeep@reddit | ExperiencedDevs | View on Reddit | 61 comments

I’m trying to become a better engineer and feeling pretty stuck with something basic: reading large codebases.

Quick background: I’ve spent a few years as a data scientist. Built Flask endpoints, Streamlit apps, worked a bit with GCP / Vertex AI. But I haven’t really done heavy engineering work (apart from some early Java bugfixes with a lot of help).

Now I’ve got a chance to work more closely with engineering teams, but the size and complexity of the codebase is intimidating me.

A concrete example: I was asked to implement prefix KV caching. There’s already a KVCache class that I’m supposed to reuse, but I can’t even begin to reason about how it behaves across the different places it’s used. There’s a lot of abstraction (interfaces, dependency injection, etc.) and I get lost trying to follow the flow.

I’ve tried reading top-down, following function calls, even using AI tools to walk through the code, but once things get abstract, I lose track.

I’m not just looking for “ask AI to explain it”, more like -

how do you approach a large unfamiliar codebase?
do you start from entrypoints or specific use-cases?
how do you trace execution without understanding everything?

Also, are there tools (AI or otherwise) that actually help you navigate and map out codebases better?

Right now it feels like everything depends on everything else and I don’t know where to get a foothold.

Would love to hear how others approach this.

[-]

Tricky_Tesla@reddit

In chunks via sequence diagrams.

[-]

radjeep@reddit (OP)

This might be a stupid question, is there a way to automatically generate sequence diagrams from (python) code or do I have to draw them out manually?

[-]

Chuu@reddit

I can't remember the name of the tool but there is a commercial tool out there that went open source about five years ago that I found to be absolutely excellent when I last took on a brand new codebase 5 years ago. With some googling you might find it.

[-]

subma-fuckin-rine@reddit

sourcetrail?

[-]

Chuu@reddit

I am 90% sure this is it.

[-]

Realistic_Yogurt1902@reddit

You could use AI to do it for you. It's pretty good with Python code

[-]

curiouscirrus@reddit

Why is this comment not higher? In 2026, no one should be doing any of this shit manually. Review and validate it, sure, but let AI take the lead here.

[-]

Veuxdo@reddit

It'll make them. But whether they are both accurate and useful is another story.

[-]

Isofruit@reddit

Just write yourself mermaid-graph's. They're widespread enough that a decent chunk of software supports them, e.g. Obsidian and Github markdown.

[-]

garbageInGarbageOot@reddit

Take a pencil and paper. Start reading code and make notes about what the modules do, their relationships to each other, the major flows.

[-]

bbaallrufjaorb@reddit

does pencil and paper work better than typing the notes out, or voice transcription?

i really hate writing, hurts my hand after a while and my writing looks like chicken scratch

[-]

garbageInGarbageOot@reddit

It’s good to create diagrams that describe the parts of software and their relationships.

[-]

Safe-Ball4818@reddit

Stop trying to read the whole thing and just attach a debugger to a specific flow you need to change. Trace the execution path line by line until the abstraction makes sense in that one context.

[-]

Delphicon@reddit

I’ve learned the most by trying to build it myself. Obviously not the whole thing but just add stuff that I don’t understand how it works. I typically write it a little bit “in my own words” too so I can’t just copy the code.

It doesn’t take long before you start to get a feel for the skeleton of the thing which makes the codebase feel more like a bunch of small chunks of code instead of one big blob.

Writing code is always easier than reading code so the easiest way to learn the code is usually to write the code yourself.

[-]

kpolli64@reddit

Start from it's entry point i.e. for an API, start from the endpoint function. You might find out some dead code

[-]

Never-Trust-Me@reddit

Vibe code it in a week and then refactor everything over the course of the next 3 months /s

Top comment already nailed it. I just came to talk crap :p

[-]

eng_lead_ftw@reddit

everyone will tell you to read the code, trace the flows, draw diagrams. that's all correct but misses the harder half: understanding WHY the code is the way it is.

i can read a codebase and understand what it does in a week. understanding why certain decisions were made - why this module is overcomplicated, why that service exists separately, why there's a weird special case in the payment flow - takes months. and that context usually isn't in the code or the docs. it's in people's heads or old slack threads.

the best teams i've joined had structured product context - not just architecture docs, but knowledge about customer edge cases, business decisions that shaped the code, and why things were built that way. when that exists, onboarding goes from months to weeks.

where are you getting stuck more - the technical architecture or the product/business context behind it?

[-]

makonde@reddit

AI is very good at this, ask it better questions, ask it to draw various types of mermaid diagrams etc. You can ask it to draw data flow diagrams of from everywhere that data that does into the cache comes from etc. This sounds like Java, AI is very good at Java because of the Java spec and the strong typing etc. Use the plan feature of any agent and it will probably one shot a working solution, then adjust it from what you like or don't like.

Ask it to implement the feature in

[-]

PressureHumble3604@reddit

check tests
write tests
build something and iterate

Also if you can ask copilot to explain you things.

[-]

duckypotato@reddit

Nothing beats either building new features into it or debugging. But outside of that:

No developer brain is capable of holding the entire context of how the code works, the best you can do is understand either a vertical slice or at a particular layer of abstraction. So, narrow it down to a single feature or set of features that you can use to understand the patterns OR try to look at the entire system from a high level.

More concretely: learn one thing at a time and source dive / read docs. Is dependency injection confusing you? Figure out what tool is being used for it and understand how it works. Same goes for Queue systems, caching tools, etc. go one at a time and understand how they work in isolation. Nothing helped me understand queue workers more than literally reading the source code for a queue library.

Basically just try to take it a step at a time and understand one new thing a day.

[-]

JohnWangDoe@reddit

i feel like most of the time im an archeologists

[-]

AromaticNovel4028@reddit

reminds me of when i tried something similar last year

[-]

zergea@reddit

Manual approach: Run doxygen or equivalent Browse docs to get basic lay of the land Look for entry points that are frequently used. How their Main configures or loads dependencies

[-]

deadbeefisanumber@reddit

This might get some hate but I found that LLMs are great into explaining and diagraming large codebases. Dont trust it 100 percent but it sure is a very big accelerator.

[-]

hell_razer18@reddit

you need to have a purpose. either feature or prod issue oncall so you can have context of business logic

[-]

AndyKJMehta@reddit

Fix a few small trivial bugs in the system and go through the full design-dev-test-deploy cycle.

[-]

Nemosaurus@reddit

I don't get it until I start breaking things. hopefully locally

[-]

warmuuh@reddit

Beides what others said: hand-draw a call/dependency diagram, leaving out unnecessary details. You build and internalise an abstract model of the app and you have a reference for later, to understand where you are in the big picture... And hand-draw because it forces you to complete the picture manually and slow, helping you to learn...

[-]

termd@reddit

Start with boxes and arrows to understand the overall system. Start from the end user, end with your service and dependencies. You can have 1 overall, then 1 for flow. For example I created an overall, a precheckout, a checkout, a post checkout, and an alternate ingress diagram for my team.

Now look at your apis and the inputs and outputs and think about what each one needs, what it does, and what it returns.

Now make a sequence diagram of how all the apis you own work. Don't try to memorize this one, just refer to it when you need it.

Also, are there tools (AI or otherwise) that actually help you navigate and map out codebases better?

You can just ask any ai to do all of those things for you nowadays. They're less good at system diagrams but VERY good at how your actual service and packages work. You might need to help it out if your team doesn't own the client facing part and explain how that works.

I was asked to implement prefix KV caching. There’s already a KVCache class that I’m supposed to reuse, but I can’t even begin to reason about how it behaves across the different places it’s used. There’s a lot of abstraction (interfaces, dependency injection, etc.) and I get lost trying to follow the flow.

I'd do a code search and identify all the places KV cache is used. Feed each into claude and ask it to explain what the use case is. Save each one.

[-]

RedditMapz@reddit

Piece by piece. My practical approach would be to identify the communication system first Usually well constructed software has a module that focuses on translating inputs (Like clicking a button) and transforming that into some other signal or form of information to be dispatched to a different module to execute the functionality linked to that input. If you can find this middle schism you can sail the code's information flow and discover functionality on your own.

To start just choose a named input (like a button). Find it in code through a universal find and either walk through the code forwards or backwards until you understand the information path for that one button (ideally wot debugger). That should inform you of where you can intersect the code to peek into the functionality of all inputs. Lastly, just be curious while following the information flow and jump into other modules/methods used that may be used along the way. Keep notes and make diagrams if there are none, or try to find them in documentation if such exists .

Now, if the code is a shitty ball of mud with nonsensical paths and architecture, then you are in wild waters. In that type of company just try to not sink 🤷🏽‍♂️.

[-]

RedditMapz@reddit

Piece by piece. My practical approach would be:

to identify the communication system.

[-]

throwaway_0x90@reddit

"A concrete example: I was asked to implement prefix KV caching. There’s already a KVCache class that I’m supposed to reuse, but I can’t even begin to reason about how it behaves across the different places it’s used. There’s a lot of abstraction (interfaces, dependency injection, etc.) and I get lost trying to follow the flow."

The way I usually start things like this, is that I tinker with it in a test-environment. Change random things and see what breaks.

[-]

headinthesky@reddit

Hopefully there are tests that you can also stay from in these cases

[-]

PurepointDog@reddit

"See what breaks" is truly the best technique

[-]

ConflictPotential204@reddit

a NASA engineer with 35 years working on multiple space programs once told me the only way you can eat an elephant is one bite at a time. You shouldn't try to rush your understanding of a large repo. You have to practice mentally filtering out the noise so you can stay in scope for the task at hand and digest what you learned piece-by-piece. Eventually you'll start to pick up on higher level patterns and the big picture will make sense.

[-]

foxj36@reddit

I used to think I should be able to understand and remember large code bases. I would get frustrated after a year or two when I still didnt know them. Then I worked with the best engineer I've ever met. He had spent 25 years developing the codebase we were working on. In meetings he would frequently say, "I would have to look through X layer again to get the full picture" or "I think we can do this but I dont quite remember how Y works, let me look at it and get back to you." I learned a lot in my 2 years with him.

[-]

BiebRed@reddit

Ship new features. Everything you add will require you to interact with some part of the existing code, and it won't work until you understand the interfaces you have to use.

Even if you're not called upon to ship features, pretend you are. Assuming you have the time available in your day, look at a feature request or bug fix ticket assigned to a developer, and figure out how to implement it. Then check on that developer's PR and see how they did it. Check for differences between what you would have done and what they did, and try to understand them. If possible, ask the developer to clarify details.

[-]

CajunBmbr@reddit

Something to add to your analysis methods or tools is Graphify.

[-]

Antsolog@reddit

I think a lot of excellent ideas are already in the thread so I’ll go with more specific concrete things:

Read Working Effectively With Legacy Code by Michael Feathers. It’s a foundational book to dealing with stuff like this
Do you have something that will spit out auto tracing for you to look at? If so then I’d go feed some of those traces to AI and try to see how things work (call graphs) and work my understanding from there. If you don’t have that then:
You have a class/module which presumably is in use by other parts of the system. Find references to it in the code base and see how it is used. If it’s not used then worst case it may be hidden as an interface to something else so look for any interfaces that the class implements.
(2) is meant to try and “seed context” into your head for this step. Find a series of steps to hit one thing in the module (it can just be a constructor or a method) Assuming you have a local or dev environment which can be broken into, something I do is attach a debugger and set a breakpoint into various functions and then just run the system until I crash in one of my exceptions. This gives me a reproducible way to hit the code I’m trying to learn.
If you don’t know how to attach a debugger (I recommend learning), 3 is possible by adding exceptions into the code base or log lines. I would still recommend using a debugger and setting breakpoints though.
Step through the code / read your log and in a separate space keep notes (handwritten or typed, doesn’t matter) about why things are happening.
Feed those questions/data to your team to double check or even to AI to bounce ideas off of, note that AI may hallucinate answers horribly here and shouldn’t be trusted 100%z

[-]

Chuu@reddit

Having had to get up to speed on large codebases many times in my career, I have to say that using AI to explore codebases is by far the best way I've come across by a mile.

I know you're specifically looking for non-AI answers but I would not use them exclusively.

[-]

CrushgrooveSC@reddit

It depends on the nature of the program / system.

In large service oriented architectures (good luck) I usually begin with whatever observably tools can provide some sort of fan-out diagram or traffic pattern analysis and then work that into something akin to a distributed flame-graph grouped by service. Work backward to the ingress controller from there.

In a real application my goal is to get from whatever the main hot parts are all the way back to ‘main()’. I don’t worry too much about ‘start:’ unless it’s embedded.

[-]

bbaallrufjaorb@reddit

interesting approach to work backwards, i’ve never thought of that. i usually look for the entry point and trace through a known function to the end. like if i know “this service can do an account transfer” i’ll find the entry point and then trace it through til it’s done

gotta try yours next time

[-]

CrushgrooveSC@reddit

The issue with going forwards when the program context is unknown is basically the halting problem.

How many potentially infinite loops? How many spawned threads processes will you encounter? How many external service calls? Distributed loops? Lambdas? Db trigger functions? Etc.

If you go backwards, it’s just like… make breakpoint. Read call stack backwards. Etc

[-]

AlexanderTroup@reddit

It's all about thin slices of the codebase. Figure out how one particular feature works. If it's too big, then summarise what a particular function is doing and build a hand-figured map of the components.

I'd also recommend getting help from people who have worked in the codebase before. They can help with the intuitive side of things, although there's really no avoiding getting in there and working stuff out.

If you need to track what you've learned, tests can be a genuinely good way to test your understanding and also document what's happening.

If the code is too dense to understand, that can be a sign that it's poorly designed, and in that case it's worth thinking about if you should just simplify that part of the system and clear out the old stuff.

I firmly believe that no problem is so complex that the code can't be clean. Even with horrendous but necessary algorithms, you can confine it to one part of the system and have the rest be well named and organised, so if it's too obfuscated, taking time to clarify the functions through either better names or better grouping of functionality can help.

Time in the codebase helps too. When you actually have features to build limit your learning to the parts necessary in your feature, and try to only understand the parts necessary. It clears itself up eventually.

[-]

stagedgames@reddit

trace everything using your IDE tools. find references, find definitions, find implementations. for interfaces in Sorensen injection, find the implementation, for methods that aren't obvious, find definition, and for methods that don't make sense on how they're used, find implementation. find a happy path, verify that is not orphan code and let your debugger be your guide.

[-]

bigorangemachine@reddit

Sometimes an app does "one thing" and that's pretty easy. Start with that one thing and follow it to the frontend. If it was an eCommerce store you'd start with a catalogue item.

The project I was on had a lot of things it was doing. For that I just really started with the frontend and traced it back to the backend.

[-]

cmpthepirate@reddit

"Something basic" lol would hate to see what you cover complicated 😅

[-]

SoulTrack@reddit

Nowadays I use Claude Code to make diagrams and describe what business cases the code potentially handles.

[-]

vom-IT-coffin@reddit

The endpoints / models.

[-]

Typical-Positive6581@reddit

Maintaining existing features and adding new ones will get you that knowledge

[-]

ivancea@reddit

I think it depends on the objective. For example, if you have a very clear and manually testable objective, I would start reading/touching that part. Then expand from there.

For another example, in my team we develop a DB query language. The first task for newcomers is usually something like "implement a new function X for the language (e.g. POW(a,b))". It's simple to understand, and easy to test (You can just use it in a query). From there, you'll start learning, step by step, while enlarging your influence radius within the project, until you understand it all. Sometimes, you have to jump and dive into a new unconnected place though, and that's it.

To rationalize it: start from a visible part of the system, and expand. That's my usual approach. But of course, the objective decides how you do it

[-]

orbit99za@reddit

I draw pictures and diagrams from explains of people before me, from the BAs to programers.

I find understanding what the program does, and why helps to discover what functions are used for what.

For example, the system has a barcode scanner, i find that code then try to backtrack where the information comes from and where it goes.

Visual Studio is excellent for this, because you can click on a method call and jump to it.

[-]

Realistic_Yogurt1902@reddit

For the majority of server-side applications, I am starting from understanding two opposite parts:
* input
* output

Then, everything in between is a black box for me.
Next step, depends on a feature I am working on, to understand the place of the feature code inside this black box.
Then, input and output of the feature code, rinse and repeat until you fully understand your feature input and your feature output.

Client applications are a bit different. On one side, they have a state, and you should always remember it. On the other side, the majority of such applications are pretty simple compared to the backend.

A concrete example: I was asked to implement prefix KV caching. There’s already a KVCache class that I’m supposed to reuse, but I can’t even begin to reason about how it behaves across the different places it’s used. There’s a lot of abstraction (interfaces, dependency injection, etc.) and I get lost trying to follow the flow.

Just my best guess, based on your description:
Input: What exactly do you need to cache? Where is this data finally prepared? Most probably - inject your cache prefix there.
Output: probably nothing, if you just write to cache, you probably need to emit some metrics about success/failed writes, and that's pretty much it.

Dependency Injection - if you have it in the application, a very important part ot understand how it works.

P.S. Current AI tools are really great for such investigations and answer the question "how does X work"?

[-]

HopadilloRandR@reddit

By taking it apart.

[-]

Kind-Armadillo-2340@reddit

The ai answers are getting downvoted but at this point just pointing at the cursor at the directory and asking questions is probably the fastest way.

The trick is what questions do you ask it? The most important things are the inputs and outputs:

What are the API endpoints? What do they return?
How does the codebase store data?

From there it’s good to learn about the code structure:

What is the module breakdown of the codebase?
What’s the test coverage? Any gaps?
What are the obvious pieces of tech debt?

You can get pretty in depth with it but it’s faster than the old way of manually tracing api endpoints.

[-]

CodeGrumpyGrey@reddit

I tend to start by identifying the core data entities and how they are wired up. 99% of the time, that means digging into the database first and working out how things are stored in there and how actions in the application change that. Roughly my process is

Dig into the DB and identify how things are stored
Identify API endpoints/key UI actions and work through how data flows through them
Deep dive into specific areas to identify details of how a specific piece of functionality works
Repeat from the top as required.

[-]

k032@reddit

I think it really just takes time, you aren't going to be an expert and know all the ins and outs day one.

Eventually you just start owning sections or features. Just asking questions (to coworkers or AI) for features and parts as you need them.

[-]

Professional_Mix2418@reddit

$ claude /init

😎

[-]

throwaway0134hdj@reddit

Use a tool like sourcetrial to visualize the codebase as an interactive dependency graph.

And I know it’s rather difficult to find this bc most projects don’t have a single entry point, but ask members of the team where the program “starts” and for the “entry points”.

Get an understanding of the tech stack by looking at the package.json/requirements.txt or whatever dependencies tooling they use.

After that, look at the database schema figure out how entities are maps and understand those relationships.

Looking at the tests can also be a great way to understand the codebase. As well as the git blame/history.

Get someone on the team to explain their understanding first and then use the dependency graph tools. And poke around.

[-]

boring_pants@reddit

"Look at the tests" and "use the debugger" are my two main tips.

Assuming the code base has decent test coverage there are probably tests using the KVCache class. So look at how they do it.

Alternatively, find a place where the class is currently being used, put a breakpoint there, and step through it in the debugger. That's an excellent way to poke through abstraction and indirection. Just step into the call and you can see exactly which function actually ended up being called, and with which parameters.

[-]

WhitelabelDnB@reddit

You could ask any coding agent (eg Claude, Codex, GHCP) to help you with that specific task and it would do a great job, especially since KVCache is already in the codebase.

For the specific example you've given, language is relevant to some degree too. You've mentioned there's a class. Is this a fully OOP codebase, with directories for classes/models, interfaces, services, etc? If there are, then the KV should be isolated in a service, and you should be able to focus your efforts there.

AI tends to hallucinate when it's pressed for an answer, but doesn't have the information. Exploring a codebase is a great example of where even older, cheaper, faster models can do a great job, as long as the harness is good, because all of the information is already there and it can answer it's own questions. Just give it a go. Ask for proof and citations.