Sprint planning has just started. You - a software developer - are going through the assigned tickets. Stakes are high to get everything done in time, and that little voice inside your head goes:
“These are just small changes, it doesn't need an actual design…right? Right!”
… while that may "work” for some of the tasks, eventually one of these “small changes” will have unclear requirements that will only be clarified once it hits production, will involve a complete redesign of the darkest corners of your monolith, or will perhaps require a lot more work than originally thought.
This post is for that moment.
I’m a big believer that architecture is not a one-off event where you summon everyone into a room, draw boxes on a whiteboard, and then carve the diagram into stone tablets.
Architecture is rinse and repeat:
Make a plan
Get feedback early from stakeholders
Ship something
Reality disagrees
Adjust
Repeat until the system and the business are both mostly happy.
The points compiled below are meant to be a lightweight guardrail, not heavy bureaucracy. For a small feature, you might just skim it mentally. For something bigger, like a new service, a new database, or a change in a not-so-well-documented area in the system, you probably want to write this down and run it past a few humans (including security).
Use your judgment. Some items are “must document”. Others can stay in the back of your head.
Architecture Is Iterative - “Draw The Rest of the Owl” Is a Trap

Before we start, this is a point worth hammering in.
Odds are, you’re not going to get the architecture perfect on the first attempt. And that is fine. The goal is to make the cheapest possible mistakes as early as possible:
Get a rough design out quickly (Architectural Decision Record, Google Doc, Slack, whatever).
Identify main actors: teams, external systems, vendors, and users.
Ask for early feedback from other Engineering teams, Security, Product, or any team that will own or support this in Production.
Expect to iterate. Good architecture is usually the result of several rounds of refinement, not a single stroke of genius.
Note: As Martin Fowler beautifully points out here, it is just as worth investing in architecture for internal tasks. Internal quality actually reduces cost over time, as high-quality internals make the system easier and cheaper to change.
AI‑Driven Development - Superpower, Not Autopilot

Bear with me before we dive down the list, as this point is also crucial.
At Zenity, we’re firm believers in AI‑driven development. It’s an incredible accelerator for design, coding, and documentation, but only if you treat it as a powerful assistant, not as the architect of record. Challenge its assumptions, don’t follow it blindly and use it with caution!
From experience, these are tasks AI generally performs well:
Clarifying ideas and requirements
AI is great at formatting, refining requirements, and acting as a curious rubber duck that asks questions you might miss. Just don’t let it guess what the feature should be - that’s (still) on us, humans.
Exploring code and systems
If the feature involves a large codebase or legacy code, your preferred AI-assistant can give you good insights. Don’t hesitate to use it as an exploratory tool whenever possible, for navigating the code, understanding behavior, or even generating documentation.
Brainstorming designs and tradeoffs
Ask for alternatives, migration strategies, rollback approaches, and then ask it to play devil’s advocate.
These are - again, from my experience - things to look out for:
Overconfident nonsense (hallucinations)
Always validate AI suggestions, especially if they sound too good to be true.
Overengineering everything
Agents tend to overengineer and add unnecessary complexity to solutions, and even more so if you set inaccurate scalability goals. Ensure checks and balances between the two of you, and guide it to the most optimal solution while keeping simplicity.
Ignoring real‑world constraints
Depending on the context your AI is exposed to, it may have no idea who’s on your team, how mature your infra is, or what your organizational constraints are. A “perfect” architecture that your team can’t realistically build, run, or own is still a bad architecture.
Best practices that I adhere to are:
Being in the driver’s seat
“Because AI said so” is not an acceptable answer.
Being explicit with context and intent
Invest time and effort in your prompts. Good prompts make good outputs.
Using AI to challenge, not just assist
Compare different models, always ask and challenge assumptions. “What’s missing? What will hurt in 6-12 months? Where are the likely bottlenecks? Any security red flags?”
Used well, AI makes this whole checklist faster and more effective. Treat it as a strong collaborator, keep your critical thinking switched on, and you can get the best of both worlds.
And now, without further ado, the checklist:
1. Start With the Functional Requirements (Always)

Before security, cost, or scalability, lock down the use case:
Who is the user? What are they trying to achieve? What are the key flows / scenarios?
What are the hard constraints? (latency, regulatory, uptime, etc.).
Just as important: what is explicitly out of scope?
Don't be afraid to always ask questions. There is no such thing as stupid questions.
If you can’t summarize the feature in a few clear sentences, your architecture will be fuzzy too.
2. Security - Bring It In Early, Not as a Final Boss
At Zenity, we run internal security reviews as part of our design process. That’s not a formality, it’s how we make sure we’re building Secure by Design, instead of duct taping security on later.
A few things to think through:
Data classification
What data are we storing or processing? Is it personal data, secrets, credentials, or anything regulated (e.g. GDPR)?
Threat modeling
Who could realistically attack this? What are they after - data exfiltration, account takeover?
Authentication & Authorization
How do we authenticate users/services? How do we enforce least privilege? Do we need new roles/permissions?
Supply chain & dependencies
Any new third-party services, SDKs, or open-source components? Do they meet our compliance/security requirements?
3. Cost - What’s This Going to Cost (Now and Later)?
For clarity purposes, cost should be divided into two categories:
Run cost
Cloud resources: compute, storage, network, managed services.
Licenses and SaaS subscriptions.
Change cost
Engineering time: how hard will this be to maintain and evolve?
These points specifically are often overlooked by developers. Money matters! Frameworks like AWS Well-Architected explicitly highlight cost optimization as one of the core pillars of a good architecture, alongside security, reliability, and performance.
Good questions to ask:
What’s the approximate monthly infra cost in each environment?
Are we introducing an expensive managed service “just in case”?
Can we reuse existing infrastructure instead of starting something new?
Are we creating complexity that will be expensive to change later?
Note: Agents, and AI in general can help expedite this research, or at the very least give an initial rough estimate of costs. Gemini and ChatGPT (GPT-5.1 Pro) are my preferred partners in crime, with thorough and more reliable answers to “research costs and alternatives for this feature” type prompts.
4. Scalability - How Far Does This Need to Go?

Not everything needs to reach “planetary scale” levels, but your design should at least address:
What’s the expected load in the coming 6 months?
Note: in some markets, especially new ones, it is very hard to predict scale. The best example currently is AI, which is exploding at an exponential rate. At Zenity, we are aware of the challenge and adhere to the old adage "hope for the best, prepare for the worst". This also shows how important the iterative aspect of software architecture is - constantly adjusting is key.
Where can this bottleneck: DB writes / Third-party API rate limits / A single worker process?
Can we scale:
Horizontally (more instances)? Vertically (bigger instances)? Asynchronously (queues, batch jobs)?
Tie this back to the use case and business expectations. Overengineering for millions of users when you have 500 is just adding unnecessary complexity.
5. Monitoring & Observability - Prefer Metrics Over Logs

Google’s SRE guidance recommends monitoring a small set of “Four Golden Signals” for services: latency, traffic, errors, and saturation. You can build a lot of effective alerting and dashboards from just those.
Keep in mind when designing a new feature/service to:
Define key metrics
Latency for critical endpoints / Error rate (HTTP 5xx, failed jobs) / Throughput (requests/sec, jobs/min) / Resource saturation (CPU, memory, DB connections).
Plan your dashboards & alerts
What dashboards do we need for On-call/Product/Analytics?
What conditions should page a human vs. just log an event?
As part of your design, you want a short Observability section:
Metrics we’ll capture.
How we’ll visualize them.
Alert thresholds (even if rough to start with).
6. KPIs - What Does Success Look Like?

Shipping alone isn’t success. Outcomes are.
What will we measure to decide if this is working?
What’s the target we’re trying to hit?
Reduced support tickets / Faster response time for a specific flow / Reduced on-call incidents
Note: At the end of the day we are trying to make someone's life better (Customer/On-call/Product/Engineering). Try to boil it down to accurate and measurable goals.
It goes without saying, but KPIs are also applicable to internal tasks. Your own Engineering team might be the user impacted by said feature, so make sure it is possible to measure if we actually accomplished what we committed ourselves to do.
A good rule of thumb when working on a design is to:
List the top 1-3 KPIs for this change
Plan how we’ll measure them (instrumentation, events, dashboards)
Decide over what time window we’ll evaluate success
7. Backwards Compatibility - Don't Break It If It's Not a Must

Make sure to always look at the bigger picture.
Can this be a breaking change?
Can we avoid it being a breaking change? If so, how?
How do we minimize impact? Who will communicate it to customers, and how?
The aim should be to always provide backwards compatibility, but if that’s not possible then overcommunicate and make no assumptions. Lay out a plan and make sure everyone in direct contact with customers - Support, Customer Success, Project Managers, etc. - is on the same page.
8. Testing & Deployment Plan
“Merging to main and hoping” is not a deployment strategy!
Testing strategy
What gets unit tests/integration/e2e tests?
Should we run stress/performance testing?
Deployment strategy
Big bang, blue/green, or canary?
Any feature flags required?
How do we avoid downtime? (or is some downtime acceptable?)
Rollout plan
Should we gradually roll out the feature? Who gets access first?
Is any knowledge transfer needed for the Support team?
It doesn’t have to be long, just enough detail so that another engineer can look at the design and say “Yes, I understand how we’re going to release this safely.”
9. Rollback Plan - What If This Goes Sideways?

Every non-trivial change should have a clear rollback plan:
If we deploy this and it breaks, what’s the fastest way to remediate it?
Revert to previous version / Turn off a feature flag / Switch traffic back in a blue/green deployment?
Make sure that whoever has to deal with it - SRE, on-call - is informed in advance
How do we handle it?
Database migrations: can we roll them back safely?
Schema changes: can old + new versions of the app run side by side?
Can we roll back only for a subset of users/regions?
Write this down. Future you (or your on-call teammate) will thank you.
10. Does the Plan Fit the Timeline? (And What Gives If It Doesn’t?)

The classic project triangle: scope, time, quality. If the deadline is fixed and you’re forced to ship, something has to give.
As part of the architecture discussion:
Check whether the design is realistic for the given timeline.
Be explicit about tradeoffs.
The key is to have a negotiable plan. If the timeline is non-negotiable, then scope or quality tradeoffs must be visible and discussed.
Conclusion
You don’t need a 20 page document for every change, but you should run through most of this list mentally. Write things when a change involves real risk, impact, or long-term consequences. Over time, as you practice these skills, they’ll become second nature and you’ll naturally expand your design process based on the needs of the task.
Note: AI has already changed the day-to-day work of developers - debugging, (vibe) coding, design reviews, you name it. At Zenity we are proud to be an AI-driven company, whether through building Claude Code Plugins to automate repetitive tasks, or through optimizing context-token usage to maintain AI efficiency and reduce cost. The same principles can be said and applied to everything listed above: AI can enhance every stage of the design process. For example, an orchestrator agent (with specialized sub-agents) can be used to guide developers through complex architectural decisions. The sky is the limit, and many (most) tasks can now be drastically improved by adding AI to the picture.
