Conditional Access policies - how do you test without nuking production?

Posted by Either-Act-3406@reddit | sysadmin | View on Reddit | 29 comments

We need to roll out CA policies across 1200 users. Microsoft docs say use report-only mode first, but that only tells you what WOULD happen, not what WILL actually break.

Our environment:

Mix of Windows 10/11, some BYOD
3-4 legacy apps that barely work with modern auth
Remote workers across 6 countries
No separate test tenant that mirrors production

Can't test emergency access accounts properly without actually locking ourselves out. Can't simulate real user impact without affecting real users.

What's your approach? Deploy to small group first? Use some third-party tool?

[-]

FunctionPitiful@reddit

Use this KQL Query:

let lookback = 5d;

union isfuzzy=true

(SigninLogs | extend SignInType = "Interactive"),

(AADNonInteractiveUserSignInLogs | extend SignInType = "NonInteractive")

| where TimeGenerated > ago(lookback)

| extend CAPolicies = coalesce(

todynamic(column_ifexists("ConditionalAccessPolicies_string", "")),

column_ifexists("ConditionalAccessPolicies_dynamic", dynamic(null)),

column_ifexists("ConditionalAccessPolicies", dynamic(null)))

| extend StatusObj = coalesce(

todynamic(column_ifexists("Status_string", "")),

column_ifexists("Status_dynamic", dynamic(null)),

column_ifexists("Status", dynamic(null)))

| extend DeviceObj = coalesce(

todynamic(column_ifexists("DeviceDetail_string", "")),

column_ifexists("DeviceDetail_dynamic", dynamic(null)),

column_ifexists("DeviceDetail", dynamic(null)))

| extend LocationObj = coalesce(

todynamic(column_ifexists("LocationDetails_string", "")),

column_ifexists("LocationDetails_dynamic", dynamic(null)),

column_ifexists("LocationDetails", dynamic(null)))

| where isnotempty(CAPolicies) and tostring(CAPolicies) != "[]"

| mv-expand CAPolicies

| extend

PolicyName = tostring(CAPolicies.displayName),

PolicyResult = tostring(CAPolicies.result),

GrantControls = tostring(CAPolicies.enforcedGrantControls)

| where PolicyResult in ("failure", "reportOnlyFailure", "interrupted", "reportOnlyInterrupted")

| extend

ErrorCode = tostring(StatusObj.errorCode),

FailureReason = tostring(StatusObj.failureReason),

IsCompliant = tobool(DeviceObj.isCompliant),

IsManaged = tobool(DeviceObj.isManaged),

Country = tostring(LocationObj.countryOrRegion)

| summarize

HardFailures = countif(PolicyResult == "failure"),

ReportOnlyFailures = countif(PolicyResult == "reportOnlyFailure"),

Interrupted = countif(PolicyResult == "interrupted"),

ReportOnlyInterrupted = countif(PolicyResult == "reportOnlyInterrupted"),

DistinctIPs = dcount(IPAddress),

Apps = make_set(AppDisplayName, 10),

ClientApps = make_set(ClientAppUsed, 10),

GrantControlsHit = make_set(GrantControls, 10),

ErrorCodes = make_set(ErrorCode, 10),

FailureReasons = make_set(FailureReason, 5),

Countries = make_set(Country, 5),

LegacyAuthSeen = countif(ClientAppUsed in ("Other clients", "IMAP", "POP", "SMTP", "Exchange ActiveSync", "Authenticated SMTP", "Exchange Web Services")),

NonCompliantDevice = countif(IsCompliant == false),

UnmanagedDevice = countif(IsManaged == false),

LastSeen = max(TimeGenerated)

by UserPrincipalName, PolicyName, SignInType

| extend Severity = case(

HardFailures > 0 and LegacyAuthSeen > 0, "High - hard fail + legacy auth",

HardFailures > 0, "Medium - hard fail",

ReportOnlyFailures > 0, "Tuning - report-only fail",

"Low - interrupt only")

| order by HardFailures desc, ReportOnlyFailures desc, Interrupted desc, UserPrincipalName asc

[-]

bjc1960@reddit

I have "conditional access phobia". I stare 20 minutes wondering. I use the what-if tool, deployment groups etc.

We have about 40 rules.

[-]

You’re thinking about this like a binary switch when it’s really risk segmentation. Build dynamic groups (BYOD, legacy auth users, high risk geos, admins) and apply progressively stricter CA layers per group. Use sign in logs to identify apps still using legacy protocols before enforcement. The biggest mistake is deploying “Require compliant device + MFA + geo restrictions” in one go. Roll out by blast radius. Measure failed sign ins daily. Adjust. Repeat. CA rollouts fail because people chase completeness instead of controlled impact.

[-]

its_FORTY@reddit

Create a proper representation of your user base mix in a restricted OU structure that you can test against.

[-]

Useful-Process9033@reddit

To be fair the legacy app angle makes this harder than standard CA rollout. Modern auth apps are predictable but that custom PHP app using basic auth against Entra is the thing that will break silently and you wont know until accounting cant run payroll.

[-]

Vektor0@reddit

OP is a wannabe entrepreneur looking for AI opportunities. That's why the post is amateur and nonsensical.

[-]

its_FORTY@reddit

It’s definitely a peculiar question to make a post about, since it’s basically sysadmin 101 level change management. Nothing he mentioned as risks/considerations are even outside what most of us here manage on a daily basis.

[-]

FearAndGonzo@reddit

I think maybe what people are trying to get at is where do you link the OU to the CA policy in the scenario you offer.

[-]

andrew181082@reddit

Groups in Entra, not OUs

[-]

its_FORTY@reddit

Yes, I could see a valid test plan that goes that route as well. Might even be easier to selectively target the desired accounts you want to test against, and more unlikely to target any that you don't.

[-]

Select-Holiday8844@reddit

I second this arrangement.

A representative OU structure wouldn't need to be large. Maybe 20-50 users. A pilot group.

[-]

Feisty-Swordfish-796@reddit

I just test with my own normal user account first, if that is working without any issues, I test with some people in IT, still working without any issues. I test with our test users in different departments working from different countries both from the offices and from home. Everything still working fine I will put it in report-only for the whole company to catch some users with issue before rolling it out. After I am done testing I activate for the whole company and something strange normally breaks anyway.

[-]

Useful-Process9033@reddit

This is the way. Graduated rollout with real accounts beats any simulation tool. The what-if tool is useful for sanity checks but it will never catch the weird edge cases like a legacy app that silently falls back to basic auth. Real users hitting real apps is the only test that matters.

[-]

Patient-Stuff-2155@reddit

we have a test user accounts and break glass global admin security key login

[-]

itguy9013@reddit

Scope to your test accounts first. Test workflows. Then push to production.

[-]

gixxer-kid@reddit

Test against test users.

Enable a log analytics workspace so you can look at insights.

Test. Test. Test.

[-]

thewunderbar@reddit

report only mode is your friend.

And then testing on a few users is your friend.

No different than anything else we've done for 25+ years. Test on a small group first. This isn't rocket surgery.

[-]

inarius1984@reddit

Break Glass account/process (document this very well, practice the process, etc.) and exclude from CA in case of unforeseen consequences somehow. Don't want to get locked out of your own M365 tenant.

[-]

trebuchetdoomsday@reddit

lol pull the ripcord and deploy on All Users including yourself for max permafuck /s

[-]

Knyghtlorde@reddit

Pilot rings.

Smaller group, larger group, everyone else.

[-]

ProfessionalWorkAcct@reddit

Crawl Walk Run

[-]

TechAdminDude@reddit

Report-only is the right starting point but you're correct that it doesn't tell the full story, especially with legacy apps and BYOD in the mix which you probably do.

Mine your sign-in logs first Before deploying anything, pull 30 days of sign-in logs and filter by the apps you're worried about. You'll see exactly which client types, auth methods, and locations are in use. Surprises here are cheaper than surprises in production.
Use the What-If tool surgically It's limited but useful for spot-checking specific user + app + location combos. Run it against your known problematic users before go-live.
Pilot order matters. IT team > low-risk department > remote workers last. Legacy app users get their own phase with a longer soak time in report-only.
Emergency access accounts!!!!! (This is mega important) Create two break-glass accounts, exclude them from ALL CA policies by UPN (not group), and store creds in a physical safe. Test them monthly — actually authenticate, don't just check they exist.
For the legacy apps if they don't support modern auth, you're looking at Entra app proxy (super easy to setup) or accepting the exclusion and mitigating with other controls

For what it's worth, Ive had this the exact issue "report-only doesn't show real impact" problem and ended up building a tool called that analyses your existing policies and flags conflicts, gaps, and risky exclusions before you deploy. Happy to share a link if useful, fire me a DM, but the manual approach above will get you most of the way there.

[-]

FearlessAwareness469@reddit

I'm kind of with everyone else here. Just turn them on. Then use sign in logs to troubleshoot. Also use a separate policy for each thing. We have one for requiring MFA. One for persistence. One for session duration. A separate group of those for IT. One for blocking non us without MFA etc...

[-]

Unfair-Plastic-4290@reddit

you dont. click with conviction and hope for the best.

[-]

Ciddie@reddit

There is literally a 'report only' toggle, the only gotcha is its not compatible with MacOS so you should be OK, pilot it in report for a few hours/days and then scope to a pilot security group and roll out that way. ensure your break glass account is excluded, i like to exclude our office (as an MSP) also via trusted locs. Things that may trip you up are enterprise apps using SSO - the sign in logs when in report-only should help you identify if they are going to be an issue.

If you want to be safer around BYOD you can exclude filtered devices, and filter for entra registered - it should give any device thats currently got access a pass, just be careful as it could give malicious devices that are already registered a continued way in.

[-]

Select-Holiday8844@reddit

They explained the report-only toggle is not doing it for them in the first line under the title of this post.

> "We need to roll out CA policies across 1200 users. Microsoft docs say use report-only mode first, but that only tells you what WOULD happen, not what WILL actually break."

Good points on the break glass account and excluding filtered devices advice.

[-]

KavyaJune@reddit

Have you tried What-if tool in CA policy? With WhatIf, you can simulate scenarios before enforcing them in production.

https://blog.admindroid.com/what-if-tool-to-test-conditional-access-policies-in-entra-id/

[-]

AppIdentityGuy@reddit

Deploy in report only mode and then deploy to a small set of test users.. I But you need a strategy with the clearly defined goals. Take a look at the MS template policies.

Also maybe consider engaging with some outside expertise to assist.

[-]

andrew181082@reddit

Start with the what if tool with a policy which is off.

Then report only on a small group (if they are android or iOS, the users will still get a popup, so warn them)

Review the logs and then communicate heavily before deploying to everyone

Make sure your break glass accounts are tested and working too