Lucky Guess or Experience? You Be the Judge!
Posted by bobarrgh@reddit | talesfromtechsupport | View on Reddit | 26 comments
I had a situation today which caused me a little panic, until I was able to think about it clearly.
On one of our website servers, there is a fairly strong and sometimes persnickety caching mechanism. It is so persnickety, that when we have to make an edit to a page -- such as a blog -- we have to be sure to make sure we check the page in incognito mode. Otherwise, if we are logged into the CMS and visit the page in regular mode, the update will appear, but it won't appear for others until the cache is cleared. However, I don't know what the cache retention policy is, so usually, we just clear the cache after an update and move on.
Today, a change was made to a page and it was passed over to me for my QA review, so I checked it in a new incognito session. The update had been made and everything was happy, so I reported up the chain that the update had been verified.
About 15 minutes later, the account person responsible for that website chatted me and said that she was not seeing the update. She has been bitten with cache issues before, so when she chatted me, she said that she had tried Chrome in both regular and incognito mode, and had also tried Safari. The update was not showing up on any of her browser instances.
I had someone else double-check for me, and that person was able to see the updates.
It was somewhat reminiscent of a problem I had encountered several years ago when I was at another company. In that instance, we had a weird load balancer situation, and a person would get assigned to one of the two load balancer URLs. So, instead of randomly getting Server1 or Server2, if you were assigned to Server1, it took a random, cosmic event of the universe to get you switched over to Server2. (Yeah, I know, that's not how load balancers are supposed to work. Don't care, that was about 8-10 years ago.)
Anyway, I knew that was not the issue in this case, because we don't have a load balancer, but something was preventing the user from seeing the updates, even though others could see it.
We got on a conference call and she even showed me that she was starting with a new incognito session. I even had her send me the URL she was using, thinking that maybe there were two instances of this page, but with different URLs.
Nope. Same URL, new incognito session, hard refreshing two or three times ... update still visible.
Then, she happened to mention, "I even tried it on my phone, and I'm still not seeing the update."
Everything is pointing to a stubborn cache somewhere between her and the website. She is about 175 miles away from me under a different ISP, so we definitely are not going through the same intermediate hops.
Then I asked her, "Is your phone going through your home's wifi?"
Turns out, it was, so she turned off that setting on her phone and hit the page using her phone's data connection. Hmmm ... the updates are appearing ... how nice!
From what I can tell, either her #WifiRouterModemThingie has some sort of stubborn cache mechanism, or, one of the hops she is going through has the stubborn cache.
So ... lucky guess or experience? You be the judge.
(Also, does anyone else have any suggestions on how I can check where the cache mechanism could be located? The user on the other end is not technical, so doing a tracert is not really an option.)
Valheru78@reddit
This is actually how a loadbalancer can work if you have president sessions enabled.
bobarrgh@reddit (OP)
It's been a while and I've slept once or twice since then, but I think we didn't have persistent sessions enabled, and it was still quite sticky. But, I do appreciate your feedback.
frymaster@reddit
another thing might have been if the choice of destination back-end was based on a hash of the source IP or similar - then the only way you'd end up on a different back-end would be if there was a change in the number of back-end instances (due to failures, maintenance, and scaling for load)
Valheru78@reddit
Well you reminded me of an issue which was quite the opposite, people kept being switched to a different server and then their shopping basket would be empty, after debugging it appeared we needed persistent sessions enabled. It was my first load balancer experience so I won't ever forget, took us three days to figure out 😅
TheCollegeIntern@reddit
Http archive captures are what I use to try to solve stuff like this.Â
dreaminginteal@reddit
I used to work for the load balancer group of a large multinational tech corp. Before I joined, they had instances where they would get bit-flip errors causing issues with their device. Turns out that the culprit was literally cosmic rays occasionally flipping bits in memory!!
I would not have wanted to be the person in charge of troubleshooting that one...
fluffy_in_california@reddit
Several years ago I saw a fantastic talk about random bitflips being used in DNS hijacking with an actual proof of concept demonstration.
You can register a name that is just one bit different than a very popular name and a tiny tiny percentage of people who are connecting to the correct domain...get you instead for the IP address.
It can be levered into a credentials hijack.
cracksation@reddit
You don't happen to still have a link to that talk on hand would you? That sounds really interesting and I'd be interested in checking it out if you'ee able to share.
HammerOfTheHeretics@reddit
I remember a similar problem with a Cisco switching ASIC I worked on years ago. Occasional particle decays in the chip packaging would cause particular bits in memory to 'latch on', which would corrupt the hardware forwarding tables. We had to add a detector to the hardware driver that locked off the affected table entries. Fun times.
dreaminginteal@reddit
I wonder if that was the same incident? Hmm....
HammerOfTheHeretics@reddit
Probably not. This was the ASIC that powered the Catalyst 4000 and 4500 series of gigabit ethernet switches. But I think the basic problem with energetic particles screwing with nanometer scale integrated circuits affected a lot of products. Physics is a harsh mistress.
ManWhoIsDrunk@reddit
Random bitflips are weird...
And if it's only a billion to one chance, it'll happen 8 times per gigabyte on average. So it's definitely something one has to account for when dealing with large volumes of data.
HelpfulPuppydog@reddit
Luck or skill, whatever gets the job done, and you go on to the next ticket.
s-mores@reddit
Could have an ISP cache.
If there's someone with a good idea for checking these kinds of caches I'm also interested.
fluffy_in_california@reddit
Use a cache buster string to catch it. Easiest way is to simply append a garbage parameter to the raw URL: ?cachebust=-923i4knsdf
bobarrgh@reddit (OP)
Thank you for the reminder to use a cache buster parameter!
On that previous site I mentioned, I had a standing policy that whenever we deployed a JS or CSS file to the site, the references to all the JS and CSS files in the global page template had to be updated with something like "?ts=202410241630" (e.g.: timestamp 10/24/2024 1620) so that fresh copies of all the JS and CSS files would get pulled in.
The reason why this was timely was because I had another situation on the same website as the one that had the problem this morning where an update was not getting seen even though the cache had been cleared in the CMS. I retried the URL my content editor had updated and I put in a query parameter of "?foo=bar", and, lo and behold, it worked!
So, definitely there is something happening after it leaves our server. It wasn't the exact same thing that happened with my user this morning, but it certainly does show that something funky is going on.
Thanks again for the reminder.
Loading_M_@reddit
The more common option I've seen is including the file's hash in the URL. CDNs do this, although I believe it's more to allow serving multiple versions of the same file.
JS and CSS includes can (and should) specify their hash in the HTML, so the browser can check it got the right file.
raip@reddit
Sounds like you might be using a CDN.
HINDBRAIN@reddit
Just use the version number?
evanldixon@reddit
This. Caching is good if the thing isn't changing, which it won't unless you release a new version
Valheru78@reddit
This is the way.
K1yco@reddit
One thing I've learned is that if you can't figure something out, some times you just have to try something silly/dumb, and it turns out to be the issue.
Customer was having a weird issue with a few programs that kept closing. We tried just about everything and couldn't figure it out, so I said "well, let's just unplug your game controller".
Once that happened, the programs stopped closing.
AshleyJSheridan@reddit
I've had this before with a mobile phone carrier. They were caching what they deemed as cacheable assets (CSS and images mostly). It was pretty annoying, because I had to then go around and add in cache-busting parts to the URLs for basically everything.
I actually turned this into an interview question, where I asked the interviewee to list out the types of caching involved in a website and talk through each they knew of. I wasn't using this as a trick question, more to gauge their level of knowledge.
Ricama@reddit
Not the a... I mean not luck, skill: you were looking for a point of commonality between the two machines.
ttlanhil@reddit
That's happening on just one server, and not others?
That'd be concerning - all servers should be set up the same
To deal with the problem directly - it might not be the caching itself, it might be cache headers (which tell the browser, and CDNs or caching proxies in between, whether it's okay to cache).
If you can check network tab in developer tools when you're getting a cached response (i.e. your own incognito mode checks), I'd suggest looking for a cache-control header that's not set correctly (you don't want a high max-age for pages that you update regularly)
Or you might see a HTTP 304 (which is the server telling the browser "show the version you previously had, it hasn't changed")
Common if the server doesn't realise the page has changed (because it's not set up to always pass the request through to the CMS server), or if the time on the server is wrong.
When you're logged in to the CMS, you'll be sending a session cookie; which can bypass caching (I'm simplifying a little)
If the server is giving incorrect cache information, then it's perfectly valid for any step along the way to be caching it, giving you the odd results (and might also be possible for the phone to detect a network change, and hence invalidate its own cache)
As for tracert - you mostly can in reverse!
Get the user to visit https://example.com/?q=findmephone on their phone, and equivalent on desktop. Then check the logs on the server for their IP address. Something simple enough to type, but distinct enough you can easily find it in the logs.
You probably won't get responses from right at their end, but you can probably get up to the phone vs broadband ISP level
Of course, if you have remote desktop tools and the user is due a coffee break, you may be able to do all that diagnostic directly as well.
Good luck! Caching is one of the Big Fun Problems
ilovemybaldhead@reddit
This has happened to me. I have a WordPress site, it has some cache management. I always clear the cache when I make a change because of experiences similar to yours. This one time the change didn't take effect, even though I checked it from different Chrome profiles, different browsers, different machines... then I used a VPN, and bingo! The change was there.
I hate caching. I would rather wait the extra second and know I'm getting up-to-the-second data.