Regex Are Not the Problem. Strings Are.
Posted by Mirko_ddd@reddit | programming | View on Reddit | 31 comments
I think it is a point of view that may seem controversial but it traces a historical precedent that is quite shareable (the Joda-Time case) and how it could be applied to the world of regular expressions, a bit like the transition from manual SQL and raw strings with the advent of jOOQ.
NoLemurs@reddit
Unpopular opinion, but regexes almost never belong in code.
Regexes are a fantastic user-facing tool to allow power users to have a lot of control, but in those cases, the input kind of has to be strings.
If you're dealing with a situation where your regex is defined at compile time, you're almost always using the wrong tool.
Mirko_ddd@reddit (OP)
Totally valid point for user-facing search tools, and Sift doesn't try to replace that use case. But compile-time validation logic is everywhere in enterprise code, and writing it as a raw String means zero compiler guarantees, silent failures, and a backtracking engine you can't swap out. That's the gap Sift addresses.
NoLemurs@reddit
I 100% agree that raw regexes in code are usually a bad idea.
But the solution isn't to wrap your regex (a special purpose language) in some second abstraction layer. The solution is to just write code in the actual programming language you're using to do whatever the regex does.
The resulting code might be a little longer than the regex version, but it will be much more maintainable, and won't require you to understand multiple levels of abstraction just to do what your programming language can already do perfectly well.
Mirko_ddd@reddit (OP)
I agree with your premise, but not the conclusion.
You are 100% right that raw regex strings don't belong in compiled code. They are cryptic and unreadable. But writing a 50-line custom parser just to validate a UUID or a VAT number is pure over-engineering.
The regex engine is the right tool; the string syntax is the problem.
That’s exactly why Sift exists. It gives you the performance of the regex engine, but replaces the unreadable string with a strongly-typed, compile-time safe AST.
NoLemurs@reddit
UUIDs and VAT numbers are great example. You absolutely shouldn't be using a regex for these.
For UUIDs use
java.util.UUID. The code is cleaner. the error handling will be better. There's really no downside.For the VAT number, the spec is a two digit country code followed by 2-13 characters where the specific format depends on the country code. If all you're doing is validating that you've got two ASCII letters followed by 2-13 ASCII characters or digits, this is trivial to code.
On the other hand, if you want to actually validate the VAT number, making sure the country code is valid and the format matches the country code, I don't even want to imagine what sort of monstrosity of a regex you'd need to construct to do that correctly. This absolutely needs to be code, and probably you want to use a library for it.
The regex is the wrong choice for both of these use cases.
Mirko_ddd@reddit (OP)
You make a fair point about
java.util.UUIDfor pure parsing. But usingUUID.fromString()just to validate user input is a known anti-pattern in Java because it relies on catchingIllegalArgumentExceptionfor control flow, which kills performance under load. Plus, a class parser can't extract a UUID from a log line or a mixed payload. Regex is for pattern matching.And to make it even better, you don't even have to write the UUID AST yourself. Sift comes with a built-in catalog for standard formats. You literally just write
.followedBy(SiftCatalog.UUID)and you get a perfectly optimized, compile-time safe UUID regex without looking at a single string.But your VAT example is actually the absolute best argument FOR Sift!
You wrote: 'I don't even want to imagine what sort of monstrosity of a regex you'd need to construct to do that correctly.'
That is exactly my point! > In raw strings, a full EU VAT validator is an unmaintainable monstrosity. But with Sift, it's not a monstrosity at all. You just compose small, readable rules:
anyOf(italianVatRule, germanVatRule, frenchVatRule).You get the blazing fast, native execution of the regex engine without having to write or maintain the 'monstrosity' of the string syntax. You just described the exact pain point Sift was created to eliminate.
If you have a few minutes this weekend, I’d love to invite you to take a quick look at the repository. Even if it doesn't completely change your mind about regex in code, I genuinely think you might appreciate the architectural approach to solving this specific pain point. As engineers, I think we both share that love for exploring new ways to tackle old problems
tes_kitty@reddit
Ok... But how would this look like? [0-8aceXZ-]{3}
Mirko_ddd@reddit (OP)
You couldn't write it, because it's malformed. I don t know if this was a trap or you simply are validating my point (typo can happen in strings). If you tell me what you want to validate I can write a snippet tho
tes_kitty@reddit
Malformed in what way? I tried the [0-8aceXZ-] part in sed and it works as expected.
Mirko_ddd@reddit (OP)
yup, I misread. I wrote the snippet you asked, sorry
tes_kitty@reddit
Looks simple enough... But why did you 'upperCaseLetters()' in your example instead of just 'range('A','Z')'? It makes changes later more complicated if the range changes. 'range()' is universal and easier to adapt.
Mirko_ddd@reddit (OP)
funny enough you can write both ways, but for me it is so much easier to read 'upperCaseLetters()', is more intentional. here I used explicitly the range() because is a custom one (0-8) but it was 0-9 I would have use 'digits()'
tes_kitty@reddit
Silly question... Is there also a 'romanDigits()'?
Mirko_ddd@reddit (OP)
you mean "I II III IV V" etc?
tes_kitty@reddit
Yes.
Mirko_ddd@reddit (OP)
I googled about the roman numbers, I was a kid last time I heard about.
for sure is longer than
but it is self documenting, and you can write it following a bit of logic without messing with special chars and parenthesis.
Mirko_ddd@reddit (OP)
how would you write it in raw regex?
mtetrode@reddit
Looks to me as a valid regex. Three chars of the set 0 to 8, a, c, e, X, Z, -
HighRelevancy@reddit
The hyphen probably should be escaped even though most implementations will not try to parse it as a range if it's first or last.
Usually works, but it's stupid that it does.
tes_kitty@reddit
I tried it with 'sed' before posting here.
Still wondering how that would be written in sift.
fearswe@reddit
That's perfectly valid regex.
tdammers@reddit
Some remarks though:
Introducing a type-safe API to the same regular expression language does nothing about that ReDoS vulnerability. Whether you write
^(a+)+$, or this:Sift.fromStart() .oneOrMoreOf(oneOrMoreOf(literal("a"))) .andNothingElse()
...doesn't change the semantics of the resulting query; when running it over the string "aaaaaaaaaaaaaaaaaaaaaaa!", the interpreter will still go into a deep recursion. The only thing you win, really, is that your regular expression is parsed at compile time, so regex syntax errors will stop your build, rather than throwing at runtime.
"The first version? Nobody can read it without running it." - this is blatantly false. I actually find it easier to read than the more verbose Sift API - it's well defined, it's much more compact, and it uses the same basic regex syntax as every other regex tool I've used over the past 30 years. It's not hard; you just need to know the syntax, just like with any other language. Sure, once you do complex stuff with regular expressions, the language's lack of abstraction and other quality-of-life features will start to bite you, but this one is still perfectly straightforward, you just need to read it left-to-right:
^= start at the beginning of the string, don't skip anything;[A-Z]= any character between A and Z, inclusive;{3}= exactly three of those;-= a literal dash character;\d= a digit;{4,8}= 4-8 of those;(?:...)= treat this as a group, but don't capture it;_TEST= the literal character sequence "_TEST";?= the preceding thing is optional;$= match end of input (i.e., no further input must follow)."With raw regex, changing the engine is a rewrite." - yes, a rewrite of the engine. Not of the regular expression itself. This is a design issue that has nothing to do with the choice of surface language (regular expression string vs. typed API). Many regular expression APIs out there do the right thing and allow you to select different interpreter backends through the same API; and in fact, I would argue that this would actually be easier to implement with the much smaller, simpler API surface of a traditional regex engine, which leaves the entire parsing, AST, and interpretation code completely invisible to the programmer, whereas a structured API like Sift, by necessity, exposes part of the AST through the API. I'm not saying regular expressions as we know them are necessarily the best choice, but I am saying that the ability to swap out the interpreter backend has nothing to do with classic regular expressions vs. a Sift-style structured API.
"You wrote a complex pattern. How do you document it for the junior developer joining your team tomorrow? Sift does it for you." - This is nice, but it also solves a problem that you shouldn't have to begin with. Regular expressions shouldn't be used as poor man's parsers; if you need a parser, write a parser. Regular expressions are great for small, regular things, such as tokenizers - something like
identifierRE = "[a-z_][a-z0-9_]*", for example, is a perfectly fine way of defining what an identifier token looks like in a language you want to parse; it's concise, readable, "self-documenting" via the variable name, and also pretty easy to debug and test. If you need something more complex than this, then you probably need an actual parser; that parser should have a typed, structured API, similar to Sift, but it should also be a little bit more powerful than Sift, being able to parse non-regular grammars, resolve ambiguities, provide elaborate error information, and come with primitives and combinators for things like speculative parsing / explicit backtracking, full-blown recursion, complex repetitions, etc. If you've ever used one of the Parsec-style parser-combinator libraries in Haskell, you'll understand what I'm talking about - these things are a joy to use, and while parser-generator toolchains tend to produce better runtime performance, they are still plenty fast for almost all of your everyday parsing needs.Email validation is actually a classic example of how not to use regular expressions. In practice, the best regular expression for validating email addresses is this:
^.+@.+$. That is: it must contain an@character, there must be other characters before it, and there must be other characters after it. This will still allow lots of invalid email addresses, but guess what, that's fine. Your email address can be invalid in all sorts of ways that you cannot tell from the address itself anyway: the destination server may not resolve, it may be down, the address might not exist on the server, the mailbox may be full, mail may disappear in/dev/nullon the other end, something along the chain might not accept an address that is formally valid, the recipient may have lost access to their mailbox, etc. What you're really interested in is just two things: first, is this something that has any chance at all of being an email address we can try sending stuff to; and second, when I send an email to this address, is there someone on the other end reading it. The simple regex above takes care of the first question; for the second, what you do is you send a confirmation link to that address, and when the user clicks it, you mark the address as "confirmed" (because now you know that emails you send there can be read). OTOH, if you're writing an actual email client, then a regular expression won't be enough anyway - you need a parser."The only question is: will Java lead this change, or will we keep writing business logic in strings for another 60 years?" I'm sorry to inform you that Java isn't going to lead this change. Haskell has been doing this kind of stuff for decades (according to Hackage, version 2.0 of
parsec, the first production-ready library to offer this, has been uploaded in 2006, 20 years ago) - people hardly ever use regular expressions in Haskell, because the parser libraries are just so convenient. They are technically overkill for something a regular expression can do, but there's so little ceremony involved that using them just for the regular expression subset of their functionality is still worth it most of the time. Sift looks decent, but compared to Haskell libraries like Megaparsec, it's still fairly limited, and far from "leading the change".HighRelevancy@reddit
No, but it does make it more visibly questionable. If that regex were a segment of a long one, I'd never see it coming.
"I have 30 years experience doing it the hard way" is not an argument for the hard way not being hard.
And 30 years of experience really should be telling you that all your "you shouldn't use regex for that" is a nice idea but it's not how it pans out in reality.
tdammers@reddit
My point here is that the syntax is only "hard" or "unreadable" because you haven't learned it, not because it's intrinsically difficult. I haven't been "doing it the hard way for 30 years" - I have been doing it the hard way for a few months, and then the hard way became the easy way.
There's a lot to say for an API like Sift, but if I had to compare them in terms of how easy they are to use, then for me personally, traditional regular expressions would win by a huge margin. Not because they're intrinsically better necessarily, but because I already know them, so I can use them without reading a tutorial or checking a reference manual all the time. I also have to read fewer characters to get their meanings - notice how the Sift example in the article takes six times more code to express the same regular expression semantics.
This doesn't just mean it takes more time to type (which is actually mostly irrelevant, since you can use autocompletion etc.); the more important issue with that is that it takes up more screen real estate (meaning, you have less context within view while reading it), and your brain has to process more tokens to extract its meaning. Once you know that
^means "from the start", there is no value in the extra characters needed to spell outfromStart(), which means that^is 11 times more efficient at expressing the same thing. And when you're dealing with larger codebases, this kind of difference definitely matters - reviewing a 1100-line change typically takes significantly more time and effort than reviewing a 100-line change, even when they are functionally equivalent.It does for me.
30 years of experience have taught me that building things on top of a broken foundation is a fool's errand, so I'll generally fix the foundation before building anything on top of it. If I see a codebase that uses regular expressions to parse complex inputs, I'll fix that. If I'm building something that uses regular expressions, and things get out of hand, I'll take a step back, revisit my design decisions, and rewrite the code to use a proper parser.
30 years of experience have also gotten me into a position where I have the authority to make such decisions; I rarely have to deal with managers or clients who insist I keep those bad design decisions and work with them somehow - when I put my foot down and say "this code needs to be rewritten, and here's how we should do that" (which I generally only do when I think it is feasible), then that's usually what happens, and it usually ends well.
It also helps that after 30 years of doing this stuff, I have gotten quite good at doing things the right way without getting lost in unnecessary abstractions, so when I do this, I'm often still faster than an inexperienced junior dev doing it the wrong way (and then frantically trying to squash the resulting bugs one by one).
So yeah, it usually does pan out like that in reality for me, but I am of course aware that it's not like that for everyone.
Mirko_ddd@reddit (OP)
I don t know if you know about jOOQ. I would put my hand on a fire that people said the same thing about writing SQL strings manually. You can try to google about it, you will be shocked to know that maybe also your bank runs it.
What I want to point out is that string validation is weak. You may be a regex genius but the world is full of teams, not single engineers. And typo happens. Also code maintenance from different devs happen.
So if jOOQ became a standard I can see room of adoption for libraries like Sift (or also even better than sift).
It s not something about me against good engineers, it s about making things simpler and harder to break.
Mirko_ddd@reddit (OP)
first of all, thank you for reading.
First off, the Sift syntax you wrote is wrong XD.
If you actually tried writing that in an IDE, the Type-State pattern would immediately suggest the correct modifiers designed exactly to prevent ReDoS, like
.preventBacktracking(). You literally have to go out of your way to write vulnerable code.Second, you're missing the core architectural feature: Sift doesn't just spit out a string for
java.util.regex. It builds an AST and allows you to swap the underlying execution engine without touching your business logic.If you are validating input that is vulnerable to deep recursion, you just plug in the officially supported RE2/J backend. It runs in linear time
O(n)using DFA. Catastrophic backtracking becomes mathematically impossible. ReDoS solved.To demonstrate that the raw regex is 'perfectly straightforward to read', you had to write a 10-line paragraph translating every single symbol into plain English.
That English translation you just wrote? That is exactly what the Sift API is.
With 30 years of experience, you can mentally decompile
(?:...)into 'non-capturing group' in milliseconds. But the junior or mid-level dev reviewing that PR tomorrow morning cannot. Sift doesn't exist to replace regex for veterans like you; it exists so teams don't have to mentally parse symbols or write the exact translation paragraph you just provided to explain what the code does.If a developer needs to validate a complex custom product serial code (e.g., starts with 3 letters, followed by a variable-length year, a dash, and a specific checksum format), writing a full-blown custom parser for that is textbook over-engineering. But writing it in raw regex means the next junior developer won't be able to read it without running it.
Sift bridges that gap. It gives Java developers the structured, 'self-documenting' joy of a parser-combinator API, while still compiling down to the lightweight, standard regex engine they already use in their stack. It solves a very real, everyday business problem.
Here I agree 100%, it was a simple example of an unreadable regex converted in a single utility method.
when I asked if Java will 'lead the change', I meant in the mainstream enterprise industry. Fortune 500 companies run on Java/C# as far as I know, not Haskell, and developers there are still writing raw regex strings every day.
ff3ale@reddit
Nice, now make it cross platform / cross language / easily (de)serializable
Mirko_ddd@reddit (OP)
as u/HighRelevancy said.
I am an android/java developer, and I wanted share my 2cents with colleagues who can validate (or not) my intuition. If someone find this useful and may spark some interest to make something similar for their language is welcome.
it is. and it is intentional
HighRelevancy@reddit
Brother, it's Java. It's cross platform. That's the whole point of Java.
What, someone can't release a library without porting it to seven different languages immediately? What weird criteria.
You wouldn't, it's Java. ???? Are there other Java libraries you normally "pass to shell scripts"? What are you talking about?
???? Wtf is this comment?
Huxton_2021@reddit
To see what people who take their regex seriously came up with some years ago:
https://docs.raku.org/language/grammars
cbarrick@reddit
This is a flawed premise.
Many (most?) regexes don't originate from your code. They originate from external files that your system ingests.
You can add a regex builder class library if you want one, but you're always going to need a regex language compiler as the more important primitive.
Also, the article reads like AI. Don't get me wrong, it's fine to use LLMs to help polish your writing. But a little dab goes a long way. This one is a bit overcooked.