The dangers of single line regular expressions

thunderbong | 85 points

Seems to me this is more about the danger of passing anything derived from user input into the TEMPLATE side of a templating engine. Why in the world would you ever do that?!?

Obviously if you pass data into the variable side of the engine, you hardly have to worry about it at all, since it's already going into a place that was designed for handling arbitrary and possibly-hostile input and been battle-tested at doing it correctly in Production for many years. If you pass it into the template side, you're betting that you can be as good as dozens of templating engine writers working for a decade at doing that, in exchange for, well, I can't really think of any possible legitimate advantage for doing that.

ufmace | 11 days ago

In my experience `$` does reliably mean end of string for regular expressions, unless you specifically ask for "multiline" mode.

Ruby seems to be in multiline mode all the time?

    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foobar") else "no"'
    yes
    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foo\nbar") else "no"'
    no
    $ python -c 'import re; print "yes" if re.match(r"^[a-z ]+$", "foo\nbar", re.M) else "no"'
    yes

    $ perl -le 'print "foobar" =~ /^[a-z ]+$/ ? "yes" : "no"'
    yes
    $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/ ? "yes" : "no"'
    no
    $ perl -le 'print "foo\nbar" =~ /^[a-z ]+$/m ? "yes" : "no"'
    yes

    $ node -e 'console.log(/^[a-z ]+$/.test("foobar") ? "yes" : "no")'
    yes           
    $ node -e 'console.log(/^[a-z ]+$/.test("foo\nbar") ? "yes" : "no")'
    no            
    $ node -e 'console.log(/^[a-z ]+$/m.test("foo\nbar") ? "yes" : "no")'
    yes

    $ ruby -e 'if "foobar" =~ /^[0-9a-z ]+$/i then puts "yes" else puts "no" end'
    yes
    $ ruby -e 'if "foo\nbar" =~ /^[0-9a-z ]+$/i then puts "yes" else puts "no" end'
    yes
EDIT: this is documented behavior for Ruby. What other languages call multiline mode is the default; you're supposed to use \A and \Z instead. They do have an `/m` but it only affects the interpretation of `.`

https://docs.ruby-lang.org/en/master/Regexp.html#class-Regex...

neilk | 11 days ago

Alternatively, don't validate and then use the original. Instead, pull out the acceptable input and use that.

Even better, compare that to the original and fail validation if they're not identical, but that requires maintaining a higher level of paranoia than may be reasonable to expect.

sfink | 11 days ago

This was interesting and new to me, but as other commenters indicate, part of the problem is that we're trying to find the bad thing rather than trying to verify it is the good thing

There's a related concept of "failing open vs failing closed" (fail open: fire exit, fail closed: ranch gate)

In Jurassic park (amazing book/film to understand system failures), when the power goes out, the fence is functionally an open gate

In this case, we shouldn't assume that we can enumerate all possible bad strings (even with a regex)

wrsh07 | 11 days ago

I think it is a surprise that a partial match return true.

But I guess this is why Python has so many ways of matching a pattern against a string (match, find, findall, I think - they are hard to remember)

wodenokoto | 11 days ago

Escape the output based on the context a string is being used in versus trying to sanitize for all use cases on input.

This will guarantee that you’re safe no matter how a piece of content is used tomorrow (just need a new escaping function for that content type), and prevent awkward things like not letting users use “unsafe” strings as input. JSX and XHP are example templating systems that understand context and escape appropriately.

If a user wants their title to be “hello%0a%3C%25%3D%20File.open%28%27flag.txt%27%29.read%20%25%3E”, so be it.

Use input validation / parsing to ensure data types aren’t violated, but not as an output safety mechanism.

ec109685 | 11 days ago

Raku (perl6) was a chance for Larry Wall to fix some of the limitations of the perl regex syntax, as you would expect from the perl heritage, it behaves similarly.

    ~ > raku -e 'say "foobar"   ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'    
    yes
    ~ > raku -e 'say "foo\nbar" ~~ /^ <[a..z ]> +$/ ?? "yes" !! "no"'  
    no
    ~ > raku -e 'say "foo\nbar" ~~ /^^<[a..z ]>+$$/ ?? "yes" !! "no"'
    yes
- ^^ and $$ are the raku flavour of multiline mode

- ~~ the smartmatch operator binds the regex to the matchee and much more

- character classes are now <[...]> (plain [...] does what (...) does in math)

- perl's triadic x ? y : z becomes x ?? y !! z

We can have whitespace in our regexen now (and comments and multiline regexen)

    my $regex =  rx/ \d ** 4            #`(match the year YYYY) 
                 '-'
                 \d ** 2                # ...the month MM 
                 '-'
                 \d ** 2 /;             # ...and the day DD 
 
    say '2015-12-25'.match($regex);     # OUTPUT: «「2015-12-25」␤»
librasteve | 11 days ago

More like "the danger of thinking you can trivially validate user-supplied input" before evaluating the string.

jlv2 | 11 days ago

I pretty much always consider regex expressions as the wrong solution. They're notoriously hard to get right.

There's a whole lot of faulty expressions out there for validating email addresses. I prefer to do less validation and let it fail. If the email address is wrong, whatever service you're using for sending emails will just reject it. If you really do need to validate email addresses, use something somebody else wrote that does it properly.

If you're working with some exotic format for which there isn't already an open source library, do what this guy says: parse it, don't try to validate it with regex: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

cedws | 11 days ago

Ruby 4 should do what every other sane programming language does and require users to opt into multi-line mode via the /m flag.

The fact that Ruby has this behavior at all is a major security issue.

ezekg | 11 days ago

I once had to explain this class of security vulnerability to IC5-IC7 senior engineers.

0. There is no universal regex language but many.

1. Perl-like ones (Ruby, Perl, and PCRE1/2) contain additional hidden traps.

2. You must vigorously match untrusted input to assume it to include invalid unicode, control characters, and other oddities.

3. You should replicate frontend and backend validations to ensure they are always exactly consistent and correct, preferably through fuzzing and/or property testing.

banish-m4 | 11 days ago

If this is sufficient for rendering the text as neon:

    @neon = "Glow With The Flow"
    erb :'index'
What exactly is `@neon = ERB.new(params[:neon]).result(binding)` even supposed to be doing?

Why wouldn't it just be:

    @neon = params[:neon]
    erb :'index'
hombre_fatal | 11 days ago

> Hire me for a penetration test

When does the blogspam end?

sublinear | 11 days ago

Regular expressions make me sad about our industry.

If you read the early papers, you get a very clear language for pattern matching on sequences. They have really nice properties - the compilation to finite automata gives you decidable equality and decidable minimisation. As in you can compile equivalent regex to exactly the same state machine however they were expressed.

At some point perl happened and that seems to have sent us down a path to encoding the regular expression in an illegible subset of ascii. The backtracking implementation cost us negation and intersection. What should be linear time matching becomes exponential.

Emacs will let you write regex in s-expressions at which point they're much easier to read. Everywhere else has gone with "looks like Perl but has different semantics, which we kind of document, be lucky".

I started writing tests to check that regex I'd begrudgingly converted to the perl style behaved the same under different engines and the divergence is rough. Granted I was parsing regex with regex which is possibly a path to insanity but things like a literal [ were a real puzzle to match on different implementations.

I don't know that the horrible syntax on semantic beauty is due to perl but it looks likely from a superficial standpoint.

JonChesterfield | 11 days ago

[dead]

SEXMCNIGGA21381 | 11 days ago

[dead]

SEXMCNIGGA14889 | 11 days ago

[dead]

SEXMCNIGGA12951 | 11 days ago

[dead]

SEXMCNIGGA28586 | 11 days ago

[dead]

SEXMCNIGGA44425 | 11 days ago

[dead]

SEXMCNIGGA30416 | 11 days ago

[dead]

SEXMCNIGGA43497 | 11 days ago

[dead]

SEXMCNIGGA29703 | 11 days ago

[dead]

SEXMCNIGGA38513 | 11 days ago

[dead]

SEXMCNIGGA32201 | 11 days ago

[dead]

SEXMCNIGGA13303 | 11 days ago

[dead]

SEXMCNIGGA29895 | 11 days ago

[dead]

SEXMCNIGGA42210 | 11 days ago

[dead]

SEXMCNIGGA20788 | 11 days ago

[dead]

SEXMCNIGGA504 | 11 days ago

[dead]

SEXMCNIGGA16843 | 11 days ago

[dead]

SEXMCNIGGA33514 | 11 days ago

[dead]

SEXMCNIGGA28139 | 11 days ago

[dead]

SEXMCNIGGA25198 | 11 days ago

[dead]

SEXMCNIGGA1743 | 11 days ago

[dead]

SEXMCNIGGA31264 | 11 days ago

[dead]

SEXMCNIGGA23391 | 11 days ago

[dead]

SEXMCNIGGA36211 | 11 days ago

[dead]

SEXMCNIGGA32583 | 11 days ago

[dead]

SEXMCNIGGA14797 | 11 days ago

[dead]

SEXMCNIGGA39900 | 11 days ago

[dead]

SEXMCNIGGA10869 | 11 days ago

[flagged]

SEXMCNIGGA19328 | 11 days ago

[flagged]

SEXMCNIGGA45936 | 11 days ago

[flagged]

SEXMCNIGGA33243 | 11 days ago

[flagged]

SEXMCNIGGA16803 | 11 days ago

[flagged]

SEXMCNIGGA1172 | 11 days ago

[flagged]

SEXMCNIGGA28062 | 11 days ago

[flagged]

SEXMCNIGGA9176 | 11 days ago

[flagged]

SEXMCNIGGA41422 | 11 days ago

[flagged]

SEXMCNIGGA45308 | 11 days ago

[flagged]

SEXMCNIGGA26727 | 11 days ago

[flagged]

SEXMCNIGGA21359 | 11 days ago

[flagged]

SEXMCNIGGA32308 | 11 days ago

[flagged]

SEXMCNIGGA49464 | 11 days ago

[flagged]

SEXMCNIGGA39240 | 11 days ago

[flagged]

SEXMCNIGGA24930 | 11 days ago

[flagged]

SEXMCNIGGA9569 | 11 days ago

[flagged]

SEXMCNIGGA5590 | 11 days ago

[flagged]

SEXMCNIGGA19655 | 11 days ago

[flagged]

SEXMCNIGGA2138 | 11 days ago

[flagged]

SEXMCNIGGA12400 | 11 days ago

[flagged]

SEXMCNIGGA16062 | 11 days ago

[flagged]

SEXMCNIGGA1022 | 11 days ago

[flagged]

SEXMCNIGGA18899 | 11 days ago

[flagged]

SEXMCNIGGA25508 | 11 days ago

[flagged]

SEXMCNIGGA8974 | 11 days ago

[flagged]

SEXMCNIGGA15625 | 11 days ago

[flagged]

SEXMCNIGGA11248 | 11 days ago

[flagged]

SEXMCNIGGA27280 | 11 days ago

[flagged]

SEXMCNIGGA20391 | 11 days ago

[flagged]

SEXMCNIGGA40759 | 11 days ago

[flagged]

SEXMCNIGGA19034 | 11 days ago

[flagged]

SEXMCNIGGA39672 | 11 days ago

[flagged]

SEXMCNIGGA34645 | 11 days ago

[flagged]

SEXMCNIGGA27030 | 11 days ago

[flagged]

SEXMCNIGGA39861 | 11 days ago