Is it possible to detect 100% of SQLi with a simple regex?

  • I'm wondering if it is possible to detect 100% of the possible SQLi attacks using a simple regex.

    In other words, using very simple PHP code as an example:

    if (preg_match("/select/i", $input)) {
        attack_log("Possible SELECT SQLi detected.")
    }
    

    The questions are:

    • Will that regex catch all possible SQLi attacks that use SELECT? If not, is it possible to change that regex so that it is going to detect all injections that rely on SELECT?
    • Is it possible to change that regex to so that it will catch all possible SQLi, so not only SELECT statements but also all the rest? I'm afraid that to achieve this I would need to add every possible SQL keyword to the regex, including "AND" and "OR".
    • Supposing it's not possible or feasible to detect all SQLi by trying to match all the possible SQL keywords, is there a limited subset of keywords that would allow me to detect the vast majority of possible attacks?

    @reed The problem is that StackExchange isn't an open forum for discussing ideas. It's a Q&A site with restrictions on the types of questions that can be asked, in order to promote content that is useful to both those asking the questions *and* future readers. Questions like this, where the goal (intended or otherwise) is to ruminate over ideas rather than get a single concrete answer, aren't considered to be on-topic here. Since any correct answer would need to start with "don't do this, use parameters instead" it doesn't seem like there's much value that can come of it.

    @Polynomial, the questions I asked are specific. Does it catch all SQLi that use SELECT? Yes, no, why? I believe these kinds of question are what is behind most IDS.

    Unless it is some sort of learning/academic exercise, I would discourage that approach. Instead of detecting/parsing user input, use the database mechanisms that avoid sqli PHP prepared statements

    No, a simple `preg_match("/select/i", $input)` will NOT detect all sqli. The `SELECT` keyword would be one strong warning indeed, but it would give you false positives (for example a password containing SELECT) and would fail when the sqli is ofuscated.

    I was sure there was a duplicate that I could point you towards, but I never found one. There should be one.

    Can you show how `$input` is going to be used in the query, please?

    @bradbury9 For intrusion detection (one of the tags on the question) I could see a basic SQLi check (regex or otherwise) being used as a flag for suspicious activity that leads to an account/ip ban instead of ignoring it because you're SQLi proof (via prepared statements) and risking some other attack vector the adversary uses will work before they trigger a ban. Something like that would be separate from protecting the application against SQLi where the approach is, as you said, completely inappropriate.

    Attempting to parse semantic information out of a recursive language with regular expressions? What are you trying to do, summon Codethulu or something?!?

    @reed Imagine the frustration of your pour users being frustrated they can't submit a post/comment/biography/whatever saying _"I recommend you select the best tool for the job."_ ;)

    Although there's no simple regex that can do this - and no regex can do it completely reliably - there's still a multi-billion dollar industry based on this idea: web application firewalls

    If you're okay with some false positives, then this one will always catch all SQL injection: `.*`

    You don't even need a regex, just `return true`. Pretty high false-positive rate though.

    Just convert strings from your users to hex. Instead of `select 'yeet'`, send it as `select x'79656574'`

    `I'm afraid that to achieve this I would need to add every possible SQL keyword to the regex` yup, and then some

    Does "SQLi" mean "SQL injection attacks"? If so, why don't you just say that?

    "He defeated all SQL Injection attacks with one simple trick! Penetration Testers hate him!"

  • schroeder

    schroeder Correct answer

    2 years ago

    Keyword filtering for SQLi is not a good technique. There are too many ways to bypass it.

    Crazy things like sel/**/ect might work, for instance. Or playing games with substr(). And then there's EXEC('SEL' + 'ECT 1').

    There are many guides on how to bypass common filtering techniques.

    But then you might ask if there is a superset of things to filter for (like select and /**/ and substr and EXEC), but then the list gets very, very long, and you still might not get a comprehensive list.

    The better approach is to understand the range of acceptable inputs and protect those or to make it ineffective to use SQLi through proper design.

    I guess this is exactly the kind of answer I needed, and that link is especially helpful to understand how complex all this can get. Thanks.

    Not to mention the false positive rate. The more things in your list, the more likely you'll match something in the query data - name, user name, email address, etc. Imagine if someone tried to register as [email protected] or something along those lines.

    @alex.forencich ... hence needing to know what's normal for input ...

    Also even if you managed to catch all injection attempts using select, injection of SQL *fragments* that modify existing queries to do something else then they were supposed to is probably even more important class of attacks—and these don't have characteristic keyword.

    @alex.forencich And sometimes the set of valid data may overlap with the set of malicious inputs. https://xkcd.com/327/

    "proper design" It would *greatly* improve the answer to mention what that proper design is, the parameterized queries part in particular.

    @jpmc26 Proper design, IMO, is not appropriate as a SE question. It's so broad with such a huge amount of already published material that anyone who can't figure it out really shouldn't be doing it at all.

    @Nelson All that of "published material" mostly boils down to, "Use parameterized queries for any value that can't be hard coded directly into the query text, especially user input." (Literally. The only thing beyond this is the question of what to do if you need to have some kind of dynamic value for an identifier, and that isn't necessary in the vast majority of cases.) No harm would come from including a brief summary of that nature. It's really not as big a topic as you seem to think.

    @jpmc26: While using parametrized queries is the *procedure of implementation* you should definitely follow, I think it is worthwhile to understand *why* it is inherently better than the proposed method. The proposed method attempts to spot anything that could become dangerous when it gets parsed and evaluated by the DB engine, whereas with parametrized queries, user input is presumably rendered unparseable in the first place (by getting converted into e.g. a plain string literal), while taking into account the specific syntax quirks of the very DB engine used to run the query.

    Also, this doesn't catch attacks that may disrupt the service, such as `"; drop table users; --` which, simply, deletes the (hypothetical) users' table. This is far more devastating. Or, just a simple `"; update users set password=""; --`.

    I would add to response that "if you have a problem and use regex, you have two problems now". Basically he wants to trade SQL injection for Regex injection.... With that approach he might make application vulnerable for both issues at the same time, and god knows what could be possible with combination of those two.

    @Mateusz I heavily disagree with your statement. The idea is to test is a certain string matches the regular expression. There's nothing coming from the user side, in the regular expression. Just the value to check. But, yes, it is still 2 problems.

    Also consider how much simpler it is to just use standard libraries to secure your queries with proper parameterization. Not to mention they've thought of a heck of a lot more edge cases than we have, even the best of us.

    `but then the list gets very, very long` - no, the list would be literally *INFINITELY LARGE*, lest you consider the max query length, which in MySQL is 67 megabytes __BY DEFAULT__, but even considering a max query length of 67 MEGABYTES, the list will be *EXTREMELY* large indeed.

License under CC-BY-SA with attribution


Content dated before 7/24/2021 11:53 AM