How are anti viruses so fast?
The common anti-virus (to my knowledge) uses a kind of brute force type method where they get the hash of the file and compare it to thousands of known virus' hash. Is it just they have servers with super fast SSD and they upload the hashes to that and search really fast or am I completely wrong with my understanding of how they work?
Why do you think looking up a hash in a list of thousands is slow? With the correct data structure, a modern computer can do such searches millions of times per second. A list of thousands of hashes easily fits in memory, no fast SSD needed.
Hash lookup is on average *O(1)* (only in very rare occasions it works in *O(n)*). Nevertheless the point of a hash is to perform *literal comparisons*. So that would mean that a hacker could add a pointless two bytes to the virus, and make easily 65'536 versions of the file by assigning all possible values to the two bytes.
AV software is still one of the biggest performance drags in PCs :) So fast is relative here...
You may also be interested in this as an example of how you can search for many substrings in a file all at once.
AhoCorasick is a multi-regex that can pattern match on many thousands of patterns simultaneously; a state machine that emits multiple events. With a first-pass of a regex, files get selected for deeper inspection. I know that's how we did firewalling and intrusion detection at wire speed; at Check Point, which also had an AV division.
If you want to test hashes for set membership in really huge sets, use a bloom filter. Though I would imagine that AhoCorasick is more relevant to the main AV detect. A bloom filter (real world example) would be a set like, every hostname associated with divorce lawyers. It's an opaque bit structure that hides what is in it, and can give false positives; particularly as it get really full.
The lookup of Hashes is not the problem, calculation of Hashes much more so. But pure signature based AV is not a good solution anymore anyway. (And the more advanced methods are even more demanding)
You're assuming it's fast--in my experience that's not the case. It's just most of the time they look at the file and quickly figure out that it's a file they have already checked. That's where the hashes come from. I also suspect they have a table of hashes for common software. It's nowhere near as fast when it hits something that hasn't already been cleared.
Disclosure: I work for an anti-virus vendor.
Because most anti-virus engines were born as protecting endpoints, and even now for many of them endpoint protection is major part of business, the modern anti-virus engines are optimized for scanning endpoints. This optimization includes many things, such as:
Not scanning the files which couldn't contain infections which can infect your computer;
Remembering which files were scanned, and not scanning them again unless the files were modified;
Optimizing scans for the file types when possible - for example when scanning executables, only certain parts of it need to be scanned. This minimizes disk reads, and improves performance.
A common misconception is that AV engines use hashes. They generally do not, for three reasons:
- First is that getting around hash detection is very easy and doesn't require modifying the actual malicious code at all;
- Second is that using hashes does not allow you to implement any kind of proactive protection - you will only be able to detect malware which you have seen;
- Calculating a hash requires the whole file to be read, while for some files (such as executables) this is not necessary. And reading the whole file on non-SSD hard drives is expensive operation - most AV engines should scan a large clean executable file faster than calculating hash on it
a random number put in the middle of a message in an executable when that file is stored makes each file have a different hash.
I've heard many times about this "hash detection". Thanks for confirming this is non sense.
I don't think this answer fully addresses the main question of _how_ the performance is achieved. In particular, "most AV engines should scan a large clean executable file faster than calculating hash on it" seems to merely perpetuate the OP's question. How exactly is it possible to scan a large file faster than computing a hash/checksum from it? Both operations would appear to be `O(n)`.
@aroth Because "for example when scanning executables, only certain parts of it need to be scanned. This minimizes disk reads, and improves performance."
The question remains how you detect that a file on disk has changed since the last scan. It is trivial to reset file modification date, so that is not an indicator; and a sophisticated infection could conceivably leave the file size unchanged. That leaves only a hash as a criteria.
@PeterA.Schneider Aren't write operations monitorable, for lack of a better word?
Another reason hashing the file would not be suitable is that it would be trivially easy to subvert, as most file types, including executables, will tolerate adding random bytes at the end of the file with no change in how the file works. Or at the start of the file, for ZIP/JAR/etc.
@PeterA.Schneider it is true that a file could be modified without changing size and setting modification time back. Virus scanners ought to re-scan files at regular intervals (even if not every time). Not the whole executable would need to be scanned.
@PeterA.Schneider there are many possible ways. For example, since most anti-viruses already include the file system driver intercepting file operations, such driver may simply reset the flag in the internal database of already scanned files.
You say that reading the file from HDD would be expensive operation. But how do you scan an executable without reading it?
@GeorgeY. So if a user deactivates their AV application for a short period, does the AV determine that all files need to be rescanned (or checked by hash???)... or does it actually leave some minimal form of monitoring active... or does it end up undermining the long-term security leaving most AV to believe files should still be the same when they no longer may be?
@aroth Two operations might both be O(n), but one might still be 100 times faster than the other.
why is it necessary to have a malware scanner like Malware Bytes? why don't AV vendors just include that in their virus scanning? or do they? And why is a separate anti-ransomware necessary?
@ycomp it is not necessary. All major AV vendors detect adware, spyware and other malware - not just viruses. And separate anti-ransomware is not necessary either.
@GeorgeY. but what happens when the AV vendor has a separate anti-ransomware tool (I use bitdefender av)? do they cripple the free av in regards to ransomware detection then?
@TomášZato you can scan most clean executables without reading the **whole** file from disk. Obviously you need to read **some** data.
@ycomp There's difference between detecting ransomware binaries (this is what AV engine is doing), and preventing ransomware from encrypting files (this may be an additional component).
so does removing ransomware fall into the AV sphere? or the anti-ransomware software sphere generally? not preventing encrypting, just removing it once detected before it can even try to encrypt
@aroth This might be one of those cases where big-O notation hides important constant factors...
Since a couple of people have asked about scanning executables vs checking a hash on them: To compute the hash on a file, you have to read *the entire file*; however, proper executable scanning does not need to read the whole file, just key parts. If you have a 100+MB executable, that can mean the difference between reading 2MB of it or reading 100+MB -- multiply that by however many large executables happen to live on a system, and you should be able to see where the performance improvement comes from.