Calculating entropy of a string

  • We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality.



    I whipped up this simple method which counts unique characters in a string, but it is quite literally the first thing that popped into my head. It's the "dumbest thing that works".



    /// <summary>
    /// returns the # of unique characters in a string as a rough
    /// measurement of entropy
    /// </summary>
    public static int Entropy(this string s)
    {
    var d = new Dictionary<char, bool>();
    foreach (char c in s)
    if (!d.ContainsKey(c)) d.Add(c, true);
    return d.Count();
    }


    Is there a better / more elegant / more accurate way to calculate the entropy of a string?



    Efficiency is also good, though we never call this on large strings so it is not a huge concern.


    Why a `Dictionary` as opposed to a `List`?

    Or, SortedList<>?

    Pull the entropy finding source from any compression algotithm, i.e. Huffman

    Your question reminded me of a Dr Dobbs article I read about 20 years ago. Fortunately, it's available online. It include simple .c code http://www.drdobbs.com/security/184408492

    Jeff, please tell me you are not trying to use this code to make it even harder to post short comments like “Yes.” by preventing users from adding dots or dashes...

    Why are you not using lambda ?

    I don't know what you want to do with it, but one way to estimate entropy in data is to compress it, and take the length of the result. Length of the data is the upper bound of the entropy. The better the compressor program - the better estimate.

    Technically, a known string has no entropy; a process of producing strings has an entropy. What you are doing is hypothesizing a space of processes, estimating which process produced this string, and giving the entropy of that process.

    How about just compressing it with gzip, or what is easily available? That's precise and elegant, and likely very optimised.

    @NRS , resharper will do that at the end :) :)

  • Won't this work too?



    string name = "lltt";
    int uniqueCharacterCount = name.Distinct().Count();


    will return 2


    Given Distinct probably uses a HashSet, I think this is the most concise and clear implementation.

    This is missing the point. The goal is to calculate the entropy of the string, not find a fancy way to count characters. Counting characters was one attempt (in the OP): counting characters more elegantly is not significantly better at calculating entropy.

  • public static int Entropy(this string s)
    {
    HashSet<char> chars = new HashSet<char>(s);
    return chars.Count;
    }

    I am always amazed at how simple algorithms can get, or, as in this case, *completely vanish*, by just using the right data structures. My favorite example is computing a histogram of discrete values, which is literally just `new MultiSet(sourceData)`.

    What about different characters that represent the same glyph, or glyphs that cannot be represented with a single character?

    @dfhwze That `s` parameter is a stream of tokens. In the implementation provided in this answer each character is already a token, so no need to preprocess it; in your case you need to "tokenize" your input before. (And the parameter won't be a `string`, more like `IEnumerable` or similar)

  • I also came up with this, based on Shannon entropy.




    In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits.




    It is a more "formal" calculation of entropy than simply counting letters:



    /// <summary>
    /// returns bits of entropy represented in a given string, per
    /// http://en.wikipedia.org/wiki/Entropy_(information_theory)
    /// </summary>
    public static double ShannonEntropy(string s)
    {
    var map = new Dictionary<char, int>();
    foreach (char c in s)
    {
    if (!map.ContainsKey(c))
    map.Add(c, 1);
    else
    map[c] += 1;
    }

    double result = 0.0;
    int len = s.Length;
    foreach (var item in map)
    {
    var frequency = (double)item.Value / len;
    result -= frequency * (Math.Log(frequency) / Math.Log(2));
    }

    return result;
    }


    Results are:



    "abcdefghijklmnop" = 4.00
    "Hello, World!" = 3.18
    "hello world" = 2.85
    "123123123123" = 1.58
    "aaaa" = 0

    There are some subtleties here. What you're calculating there isn't the entropy of the string but the entropy of a character in the string. You should consider whether to include a pseudo-character with frequency 1 for the string terminator (a unary number has some content), and whether to multiply by the length of the string.

    Sorry, just noticed this, this is equivalent to the code I later posted. Jeff, this is definitely a better solution; I think the most upvoted answer to this question is missing the point.

    Here we see another case where a frequency data structure would be useful. `var map = new FrequencyTable(); foreach (char c in s) { map.Add(c); }`

    If you're not using the keys, would it not be more clear to do `foreach (var value in map.Values)`?

    Not that it'd be a huge thing, but I'd lift the `Math.Log(2)` calculation out of the loop.

    By the way, this could also be used to for fraudulent vote detection. Past a certain amount of votes, if there are only votes for "user x", the entropy of the different votes is zero: there is no additional information, and the votes should be discarded, whether they're upvotes or downvotes. Could be difficult to find a meaningful threshold.

    Instead of the above, just compress the string using a few algorithms. The above seems to correspond to Huffman coding. LZ would allow character tuples to be factored into the entropy (so `ababab` has less entropy than `aabbab`). The neat thing is you could build more than one compression model (keyboard position based for asdf, for example). Picking which models needs be given "cost" in theory (as enough models make anything low entropy).

    Are you sure about using natural logarithms for binary systems? I need to calculate this. But as I am no mathematician, this is hard topic to me.

    @LinuxSecurityFreak I'm a bit late, but Log2(x) == LogY(x)/LogY(2)

  • in theory you can measure entropy only from the point of view of a given model. For instance the PI digits are well distributed, but actually is the entropy high? Not at all since the infinite sequence can be compressed into a small program calculating all the digits.



    I'll not dig further in the math side, since I'm not an expert in the field. But I want to suggest to you a few things that can make a very simple but practical model.



    Short strings



    To start, distribution. Comparing characters that are the same is exactly this in some way, but the generalization is to build a frequency table and check the distribution.



    Given a string of length N, how many A chars should I expect in average, given my model (that can be the english distribution, or natural distribution)?



    But then what about "abcdefg"? No repetition here, but this is not random at all.
    So what you want here is to take also the first derivative, and check the distribution of the first derivative.



    it is as trivial as subtracting the second char from the first, the thrid from the second, so in our example string this turns into: "abcdefg" => 1,1,1,1,1,1



    Now what aobut "ababab"... ? this will appear to have a better distribution, since the derivative is 1,-1,1,-1,... so what you actually want here is to take the absolute value.



    Long strings



    If the string is long enough the no brainer approach is: try to compress it, and calculate the ratio between the compression output and the input.


    its tricky ... `asdfghjkl;` is a pretty crappy string as well

    @Sam: your string will be actually flagged as low entropy by the first derivative test. Of course here you are changing the model, that is, accordingly to the position of chars in a keyboard, that is also a good model. You can add it to the mix as well of course.

    very interesting approach. keep in mind our entropy tests are mainly there for really short strings. Here is the classic example of where we use it ( http://stackoverflow.com/review/low-quality-posts?pagesize=15&filter=day ) in conjunction with a few other algorithms

    You cannot tell from looking at a string whether it is randomly produced (abc). If you pick 3 characters from an equal distribution, abc, aaa, zzz, zur and apk are of equal chance. Of course, in your example, you choosed the abcdef intentionally and not randomly, but this doesn't prove a random generator couldn't have formed it.

  • How about actually computing entropy? Also, it's not clear that character-level entropy will help, but here goes. It's in my mother tongue C++, but surely you can convert this to Java using Array instead of std::vector.



    float CharacterEntropy(const char *str) {
    std::vector<unsigned> counts(256);
    for (const char *i = str; *i; ++i)
    ++counts[static_cast<unsigned char>(*i)];
    unsigned int total = 0;
    for (unsigned i = 0; i < 256; ++i)
    total += counts[i];
    float total_float = static_cast<float>(total);
    float ret = 0.0;
    for (unsigned i = 0; i < 256; ++i) {
    float p = static_cast<float>(counts[i]) / total_float;
    ret -= p * logf(p);
    }
    return p * M_LN2;
    }

    be careful that 0 * log(0) -> 0

    It's not Java - I guess it's C#. In Java it is 'String' not 'string'. :)

  • Similar to zngu's answer, I think better than just counting the number of characters would be calculating the character-entropy of the message:



    public double CalculateEntropy(string entropyString)
    {
    Dictionary<char, int> characterCounts = new Dictionary<char, int>();
    foreach(char c in entropyString.ToLower())
    {
    if(c == ' ') continue;
    int currentCount;
    characterCounts.TryGetValue(c, out currentCount);
    characterCounts[c] = currentCount + 1;
    }

    IEnumerable<double> characterEntropies =
    from c in characterCounts.Keys
    let frequency = (double)characterCounts[c]/entropyString.Length
    select -1*frequency*Math.Log(frequency);

    return characterEntropies.Sum();
    }


    It seems to work well with both code and text, but note that it is not calculating the actual entropy of the string, only the entropy of the character-distribution; sorting the characters within the string should reduce the entropy of the string, but it does not reduce the result of this function.



    Here are some tests:



    private void CalculateEntropyTest(object sender, EventArgs e)
    {
    string[] testStrings = {
    "Hello world!",
    "This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs",
    String.Join("", "This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs".ToCharArray().OrderBy(o => o).Select(o => o.ToString()).ToArray()),
    "Won't this work too?\nstring name = \"lltt\";\nint uniqueCharacterCount = name.Distinct().Count();\nwill return 2",
    "Pull the entropy finding source from any compression algotithm, i.e. Huffman",
    "float CharacterEntropy(const char *str) {\n std::vector<unsigned> counts(256);\n for (const char *i = str; *i; ++i)\n ++counts[static_cast<unsigned char>(*i)];\n unsigned int total = 0;\n for (unsigned i = 0; i < 256; ++i)\n total += counts[i];\n float total_float = static_cast<float>(total);\n float ret = 0.0;\n for (unsigned i = 0; i < 256; ++i) {\n float p = static_cast<float>(counts[i]) / total_float;\n ret -= p * logf(p);\n }\n return p * M_LN2;\n}",
    "~~~~~~No.~~~~~~",
    "asdasdasdasdasdasd",
    "abcdefghijklmnopqrstuvwxyz",
    "Fuuuuuuu-------",
    };
    foreach(string str in testStrings)
    {
    Console.WriteLine("{0}\nEntropy: {1:0.000}\n", str, CalculateEntropy(str));
    }
    }





    Results:

    Hello world!

    Entropy: 1.888



    This is a typical english sentence containing all the letters of the english language - The quick brown fox jumped over the lazy dogs

    Entropy: 2.593



    -TTaaaaaaabccccddeeeeeeeeeeeeeeffgggggghhhhhhhiiiiiiiijk
    lllllllmnnnnnnnnnooooooppqrrrsssssssttttttttuuuvwxyyz

    Entropy: 2.593



    Won't this work too?
    string name = "lltt";
    int uniqueCharacterCount = name.Distinct().Count();
    will return 2

    Entropy: 2.838



    Pull the entropy finding source from any compression algotithm, i.e. Huffman

    Entropy: 2.641



    float CharacterEntropy(const char *str) {
    std::vector counts(256);
    for (const char *i = str; *i; ++i)
    ++counts[static_cast(*i)];
    unsigned int total = 0;
    for (unsigned i = 0; i < 256; ++i)
    total += counts[i];
    float total_float = static_cast(total);
    float ret = 0.0;
    for (unsigned i = 0; i < 256; ++i) {
    float p = static_cast(counts[i]) / total_float;
    ret -= p * logf(p);
    }
    return p * M_LN2;
    }

    Entropy: 2.866



    ~~~~~~No.~~~~~~

    Entropy: 0.720



    asdasdasdasdasdasd

    Entropy: 1.099



    abcdefghijklmnopqrstuvwxyz

    Entropy: 3.258



    Fuuuuuuu-------

    Entropy: 0.892







    Actually, I think it would be better to do some frequency analysis, but I don't know anything about the frequencies of symbols used in code. The best place to determine that would be the stackoverflow data-dump - I'll have to get back to you after it finishes downloading, in 2 years.



    1. I don't understand the point of the bool. You never appear to set it to false, so we can use a List<T> instead.

    2. Given that you want just unique items, we can just use HashSet<T> instead.



    Haven't tested, but this method should be equivalent and faster:



    /// <summary>
    /// returns the # of unique characters in a string as a rough
    /// measurement of entropy
    /// </summary>
    public static int Entropy(this string s)
    {
    var hs = new HashSet<char>();
    foreach (char c in s)
    hs.Add(c);
    return hs.Count();
    }

    While I agree that using a `HashSet` is clearer than using a `Dictionary` and just ignoring its values, I don't see any reason why it would be faster.

  • Why not divide the number of unique characters in the given string by the total number of characters in that string. That would give a more accurate measure of entropy.



    For example, going by your formula, an entropy of 3 for a string of 5 characters should be fine but an entropy of 3 for a string of 8 characters is poor. But, your formula wouldn't be able to differentiate between the two results. Whereas, the above given formula would do so to give a more accurate measure.


  • I think antirez is right in suggesting that an entropy approach needs a model. So assuming we're talking English, then examining the string's character distribution and how closely it aligns with "average" will likely show that the text is mostly English. But is this what you want to achieve? Presumable many things are code or pseudo-code. Compression is a great idea, but this'll give the highest entropy for random text - is high entropy bad? Low entropy would indicate lots of repetition, maybe verbosity, but one can write long drawn out sentences with frilly words and transmit little information (e.g. this comment).


  • I just whipped this algorithm together, so I have no idea how good this is. I fear that it will cause an overflow exception if used on very long strings.



    Key concepts of this algorithm:




    • When encountering a character for the first time, the maximum value is added to the un-normalized entropy total. The "maximum value" is the length of the string.

    • If a character is encountered again, then we count the number of positions between this occurrence and the last occurrence, then we subtract the total number of times this character has appeared in the string. We then add that value to the un-normalized entropy total.

    • Once the final un-normalized entropy total has been calculated, it is divided by the length of the string in order to "normalize" it.



      public static int Entropy(this string s)
      {
      int entropy = 0;

      var mapOfIndexByChar = new Dictionary<char, CharEntropyInfo>();

      int index = 0;
      foreach (char c in s)
      {
      CharEntropyInfo charEntropyInfo;
      if (mapOfIndexByChar.TryGetValue(c, out charEntropyInfo))
      {
      // If this character has occurred previously, then only add the number of characters from
      // the last occurrence to this occurrence, and subtract the number of previous occurrences.
      // Many repeated characters can actually result in the entropy total being negative.
      entropy += ((index - charEntropyInfo.LastIndex) - charEntropyInfo.Occurrences);

      // update the last index and number of occurrences of this character
      mapOfIndexByChar[c] = new CharEntropyInfo(index, charEntropyInfo.Occurrences + 1);
      }
      else
      {
      // each newly found character adds the maximum possible value to the entropy total
      entropy += s.Length;

      // record the first index of this character
      mapOfIndexByChar.Add(c, new CharEntropyInfo(index, 1));
      }
      }

      // divide the entropy total by the length of the string to "normalize" the result
      return entropy / s.Length;
      }

      struct CharEntropyInfo
      {
      int _LastIndex;
      int _Occurrences;

      public int LastIndex
      {
      get { return _LastIndex; }
      }
      public int Occurrences
      {
      get { return _Occurrences; }
      }

      public CharEntropyInfo(int lastIndex, int occurrences)
      {
      _LastIndex = lastIndex;
      _Occurrences = occurrences;
      }
      }



    A quick test:



            var inputs = new[]{
    "Hi there!",
    "Hi there, bob!",
    "ababababababababababababab",
    @"We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality.

    I whipped up this simple method which counts unique characters in a string, but it is quite literally the first thing that popped into my head. It's the ""dumbest thing that works""."
    };

    foreach (string s in inputs)
    {
    System.Console.WriteLine("{1}: \"{0}\"", s, s.Entropy());
    }


    Resulting entropy values:




    • 7: "Hi there!"

    • 10: "Hi there, bob!"

    • -4: "ababababababababababababab"

    • 25: "We're calculating entropy of a string ..."


License under CC-BY-SA with attribution


Content dated before 7/24/2021 11:53 AM

Tags used