Thursday, January 21, 2016

Passwordocalypse

This is the second year in a row that I've seen an article decrying our collective cyber stupidity because of the awful passwords we use to protect ourselves. And this is the second year in a row that I've rolled my eyes very hard at the article because of its mathematical ignorance. This is the first year I've decided to blog about it, though.

The article linked to above lists the most popular passwords found in databases of stolen passwords. At the top of the list are groaners such as "password", "123456", and “qwerty”. How could those be the most popular, when everyone knows China is hacking its way into our country and we're using passwords to protect our finances, identities, and porn habits? How could everyone be so stupid?

Well, the truth is, very few people have to be stupid for those passwords to be the most popular. In fact, there's an easy to imagine scenario in which no one is so cyber-challenged. Let's see how.

The most popular password is the one that gets used more than any other individual password. This doesn't mean it's used by a majority of people, obviously, just as Donald Trump isn't supported by a majority of Republicans. Additionally, when we're ranking password popularity, we're doing so by login rather than by person, because that's how the data comes to us. So password popularity is measured as logins/password.

And what's being railed against in the above article is that the passwords with the highest login/password are bad ones. But what makes a password bad? Ease of guessing--those that take the least time to crack are the least secure.

This quality is quantified in a password's information entropy, which is a measure of the number of bits needed to specify the password. In other contexts, a piece of data's information entropy tells you how much that data can be compressed. The higher the entropy, the more bits needed to specify the data, the fewer bits you can get rid of and still preserve it.

When I think entropy, I think physics. Most people probably do, too, knowing it has something to do with thermodynamics and disorder. You probably know the second law of thermodynamics, which is usually stated as something like, "The entropy (disorder) of a system tends to increase."

The "tends to" there indicates that this is a probabilistic law. That is, if you have a box with octillions of gas molecules all bouncing around at different speeds and directions, it's hard to say exactly what they're going to do, but you can say what they're likely to do. And it turns out that a box of gas is more likely to go to a high entropy state than a low one. The reason is that there are many more high entropy states than low ones available.

This is where the connection to disorder comes in. The canonical example is probably an egg. An intact egg is a very ordered thing. It has a specific shape, and you can't change the shape of the egg without changing the fact that it's an intact egg. Thus order means low entropy, because there are only a small number of ways for an egg to be an egg.

Scrambled eggs, on the other hand, are disordered and high entropy. The high entropy results from the fact that you can rearrange your egg particles (eggs are made of egg particles, right?) in many, many different ways but still end up with the same basic breakfast: scrambled eggs.

How does this connect back to information and passwords? Because as the entropy of a system increases, it takes longer and longer to accurately describe the system in detail. With low entropy, high order systems, there might be one law of nature telling you why the system is shaped the way it is, which means it's easy to specify it in detail. But with a high entropy system, there are many microstates that are approximately the same, so you need to be more and more detailed if you want to specify a particular one. "No, the one with particle 1,036,782,561 going this way, not that way."

So high entropy data doesn't compress as easily because there are many high entropy systems, which means it takes a lot of bits to differentiate between two chunks of data. And this is also why high entropy passwords are more secure: because if you're randomly guessing a password, it takes you much, much longer to get through all the available high entropy passwords than it does the low entropy passwords.

But that's also why the least secure passwords will always be the most popular ones. Compared to the secure passwords, there just aren't that many bad passwords out there, because bad passwords are low entropy. The login/password for bad passwords is going to be high essentially by definition. Here's a toy model to demonstrate.

Mathematically, the entropy of a system (s) is proportional to the log of the number of microstates (n) that correspond to a single macrostate. Computer people like to do things in binary, so they use a log base of two: S = log2(n). Now let’s take some real data and see what we find. Using this website, I have found the entropy of each of the 25 most popular passwords. Their average entropy is 20.12. Using my password manager, I've found the average entropy of 10 randomly generated strong passwords (I got lazy, but the variation in entropy was low): 80.84.

So the average good password is ~4 times as strong as the average bad password. If we assume there are only 25 bad passwords (there are many more, but more makes the point even stronger), and that the population of logins (p) uses either good passwords or bad passwords, we can write an expression comparing password popularity (logins/password). For our model, let’s see what it would take for good passwords to be just as popular as bad passwords:

pbad/nbad = pgood/ngood

How do the number of good passwords compare to the number of bad ones? Well, from the log formula up there, if we multiply the strength of a bad password by 4, we get 4S = 4log2(n). From the rules of logs, we can take that 4 on the outside of the log and bring it in: 4S = log2(n4). So if you have n bad passwords, then you have n4 good passwords.

pbad/nbad = pgood/nbad4

Solving for the ratio of logins using bad passwords to good, we get:

pbad/pgood = 1/nbad3

Now let’s plug in nbad = 25.

pbad/pgood = 1/15625 = 0.000064

This means that as long as more than 0.0064% of all logins use bad passwords, they will be the most popular. Stating the converse, 99.9935% of all logins can use strong passwords, and the bad ones will still be more common.

Of course, in the real world, there are more than 25 bad passwords (and waaaay more than 254 good passwords), and people aren't divided up into binary good and bad password users. But I think this demonstrates that very few people need actually be stupid for the above article to be true.

And as I said, it's possible that no one is stupid because this is based on logins rather than users. All it takes is that more than 0.0064% of the time you need to pick a username and password for a site, it's a site for rating cat videos and you rightly don't care about security.

No comments:

Post a Comment