2600 Article: Password (In)Security
A long while back I wrote an article on the state of passwords online, and submitted it to 2600. It was accepted, and was published on page 6 of the 28-1 edition. It's been long enough now (like, a year!) that I've decided to archive it on my site in full (and with all the spelling/grammar issues intact), just so it doesn't get lost in the shuffle. So, without any futher ado....
Recently I've heard a lot of talk, both on the internet and around the water cooler, regarding password security and how bad it is. Not to say that using a username and password is a bad method of securing resources, but most folks are claiming that users are choosing poor passwords. This got me thinking: how bad are passwords out there in the wild, REALLY? Is there actually a pandemic of stupidity among users that needs to be addressed?
Before we jump into making value-based judgments about passwords, we better lay down some ground rules about what makes a password good, and what makes it worthless. You may agree or disagree with these criteria, but the things that come to my mind right away are, a password of sufficient length, containing mixed upper and lower case, and containing special characters. On the other hand, things that make a password bad include using dictionary words, dates, or a password that is the same as the username or a slight variant.
So we're on this journey to find out how bad passwords actually are in the wild, and we have laid down specific rules about what makes a password good or bad, so now let's talk about the data set I use and the methods I gather information. The data set is relatively large and contains credentials from multiple websites, none of which have much, if any, user-overlap (meaning each site caters to a different crowd; the credentials aren't all from, say, music sites). That's one of the biggest things going for this experiment, in my opinion. A while back there was a data set leaked containing millions of passwords about users from a single site, and a lot of conclusions about password (in)security were made. If my undergrad statistics course taught me anything, it's that the results are only as good as the data, so it was very important that I ensure my data set be as diverse as possible.
Also, as a quick note, I won't say how I got my hands on all this beautiful data, but please feel free to use your imaginations...
The tools I use to analyze the data are homegrown Windows apps written in C#, and are largely used for CSV manipulation and basic statistical analysis. The process to get all the data together was an arduous one, and required spending a LOT of time parsing different data formats and pulling only the information I wanted from the records (username and password). In the end, though, I was left with a HUGE .csv file ready for tearing apart and inspecting. And what a wealth of information it turned out to be!
For the most part, the results are about what I was expecting, though there were a few strange statistics that made me think a bit. The first thing I looked at was the distribution of password lengths. While it's the simplest statistic, it's probably one of the most important factors in determining if a password is good or bad since passwords that aren't long enough have the potential to be brute-forced in a trivial amount of time.
|Passwords By Length|
This is about how I expected the passwords to be distributed, actually. One thing I do wonder is if password rules on some of the sites this data is from is skewing the results a bit, or if users are picking passwords that are 6-8 characters on their own. While a password that is only 6 characters long won't stand up very long to a brute force attack, 8 characters will do pretty well.
The next thing I looked at was how many passwords were using dictionary words. I used a standard English dictionary, but stripped of any word that was under 4 characters long to get a better idea of what actually IS a match and what was just coincidence. In addition to checking for exact dictionary matches, I also check passwords that contain dictionary words and a modifier of at most two characters. So the password "bicycle54" would count as a partial dictionary match, but "1$bicycle54" would not count. So, how did these passwords stand up to the mighty dictionary?
|Exact Dictionary Matches|
|Total exact matches||13.74%|
|5 letter words||13.52%|
|6 letter words||43.87%|
|7 letter words||24.40%|
|8 letter words||18.21%|
This statistic is surprisingly higher than I thought it would be. Regardless of length, using a word found in a dictionary is a huge password faux pas, so to see more than 1/8 passwords fall into that category was surprising.
|Top 5 Dictionary Matches|
I can't believe that out of all the words in the dictionary, "password" is the most used for passwords STILL to this day. Actually, considering some of my users, it's not surprising in the least. One thing worth noting is that there is a great diffusion of passwords all across the dictionary, with "password" being the only word that accounted for more than 1% of the entries. On a similar note, passwords containing close matches to dictionary words met my expectations.
|Close Dictionary Matches (+/- 2 characters)|
|Total close matches||12.53%|
|6 letter passwords||22.56%|
|7 letter passwords||30.92%|
|8 letter passwords||27.71%|
|9 letter passwords||14.54%|
|10 letter passwords||4.27%|
This isn't surprising in the least, I know many non-technical people that will take a word, slap a few numbers at the end, and use it for their password. What really blows me away is, when you combine these last two statistics, 26.27% of passwords are represented. I saw dictionaries out there that covered MANY more words than mine had, so this number can only get larger. That means that 1/4 of the time, you can crack someone's password using a simple dictionary attack that only requires a couple million attempts. This is by no means fast, but it pales in comparison to a password that doesn't contain a dictionary word/variant.
Another common thing I saw while I was parsing all these files into a common format were dates. This got me wondering how many people actually used a date as their password. It turns out that only 6.21% of these passwords were dates or years. This is by no means a huge amount, but the space that you'd have to search for past dates is just over 700,000, which again is a small space when compared to passwords using more characters.
The last statistic, and the one that makes good passwords great, is a mix of characters. If a password contains a broader range of characters (letters, numbers, special characters) then the search space grows significantly. So, do people make good use of this?
The "Mixed Case" statistic caught my eye because it was much lower than I expected. I went back and started tracing statements in my code to see if I was doing something wrong. It turns out the number is correct and there are a few things that can account for it. Users might be creating passwords that are mixed case, but the places storing this information may not be storing them in mixed-case format. The practice of using mixed case automatically adds another 26 potential characters to the password, and should be utilized often.
The fact that nearly 1/2 of users are using special characters is good, since it's another way to further expand the space a potential attacker has to search. The same goes for numbers. I suspect there is a lot of overlap in the "Special Character" and "Numbers" statistics, and even some with the mixed case number as well. People who follow good password practices will have at least one of each in their passwords.
There are many more statistics we can pull from this data, but I think I've covered all the big ones. So, how bad is the state of online password security these days? That'll still depend on who you ask, but I'd say it could be worse. The things to keep in mind here is that all these passwords are for online systems, which increases the time needed to brute force a password by many orders of magnitude. So, online password standards are less important than in other systems (don't get me wrong, using "password" as your password is just plain idiotic). But keep in mind that all the big hacks in the past few months that have compromised high profile accounts (like the Sarah Palin's email, for example) involve insecurities elsewhere in the system, not poor passwords.
Considering this, how can people make their passwords more secure? Well, a good start is to use passwords that are of sufficient length (I'd say nothing under 8 characters long) and use at least one number, special character, and upper/lower case character in the password. Nothing adds time to a brute force job faster than expanding the set of characters the password can contain! None of what I just said is new or exciting, but users are still showing either a lack of knowledge or complete disregard for basic password policy.
Developers are going to take the brunt of the responsibility if things are to change. Since it's up to us to create the security policy, enforce these as standards and, even though you might have to drag your users kicking and scraming all the way, passwords in general will become better. Developers also need to be more aware of the security risks facing their systems, and have appropriate policies in place for dealing with passwords (be it password recovery, too many bad password attempts, etc) in a better way. And I'm not trying to pass the blame or anything, I'm a code monkey myself, and as painful as it is to admit, the burden falls mostly to us.
If anyone has any input regarding the article, drop me an email at firstname.lastname@example.org, I'd love to talk more about it. And the information in the article can only be as good as the data behind it, so if some of you folks out there happen to send me more information to work with, we'll have an even better idea about the state of password affairs online.
Thanks to all the folks that make 2600 happen, you guys/gals rock! And a very special "big ups" to C.M.F. and colonelxc!