Formerly a mathematician, I started programming on the back-end, specializing in algorithm design and implementation, in C if I needed the
performance or in Haskell otherwise. After a couple years I started branching out, doing front-end development on Android and the web, and tackling a broader range of applied problems.
For a while I was a full-stack engineer working in Python/Django and JavaScript at drchrono, an electronic health records startup.
I've now moved all the way to Frontend Engineer, working at Google in Mountain View, CA.
You can see many of my projects here, as well as on my GitHub.
Password Strength
Python
Security
Mark Burnett's
release of 10M passwords got me wondering just how strong people's passwords are on average.
SPOILER: NOT VERY.
So I designed a Python tool
to estimate how many bits of entropy a password has against
a strong attacker. It considers several plausible strategies an attacker might use,
and returns the lowest number of bits the password has relative to any strategy.
The strategies used are:
Random passwords using various charsets. This includes guessing that small subsets of
charsets are being used, e.g. "aaaaaaaaaaaaaaaaaaaaa" is not a good password.
Dictionary attacks using a list of common formats (including capitalization, l33t and symbols),
in order of increasing complexity, and several possible sets of dictionaries:
Names
Nouns, both short and long lists
Words, both short and long lists
Recursive strategies where passwords are assumed to consist of multiple
plausible passwords concatenated.
These strategies cover most common methods of password creation. To use
XKCD's example, "Tr0ub4dor&3" is a bad password while
"correcthorsebatterystaple" is good. This is actually slightly off: "Troubador" is actually spelled
"Troubadour". With corrected spellings, we have:
Password
XKCD's estimate
My estimate
Tr0ub4dour&3
28
30
correcthorsebatterystaple
44
50
Applied to Mark's password list, the results are alarming:
Threat Model
Appropriate Services
Bits of Entropy
# Users
% Users
Your little sibling
Piggy bank combination
0-19
3178432
31.8%
Mischievous friends
Forum accounts
20-29
3280626
32.8%
Jealous ex
Photo storage, social media
30-39
1976416
19.8%
Organized crime
Financial accounts
40-49
1009985
10.1%
Governments
Email, keyrings
50-59
308812
3.1%
Determined governments
PGP keys, underground networks
60-999
245702
2.5%
What caused this situation, and what can we do about it?
The primary problem is that we've trained anti-patterns into users.
The usual process of creating a password goes something like this, breaking as soon as the user is allowed:
> Enter password: password
Error: password must contain a number
> Enter password: passw0rd
Error: password must use mixed case
> Enter password: Passw0rd
Error: password must contain a symbol
> Enter password: Passw0rd!
Error: password must not contain a word
> Enter password: P4ssw0rd!
Combine this with bizarre length limits (USAA, for example, only allows passwords up to 12 characters,
which is what originally prompted my password generator),
and it is easy to see why most passwords are simple patterns applied to single words.
Compare instead a password like "correcthorsebatterystaple", which fails even the first test.
To get something like real security and overcome people's training, I think we need intelligent
password strength tests on websites, combined with correct instructions on making strong passwords.
Password Generator
Bash
JavaScript
Security
This combines a master password with the name of a service and takes the SHA-256 hash.
An (old) bash version is available on my GitHub,
and an android version is available on the Google Play store as
SHA-PASS, with code also on
GitHub.
This is a Javascript version, the source of which is here.
Budget Supercomputer
C
Mathematics
Like most hackers, I have far more uses for computing power than I have computing power.
Unfortunately, even Cray's new budget supercomputer, the XC30-AC, is a tad out of my price range at $250k.
Fortunately you can buy that much processing power on AWS for $10/hr and hack together a Beowulf cluster.
Who could resist?
The Problem
I was doing some research with a professor on the representation theory of the symmetric group Sn,
for which we needed to compute some very large and expensive matrices called character tables.
The algorithmic and computational difficulty of the problem is nicely summed up by the complexity:
T(n)∈Θ(n5/2+√2/2p(n)1+√2)
Here p(n) is the number of partitions of n, which grows roughly like exp(√n).
In my case n=46, hence the necessity of a supercomputer.
The paper describing the algorithm I was implementing claims a lower complexity by a factor of √n,
but this analysis rests on the assumption that sparse polynomials of length n but arbitrary degree and
arity can be added to dense polynomials in O(n) time. If there's a machine architecture that can do that,
I would dearly like to know.
The code for all this is on GitHub,
and the math is discussed in more depth on my
mathematics page.
Don't worry if the math is Greek to you.
The key facts are that these computations are 1) very hard and 2) embarrassingly parallel.
Pricing
AWS offers three different types of computation instances: Reserved, On-Demand and Spot.
Reserved instances are for long-term applications, so we have to decide between on-demand and spot.
On-demand instances give you a guaranteed instance for as long as you want it.
You bid on a spot instance; as soon as the price falls below your bid, you'll get the instance,
and as soon as the price rises above, you'll get kicked off. No saving, no warning, just poof!
But spot instances are generally much cheaper than on-demand, so this is what we'll use.
You might expect (like me) that the spot price of computing instances would vary about like the spot prices of commodities.
It doesn't.
The price is more or less constant most of the time, but occasionally you get a gigantic spike.
However, since the average price is so much lower than on-demand, we will be using spot instances.
Instance Type
It is tempting to just go with c3.8xlarge instances and be done with it.
But since we know systems don't always perform as we expect, it is worth testing.
This turns out to be wise: I got almost 100% CPU usage on c3.4xlarge instances or smaller,
but only %65 usage on c3.8xlarge instances.
(I have since performed the same test on AWS's new c4.8xlarge instances, which seem to have
fixed the parallelization issues with the c3.8xlarge and give me almost 100% usage.)
I decided to launch 16 c3.4xlarge instances as slaves, for a total of 256 cores.
For a master I used an on-demand t2.medium instance, which offers decent performance when only used
intermittently, as was the case here.
Beowulf Cluster
The AWS offers a Hadoop service which makes distributing computations trivial,
but it costs almost as much as the spot instances themselves. We're too cheap for that.
Since this problem is embarrassingly parallel, all the nodes need to do is agree on are
indices from 0-15, and then they need only write their results every few minutes to a shared directory
which can then be put together the master. Enter the hack: I used NFS to create the shared directory,
then had each node write its IP address as a filename. Once the node detected that there were 16 such
files, it selected the index of its own IP address among the 16 and started computing.
# connect to master
sudo mkdir /mnt/master
sudo mount -t nfs %ipaddr%:%dir% /mnt/master
# perform handshake to get unique idcd /mnt/master/handshake
touch "$HOSTNAME"while [ $(ls -1 | wc -l) -lt %num_nodes% ]; do
:
done
id=$(ls -1 | grep -n "$HOSTNAME" | sed "s/:.*//")
let id=id-1
Complications
Working with distributed, virtualized systems is not without its pains.
One of my machines' NFS client just died a few hours in (at the start of my classes, of course)
and had to be restarted. Thankfully my program saved its progress every few minutes,
so this was relatively costless. All but one instance finished at roughly the same time, after about 16 hours.
For some reason--which must be related to virtualization somehow--one of my instances took
twice as long as all the others. It almost certainly wasn't poor balancing; it would be a very strange
world indeed where the complexity of the characters of S46 depends heavily on their index modulo 16.
Storing Data in GIFs
C
Did you know that Flickr offers 2TB of free storage, but only for images?
If only there were a way around that...
These programs encode and decode arbitrary files as uncompressed GIFs.
The files increase in size by roughly 1/7, because uncompressed GIFs only allow
bytes in the range 0x00-0x7F in the data block.
The encoding maps every block of 7 bytes into a block of 8 bytes where the first
byte stores the first bit of each other byte.
For example, the encoding program and its source:
You can see the 340-byte hard-coded header clearly in both images.
In the executable on the left it's the middle pinkish strip;
in the source on the right it's the big section with diagonal green lines going through it.
Formal Morality
Haskell
Morality
My project to mathematize everything could never be complete without mathematizing morality.
I model a moral theory as a groupoid of preorders of probability distributions of worldstates.
So it's pretty simple really. A lot of mathematical setup finally gives us the function we need:
-- judges which state is Just (punnyness >9000), or Nothingjudge :: MoralTheory -> PossibleState -> PossibleState -> MaybePossibleStatejudge theory state1 state2 = case (comp (veil theory state1) (veil theory state2), comp (veilMinusSelf theory state1) (veilMinusSelf theory state2)) of
(PGT, PGT) -> Just state1
(PLT, PLT) -> Just state2
_ -> Nothingwhere
comp = pCompare $ fst $ head theory
If you're philosophically inclined you might notice the function "veil", which is indeed a reference
to Rawls. A more in-depth discussion and the code itself can be found on
GitHub.
Running a toy moral theory with 2 actors and 3 worldstates, we get:
Worldstates: ["Alice falling in love","Bob falling in love","Both getting $10,000"]
Alice's values: Lexicographical (Alice in love, Bob in love, money)
Bob's values: Bob in love with probability x is valued the same as money with probability 3x, Alice in love is valued least.
Alice -> Bob Correspondence:
Alice falling in love -> Bob falling in love
Bob falling in love -> Alice falling in love
Both getting $10,000 -> Both getting $10,000
Alice's duties:
(ProbDist [("Alice falling in love",0.0),("Bob falling in love",1.0),("Both getting $10,000",0.0)],ProbDist [("Alice falling in love",0.0),("Bob falling in love",0.0),("Both getting $10,000",1.0)])
Judgement: Just (ProbDist [("Alice falling in love",0.0),("Bob falling in love",1.0),("Both getting $10,000",0.0)])
Alice must choose Bob falling in love over the money, because both agree that the veiled situation of falling in love half the time is more valuable than 100% chance of getting the money. The unveiled situation happens to agree with Alice's personal values as well.
(ProbDist [("Alice falling in love",1.0),("Bob falling in love",0.0),("Both getting $10,000",0.0)],ProbDist [("Alice falling in love",0.0),("Bob falling in love",0.0),("Both getting $10,000",1.0)])
Judgement: Nothing
Alice values falling in love over getting the money, and Bob is indifferent for his equivalent situation. This would result in a duty to Alice, except it would be a duty to herself, which is impossible.
(ProbDist [("Alice falling in love",0.75),("Bob falling in love",0.25),("Both getting $10,000",0.0)],ProbDist [("Alice falling in love",0.0),("Bob falling in love",0.0),("Both getting $10,000",1.0)])
Judgement: Nothing
Alice, were she in Bob's shoes, would value Bob falling in love over the money at any probability, but when her values are not imposed on Bob he does not value the options this way. Thus no duty is created.
Bob's duties:
(ProbDist [("Alice falling in love",1.0),("Bob falling in love",0.0),("Both getting $10,000",0.0)],ProbDist [("Alice falling in love",0.0),("Bob falling in love",0.0),("Both getting $10,000",1.0)])
Judgement: Just (ProbDist [("Alice falling in love",1.0),("Bob falling in love",0.0),("Both getting $10,000",0.0)])
The second case above, from Bob's perspective. While Alice cannot have a duty to herself, Bob can have a duty to her. Note that the unveiled situation disagrees with Bob's own values, so in this case the moral theory has nontrivial implications.
Mouse Events Demo
HTML
JavaScript
I've noticed a lot of confusion around how mouse events work, such as the difference between mouseover and mouseenter,
what order events are fired in, and what happens when the user moves their mouse around while holding down a mouse button.
I've made a demo with an explanation of the various events; it's relagated to its own page to avoid polluting event listeners.