Monthly Archives: July 2014

Walking through a simple substitution cipher

While reading The Security Dialogue, I noticed the code contest and decided to give it a shot.  Here I present a way, one way of many, to solve it.  I enjoy solving newspaper cryptograms but I don’t claim to have any real cryptanalytic experience of any kind so take everything with a big grain of salt.

Given the following ciphertext, and assuming a simple substitution cipher:

jdc9)c9)4ds)9sz21x)z2xs)z214s94!))ud25vx)-25)es4)4dc9)8ced4)q1x)zq1)stqcv)ts)q4)9z8c6s1vfc1e[etqcv!z2t)wc894@)-25)7cvv)es4)q)}+&)ecw4)zq8x)42)Gtq=21!z2t!))jdq1f9)w28)3vq-c1e!

How to crack it?

First make some assumptions. At some point if you don’t get the unencrypted cleartext you may need to revisit these assumptions, but you have to start somewhere.  Knowing your target makes breaking codes much, much easier.  Sometimes you will gain more by spending a few hours researching rather than staring at the cipher.

I made the following assumptions:

  • Scriven truthfully relayed that he used a substitution cipher
  • The message consists of one or more grammatically proper sentences in English.

Start by counting the frequency of each symbol in the ciphertext. You can do that manually with a message this short but I wrote some basic Perl code to do it.  Run the code, paste in the ciphertext and hit ctrl-D to end, and it prints the character frequencies:

#!/usr/bin/perl 

until(eof(STDIN)) { $ch = getc(STDIN) }
  continue { $ch !~ m/\n/ && ($c{$ch} = defined($c{$ch}) ? $c{$ch}+1 : 1) }

print "$_\t$c{$_}\n" foreach (reverse sort {$c{$a} <=> $c{$b}} (keys %c));
)       31
c       12
4       11
2       11
q       10
s       9
1       9
z       8
9       8
v       7
e       7
t       6
d       6
8       5
!       5
x       5
5       3
-       3
w       3
f       2
j       2
}       1
G       1
&       1
+       1
3       1
6       1
[       1
@       1
u       1
7       1
=       1

The ‘)’ character appears 19 more times than any other symbol in the message and seems distributed throughout the message in a way that it could represent a blank space between words. I will assume for now that ‘)’ = ‘ ‘. Having ciphertext broken up into words makes the rest of the work infinitely easier, so rewrite the message with this change.

jdc9 c9 4ds 9sz21x z2xs z214s94!  ud25vx -25 es4 4dc9 8ced4 q1x zq1 stqcv ts q4 9z8c6s1vfc1e[etqcv!z2t wc894@ -25 7cvv es4 q }+& ecw4 zq8x 42 Gtq=21!z2t!   jdq1f9 w28 3vq-c1e!

Of interest when you do this, ‘))’ appears twice, both times preceded by ‘!’. Going from the assumption that ‘)’ = ‘ ‘, this could indicate what we in the US currently call “French spacing”, or using two spaces after the end of a sentence instead of just one. Though considered deprecated in American English style guides, many people still use it (including me), and autocorrect on mobile devices even takes advantage of that to turn a double tap on the space bar into a period followed by a space and then a capital letter. This adds strength to the assumption and indicates we likely have three sentences. I don’t yet have a reason for why the assumed sentence-terminator ‘!’ sometimes appears in a word, but I will go with it for now.

With the ciphertext letter frequencies in hand, now you need English text letter and word frequencies. You can use ETAOIN SHRDLU as a mnemonic for the most frequently used letters in descending order if you want to keep things simple.

Look at the (assumed) words in the ciphertext. Make lists of all the words with only one letter, only two letters, only three letters, only four letters.  Note any that appear twice or more, and any repeated strings. I made this list by hand but you can write code to do it.

1 letter words: q
2 letter words: c9 ts q4 42
3 letter words: 4ds -25 es4 q1x zq1 -25 es4 w28
4 letter words: jdc9 z2xs 4dc9 7cvv ecw4 zq8x

No repeated 2 letter words
Repeated 3 letter words: -25 es4
No repeated 4 letter words

Repeated digrams (2 letters): c9 jd 4d z2 25 1x zq q1 21     
Repeated trigrams (3 letters): -25 es4 dc9 c1e z2t
Repeated fourgrams (4 letters): !z2t

Repeated letters: vv

Notice the single one letter word: q. In English this can only mean one of the words “I” or “a”. The frequency of ‘q’ in the ciphertext also indicates a possible vowel.

Look for repeated digraphs, pairs or triplets of symbols that appear next to each other frequently. I already noticed ‘!))’ which may mean ‘.  ‘, but I also see ‘c9’ three times. Twice it ends a four letter word, once it stands alone as a two letter word. The ciphertext starts with “jdc9 c9 4ds”, or a four letter word followed by a two letter word made up from the last two letters of the preceding word. In English, “This is” or “What at” or “That at” or even “Shit it” all fit that pattern and can fit grammatically at the start of a sentence. The ‘d’ in the third (three letter) word yields the cleartext ‘h’ in each case, as the second letter of “this”, “what” and “shit”.  Many three letter words have ‘h’ as their second letter and can fit in the sentence I have so far: “This is why”, “This is the”, “What at the”.  I will throw out “That at” for now because I don’t like to see both ‘j’ and ‘9’ meaning ‘t’, unless he decided to sneakily use different symbols for the upper and lowercase versions of the same letter.

So assume for now with some confidence:

')' = ' '
'!' = '.'
'd' = 'h'

For clarity, when I rewrite the text with my substitutions, I will use capital letters for cleartext and lowercase letters for ciphertext (though the ciphertext contains a single capital ‘G’ that I will ignore for the moment).  Rewrite the text with the three substitutions so far:

jHc9 c9 4Hs 9sz21x z2xs z214s94.  uH25vx -25 es4 4Hc9 8ceH4 q1x zq1 
stqcv ts q4 9z8c6s1vfc1e[etqcv.z2t wc894@ -25 7cvv es4 q }+& ecw4 
zq8x 42 Gtq=21.z2t.   jHq1f9 w28 3vq-c1e.

I wrote some simple Perl code to handle rewriting the ciphertext pasted into it, configurable by adding new substitutions to the code.  I will use this going forward instead of substituting manually.

#!/usr/bin/perl

$subst{')'} = ' ';
$subst{'!'} = '.';
$subst{'d'} = 'H';
# add more substitutions here following the same pattern

until(eof(STDIN)) { $ch = getc(STDIN) }
  continue { print defined($subst{$ch}) ? $subst{$ch} : $ch }

Time now to make some guesses.  Earlier I suspected the first two words may encode “What at” or “This is”, and I also know that ‘q’ must represent ‘a’ or ‘I’, so let’s have a look at the ciphertext with those changes. As a simple substitution cipher, no cleartext character can come from two different ciphertext characters, so assume ‘q’ means ‘I’ if ‘c’ means ‘a’, and vice versa (since both ‘c’ and ‘q’ cannot map the same letter).

All use the previous substitutions:

')' = ' '
'!' = '.'
'd' = 'h'

"What at":

'j' = 'W'
'c' = 'A'
'9' = 'T'
'q' = 'I'

WHAT AT 4Hs Tsz21x z2xs z214sT4.  uH25vx -25 es4 4HAT 8AeH4 I1x zI1 
stIAv ts I4 Tz8A6s1vfA1e[etIAv.z2t wA8T4@ -25 7Avv es4 I }+& eAw4 
zI8x 42 GtI=21.z2t.  WHI1fT w28 3vI-A1e.

"Shit it":

'j' = 'S'
'c' = 'I'
'9' = 'T'
'q' = 'A'

SHIT IT 4Hs Tsz21x z2xs z214sT4.  uH25vx -25 es4 4HIT 8IeH4 A1x zA1 
stAIv ts A4 Tz8I6s1vfI1e[etAIv.z2t wI8T4@ -25 7Ivv es4 A }+& eIw4 
zA8x 42 GtA=21.z2t.  SHA1fT w28 3vA-I1e.

"This is":

'j' = 'T'
'c' = 'I'
'9' = 'S'
'q' = 'A'

THIS IS 4Hs Ssz21x z2xs z214sS4.  uH25vx -25 es4 4HIS 8IeH4 A1x zA1 
stAIv ts A4 Sz8I6s1vfI1e[etAIv.z2t wI8S4@ -25 7Ivv es4 A }+& eIw4 
zA8x 42 GtA=21.z2t.  THA1fS w28 3vA-I1e.

Each of these seems like a start on a solution.  Where to go from here to give some weight to one choice or the other?  All three could produce a grammatical sentence given the first two words, though I’ve lost faith in “Shit it” at this point if I ever had any.

Take a look at the words where you almost have all of the letters translated, but not quite.  I see the original word “4dc9” which we have translated as either “-hat” or “-his”, and we have the original word “jdq1f9” which we have translated as either “tha–s” or “whi–t”.  That second one seems like a good candidate.  Now I need a word list. I will use a classic English word list from Donald E. Knuth.  You must use a word list appropriate for the cleartext you expect to find.  This would not help me for French text, nor would it help for government or corporate information which might contain many acronyms.

I have two possible six letter words identified: “tha–s” and “whi–t”.  Check the word list for words that match each pattern.  The following Perl command line will do it, assuming you have a word list file named ‘wordlist.txt’.

$ perl -ne 'print if m/^tha[a-z]{2}s$/' wordlist.txt
thanks
$ perl -ne 'print if m/^whi[a-z]{2}t$/' wordlist.txt
whilst

So only one word fits for each possibility.  I will go out on a limb and assume he used the word “Thanks” rather than “Whilst”.  I follow him on Twitter and I’ve seen him say “thanks”, but never “whilst”. Speakers of American English simply don’t use “whilst” very often.  Let’s take a look at the text if we assume the word “jdq1f9” means “Thanks”.  We get two more letters, ‘1’=’n’ and ‘f’=’k’.

')' = ' '
'!' = '.'
'd' = 'h'
'j' = 'T'
'c' = 'I'
'9' = 'S'
'q' = 'A'
'1' = 'N'
'f' = 'K'

THIS IS 4Hs Ssz2Nx z2xs z2N4sS4.  uH25vx -25 es4 4HIS 8IeH4 ANx zAN 
stAIv ts A4 Sz8I6sNvKINe[etAIv.z2t wI8S4@ -25 7Ivv es4 A }+& eIw4 
zA8x 42 GtA=2N.z2t.  THANKS w28 3vA-INe.

Looking better here.  Three words possibly done and nothing else looks too wrong.  I want to get that third word, after “This is”.  So what three letter words match the pattern “-h-“?

$ perl -ne 'print if m/^[a-z]h[a-z]$/i' wordlist.txt
aha
chi
ohm
oho
phi
rho
she
shh
shy
the
tho
thy
who
why

Which of those words make sense in a sentence following “This is”?  Only “the”, “who” and “why”.  I lean towards “who” and “why”, but if a capital ‘T’ at the beginning of a sentence has a different symbol from a lowercase ‘t’ in the middle of the sentence, “the” may do it.  This gives me a few more combinations to test:

Using the previous substitutions:

')' = ' '
'!' = '.'
'd' = 'h'
'j' = 'T'
'c' = 'I'
'9' = 'S'
'q' = 'A'
'1' = 'N'
'f' = 'K'

"This is who":

'4' = 'W'
's' = 'O'

THIS IS WHO SOz2Nx z2xO z2NWOSW.  uH25vx -25 eOW WHIS 8IeHW ANx zAN 
OtAIv tO AW Sz8I6ONvKINe[etAIv.z2t wI8SW@ -25 7Ivv eOW A }+& eIwW 
zA8x W2 GtA=2N.z2t.  THANKS w28 3vA-INe.

"This is why":

'4' = 'W'
's' = 'Y'

THIS IS WHY SYz2Nx z2xY z2NWYSW.  uH25vx -25 eYW WHIS 8IeHW ANx zAN 
YtAIv tY AW Sz8I6YNvKINe[etAIv.z2t wI8SW@ -25 7Ivv eYW A }+& eIwW 
zA8x W2 GtA=2N.z2t.  THANKS w28 3vA-INe.

"This is the":

'4' = 'T' (lowercase t!)
's' = 'E'

THIS IS THE SEz2Nx z2xE z2NTEST.  uH25vx -25 eET THIS 8IeHT ANx zAN 
EtAIv tE AT Sz8I6ENvKINe[etAIv.z2t wI8ST@ -25 7Ivv eET A }+& eIwT 
zA8x T2 GtA=2N.z2t.  THANKS w28 3vA-INe.

The last one gives me words 3 (“the”), 10 (“this”) and 16 (“at”).  My word list does not contain “whis” so I will throw out the two previous tries and continue from here.

Word 6 (“z214s94”) looks interesting with the pattern “–ntest”.  Only one word fits that pattern: “contest”.  It doesn’t surprise me one bit to find the word “contest” in the cleartext.  Assign ‘z’=’C’ and ‘2’=’O’.

Using the previous substitutions:

')' = ' '
'!' = '.'
'd' = 'h'
'j' = 'T'
'c' = 'I'
'9' = 'S'
'q' = 'A'
'1' = 'N'
'f' = 'K'
'4' = 'T'
's' = 'E'

Add in "contest":

'z' = 'C'
'2' = 'O'

THIS IS THE SECONx COxE CONTEST.  uHO5vx -O5 eET THIS 8IeHT ANx CAN 
EtAIv tE AT SC8I6ENvKINe[etAIv.COt wI8ST@ -O5 7Ivv eET A }+& eIwT 
CA8x TO GtA=ON.COt.  THANKS wO8 3vA-INe.

That gave me words 5 (“contest”), 13 (“can”) and 26 (“to”).  Now I want to take a look at the last word, “3vq-c1e” which I so far have matching the pattern “–a-in-“. The final encrypted ‘e’ also serves as the first letter in the three letter word “-et” (word 21).

53 words match the “–a-in-” pattern.  Of those 53, 44 of them (83%) end with “ing”, and would yield “get” for word 21.  I’ll take a leap here and assign ‘e’=’G’.

Looking at that first sentence, if ‘x’=’D’ then “This is the second code contest.” That makes perfect sense.

Using the previous substitutions:

')' = ' '
'!' = '.'
'd' = 'h'
'j' = 'T'
'c' = 'I'
'9' = 'S'
'q' = 'A'
'1' = 'N'
'f' = 'K'
'4' = 'T'
's' = 'E'
'z' = 'C'
'2' = 'O'

Add our new letters:

'e' = 'G'
'x' = 'D'

THIS IS THE SECOND CODE CONTEST.  uHO5vD -O5 GET THIS 8IGHT AND CAN 
EtAIv tE AT SC8I6ENvKING[GtAIv.COt wI8ST@ -O5 7Ivv GET A }+& GIwT 
CA8D TO GtA=ON.COt.  THANKS wO8 3vA-ING.

That gives me words 4 (“second”), 5 (“code”), 9 (“get”), 12 (“and”), 21 (“get”).

Word 17 (“9z8c6s1vfc1e[etqcv!z2t”), or “sc-i-en-king-g-ai-.co-” looks suspiciously like our host’s email address that he provided in the contest description.  Let’s substitute the letters to complete that.

Using the previous substitutions:

')' = ' '
'!' = '.'
'd' = 'h'
'j' = 'T'
'c' = 'I'
'9' = 'S'
'q' = 'A'
'1' = 'N'
'f' = 'K'
'4' = 'T'
's' = 'E'
'z' = 'C'
'2' = 'O'
'e' = 'G'
'x' = 'D'

Completing his email address:

'8' = 'R'
'6' = 'V'
'v' = 'L'
'[' = '@'
't' = 'M'

THIS IS THE SECOND CODE CONTEST.  uHO5LD -O5 GET THIS RIGHT AND CAN 
EMAIL ME AT SCRIVENLKING@GMAIL.COM wIRST@ -O5 7ILL GET A }+& GIwT 
CARD TO GMA=ON.COM.  THANKS wOR 3LA-ING.

Definitely on the right track here.  I can feel that Amazon.com gift card for $25.  The message even seems to mention it “-ill get a — gi-t card to -ma-on.com”.  From here, one only needs to plug in the letters and symbols that make sense and finish stepping through the process.