Recently, a daily word game called Wordle has gone viral on Twitter. The rule of the game is simple, guess a five letters word in 6 tries. Some of the hint after guessing a word is shown in the figure below.

I found the game as interesting and challenging since there are hundreds of thousands of words in English but it is not my native language. Luckily I know Python! I calculated the statistics of letters occurrence in English and make an informed initial guess
Statistics!
First thing first, I load all the words from a .txt file provided on GitHub. Since the quiz only considers a five letters word, we eliminate all the words with length other than that. Some words might not be in the quiz dictionary, but I just let it be.
Next, we calculate the occurrence of the alphabet. in the five letters words. This can be done easily by using dictionary in Python, iterate through all the words and letters, count it, and normalize the data. The code can be seen below.
import string
letter_count = dict.fromkeys(string.ascii_lowercase, 0)
for word in words_5_letters:
for letter in word:
letter_count[letter] = letter_count[letter] + 1
total_count = sum(letter_count.values())
letter_count_normalized = {key: value/total_count for key, value in letter_count.items()}
sorted_letter_prob = {k: v for k, v in sorted(letter_count_normalized.items(), key=lambda item: item[1], reverse=True)}

The result shows that more than 10% of the letters is ‘a’, with ‘e’ trails behind at almost 10% and ‘s’ at slightly higher than 8%. This is slightly different from the distribution for words with any length, shown in the image below. With this statistics, we are confident that we should include those letters in our initial guess.

Another statistics, let’s count the occurrence probability of a word in a letter. With “’Pandas’ library, the code is shown below.
import pandas as pd
data = dict.fromkeys(string.ascii_lowercase, [0, 0, 0, 0, 0])
df = pd.DataFrame.from_dict(data, orient='index')
df.columns = [1, 2, 3, 4, 5]
for word in words_5_letters:
for count,letter in enumerate(word):
df.loc[letter][count+1] = df.loc[letter][count+1] + 1
df_transposed = df.transpose()
df_normalized = df_transposed.div(df_transposed.sum(axis=1), axis=0)
The result is shown in the figure below. One can see that the most common letter for first to fifth position in a word are ‘s’, ‘a’, ‘r’, ‘e’, and ‘s’. This can be a hint on what first word is good as a guess. However, since using ‘s’ double is not efficient, we can substitute the first letter with the next high occurring letter, ‘c’.

Hold up! what about the Statistics 1? Yes! We should also consider it so let us do the calculation! Suppose that we consider both Statistics 1 and 2 is equally important so we can measure the most probable word by averaging those criteria. The function to calculate the score is written below
def count_score(word):
count_crit_1 = 0
count_crit_2 = 0
for count,letter in enumerate(word):
count_crit_1 = count_crit_1 + sorted_letter_prob[letter]/100
count_crit_2 = count_crit_2 + df_normalized.iloc[count][letter]
return (count_crit_1 + count_crit_2)/2*100
Making an Informed Guess
To make an initial guess, let us iterate through all the five letters words and see which one has the highest probability. Note that we should not include a word with doubled letter since it is not an efficient guess. The code is shown below.
def letter_is_not_doubled(check_string):
count = {}
condition = True
for s in check_string:
if s in count:
count[s] += 1
else:
count[s] = 1
for key in count.keys():
condition = condition and (count[key] == 1)
return condition
words_score = {}
for word in words_5_letters:
if letter_is_not_doubled(word):
words_score[word] = count_score(word)
sorted_words_score = {k: v for k, v in sorted(words_score.items(), key=lambda item: item[1], reverse=True)}
From this calculation, we found that the highest probability word is ‘tares’. Thus, we can use this word as our first guess, an informed guess!
Once you play the quiz, you will notice that one guess is not enough so we need another one. For our next guess, we do not want to include the letters already exist in the first guess. Let us define a function to filter out which letter we want to exclude and calculate the score again.
def not_contain_this_letter(word, not_contain):
condition = True
for letter in not_contain:
condition = condition and (letter not in word)
return condition
not_contain = 'tares'
word_guess = [word for word in words_5_letters if not_contain_this_letter(word, not_contain)]
words_score_2 = {}
for word in word_guess:
if letter_is_not_doubled(word):
words_score_2[word] = count_score(word)
sorted_words_score_2 = {k: v for k, v in sorted(words_score_2.items(), key=lambda item: item[1], reverse=True)}
sorted_words_score_2
From this filtering, we found out that the best word for the second guess is ‘colin’! Using the same technique, by excluding ‘tares’ and ‘colin’ we found that the third best guess in case two is not enough is ‘bumpy’.
There you go! Make “tares” as your initial guess, following with ‘colin‘ for the second one, and ‘bumpy’ in case you think you need a third one.
Now you can play Wordle with statistically best initial guess. Good luck!