Robber language was designed for Swedish and doesn't translate cleanly to English. Swedish has a clean consonant/vowel split — the letter "y" is a vowel, and "x" is always pronounced "ks" — which is what makes the rule "consonant + o + consonant" work consistently. English is messier: "y" flips between consonant and vowel ("yes" vs "happy"), "x" can sound like "ks" ("ax") or "z" ("xylophone"), and digraphs like "sh", "ch" and "th" represent single sounds that the letter-by-letter rule splits apart.
Spoken English actually works fine if you treat a consonant as a consonant sound rather than a letter — "ship" then becomes one consonant sound + vowel + consonant sound, not s-h-i-p. By sound, "ship" translates to "shoshipop"; a letter-by-letter generator would instead spit out "soshohipop", which doesn't match how the word is pronounced. Doing it by sound puts the burden on the speaker to do the phonetic analysis on the fly, and writing a generator for it would mean shipping a pronunciation dictionary or grapheme-to-phoneme model rather than the simple letter-by-letter substitution this post is about. So the Swedish examples are the ones that actually sound right when spoken — though even Swedish isn't perfectly clean: digraphs like "sj", "sch" and "tj" are single sounds that the letter-by-letter rule splits apart, just accepted by convention rather than worked around.