Soundex

Soundex is a coding system that identifies "sound alike" surnames. The Soundex Code of a surname is always one letter followed by three numbers, such as E235 or W-262 (the hyphen is optional and can be disregarded).

The Soundex Code was "high tech" in 1918 when it was invented by Robert Russell. In a nutshell, Soundex Codes provide a means of identifying words Ė especially names -- by the way they sound. They were used extensively by the WPA crews working in the 1930s to organize Federal Census data from 1880 to 1920. Soundex has also been used for many state and local census records and is very popular in genealogy software and databases.

In the days when nearly all of the data for the Census of Population was collected by actual enumerators and individuals who walked from door to door, it was discovered that many of these people spelled surnames phonetically. Thus, one might spell Smith as "Smith" while another might spell it as "Smyth" and still another "Smythe." The census records were to be indexed by the sound of each name rather than by its spelling, and Soundex was the code system used to organize this index.

If you search many records of interest to genealogists, sooner or later you will need to use Soundex Codes. Why? Well, you can often find a personís entry by his or her Soundex Code even when the names have been misspelled. This becomes important when you realize that many census takers did not speak the language of the people being enumerated. In fact, in the first 150 years of U.S. census records, the majority of Americans were illiterate and did not know how to write their own last names. The spelling of many family names also has changed over the years, but often the Soundex Code remains the same.

The Soundex Code is not difficult to learn. Every Soundex Code consists of a letter and three numbers, such as W-252. The letter is always the first letter of the surname, and the hyphen is optional. The numbers are assigned to the remaining letters of the surname according to the Soundex guide shown below. If necessary, zeroes are added at the end to produce a four-character code. Additional letters are disregarded.

Soundex Coding Guide

Each number represents letters:

1 = B, F, P and V
2 = C, G, J, K, Q, S, X and Z
3 = D and T
4 = L
5 = M and N
6 = R

Disregard the letters A, E, I, O, U, H, W, and Y.

Here are some of the simpler examples:

Washington is coded W252 (W, 2 for the S, 5 for the N, 2 for the G, remaining letters disregarded).

Lee is coded L000 (L, there is no Soundex Code for E so the numbers 000 are added).

Now letís move on to some of the more complex rules:

Any double letters in a name are treated as one letter. For example:

Gutierrez is coded G-362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).

If the surname has different letters side-by-side that have the same number in the Soundex coding guide, they are treated as one letter. Examples:

Pfister is coded as P-236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).

Jackson is coded as J-250 (J, 2 for the C, K ignored, S ignored, 5 for the N, 0 added).

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored, 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

Names with Prefixes

If a surname has a prefix, such as Van, Con, De, Di, La, or Le, the code should ignore these prefixes. However, coders sometimes miss this rule, so they might assign the Soundex code either with or without the prefix. Because the surname might be listed under either code, a thorough search of the Soundex index should include both forms. Note, however, that Mc and Mac are not considered prefixes, according to the National Archives and Records Administration. Once again, however, not everyone knows this particular rule, so you might want to search both with and without the Mc or Mac coded.

VanDeusen might be coded two ways:

With the prefix included, V-532 (V, 5 for N, 3 for D, 2 for S)
or
With the prefix excluded, D-250 (D, 2 for the S, 5 for the N, 0 added).

Consonant Separators

If a vowel (A, E, I, O, U) separates two consonants that have the same Soundex Code, the consonant to the right of the vowel is coded. Example:

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded.

If "H" or "W" separate two consonants that have the same Soundex Code, the consonant to the right of the vowel is not coded. Example:

Ashcraft is coded A-261 (A, 2 for the S, C ignored, 6 for the R, 1 for the F). It is not coded A-226.

American Indian and Asian Names

A phonetically spelled American Indian or Asian name was sometimes coded as if it were one continuous name. If a distinguishable surname was given, the name may have been coded in the normal manner. For example, Dances with Wolves might have been coded as Dances (D-522) or as Wolves (W-412), or the name Shinka-Wa-Sa may have been coded as Shinka (S-520) or Sa (S-000).

Other Resources

While the rules sound a bit complex, they do become easier with a bit of practice. For those of us who are too lazy to go through the coding exercise, the computer age has brought many new tools. Most modern genealogy programs will tell you the Soundex Code of any name that you enter. In addition, a number of online Soundex Machines are available, including those at: http://resources.rootsweb.com/cgi-bin/soundexconverter, http://www.searchforancestors.com/soundex.html, http://www.geocities.com/Heartland/Hills/3916/soundex.html, http://www.pa-roots.com/soundex.html and http://www.genealogy.org/soundex.shtml. On any of these sites, you type in a last name, and then the site will display the correct Soundex Code. Yet Another Soundex Converter (YASC) at http://www.bradandkathy.com/genealogy/yasc.html will even convert a long list of names to their Soundex equivalents; you do not have to enter them one at a time.

The National Archives and Records Administration (NARA) publishes a free brochure, entitled Using the Census Soundex. To obtain a copy, send an e-mail to inquire@nara.gov and ask for General Information Leaflet 55, usually referred to as GIL 55. Make sure that you include your name, postal address, and "GIL 55 please".

Anyone hosting genealogy pages on a UNIX or Linux Web server might want to know about a bash script called soundex.ss that is available at http://www.unixreview.com/documents/s=7458/uni1026336632258/0207e.htm. If you are familiar with bash, you can add a Soundex machine to your Web site. A similar program, written in C, is available at http://www.unixreview.com/documents/s=7458/uni1026336632258/0207e_C.htm.

Soundex Shortcomings and Variations

While Soundex is a great tool and in widespread use, it certainly is not perfect. For example, it fails when the first letters are different. For instance, Knowles is coded as K542 while both Noles and Nolles are N420. Likewise, Cantor is C536 while the similar sound of Kantor is K536.

Soundex also has a number of shortcomings when dealing with Eastern European Jewish names. Two Jewish genealogists, Randy Daitch and Gary Mokotoff, developed a more sophisticated system, more suitable for Jewish genealogy. The Daitch-Mokotoff Soundex is becoming the de facto standard for on-line lookups on Jewish-related web sites. You can read more about the The Daitch-Mokotoff Soundex in an article written by Gary Mokotoff at http://www.avotaynu.com/soundex.html.

Numerous other improved Soundex methods have been developed in recent years and are in widespread use on numerous computer databases. However, these newer "improved Soundex" methods have never seen much use in genealogy databases.

Also see: Census Record, Wikipedia Soundex Entry