To simplify search for a name, STP uses a concept that we call "normalized names." A normalized name can be thought of as a 21st century, localized representation of a name and has the objective of presenting names in a standardized format that facilitates searching for a name. A normalized names is NOT an indexed name, but rather is how an indexed name is presented to our users. To find records associated with a name, STP users search for a normalized name, rather than the indexed name(s).
The general rule for indexing is "index what you see, even if you suspect an error." Names in our indexes indexed are transcribed as they appeared on original documents, even when the documents is known to contain abbreviations or mistakes. The name Richard Williamson, for example, may appear in records as Richard, Rich, Rich'd., R, Williamson, Williams'n, W'ms'n, and so forth. For these reasons and because name spellings change over time, it's often useful to search for someone using variations of their name. Normalized names are transformation of indexed names to a representation that is iin everyday use today in Southampton County. Thus, the indexed name "Rich'd. L, Wm'son" and most variations of the indexed name are presented to a STP user as "Richard L Williamson."
What are some of the naming problems a genealogist may encounter?
Records often use abbreviations for names that were in common use at the time of record creation. Many of the names in theses records were recorded by court officials and there is a large variation in the abbreviations used. For example, William=Wm or W'm or Thomas=Tho's or Thom or Richard=Rich'd or Ric'd. STP converts all name abbreviations to a 21st century representation of the unabbreviated name. STP search tools use the unabbreviated 'normalized' name.
There isn't a name that can't be spelled at least two different ways. It is common to find different spellings for the same person or the same spelling for different persons in many genealogical records. Many misspellings in record keeping can be due to words and names that sound alike. There can be a name that is pronounced that same but spelled differently. Many different letters sound similar when said aloud. These phonetic similarities easily lead to misspellings.
Illiteracy was much more common in early America than today, thus names in official documents were often spelled phonetically. Census takers were especially notorious for the large variation in recorded names, but there is significant variation in names recorded by the relatively well-educated court officials as well. STP presents a "normalized" phonetically spelled name as the name in common use today. Still, there may alternate spellings for the same name. For example, Benjamin Beale may also be recorded as Benjamin Beal. The usual recommendation for this example would be to conduct a search for both names, but the default option for the STP search engine uses grep pattern matching. Continuing our example, a search for "Ben.* Beal[e]*" would return matches for all names containing "Ben" and "Beal." In addition, STP offers phonetically-based search options (See soundex and Metaphone below).
Erroneous information on transcriptions of genealogical records and their indexes is epidemic. Errors can be due to many problems including
Fortunately, the indexers of of the records found here are unusually skilled, some with decades of genealogical experience. In addition, a majority of the indexers are intimately familiar with the names and places found in Southampton County. Indexers with a Southampton County background offer a significant advantage and can mitigate some of the issues associated with problems 1-3.
Perhaps the biggest source of indexing errors is poor readability of source documents. Most of the images of documents on the STP site have been enhanced and converted to grey-scale. Even so, some source documents images are of such poor quality that even with the most sophisticated of processing they remain difficult to read. In most cases our processed records are noticeably improved.
Even so, in handwritten records letters may be hard to distinguish. In most cases, the records found on the STP site are handwritten by county officials whose penmanship varies widely. An I may look like a J. It is easy to confuse a W with an M or a T with an F. We have corrected some of these errors by comparing name and places in multiple indexes. For example, births from 1853 to 1870 are found in two indexes created by different indexers. An name uncertainty in one index may be significantly less in another index. In male births in the late 19th century, comparing an index entry with draft records can reduce uncertainty.
Normalized names do not contain punctuation marks. The indexed name "A. L. Marks" is normalized to "A L Marks".
Nicknames present can be a difficult problem for researchers. Nicknames are ofter non-intuitive, For example, "Polly" and "Peggy" are common nicknames for "Mary" and "Margaret," respectively. STP currently does not have an approach to solving the "nickname" problem. Our current recommended approach is to search using GREP with a pattern such as "(Polly|Mary)".
Sometimes a "Spelling Variation" can only be described as a "Spelling Error." For example, Assamoosick Swamp is a well known landmark in Southampton County; nevertheless, the Southampton County records contains at least a dozen spellings for this well known landmark. Even the name "Nottoway" is spelled four different ways in the Southampton records. STP corrects all spelling known errors that can be unambiguously corrected.
Search options for STP users
In 2020, STP debuted a web-based search of the Southampton County records. With the initial search capability, every record containing a indexed entity (name, place or thing) or site could be located and displayed in a table containing 1) a link to the line in the index containing the entity, 2) the target of the search in a link to supplemental information, and 3) links to either images of the records containing the entity or to the table of contents containing the search term.
To handle the numerous possible spellings of an indexed name, the initial STP search engine used GREP pattern matching to locate index entries. Thus in our initial search capability, a successful search was the result of defining a pattern that matched all the desired variant spellings of a indexed name without returning names that are not wanted. For example, to locate the records associated with Benjamin Beale, we might use pattern "Ben.* Beal[e]*". This Pattern will match "Ben Beale" or "Benj. Beal" or "Benja Beall" as well as others that are phonetically close to "Ben Beal.""
Now in 2021, STP has modified search to find the normalized representation of indexed names (as described above). The STP search engine still supperts GREP pattern matching to locate index entries, so the pattern "Ben.* Beal[e]*" will still return the same entries as before. In addition, we have expanded search to include additional options including 1) Literal, 2) Soundex, 3) Metaphone, and 4) MetaSoundex. Soundex, Metaphone, and MetaSoundex are phonetic algorithms and enable the indexing of entities by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.
Literal Search - A literal search finds index entries that match the search term exactly. The result is similar to using "Exact" or "Match All Terms Exactly" on other web sites. This option is most useful for narrowing the output of a search to a specific name.
Soundex (or Sound Index) Search - Soundex is an algorithm to code surnames phonetically by reducing them to a four digit code consisting of the first letter and up to three digits, where each digit is one of six consonant sounds and has the goal of reducing matching problems from different spellings of names that sound similar. The algorithm was devised to code names in US census records and the standard algorithm works best on European names.
Soundex produces a phonetic code, not a strictly alphabetical one, where surnames are coded based on the way a name sounds rather than on how it is spelled. For example, surnames that sound the alike but are spelled differently, like Smith and Smyth, have the same code. The intent was to help researchers find a surname quickly even though it may have been recorded differently by census takers. Soundex was first applied to the 1880 census and reduced all the surname in the US census to 1 of 26665 possible codes. If a name like Cook, though, is spelled Koch or Faust is Phaust, a search for a different set of Soundex codes based on the variation of the surname's first letter is necessary.
The Soundex algorithm, developed by Robert C. Russell and Margaret King Odell and patented in 1918] and 1922, is a very simple code well suited to hand encoding. The usual description of the algorithm is:
Phonetic Description | Letters to encode | Encoding Digit |
---|---|---|
Oral Resonant | A, E, I, O, H, U, W, Y | 0 |
Labials and labio-dentals | B, F, P, V | 1 |
Gutterals and sibilants | C, G, J, K, Q, S, X, Z | 2 |
Dental-mutes | D, T | 3 |
Palatal-fricative | L | 4 |
Labio-nasal and Lingua-nasal | M, N | 5 |
Dental fricative | R | 6 |
Special Soundex Considerations
Double Letters - If the surname has any double letters, they are treated as one letter. Thus, in the surname Lloyd, the second L is ignored. In the surname Gutierrez, the second R is disregarded.
Side-by-Side Letters - A surname may have different side-by-side letters that receive the same number on the Soundex coding guide. For example, the c, k, and s in Jackson all receive the same code. These letters with the same code should be treated as only one letter. In the name Jackson, the k and s should be disregarded. This rule also applies to the first letter of a surname, even though it is not coded. For example, Pf in Pfister would receive a number 1 code for both the P and f. Thus in this name the letter f should be crossed out, and the code is P-236.
H and W Rule - The letters H and W do not act as separators between letters having the same code value. As a result, such letters are treated as adjacent and are condensed into a single code. For example, the letter sequence “CHS” would be coded as 2, whereas without this rule, it would be coded as 22. Note that this rule has often been omitted in descriptions of Soundex.
Mononymous names - People with a singular given name, were once the norm throughout much of the world, but are rare in modern times, particularly in the West. Before 1865 many individuals, especially in the South with enslaved Americans or areas with many Native Americans, may have used only a single-term name. Here, a phonetically spelled mononymous name for an American Indian or enslaved American is coded as If it were one continuous surname. If a distinguishable surname was given, the name may have been coded in the normal manner. For example, Dances with Wolves might have been coded as Dances (D-522) or as Wolves (W-412), or the name Shinka-Wa-Sa may have been coded as Shinka (S-520) or Sa (S-000). In other cases, first names were recorded as surnames.
Religious Figures - Nuns or other female religious figures with names such as Sister Veronica may have been members of households or heads of households or institutions where a child or children age 10 or under resided. Because many of these religious figures do not use a surname, the Soundexes for the post-1880 censuses frequently use the code S-236, for Sister. Similarly, some priests or monks may have been coded as B-636 (for “Brother”) or F-360 (for “Father”).
Prefixes - If the surname has a prefix, such as D’, De, dela, Di, du, Le, van, or Von, code it both with and without the prefix because it might be listed under either code. The surname vanDevanter, for example, could be V-531 or D-153. Mc and Mac are not considered to be prefixes and should be coded like other surnames.
References
Metaphone Search -
Metaphone is a phonetic algorithm that improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding which does a better job of matching words and names which sound similar.
Metaphone was published by Lawrence Philips in 1990. As with Soundex, similar sounding words should share the same code; but, as a general rule, Metaphone produces longer, variable length codes, consider more pronunciation rules, and have fewer ambiguities than Soundex.
The algorithm phonetically codes words by reducing them to 16 consonant sounds: B, F, H, J, K, L, M, N, P, R, S, T, W, X, Y, and 0 where 0 represents the "th" sound and X stands for the "sh" sound.
There are numerous implementations of the Metaphone algorithm that produce encoding that differ from the original Philips code. The PHP implementation is used by STP. We have implemented a version of Metaphone in AppleScript and compared nearly 1,000,000 encodings with PHP obtaining 100% accuracy.
The Metaphone algorithm is very simple, but the rules for encoding letters of a name are fairly complex and depend on the position of the letter in the name, the letters that precede and follow the letter.
The Metaphone algorithm
Encoding | Digit |
---|---|
C, G, Q | K |
D | T |
V | F |
X, Z | S |
H, W, Y | empty "" |
otherwise | the letter |
SYMB = | and | then METAPH = | Example | Comment | |
---|---|---|---|---|---|
IF | not "C" | SYMB = pSYMB | METAPH:"" | Goggle | Drop doubled letter unless "C" |
IF | "A" | N = 1 | METAPH:"A" | Aaron | Drop vowels unless first character |
IF | "B" | pSYMB = "M" | METAPH:"" | Holcomb | PHP drops "B" preceded by "M" everywhere, not just at end. |
ELSE | "B" | METAPH:"B" | Wombwell | PHP drops "B" preceded by "M" everywhere, not just at end. | |
IF | "C" | nSYMB = "H" | METAPH:"X" | Church | |
ELSEIF | "C" | nSYMB = "I" and nnSYMB = "A" | METAPH:"X" | Marcia | |
ELSEIF | "C" | nSYMB is FRONTV and pSYMB = "S" | METAPH:"" | Cascill | |
ELSEIF | "C" | nSYMB is FRONTV and pSYMB ≠ "S" | METAPH:"S" | Francis | |
ELSE | "C" | METAPH:"K" | Backham | ||
IF | "D" | nSYMB = "G" and nnSYMB is FRONTV | METAPH:"J" | Dodge | |
ELSE | "D" | METAPH:"T" | Adams | ||
IF | "E" | N = 1 | METAPH:"E" | Elisha | Drop vowels unless first character |
IF | "F" | METAPH:"F" | Francis | ||
IF | "G" | pSYMB = "G" | METAPH:"K" | Goggle | Hard G |
ELSEIF | "G" | nSYMB = "H" | METAPH:"F" | Buckingham | |
ELSEIF | "G" | nSYMB = "H" and name contains "[BDH]..GH" | METAPH:"" | Bright | |
ELSEIF | "G" | nSYMB = "H" and name contains "H...GH" | METAPH:"" | Throughtgood | |
ELSEIF | "G" | nSYMB = "N" | METAPH:"K" | signs | |
ELSEIF | "G" | nSYMB ... nnnSYMB = "NED" | METAPH:"" | signed | PHP drops "G" followed by "NED" everywhere, not just at end. |
ELSEIF | "G" | N = LEN - 1 | METAPH:"" | sign | |
ELSEIF | "G" | nSYMB is FRONTV and pSYMB ≠ "D" | METAPH:"J" | Burger | DG handled in D rules |
ELSEIF | "G" | nSYMB is FRONTV and pSYMB = "D" | METAPH:"" | Bridgeman | DG handled in D rules |
ELSE | "G" | METAPH:"K" | Lodge | ||
IF | "H" | N=2 and pSYMB = "W" | METAPH:"" | White | |
ELSEIF | "H" | nSYMB = is VOWEL and not (n = LEN or (pSYMB is (VARSON or VOWEL))) | METAPH:"H" | Barham | |
ELSEIF | "H" | nSYMB = is VOWEL and pSYMB = is VOWEL | METAPH:"H" | Abraham | |
ELSE | "H" | METAPH:"" | Abraham | ||
IF | "I" | N = 1 | METAPH:"I" | Ishmael | Drop vowels unless first character |
IF | "J" | METAPH:"J" | Major | ||
IF | "K" | N = 1 or (N > 1 and pSYMB ≠ "C") | METAPH:"K" | Mark | |
ELSE | "K" | METAPH:"" | Mark | ||
IF | "L" | METAPH:"L" | Aldride | ||
IF | "M" | METAPH:"M" | Abraham | ||
IF | "N" | METAPH:"N" | Adkins | ||
IF | "O" | N = 1 | METAPH:"O" | Omega | Drop vowels unless first character |
IF | "P" | nSYMB = "H" | METAPH:"F" | Phillips | |
ELSE | "P" | METAPH:"P" | Pond | ||
IF | "Q" | METAPH:"K" | Squire | ||
IF | "R" | METAPH:"R" | Aaron | ||
IF | "S" | nSYMB = "I" and (nnSYMB = "A" or nnSYMB = "O") | METAPH:"X" | Session | |
ELSEIF | "S" | nSYMB = "H" | METAPH:"X | Walsh | |
ELSE | "S" | METAPH:"S | Allison | ||
IF | "T" | nSYMB = "C" and nnSYMB = "H" | METAPH:"" | Hitchcock | |
ELSEIF | "T" | nSYMB = "I" and (nnSYMB = "A" or nnSYMB = "O") | METAPH:"X" | Martian | |
ELSEIF | "T" | nSYMB = "H" and pSYMB ≠"T"b | METAPH:"0" | Anthony | |
ELSE | "T" | METAPH:"T" | Abbott | ||
IF | "U" | N = 1 | METAPH:"U" | Uriah | Drop vowels unless first character |
IF | "V" | METAPH:"F" | Avery | ||
IF | "W" | N=1 and nSYMB is "H" | METAPH:"W" | White | |
ELSEIF | "W" | N=1 and nSYMB is CONSONANT | METAPH:"" | Wren | |
ELSEIF | "W" | N=1 | METAPH:"W" | Waters | |
ELSEIF | "W" | nSYMB is VOWEL | METAPH:"W" | Woodward | |
ELSE | "W" | nSYMB is VOWEL | METAPH:"W" | Bowden | |
IF | "X" | N = 1 | METAPH:"S" | Xaiver | |
ELSE | "X" | METAPH:"KS" | Alex | ||
IF | "Y" | nSYMB is VOWEL | METAPH:"Y" | Bayer | |
ELSE | "Y" | nSYMB is VOWEL | METAPH:"" | Ayres | |
IF | "Z" | METAPH:"S" | Cozier |
Transformation | Encoding | Digit |
---|---|---|
1 | A, E, I, O, U | 0 |
2 | J, Y | 1 |
3 | D, T | 3 |
4 | C, S, Z | 4 |
5 | G, H, K, Q, X | 5 |
6 | N, M | 6 |
7 | B, F, P, V, W | 7 |
8 | L | 8 |
9 | R | 9 |