Ababa's Genealogy website

Of all the difficilties encountered in genealogical research, names are some of the most difficult, especially for the novice genealogist.

To simplify search for a name, STP uses a concept that we call "normalized names." A normalized name can be thought of as a 21st century, localized representation of a name and has the objective of presenting names in a standardized format that facilitates searching for a name. A normalized names is NOT an indexed name, but rather is how an indexed name is presented to our users. To find records associated with a name, STP users search for a normalized name, rather than the indexed name(s).

The general rule for indexing is "index what you see, even if you suspect an error." Names in our indexes indexed are transcribed as they appeared on original documents, even when the documents is known to contain abbreviations or mistakes. The name Richard Williamson, for example, may appear in records as Richard, Rich, Rich'd., R, Williamson, Williams'n, W'ms'n, and so forth. For these reasons and because name spellings change over time, it's often useful to search for someone using variations of their name. Normalized names are transformation of indexed names to a representation that is iin everyday use today in Southampton County. Thus, the indexed name "Rich'd. L, Wm'son" and most variations of the indexed name are presented to a STP user as "Richard L Williamson."

What are some of the naming problems a genealogist may encounter?

Abbreviations

Records often use abbreviations for names that were in common use at the time of record creation. Many of the names in theses records were recorded by court officials and there is a large variation in the abbreviations used. For example, William=Wm or W'm or Thomas=Tho's or Thom or Richard=Rich'd or Ric'd. STP converts all name abbreviations to a 21st century representation of the unabbreviated name. STP search tools use the unabbreviated 'normalized' name.

Spelling Variations

There isn't a name that can't be spelled at least two different ways. It is common to find different spellings for the same person or the same spelling for different persons in many genealogical records. Many misspellings in record keeping can be due to words and names that sound alike. There can be a name that is pronounced that same but spelled differently. Many different letters sound similar when said aloud. These phonetic similarities easily lead to misspellings.

Illiteracy was much more common in early America than today, thus names in official documents were often spelled phonetically. Census takers were especially notorious for the large variation in recorded names, but there is significant variation in names recorded by the relatively well-educated court officials as well. STP presents a "normalized" phonetically spelled name as the name in common use today. Still, there may alternate spellings for the same name. For example, Benjamin Beale may also be recorded as Benjamin Beal. The usual recommendation for this example would be to conduct a search for both names, but the default option for the STP search engine uses grep pattern matching. Continuing our example, a search for "Ben.* Beal[e]*" would return matches for all names containing "Ben" and "Beal." In addition, STP offers phonetically-based search options (See soundex and Metaphone below).

Indexing and Transcription Errors

Erroneous information on transcriptions of genealogical records and their indexes is epidemic. Errors can be due to many problems including

Bad or difficult to interpret handwriting
Poor copying or image of the record
Damaged or difficult to read documents
Poor skill of transcriber/indexer

Fortunately, the indexers of of the records found here are unusually skilled, some with decades of genealogical experience. In addition, a majority of the indexers are intimately familiar with the names and places found in Southampton County. Indexers with a Southampton County background offer a significant advantage and can mitigate some of the issues associated with problems 1-3.

Perhaps the biggest source of indexing errors is poor readability of source documents. Most of the images of documents on the STP site have been enhanced and converted to grey-scale. Even so, some source documents images are of such poor quality that even with the most sophisticated of processing they remain difficult to read. In most cases our processed records are noticeably improved.

Even so, in handwritten records letters may be hard to distinguish. In most cases, the records found on the STP site are handwritten by county officials whose penmanship varies widely. An I may look like a J. It is easy to confuse a W with an M or a T with an F. We have corrected some of these errors by comparing name and places in multiple indexes. For example, births from 1853 to 1870 are found in two indexes created by different indexers. An name uncertainty in one index may be significantly less in another index. In male births in the late 19th century, comparing an index entry with draft records can reduce uncertainty.

Punctuation

Normalized names do not contain punctuation marks. The indexed name "A. L. Marks" is normalized to "A L Marks".

Nicknames

Nicknames present can be a difficult problem for researchers. Nicknames are ofter non-intuitive, For example, "Polly" and "Peggy" are common nicknames for "Mary" and "Margaret," respectively. STP currently does not have an approach to solving the "nickname" problem. Our current recommended approach is to search using GREP with a pattern such as "(Polly|Mary)".

Spelling Errors

Sometimes a "Spelling Variation" can only be described as a "Spelling Error." For example, Assamoosick Swamp is a well known landmark in Southampton County; nevertheless, the Southampton County records contains at least a dozen spellings for this well known landmark. Even the name "Nottoway" is spelled four different ways in the Southampton records. STP corrects all spelling known errors that can be unambiguously corrected.

Search options for STP users

In 2020, STP debuted a web-based search of the Southampton County records. With the initial search capability, every record containing a indexed entity (name, place or thing) or site could be located and displayed in a table containing 1) a link to the line in the index containing the entity, 2) the target of the search in a link to supplemental information, and 3) links to either images of the records containing the entity or to the table of contents containing the search term.

To handle the numerous possible spellings of an indexed name, the initial STP search engine used GREP pattern matching to locate index entries. Thus in our initial search capability, a successful search was the result of defining a pattern that matched all the desired variant spellings of a indexed name without returning names that are not wanted. For example, to locate the records associated with Benjamin Beale, we might use pattern "Ben.* Beal[e]*". This Pattern will match "Ben Beale" or "Benj. Beal" or "Benja Beall" as well as others that are phonetically close to "Ben Beal.""

Now in 2021, STP has modified search to find the normalized representation of indexed names (as described above). The STP search engine still supperts GREP pattern matching to locate index entries, so the pattern "Ben.* Beal[e]*" will still return the same entries as before. In addition, we have expanded search to include additional options including 1) Literal, 2) Soundex, 3) Metaphone, and 4) MetaSoundex. Soundex, Metaphone, and MetaSoundex are phonetic algorithms and enable the indexing of entities by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

Literal Search - A literal search finds index entries that match the search term exactly. The result is similar to using "Exact" or "Match All Terms Exactly" on other web sites. This option is most useful for narrowing the output of a search to a specific name.

Soundex (or Sound Index) Search - Soundex is an algorithm to code surnames phonetically by reducing them to a four digit code consisting of the first letter and up to three digits, where each digit is one of six consonant sounds and has the goal of reducing matching problems from different spellings of names that sound similar. The algorithm was devised to code names in US census records and the standard algorithm works best on European names.

Soundex produces a phonetic code, not a strictly alphabetical one, where surnames are coded based on the way a name sounds rather than on how it is spelled. For example, surnames that sound the alike but are spelled differently, like Smith and Smyth, have the same code. The intent was to help researchers find a surname quickly even though it may have been recorded differently by census takers. Soundex was first applied to the 1880 census and reduced all the surname in the US census to 1 of 26665 possible codes. If a name like Cook, though, is spelled Koch or Faust is Phaust, a search for a different set of Soundex codes based on the variation of the surname's first letter is necessary.

The Soundex algorithm, developed by Robert C. Russell and Margaret King Odell and patented in 1918] and 1922, is a very simple code well suited to hand encoding. The usual description of the algorithm is:

Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
Replace consonants with digits as in table 1 (after the first letter):

Table 1: Soundex Transformations
Phonetic Description	Letters to encode	Encoding Digit
Oral Resonant	A, E, I, O, H, U, W, Y	0
Labials and labio-dentals	B, F, P, V	1
Gutterals and sibilants	C, G, J, K, Q, S, X, Z	2
Dental-mutes	D, T	3
Palatal-fricative	L	4
Labio-nasal and Lingua-nasal	M, N	5
Dental fricative	R	6

If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have four or more numbers, retain only the first three.

Special Soundex Considerations

Double Letters - If the surname has any double letters, they are treated as one letter. Thus, in the surname Lloyd, the second L is ignored. In the surname Gutierrez, the second R is disregarded.

Side-by-Side Letters - A surname may have different side-by-side letters that receive the same number on the Soundex coding guide. For example, the c, k, and s in Jackson all receive the same code. These letters with the same code should be treated as only one letter. In the name Jackson, the k and s should be disregarded. This rule also applies to the first letter of a surname, even though it is not coded. For example, Pf in Pfister would receive a number 1 code for both the P and f. Thus in this name the letter f should be crossed out, and the code is P-236.

H and W Rule - The letters H and W do not act as separators between letters having the same code value. As a result, such letters are treated as adjacent and are condensed into a single code. For example, the letter sequence “CHS” would be coded as 2, whereas without this rule, it would be coded as 22. Note that this rule has often been omitted in descriptions of Soundex.

Mononymous names - People with a singular given name, were once the norm throughout much of the world, but are rare in modern times, particularly in the West. Before 1865 many individuals, especially in the South with enslaved Americans or areas with many Native Americans, may have used only a single-term name. Here, a phonetically spelled mononymous name for an American Indian or enslaved American is coded as If it were one continuous surname. If a distinguishable surname was given, the name may have been coded in the normal manner. For example, Dances with Wolves might have been coded as Dances (D-522) or as Wolves (W-412), or the name Shinka-Wa-Sa may have been coded as Shinka (S-520) or Sa (S-000). In other cases, first names were recorded as surnames.

Religious Figures - Nuns or other female religious figures with names such as Sister Veronica may have been members of households or heads of households or institutions where a child or children age 10 or under resided. Because many of these religious figures do not use a surname, the Soundexes for the post-1880 censuses frequently use the code S-236, for Sister. Similarly, some priests or monks may have been coded as B-636 (for “Brother”) or F-360 (for “Father”).

Prefixes - If the surname has a prefix, such as D’, De, dela, Di, du, Le, van, or Von, code it both with and without the prefix because it might be listed under either code. The surname vanDevanter, for example, could be V-531 or D-153. Mc and Mac are not considered to be prefixes and should be coded like other surnames.

We use the soundex function of PHP to generate our codes. Note: PHP does not implement H and W Rule above; instead H and W are treated as vowels.

References

https://bradandkathy.com/genealogy/overviewofsoundex.php - Excellent overview of Soundex
https://www.litscape.com/word_tools/soundex_match.php - Experiment with Soundex codes
https://www.functions-online.com/soundex.html - Another experiment with Soundex codes
https://www.avotaynu.com/soundex.htm - Good background on history of Soundex

Metaphone Search -

Metaphone is a phonetic algorithm that improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding which does a better job of matching words and names which sound similar.

Metaphone was published by Lawrence Philips in 1990. As with Soundex, similar sounding words should share the same code; but, as a general rule, Metaphone produces longer, variable length codes, consider more pronunciation rules, and have fewer ambiguities than Soundex.

The algorithm phonetically codes words by reducing them to 16 consonant sounds: B, F, H, J, K, L, M, N, P, R, S, T, W, X, Y, and 0 where 0 represents the "th" sound and X stands for the "sh" sound.

There are numerous implementations of the Metaphone algorithm that produce encoding that differ from the original Philips code. The PHP implementation is used by STP. We have implemented a version of Metaphone in AppleScript and compared nearly 1,000,000 encodings with PHP obtaining 100% accuracy.

The Metaphone algorithm is very simple, but the rules for encoding letters of a name are fairly complex and depend on the position of the letter in the name, the letters that precede and follow the letter.

The Metaphone algorithm

LEN = length of the name
VOWELS = characters of "AEIOU"
FRONTV = characters of "EIY"
CONSONANTS = characters of "BCDFGHJKLMNPQRSTVWXYZ" IF"W"N=1 and nSYMB is not VOWELMETAPH:""METAPH:"W"Wren
VARSON = characters of "CGPST"
LEN = length of the name
IF LEN = 0 then return METAPH = "".
ELSEIF LEN = 1 then return METAPH = the transformations in Table 2.1.
ELSEIF LEN > 1 then
IF the name starts with "KN", "GN", "PN", "AE" or "WR" then drop the first character.
METAPH = "".
Looping with N from 1 to LEN -- That is, Looping through characters of name
SYMB = character N of name; pSYMB = character N-1 of name; nSYMB = character N+1 of name; etc.
IF SYMB is not "C" and SYMB is equal to pSYMB, METAPH = METAPH:""
Else METAPH = METAPH:transformation from Table 2.2.
Return METAPH.

Table 2.1 Metaphone Transformations for single letters
Encoding	Digit
C, G, Q	K
D	T
V	F
X, Z	S
H, W, Y	empty ""
otherwise	the letter

Table 2.2 Metaphone Transformations as implemented in PHP
	SYMB =	and	then METAPH =	Example	Comment
IF	not "C"	SYMB = pSYMB	METAPH:""	Goggle	Drop doubled letter unless "C"
IF	"A"	N = 1	METAPH:"A"	Aaron	Drop vowels unless first character
IF	"B"	pSYMB = "M"	METAPH:""	Holcomb	PHP drops "B" preceded by "M" everywhere, not just at end.
ELSE	"B"		METAPH:"B"	Wombwell	PHP drops "B" preceded by "M" everywhere, not just at end.
IF	"C"	nSYMB = "H"	METAPH:"X"	Church
ELSEIF	"C"	nSYMB = "I" and nnSYMB = "A"	METAPH:"X"	Marcia
ELSEIF	"C"	nSYMB is FRONTV and pSYMB = "S"	METAPH:""	Cascill
ELSEIF	"C"	nSYMB is FRONTV and pSYMB ≠ "S"	METAPH:"S"	Francis
ELSE	"C"		METAPH:"K"	Backham
IF	"D"	nSYMB = "G" and nnSYMB is FRONTV	METAPH:"J"	Dodge
ELSE	"D"		METAPH:"T"	Adams
IF	"E"	N = 1	METAPH:"E"	Elisha	Drop vowels unless first character
IF	"F"		METAPH:"F"	Francis
IF	"G"	pSYMB = "G"	METAPH:"K"	Goggle	Hard G
ELSEIF	"G"	nSYMB = "H"	METAPH:"F"	Buckingham
ELSEIF	"G"	nSYMB = "H" and name contains "[BDH]..GH"	METAPH:""	Bright
ELSEIF	"G"	nSYMB = "H" and name contains "H...GH"	METAPH:""	Throughtgood
ELSEIF	"G"	nSYMB = "N"	METAPH:"K"	signs
ELSEIF	"G"	nSYMB ... nnnSYMB = "NED"	METAPH:""	signed	PHP drops "G" followed by "NED" everywhere, not just at end.
ELSEIF	"G"	N = LEN - 1	METAPH:""	sign
ELSEIF	"G"	nSYMB is FRONTV and pSYMB ≠ "D"	METAPH:"J"	Burger	DG handled in D rules
ELSEIF	"G"	nSYMB is FRONTV and pSYMB = "D"	METAPH:""	Bridgeman	DG handled in D rules
ELSE	"G"		METAPH:"K"	Lodge
IF	"H"	N=2 and pSYMB = "W"	METAPH:""	White
ELSEIF	"H"	nSYMB = is VOWEL and not (n = LEN or (pSYMB is (VARSON or VOWEL)))	METAPH:"H"	Barham
ELSEIF	"H"	nSYMB = is VOWEL and pSYMB = is VOWEL	METAPH:"H"	Abraham
ELSE	"H"		METAPH:""	Abraham
IF	"I"	N = 1	METAPH:"I"	Ishmael	Drop vowels unless first character
IF	"J"		METAPH:"J"	Major
IF	"K"	N = 1 or (N > 1 and pSYMB ≠ "C")	METAPH:"K"	Mark
ELSE	"K"		METAPH:""	Mark
IF	"L"		METAPH:"L"	Aldride
IF	"M"		METAPH:"M"	Abraham
IF	"N"		METAPH:"N"	Adkins
IF	"O"	N = 1	METAPH:"O"	Omega	Drop vowels unless first character
IF	"P"	nSYMB = "H"	METAPH:"F"	Phillips
ELSE	"P"		METAPH:"P"	Pond
IF	"Q"		METAPH:"K"	Squire
IF	"R"		METAPH:"R"	Aaron
IF	"S"	nSYMB = "I" and (nnSYMB = "A" or nnSYMB = "O")	METAPH:"X"	Session
ELSEIF	"S"	nSYMB = "H"	METAPH:"X	Walsh
ELSE	"S"		METAPH:"S	Allison
IF	"T"	nSYMB = "C" and nnSYMB = "H"	METAPH:""	Hitchcock
ELSEIF	"T"	nSYMB = "I" and (nnSYMB = "A" or nnSYMB = "O")	METAPH:"X"	Martian
ELSEIF	"T"	nSYMB = "H" and pSYMB ≠"T"b	METAPH:"0"	Anthony
ELSE	"T"		METAPH:"T"	Abbott
IF	"U"	N = 1	METAPH:"U"	Uriah	Drop vowels unless first character
IF	"V"		METAPH:"F"	Avery
IF	"W"	N=1 and nSYMB is "H"	METAPH:"W"	White
ELSEIF	"W"	N=1 and nSYMB is CONSONANT	METAPH:""	Wren
ELSEIF	"W"	N=1	METAPH:"W"	Waters
ELSEIF	"W"	nSYMB is VOWEL	METAPH:"W"	Woodward
ELSE	"W"	nSYMB is VOWEL	METAPH:"W"	Bowden
IF	"X"	N = 1	METAPH:"S"	Xaiver
ELSE	"X"		METAPH:"KS"	Alex
IF	"Y"	nSYMB is VOWEL	METAPH:"Y"	Bayer
ELSE	"Y"	nSYMB is VOWEL	METAPH:""	Ayres
IF	"Z"		METAPH:"S"	Cozier

Examination of the encoding in Table 2 confirms that Metaphone retains much more information about pronunciation than Soundex. Metaphone is available as a built-in operator in a number of systems, including the version of PHP that STP uses.

References

MetaSoundex Search -

MetaSoundex belongs to a class of algorithms called hybrid models where of two or more methods are combined to achieve accurate recall with high precision. The algorithm used by STP is similar to a method by presented by Keerthi Koneruin in his Masters thesis. As the name suggests, the Koneruin Metasound algorithm uses Soundex to encode the output of metaphone with the objective of retaining the strengths of both algorithms while minimizing their weaknesses. Soundex is known to have high accuracy but very low precision. Metaphone phonetic matching algorithm encodes information about vowels and sounds of diphthongs but has less accuracy than Soundex.

The algorithm for MetaSoundex encoding of a name is very simple.

Convert all letters of the name to uppercase.
Encode the upper case name using Metaphone algorithm to retain vowel sounds and diphthong combinations.
Encode the Metaphone output using the Soundex algorithm.
Encode the first the letter of the Soundex output using a digit from Table 3.

Table 3 Meta-Soundex Transformations
Transformation	Encoding	Digit
1	A, E, I, O, U	0
2	J, Y	1
3	D, T	3
4	C, S, Z	4
5	G, H, K, Q, X	5
6	N, M	6
7	B, F, P, V, W	7
8	L	8
9	R	9

Note that the Koneruin Metasound algorithm encodes all vowels as 0, thus the algorithm is not able to discriminate between names such as Addy and Eddy (see transformation 1 in Table 3). STP uses a variation of the Koneruin algorithm where vowels are not transformed, thus retaining addition information about the name pronunciation and providing additional accuracy in search results.

We have not found an implementation of MetaSoundex that is publicly available.

References

PHONETIC MATCHING TOOLKIT WITH STATE-OF-THE-ART META-SOUNDEX: Master Thesis by Keerthi Koneru

Current and Future Work -

Our initial analysis of Phonetic Algorithms for retrieval of names in genealogical research shows much promise. The STP search engine is currently being augmented to include options for Soundex, Metaphone, and MetaSoundex. More study is needed to determine the optimal approach for name retrieval. For example, an open issue is an approach for handling names with initials.

Our experience to date with hybrid algorithms is limited to Metasound, but we beleive that hybrid algorithms can significantly improve the accuracy and utility of search results. Of particular interest is a method to rank search results such that the most relevant results are presented to the user first.

We will soon begin experimenting with ranking methods by first evaluating a Levenshtein distance metric, a string metric for measuring the difference between two sequences of characters. In 1965 mathematician Vladimir Levenshtein proposed a distance metric that measures how different two character strings are. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.

We are very interested in record linkage. These Southampton Files contain information often compiled by court officials, the name of an individual name may be spelled differently (or misspelled) in the records, indexed by different researchers with multiple styles of indexing, and may only contain initials or abbreviated name. Normalized names help with record linkage, but is only a tiny, first step. Ranking of search results is a second, small step, but record linkage is still mainly manual process. Can we systematically link these records? Recent research (see reference 2) suggests the answer is maybe "Yes." Do you have ideas? Please share at abbubba@me.com.

Update: We have begun to rank search results with closest matched listed first. Four methods have been implemented (Levenshtein, Similar_text, Jaro, JaroWinkler). Initial evaluations show JaroWinkler gives the best results.

References

Levenshtein distance metric

Thee of Record Linkage and Current Research Problems