One approach, if you run entity resolution in Senzing on a collection of datasets, the resolved entities will each have their lists of values for features, such as names.
In other words, one of the byproducts from running ER is that it produces a domain-specific thesaurus, plus data quality metrics for the features.
Then you can get from the results which name variants are related, and which are relatively the most popularly used.
If this might help develop data and tests for name matching?