Patient Duplicate Identification Process

The purpose of the process is to identify possible duplicate patient records within a patient database by finding records with matching identifying data elements and scoring them based on the number and quality of matching elements. Data elements that are more prone to unique identification such as social security numbers would carry higher scores while elements that could be shared such as names and dates of birth would score lower. The composite scores should dictate the likelihood that an actual duplicate record has been found. The process will produce a temporary table which will be replaced with each run. This table will contain a row for each matching data element with identifiers linking to each original patient record. Any given match candidate could contain several rows as there would be a number of elements that match for records that were duplicates.

The process begins by building supplemental indexes on the patient table. The indexes are vital to optimizing execution time of the process given the size of the patient database. Other implementations may not require supplemental indexes.

The next step is to rebuild the temp table. This is done by clearing the table and running a series of matching queries to identify matching elements and appending them to the temp table along with the identifying patient ids and match scores. The matches are identified by utilizing a view which self joins the patient data producing a “square” of the patient data exposing the fields which will be checked for matches. The view also contains fields which output the value of each possible match. So any change to the scoring is made by altering the view. Each append query is run against this view filtering by the fields being matched. In the append query the score value for the field match is obtained from the aforementioned field.

To improve performance we start with high quality matching field which produce fewer records but higher scoring fields. As we progress to less reliable matching candidates, we target the duplicate search to only check candidates that have already compiled a score above a threshold that would indicate a possible legitimate match. A view that aggregates the current match scores is used to accomplish this.

For example, matching zip codes would produce billions of records but we filter by matching the candidates with candidates that have already been identified by other criteria and have obtained a minimal score. This improves the quality of match identification without writing extraneous records to our table.

After the temp table is completed, another view can be used to identify and rank the quality of potential patient duplicates. A report can be produced from this view by linking to both patient records filtering the list to scores above a determined threshold. The cutoff can be tweaked if a more reliable cutoff is determined.

The optimization techniques applied here significantly reduced run time for the process. A process that could have taken over an hour will run in under a minute.