Can open data sources predict an election?

davidjstier
5 min readSep 18, 2020

I’m seeking your help to find out — it’s a huge project & initial results are promising. If you’re interested please get in touch:

Quick background: For 20 years I’ve used publicly available datasets to identify companies and create business contact lists in niche industries. Recently, government agencies have created “Open Data” portals that make it significantly easier to locate and download thousands of pre-assembled, easy to compile datasets.

Finding and stitching these datasets together can provide us with a highly granular view of how society behaves and may improve our ability to predict future elections.

King County WA open data website. Search for ‘parcels’ yields datasets on foreclosed properties, GIS data on property size and a file with 759,000 records of parcel # and location (to match w/ tax assessor info)

What can open data tell us about the 2020 election?

Hypothesis: Knowing who you neighbors are and which precinct you live in are essential data elements to increase the predictive power of estimating voter behavior. I have compiled a dataset that includes all of the following :

  • where a voter lives
  • how they and their neighbors have voted (or not)
  • which candidates their neighbors support and how strongly they do so
  • how each voter compares to others in their voting precinct in terms of their type of residence, rent vs. own, property value and the range of variability of each of these both within and across precincts
  • how many people in each precinct are registered, are ‘active’ voters, have actually voted, and which candidate/party they voted for during the runoffs and general elections of 2016 and 2020

What kind of information is in this model:

Individual voters: Many assume that personal voting information is ‘confidential’ — and this is true for many states. Election officials make it nearly impossible to compile data on individual voters — typically, you must enter several fields such as full address and/or full name to see just one person’s voter registration information. However, the state of Florida provides free public access to their list of all registered voters and one large county in Texas provides immediate access to their complete voter registration dataset.

Voters’ history: Both of these sources include date of birth, home address, party affiliation, assigned precinct and participation in prior elections (not who you voted for, just whether you voted in that election).

Election results by precinct: The precinct is the lowest level of aggregation for tracking election results. This data is often available for elections back to 2016 and includes total number of registered voters per precinct and the count of who voted for each candidate or proposal.

Candidate preference: Beyond the precinct level totals for each candidate there’s another source of data that can provide insight into how unique individuals are demonstrating their support. This dataset includes information on specific individuals over the past 4 years and can track monthly voter support during each campaign cycle for both the 2016 and 2020 election cycle. Unfortunately, this dataset doesn’t make it easy to identify unique individuals. For one major city that I analyzed, there were on average 2.9 different combinations for each unique individual.

For example, what you and I see as the same person looks like 3 different people to a computer:

Rebecca Smith 117 E Harborview Drive

Rebecca L Smith 117 Harborview Drive

Rebecca Smith 117 Harborview Dr E

A citizen data scientist can use pattern matching algorithms to determine that there’s just one Rebecca Smith on Harborview Drive. However, to do so requires time and significant computing resources. There are nearly 1 million records in just that one city; they were eventually boiled down to 45,000 unique people who on average have 20 different records.

Once this distillation is done, you can get a much clearer view of what’s happening. For example

This new dataset shows how many voters switched to Biden after he earned the nomination and which candidate they initially supported:

By uniquely identifying individuals, we can see the # of new supporters for each candidate by month:

#of unique people who start to support a candidate since January 2019 (each person counted only once)

Home address: It may disturb people to know that their name, home address, which candidates they support and who they vote for can all be found online (note: the extent of information online does vary by state).

An address provides a wealth of untapped information. With an address you can often find out what type of residence they live in — it’s size, value, # of bedrooms, age, etc and whether they own or rent. Geographic software program (GIS) make it possible to compare each person with those who live on the same block or within a 1/4 or 1/2 mile area around them. Generating summary statistics at each of these groupings can determine if there’s something unique about that person compared to others who live nearby.

For example, I created an algorithm to score every homeowner in Boston on the likelihood that they would install a solar panel system.

An example of Boston homes that have installed solar power. Often these installations cluster together within a 1–2 block area and then another cluster would be a mile away.

I compared each homeowner to all homes within ever increasing areas starting with the same side of a block (105 Main and 119 Main St), to those on both sides of the block (104, 105, 112 and 119 Main St), and then to those within a 1/4 mile and 1/2 mile square grid. I did this to see how the predictive power of any particular feature degraded as one compared each house to an ever widening geographic group of homes. Here’s how the predictive power of these comparisons lessened as the geographic cluster increased in size:

37.9% same side of the block

18.6% both sides of the block

11.9% 1/4 mile grid

8.4% 1/2 mile grid

5.4% Census tract

4.1% Neighborhood (17 in Boston)

I plan to publish another article with more details on the information gained from comparing voters to their neighbors.

Thank you for reading.

If you’re interested in learning more, please contact me via DM on Twitter (‘@davidjstier) or email: davidjstier at gmail dot com

--

--

davidjstier

Here to help lead a socially responsible company / organization with kindness, insight, tenacity + teamwork. Let’s accomplish good with greatness as our guide.