| --- | |||||
| title: "Candidate Listing Resolution" | |||||
| author: Garrick Aden-Buie | |||||
| date: 2024-01-05 | |||||
| format: | |||||
| html: | |||||
| embed-resources: true | |||||
| mermaid: | |||||
| theme: neutral | |||||
| editor: | |||||
| render-on-save: true | |||||
| --- | |||||
| ```{=html} | |||||
| <style> | |||||
| .knitsql-table { | |||||
| margin-bottom: 1rem; | |||||
| } | |||||
| </style> | |||||
| ``` | |||||
| ## Overview | |||||
| ```{r setup} | |||||
| #| include: false | |||||
| library(tidyverse) | |||||
| library(fs) | |||||
| pkgload::load_all(here::here("process")) | |||||
| ``` | |||||
| The complete dataset involves a number of tables from four data sources: | |||||
| 1. The Campaign Finance Reports from the NC State Board of Elections (SBOE); | |||||
| 2. The Candidate Listing from the NC SBOE; | |||||
| 3. The Voter Registration from the NC SBOE; and | |||||
| 4. Resolved addresses from the U.S. Census Bureau. | |||||
| The diagram below outlines the tables in the uploaded dataset | |||||
| and their general relationship to each other. | |||||
| ```{mermaid} | |||||
| flowchart LR | |||||
| subgraph "Campaign Finance reports" | |||||
| reports | |||||
| cover | |||||
| officers | |||||
| receipts | |||||
| expenses | |||||
| committees | |||||
| receipts_payer | |||||
| expenses_payee | |||||
| end | |||||
| subgraph "Candidate Listing" | |||||
| cl_candidates | |||||
| cl_elections | |||||
| cl_name_on_ballot | |||||
| cl_party | |||||
| cl_contact | |||||
| end | |||||
| committees --> reports | |||||
| reports --> cover | |||||
| reports --> officers | |||||
| reports --> receipts | |||||
| reports --> expenses | |||||
| receipts --> receipts_payer | |||||
| expenses --> expenses_payee | |||||
| cl_candidates --> cl_elections | |||||
| cl_elections --> cl_name_on_ballot | |||||
| cl_elections --> cl_party | |||||
| cl_elections --> cl_contact | |||||
| committee_candidate <--> committees | |||||
| committee_candidate <--> cl_candidates | |||||
| ``` | |||||
| ### Campaign finance reports | |||||
| The primary goal of this project was to collect and organize the campaign finance reports. | |||||
| The core tables of this portion of the dataset are: | |||||
| 1. `reports`: This table provides a master list of reports filed with the SBOE. | |||||
| 2. `committees`: This table extracts the most recent committee information | |||||
| from the filed reports. If you're interested in a particular committee, this | |||||
| is likely the place you'll want to start. | |||||
| 3. `receipts`, `expenses`: These table provides a list of received contributions | |||||
| and expenses paid by the committee. | |||||
| 4. `receipts_payer`, `expenses_payee`: These tables provide the payer/payee | |||||
| information for the receipts and expenses, extracted from the `receipts` and | |||||
| `expenses` tables. I haven't de-duplicated the records in this table (yet). | |||||
| 5. `cover`: Each report has a cover "page" where key information about the | |||||
| committee or the period being reported is provided. | |||||
| 6. `officers`: Each committee has a list of officers. This table provides a | |||||
| master list of officers for all committees. | |||||
| All of the above tables have both `sboe_id` and `report_id` columns. | |||||
| `sboe_id` uniquely identifies a committee by its SBOE-assigned ID, | |||||
| and `report_id` uniquely identifies an individual campaign finance report. | |||||
| In the `committees` table, | |||||
| `report_id` refers to the latest report | |||||
| from which the committee contact information was extracted. | |||||
| The `sboe_id` is always the same for a given committee | |||||
| and is the best way to identify a committee. | |||||
| The `report_id` refers to a specific filing of a report, | |||||
| and I've included only the most recently filed reports in this database. | |||||
| Note that amended filings receive a new report ID, | |||||
| the `report_id` may change in the future | |||||
| if or when the committee files an amendment. | |||||
| ### Candidate listing | |||||
| To augment the campaign finance data, | |||||
| I've included several tables extracted from each year's candidate listing. | |||||
| The primary table of interest is `cl_candidates`. | |||||
| It contains individual candidates from the candidate listing | |||||
| and their contact information and party affiliation. | |||||
| If you're looking for a specific candidate, | |||||
| but don't know their SBOE ID or their committee's name, | |||||
| this is the place to start. | |||||
| Candidates from the candidate listing are linked to a specific SBOE committee | |||||
| via the `committee_candidate` table, | |||||
| which matches a `candidate_id` with an `sboe_id`. | |||||
| Note that the `candidate_id` is an ID I've created | |||||
| to help organize the candidate listing -- | |||||
| although I'm sure the SBOE has its own internal ID for each candidate, | |||||
| they don't include them in the data they publish. | |||||
| This means that the `candidate_id` may change | |||||
| when the candidate listing is updated. | |||||
| ### Additional data sources | |||||
| Many tables have an `address_lookup` column. | |||||
| When this column is found in a table, | |||||
| it serves as the key for matching the address of that row | |||||
| with the resolved addresses in the `addresses` table. | |||||
| These addresses have been passed through the geocoding services | |||||
| provided by the U.S. Census Bureau, | |||||
| so the `addresses` is also a useful way | |||||
| to get the latitude and longitude on an address for mapping purposes. | |||||
| Additionally, I've included the voter registration data. | |||||
| It's not currently linked to any other tables, | |||||
| but it could be useful in observing trends in voter registration across counties, | |||||
| or for exploring demographic trends in the voting population. | |||||
| ```{r load-database} | |||||
| #| echo: false | |||||
| reports <- out_open_dataset_db(here::here("process/data-out/reports")) | |||||
| officers <- out_open_dataset_db(here::here("process/data-out/officers")) | |||||
| committees <- out_open_dataset_db(here::here("process/data-out/committees")) | |||||
| cl_candidates <- out_open_dataset_db(here::here("process/data-out/cl_candidates")) | |||||
| cl_elections <- out_open_dataset_db(here::here("process/data-out/cl_elections")) | |||||
| cl_party <- out_open_dataset_db(here::here("process/data-out/cl_party")) | |||||
| cl_name_on_ballot <- out_open_dataset_db(here::here("process/data-out/cl_name_on_ballot")) | |||||
| cl_contact <- out_open_dataset_db(here::here("process/data-out/cl_contact")) | |||||
| committee_candidate <- out_open_dataset_db(here::here("process/data-out/committee_candidate")) | |||||
| receipts <- out_open_dataset_db("../../process/data-out/receipts") | |||||
| receipts_payer <- out_open_dataset_db("../../process/data-out/receipts_payer") | |||||
| con <- duckdb_global_con() | |||||
| ``` | |||||
| ## Finding a candidate | |||||
| You can often start by searching for a candidate in the `cl_candidates` table. | |||||
| Here's an example looking for **John Bell**. | |||||
| ```{sql john-bell, connection=con} | |||||
| SELECT * | |||||
| FROM cl_candidates | |||||
| WHERE ( | |||||
| first_name = 'JOHN' AND | |||||
| middle_name = 'RICHARD' AND | |||||
| last_name = 'BELL' | |||||
| ) | |||||
| ``` | |||||
| Then, join the result with the `committee_candidate` | |||||
| to find the candidate's `sboe_id`, | |||||
| if a match has been identified. | |||||
| The candidate-committee linking uses probabilistic matching. | |||||
| which allows for some flexibility in the matching process. | |||||
| Note that not every candidate is linked to a committee | |||||
| and there are many more committees than candidates. | |||||
| ```{sql john-bell-committee-mapping, connection=con} | |||||
| WITH john_bell AS ( | |||||
| SELECT * | |||||
| FROM cl_candidates | |||||
| WHERE ( | |||||
| first_name = 'JOHN' AND | |||||
| middle_name = 'RICHARD' AND | |||||
| last_name = 'BELL' | |||||
| ) | |||||
| ) | |||||
| SELECT * | |||||
| FROM committee_candidate | |||||
| WHERE candidate_id IN ( | |||||
| SELECT candidate_id | |||||
| FROM john_bell | |||||
| ) | |||||
| ``` | |||||
| The candidate listing also gives you | |||||
| a complete history of the candidate's election history. | |||||
| You might find some historically interesting information | |||||
| by joining `cl_candidates` with | |||||
| * `cl_elections` for specific election contests, | |||||
| * `cl_name_on_ballot` for the candidate's name on the ballot, | |||||
| * `cl_party` for the party affiliation of a candidate in an election, and | |||||
| * `cl_contact` for contact information for a candidate. | |||||
| Here's an example combining the above tables | |||||
| to show **the last three elections** for John Bell. | |||||
| ```{sql john-bell-elections, connection=con} | |||||
| WITH john_bell AS ( | |||||
| SELECT * | |||||
| FROM cl_candidates | |||||
| WHERE ( | |||||
| first_name = 'JOHN' AND | |||||
| middle_name = 'RICHARD' AND | |||||
| last_name = 'BELL' | |||||
| ) | |||||
| ) | |||||
| SELECT * | |||||
| FROM cl_elections | |||||
| LEFT JOIN cl_name_on_ballot USING (candidate_id, election_dt) | |||||
| LEFT JOIN cl_party USING (candidate_id, election_dt) | |||||
| LEFT JOIN cl_contact USING (candidate_id, election_dt) | |||||
| WHERE candidate_id IN ( | |||||
| SELECT candidate_id | |||||
| FROM john_bell | |||||
| ) | |||||
| ORDER BY election_dt DESC | |||||
| LIMIT 3 | |||||
| ``` | |||||
| ## Campaign Finance Reports | |||||
| Within the campaign finance report data, | |||||
| the best place to get start is with the `reports` or `committees` table. | |||||
| The `reports` table provides a master list of reports filed with the SBOE. | |||||
| These include only the most up-to-date reports, | |||||
| taking into account amended filings, | |||||
| so you do not need to worry about filtering out outdated reports. | |||||
| Here are the **first 5 reports filed by John Bell's committee** | |||||
| (`sboe_id='STA-8S285O-C-001'` taken from the previous query). | |||||
| ```{sql john-bell-reports, connection=con} | |||||
| SELECT * | |||||
| FROM reports | |||||
| WHERE sboe_id='STA-8S285O-C-001' | |||||
| LIMIT 5 | |||||
| ``` | |||||
| Each report has a `cover` page, which is included in the `cover` table, | |||||
| but I've extracted the most recent name and contact information for the committee | |||||
| into the `committees` table. | |||||
| Here, you're guaranteed to get a single row per committee, | |||||
| for example with John Bell's committee: | |||||
| ```{sql john-bell-committee, connection=con} | |||||
| SELECT * | |||||
| FROM committees | |||||
| WHERE sboe_id='STA-8S285O-C-001' | |||||
| ``` | |||||
| Next, we have the `receipts` and `expenses` tables. | |||||
| Both are similarly structured, so I'll just demonstrate how to use `receipts`. | |||||
| First, we'll filter the `reports` table to get the 2016 Q1 report record for John Bell's committee. | |||||
| Then, we can `INNER JOIN` this table with the `receipts` table, | |||||
| which returns all of the receipts | |||||
| for the reports in the `filtered_reports` common table expression | |||||
| (CTE, i.e. the temporary tables created using `WITH ___ AS (__query__))`. | |||||
| ```{sql john-bell-receipts, connection=con} | |||||
| WITH filtered_reports AS ( | |||||
| SELECT * | |||||
| FROM reports | |||||
| WHERE sboe_id='STA-8S285O-C-001' | |||||
| AND year=2016 | |||||
| AND doc_name='First Quarter' | |||||
| ) | |||||
| SELECT * | |||||
| FROM receipts | |||||
| INNER JOIN filtered_reports USING (report_id) | |||||
| LIMIT 5 | |||||
| ``` | |||||
| Once you have a list of receipts, | |||||
| you can start to do some analysis on these, | |||||
| like finding the total money received by contribution type. | |||||
| This time I'm filtering to John Bell's 2020 Q3 report. | |||||
| ```{sql john-bell-donations, connection=con} | |||||
| WITH filtered_reports AS ( | |||||
| SELECT * | |||||
| FROM reports | |||||
| WHERE sboe_id='STA-8S285O-C-001' | |||||
| AND year=2020 | |||||
| AND doc_name='Third Quarter' | |||||
| ) | |||||
| SELECT | |||||
| receipt_type_desc, | |||||
| receipt_type_code, | |||||
| is_donation, | |||||
| SUM(amount) as total | |||||
| FROM receipts | |||||
| INNER JOIN filtered_reports USING (report_id) | |||||
| GROUP BY | |||||
| receipt_type_desc, | |||||
| receipt_type_code, | |||||
| is_donation | |||||
| ORDER BY total DESC | |||||
| ``` | |||||
| ## Receipts and Payer Information | |||||
| As you may have noticed above, | |||||
| each receipt is linked to a payer in the `receipts_payer` table | |||||
| via the `payer_id` column. | |||||
| At this time, I have not de-duplicated the `receipts_payer` table, | |||||
| so there may be multiple records for a single payer. | |||||
| Separating the `receipts` and `receipts_payer` tables | |||||
| will allow us to de-duplicate payer information in the future. | |||||
| Compared with the committee and candidate information, | |||||
| the payer records are much noisier | |||||
| and don't have any highly reliable fields | |||||
| that we can use to de-duplicate the records. | |||||
| The probabilistic matching I used for linking candidates and committees | |||||
| will work but it's a relatively large engineering lift. | |||||
| The challenge is that there are about 750,000 unique payer records, | |||||
| which certainly isn't _big data_, | |||||
| but deduplicating the records requires more than 2.8 Billion comparisons | |||||
| (to compare every record with every other record). | |||||
| For now, I'll show you how to use the `receipts_payer` table to look for specific donors. | |||||
| I recommend starting with `receipts_payer`, | |||||
| finding all `payer_id` values that match your person(s) of interest, | |||||
| and then working from those records back to the `receipts` table. | |||||
| As an example, let's find the top donors who have contributed the most to NC candidates since 2016. | |||||
| The first CTE (`donations_total`) uses `receipts` to count the total amount donated by each `payer_id`. | |||||
| The second CTE (`top_donors`) filters the donors down to the top 20 donors. | |||||
| Finally, we join `top_donors` with `receipts_payer` to get the payer information for each donor. | |||||
| This gives us the "top 20" donors, | |||||
| but it's clear that some of them are the same person with slightly different payer information. | |||||
| ```{sql top-donors, connection=con} | |||||
| WITH | |||||
| donations_total AS ( | |||||
| SELECT payer_id, SUM(amount) AS total | |||||
| FROM receipts | |||||
| WHERE (NOT((payer_id IS NULL))) | |||||
| GROUP BY payer_id | |||||
| ), | |||||
| top_donors AS ( | |||||
| SELECT payer_id, total | |||||
| FROM ( | |||||
| SELECT *, RANK() OVER (ORDER BY total DESC) AS donor_rank | |||||
| FROM donations_total | |||||
| ) | |||||
| WHERE (donor_rank <= 20) | |||||
| ) | |||||
| SELECT | |||||
| top_donors.*, | |||||
| org_name, | |||||
| profession, | |||||
| employers_name, | |||||
| address_lookup | |||||
| FROM top_donors | |||||
| LEFT JOIN receipts_payer | |||||
| ON (top_donors.payer_id = receipts_payer.payer_id) | |||||
| ORDER BY UPPER(org_name) | |||||
| ``` | |||||
| Let's pick out **Greg Lindberg** from the list above. | |||||
| (I'm sure you're familiar with [Greg Lindberg](https://www.justice.gov/opa/pr/founder-and-chairman-multinational-investment-company-and-company-consultant-convicted) but I wasn't, and wow!) | |||||
| ```{sql greg-lindberg, connection=con, sql.max.print = 15} | |||||
| SELECT | |||||
| payer_id, | |||||
| org_name, | |||||
| profession, | |||||
| employers_name, | |||||
| address_lookup | |||||
| FROM receipts_payer | |||||
| WHERE ( | |||||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||||
| ) | |||||
| ``` | |||||
| All of the above records are definitely the same person, | |||||
| so, because I'm curious, | |||||
| let's find out **how much money Greg Lindberg has donated**. | |||||
| ```{sql greg-lindberg-total, connection=con, output.var="gl_total"} | |||||
| WITH greg_lindberg AS ( | |||||
| SELECT | |||||
| payer_id, | |||||
| org_name, | |||||
| profession, | |||||
| employers_name, | |||||
| address_lookup | |||||
| FROM receipts_payer | |||||
| WHERE ( | |||||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||||
| ) | |||||
| ) | |||||
| SELECT SUM(amount) AS total | |||||
| FROM receipts | |||||
| INNER JOIN greg_lindberg USING (payer_id) | |||||
| ``` | |||||
| The result of that query tells us that Greg Lindberg has donated | |||||
| **`r scales::dollar(gl_total$total, 1)`** | |||||
| to NC candidates since 2016. | |||||
| This number tracks with | |||||
| [an article from The News & Observer](https://www.newsobserver.com/news/politics-government/article228779794.html) | |||||
| that reports Greg Lindberg donated around $7.5 million from 2016 to 2018, | |||||
| a figure that includes donations to federal PACs. | |||||
| Now we can find out where Greg Lindberg has donated his money. | |||||
| The first CTE picks out the payer records for Greg Lindberg | |||||
| and the second finds his total donations to each committee. | |||||
| Finally, we join the total donations results with the `committees` table | |||||
| to get the committee names (and pick out the top 10). | |||||
| ```{sql greg-lindberg-donations, connection=con, output.var="gl_donations"} | |||||
| WITH | |||||
| greg_lindberg AS ( | |||||
| SELECT | |||||
| payer_id, | |||||
| org_name, | |||||
| profession, | |||||
| employers_name, | |||||
| address_lookup | |||||
| FROM receipts_payer | |||||
| WHERE ( | |||||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||||
| ) | |||||
| ), | |||||
| greg_lindberg_donations AS ( | |||||
| SELECT sboe_id, SUM(amount) AS total | |||||
| FROM receipts | |||||
| INNER JOIN greg_lindberg USING (payer_id) | |||||
| GROUP BY sboe_id | |||||
| ) | |||||
| SELECT | |||||
| greg_lindberg_donations.sboe_id, | |||||
| committees.committee_name, | |||||
| greg_lindberg_donations.total | |||||
| FROM greg_lindberg_donations | |||||
| LEFT JOIN committees USING (sboe_id) | |||||
| ORDER BY total DESC | |||||
| LIMIT 10 | |||||
| ``` | |||||
| ```{r table-gl-donations, echo = FALSE} | |||||
| library(gt) | |||||
| gt(gl_donations) |> | |||||
| cols_label( | |||||
| sboe_id = "SBOE ID", | |||||
| committee_name = "Committee Name", | |||||
| total = "Total Donations" | |||||
| ) |> | |||||
| fmt_currency(columns = total, decimals = 0) | |||||
| ``` | |||||
| And now, also because I'm curious, | |||||
| here is a plot showing Lindberg's donations over time. | |||||
| [Things went south for Lindberg](https://www.newsobserver.com/news/politics-government/article228776004.html) | |||||
| early in 2019, | |||||
| so it's unsurprising that his donations have dropped off after 2018. | |||||
| ```{sql greg-lindberg-donations-over-time, connection=con, output.var="gl_donations_over_time"} | |||||
| WITH | |||||
| greg_lindberg AS ( | |||||
| SELECT | |||||
| payer_id, | |||||
| org_name, | |||||
| profession, | |||||
| employers_name, | |||||
| address_lookup | |||||
| FROM receipts_payer | |||||
| WHERE ( | |||||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||||
| ) | |||||
| ) | |||||
| SELECT YEAR(occur_date) AS year, SUM(amount) AS total | |||||
| FROM receipts | |||||
| INNER JOIN greg_lindberg USING (payer_id) | |||||
| GROUP BY year | |||||
| ``` | |||||
| ```{r plot-gl-donations-over-time, echo = FALSE} | |||||
| #| fig.width: 9 | |||||
| library(ggplot2) | |||||
| ggplot(gl_donations_over_time, aes(x = factor(year), y = total)) + | |||||
| geom_col(fill = "#517898") + | |||||
| labs( | |||||
| title = "Greg Lindberg's Donations", | |||||
| x = NULL, | |||||
| y = "Total Donations" | |||||
| ) + | |||||
| scale_y_continuous(labels = scales::dollar) + | |||||
| theme_minimal(14, "Helvetica") + | |||||
| theme( | |||||
| panel.grid.minor = element_blank(), | |||||
| panel.grid.major.x = element_blank(), | |||||
| plot.title.position="plot" | |||||
| ) | |||||
| ``` | |||||
| ```{r cleanup} | |||||
| #| include: false | |||||
| DBI::dbDisconnect(con, shutdown=TRUE) | |||||
| ``` | |||||