| @@ -0,0 +1,548 @@ | |||
| --- | |||
| title: "Candidate Listing Resolution" | |||
| author: Garrick Aden-Buie | |||
| date: 2024-01-05 | |||
| format: | |||
| html: | |||
| embed-resources: true | |||
| mermaid: | |||
| theme: neutral | |||
| editor: | |||
| render-on-save: true | |||
| --- | |||
| ```{=html} | |||
| <style> | |||
| .knitsql-table { | |||
| margin-bottom: 1rem; | |||
| } | |||
| </style> | |||
| ``` | |||
| ## Overview | |||
| ```{r setup} | |||
| #| include: false | |||
| library(tidyverse) | |||
| library(fs) | |||
| pkgload::load_all(here::here("process")) | |||
| ``` | |||
| The complete dataset involves a number of tables from four data sources: | |||
| 1. The Campaign Finance Reports from the NC State Board of Elections (SBOE); | |||
| 2. The Candidate Listing from the NC SBOE; | |||
| 3. The Voter Registration from the NC SBOE; and | |||
| 4. Resolved addresses from the U.S. Census Bureau. | |||
| The diagram below outlines the tables in the uploaded dataset | |||
| and their general relationship to each other. | |||
| ```{mermaid} | |||
| flowchart LR | |||
| subgraph "Campaign Finance reports" | |||
| reports | |||
| cover | |||
| officers | |||
| receipts | |||
| expenses | |||
| committees | |||
| receipts_payer | |||
| expenses_payee | |||
| end | |||
| subgraph "Candidate Listing" | |||
| cl_candidates | |||
| cl_elections | |||
| cl_name_on_ballot | |||
| cl_party | |||
| cl_contact | |||
| end | |||
| committees --> reports | |||
| reports --> cover | |||
| reports --> officers | |||
| reports --> receipts | |||
| reports --> expenses | |||
| receipts --> receipts_payer | |||
| expenses --> expenses_payee | |||
| cl_candidates --> cl_elections | |||
| cl_elections --> cl_name_on_ballot | |||
| cl_elections --> cl_party | |||
| cl_elections --> cl_contact | |||
| committee_candidate <--> committees | |||
| committee_candidate <--> cl_candidates | |||
| ``` | |||
| ### Campaign finance reports | |||
| The primary goal of this project was to collect and organize the campaign finance reports. | |||
| The core tables of this portion of the dataset are: | |||
| 1. `reports`: This table provides a master list of reports filed with the SBOE. | |||
| 2. `committees`: This table extracts the most recent committee information | |||
| from the filed reports. If you're interested in a particular committee, this | |||
| is likely the place you'll want to start. | |||
| 3. `receipts`, `expenses`: These table provides a list of received contributions | |||
| and expenses paid by the committee. | |||
| 4. `receipts_payer`, `expenses_payee`: These tables provide the payer/payee | |||
| information for the receipts and expenses, extracted from the `receipts` and | |||
| `expenses` tables. I haven't de-duplicated the records in this table (yet). | |||
| 5. `cover`: Each report has a cover "page" where key information about the | |||
| committee or the period being reported is provided. | |||
| 6. `officers`: Each committee has a list of officers. This table provides a | |||
| master list of officers for all committees. | |||
| All of the above tables have both `sboe_id` and `report_id` columns. | |||
| `sboe_id` uniquely identifies a committee by its SBOE-assigned ID, | |||
| and `report_id` uniquely identifies an individual campaign finance report. | |||
| In the `committees` table, | |||
| `report_id` refers to the latest report | |||
| from which the committee contact information was extracted. | |||
| The `sboe_id` is always the same for a given committee | |||
| and is the best way to identify a committee. | |||
| The `report_id` refers to a specific filing of a report, | |||
| and I've included only the most recently filed reports in this database. | |||
| Note that amended filings receive a new report ID, | |||
| the `report_id` may change in the future | |||
| if or when the committee files an amendment. | |||
| ### Candidate listing | |||
| To augment the campaign finance data, | |||
| I've included several tables extracted from each year's candidate listing. | |||
| The primary table of interest is `cl_candidates`. | |||
| It contains individual candidates from the candidate listing | |||
| and their contact information and party affiliation. | |||
| If you're looking for a specific candidate, | |||
| but don't know their SBOE ID or their committee's name, | |||
| this is the place to start. | |||
| Candidates from the candidate listing are linked to a specific SBOE committee | |||
| via the `committee_candidate` table, | |||
| which matches a `candidate_id` with an `sboe_id`. | |||
| Note that the `candidate_id` is an ID I've created | |||
| to help organize the candidate listing -- | |||
| although I'm sure the SBOE has its own internal ID for each candidate, | |||
| they don't include them in the data they publish. | |||
| This means that the `candidate_id` may change | |||
| when the candidate listing is updated. | |||
| ### Additional data sources | |||
| Many tables have an `address_lookup` column. | |||
| When this column is found in a table, | |||
| it serves as the key for matching the address of that row | |||
| with the resolved addresses in the `addresses` table. | |||
| These addresses have been passed through the geocoding services | |||
| provided by the U.S. Census Bureau, | |||
| so the `addresses` is also a useful way | |||
| to get the latitude and longitude on an address for mapping purposes. | |||
| Additionally, I've included the voter registration data. | |||
| It's not currently linked to any other tables, | |||
| but it could be useful in observing trends in voter registration across counties, | |||
| or for exploring demographic trends in the voting population. | |||
| ```{r load-database} | |||
| #| echo: false | |||
| reports <- out_open_dataset_db(here::here("process/data-out/reports")) | |||
| officers <- out_open_dataset_db(here::here("process/data-out/officers")) | |||
| committees <- out_open_dataset_db(here::here("process/data-out/committees")) | |||
| cl_candidates <- out_open_dataset_db(here::here("process/data-out/cl_candidates")) | |||
| cl_elections <- out_open_dataset_db(here::here("process/data-out/cl_elections")) | |||
| cl_party <- out_open_dataset_db(here::here("process/data-out/cl_party")) | |||
| cl_name_on_ballot <- out_open_dataset_db(here::here("process/data-out/cl_name_on_ballot")) | |||
| cl_contact <- out_open_dataset_db(here::here("process/data-out/cl_contact")) | |||
| committee_candidate <- out_open_dataset_db(here::here("process/data-out/committee_candidate")) | |||
| receipts <- out_open_dataset_db("../../process/data-out/receipts") | |||
| receipts_payer <- out_open_dataset_db("../../process/data-out/receipts_payer") | |||
| con <- duckdb_global_con() | |||
| ``` | |||
| ## Finding a candidate | |||
| You can often start by searching for a candidate in the `cl_candidates` table. | |||
| Here's an example looking for **John Bell**. | |||
| ```{sql john-bell, connection=con} | |||
| SELECT * | |||
| FROM cl_candidates | |||
| WHERE ( | |||
| first_name = 'JOHN' AND | |||
| middle_name = 'RICHARD' AND | |||
| last_name = 'BELL' | |||
| ) | |||
| ``` | |||
| Then, join the result with the `committee_candidate` | |||
| to find the candidate's `sboe_id`, | |||
| if a match has been identified. | |||
| The candidate-committee linking uses probabilistic matching. | |||
| which allows for some flexibility in the matching process. | |||
| Note that not every candidate is linked to a committee | |||
| and there are many more committees than candidates. | |||
| ```{sql john-bell-committee-mapping, connection=con} | |||
| WITH john_bell AS ( | |||
| SELECT * | |||
| FROM cl_candidates | |||
| WHERE ( | |||
| first_name = 'JOHN' AND | |||
| middle_name = 'RICHARD' AND | |||
| last_name = 'BELL' | |||
| ) | |||
| ) | |||
| SELECT * | |||
| FROM committee_candidate | |||
| WHERE candidate_id IN ( | |||
| SELECT candidate_id | |||
| FROM john_bell | |||
| ) | |||
| ``` | |||
| The candidate listing also gives you | |||
| a complete history of the candidate's election history. | |||
| You might find some historically interesting information | |||
| by joining `cl_candidates` with | |||
| * `cl_elections` for specific election contests, | |||
| * `cl_name_on_ballot` for the candidate's name on the ballot, | |||
| * `cl_party` for the party affiliation of a candidate in an election, and | |||
| * `cl_contact` for contact information for a candidate. | |||
| Here's an example combining the above tables | |||
| to show **the last three elections** for John Bell. | |||
| ```{sql john-bell-elections, connection=con} | |||
| WITH john_bell AS ( | |||
| SELECT * | |||
| FROM cl_candidates | |||
| WHERE ( | |||
| first_name = 'JOHN' AND | |||
| middle_name = 'RICHARD' AND | |||
| last_name = 'BELL' | |||
| ) | |||
| ) | |||
| SELECT * | |||
| FROM cl_elections | |||
| LEFT JOIN cl_name_on_ballot USING (candidate_id, election_dt) | |||
| LEFT JOIN cl_party USING (candidate_id, election_dt) | |||
| LEFT JOIN cl_contact USING (candidate_id, election_dt) | |||
| WHERE candidate_id IN ( | |||
| SELECT candidate_id | |||
| FROM john_bell | |||
| ) | |||
| ORDER BY election_dt DESC | |||
| LIMIT 3 | |||
| ``` | |||
| ## Campaign Finance Reports | |||
| Within the campaign finance report data, | |||
| the best place to get start is with the `reports` or `committees` table. | |||
| The `reports` table provides a master list of reports filed with the SBOE. | |||
| These include only the most up-to-date reports, | |||
| taking into account amended filings, | |||
| so you do not need to worry about filtering out outdated reports. | |||
| Here are the **first 5 reports filed by John Bell's committee** | |||
| (`sboe_id='STA-8S285O-C-001'` taken from the previous query). | |||
| ```{sql john-bell-reports, connection=con} | |||
| SELECT * | |||
| FROM reports | |||
| WHERE sboe_id='STA-8S285O-C-001' | |||
| LIMIT 5 | |||
| ``` | |||
| Each report has a `cover` page, which is included in the `cover` table, | |||
| but I've extracted the most recent name and contact information for the committee | |||
| into the `committees` table. | |||
| Here, you're guaranteed to get a single row per committee, | |||
| for example with John Bell's committee: | |||
| ```{sql john-bell-committee, connection=con} | |||
| SELECT * | |||
| FROM committees | |||
| WHERE sboe_id='STA-8S285O-C-001' | |||
| ``` | |||
| Next, we have the `receipts` and `expenses` tables. | |||
| Both are similarly structured, so I'll just demonstrate how to use `receipts`. | |||
| First, we'll filter the `reports` table to get the 2016 Q1 report record for John Bell's committee. | |||
| Then, we can `INNER JOIN` this table with the `receipts` table, | |||
| which returns all of the receipts | |||
| for the reports in the `filtered_reports` common table expression | |||
| (CTE, i.e. the temporary tables created using `WITH ___ AS (__query__))`. | |||
| ```{sql john-bell-receipts, connection=con} | |||
| WITH filtered_reports AS ( | |||
| SELECT * | |||
| FROM reports | |||
| WHERE sboe_id='STA-8S285O-C-001' | |||
| AND year=2016 | |||
| AND doc_name='First Quarter' | |||
| ) | |||
| SELECT * | |||
| FROM receipts | |||
| INNER JOIN filtered_reports USING (report_id) | |||
| LIMIT 5 | |||
| ``` | |||
| Once you have a list of receipts, | |||
| you can start to do some analysis on these, | |||
| like finding the total money received by contribution type. | |||
| This time I'm filtering to John Bell's 2020 Q3 report. | |||
| ```{sql john-bell-donations, connection=con} | |||
| WITH filtered_reports AS ( | |||
| SELECT * | |||
| FROM reports | |||
| WHERE sboe_id='STA-8S285O-C-001' | |||
| AND year=2020 | |||
| AND doc_name='Third Quarter' | |||
| ) | |||
| SELECT | |||
| receipt_type_desc, | |||
| receipt_type_code, | |||
| is_donation, | |||
| SUM(amount) as total | |||
| FROM receipts | |||
| INNER JOIN filtered_reports USING (report_id) | |||
| GROUP BY | |||
| receipt_type_desc, | |||
| receipt_type_code, | |||
| is_donation | |||
| ORDER BY total DESC | |||
| ``` | |||
| ## Receipts and Payer Information | |||
| As you may have noticed above, | |||
| each receipt is linked to a payer in the `receipts_payer` table | |||
| via the `payer_id` column. | |||
| At this time, I have not de-duplicated the `receipts_payer` table, | |||
| so there may be multiple records for a single payer. | |||
| Separating the `receipts` and `receipts_payer` tables | |||
| will allow us to de-duplicate payer information in the future. | |||
| Compared with the committee and candidate information, | |||
| the payer records are much noisier | |||
| and don't have any highly reliable fields | |||
| that we can use to de-duplicate the records. | |||
| The probabilistic matching I used for linking candidates and committees | |||
| will work but it's a relatively large engineering lift. | |||
| The challenge is that there are about 750,000 unique payer records, | |||
| which certainly isn't _big data_, | |||
| but deduplicating the records requires more than 2.8 Billion comparisons | |||
| (to compare every record with every other record). | |||
| For now, I'll show you how to use the `receipts_payer` table to look for specific donors. | |||
| I recommend starting with `receipts_payer`, | |||
| finding all `payer_id` values that match your person(s) of interest, | |||
| and then working from those records back to the `receipts` table. | |||
| As an example, let's find the top donors who have contributed the most to NC candidates since 2016. | |||
| The first CTE (`donations_total`) uses `receipts` to count the total amount donated by each `payer_id`. | |||
| The second CTE (`top_donors`) filters the donors down to the top 20 donors. | |||
| Finally, we join `top_donors` with `receipts_payer` to get the payer information for each donor. | |||
| This gives us the "top 20" donors, | |||
| but it's clear that some of them are the same person with slightly different payer information. | |||
| ```{sql top-donors, connection=con} | |||
| WITH | |||
| donations_total AS ( | |||
| SELECT payer_id, SUM(amount) AS total | |||
| FROM receipts | |||
| WHERE (NOT((payer_id IS NULL))) | |||
| GROUP BY payer_id | |||
| ), | |||
| top_donors AS ( | |||
| SELECT payer_id, total | |||
| FROM ( | |||
| SELECT *, RANK() OVER (ORDER BY total DESC) AS donor_rank | |||
| FROM donations_total | |||
| ) | |||
| WHERE (donor_rank <= 20) | |||
| ) | |||
| SELECT | |||
| top_donors.*, | |||
| org_name, | |||
| profession, | |||
| employers_name, | |||
| address_lookup | |||
| FROM top_donors | |||
| LEFT JOIN receipts_payer | |||
| ON (top_donors.payer_id = receipts_payer.payer_id) | |||
| ORDER BY UPPER(org_name) | |||
| ``` | |||
| Let's pick out **Greg Lindberg** from the list above. | |||
| (I'm sure you're familiar with [Greg Lindberg](https://www.justice.gov/opa/pr/founder-and-chairman-multinational-investment-company-and-company-consultant-convicted) but I wasn't, and wow!) | |||
| ```{sql greg-lindberg, connection=con, sql.max.print = 15} | |||
| SELECT | |||
| payer_id, | |||
| org_name, | |||
| profession, | |||
| employers_name, | |||
| address_lookup | |||
| FROM receipts_payer | |||
| WHERE ( | |||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||
| ) | |||
| ``` | |||
| All of the above records are definitely the same person, | |||
| so, because I'm curious, | |||
| let's find out **how much money Greg Lindberg has donated**. | |||
| ```{sql greg-lindberg-total, connection=con, output.var="gl_total"} | |||
| WITH greg_lindberg AS ( | |||
| SELECT | |||
| payer_id, | |||
| org_name, | |||
| profession, | |||
| employers_name, | |||
| address_lookup | |||
| FROM receipts_payer | |||
| WHERE ( | |||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||
| ) | |||
| ) | |||
| SELECT SUM(amount) AS total | |||
| FROM receipts | |||
| INNER JOIN greg_lindberg USING (payer_id) | |||
| ``` | |||
| The result of that query tells us that Greg Lindberg has donated | |||
| **`r scales::dollar(gl_total$total, 1)`** | |||
| to NC candidates since 2016. | |||
| This number tracks with | |||
| [an article from The News & Observer](https://www.newsobserver.com/news/politics-government/article228779794.html) | |||
| that reports Greg Lindberg donated around $7.5 million from 2016 to 2018, | |||
| a figure that includes donations to federal PACs. | |||
| Now we can find out where Greg Lindberg has donated his money. | |||
| The first CTE picks out the payer records for Greg Lindberg | |||
| and the second finds his total donations to each committee. | |||
| Finally, we join the total donations results with the `committees` table | |||
| to get the committee names (and pick out the top 10). | |||
| ```{sql greg-lindberg-donations, connection=con, output.var="gl_donations"} | |||
| WITH | |||
| greg_lindberg AS ( | |||
| SELECT | |||
| payer_id, | |||
| org_name, | |||
| profession, | |||
| employers_name, | |||
| address_lookup | |||
| FROM receipts_payer | |||
| WHERE ( | |||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||
| ) | |||
| ), | |||
| greg_lindberg_donations AS ( | |||
| SELECT sboe_id, SUM(amount) AS total | |||
| FROM receipts | |||
| INNER JOIN greg_lindberg USING (payer_id) | |||
| GROUP BY sboe_id | |||
| ) | |||
| SELECT | |||
| greg_lindberg_donations.sboe_id, | |||
| committees.committee_name, | |||
| greg_lindberg_donations.total | |||
| FROM greg_lindberg_donations | |||
| LEFT JOIN committees USING (sboe_id) | |||
| ORDER BY total DESC | |||
| LIMIT 10 | |||
| ``` | |||
| ```{r table-gl-donations, echo = FALSE} | |||
| library(gt) | |||
| gt(gl_donations) |> | |||
| cols_label( | |||
| sboe_id = "SBOE ID", | |||
| committee_name = "Committee Name", | |||
| total = "Total Donations" | |||
| ) |> | |||
| fmt_currency(columns = total, decimals = 0) | |||
| ``` | |||
| And now, also because I'm curious, | |||
| here is a plot showing Lindberg's donations over time. | |||
| [Things went south for Lindberg](https://www.newsobserver.com/news/politics-government/article228776004.html) | |||
| early in 2019, | |||
| so it's unsurprising that his donations have dropped off after 2018. | |||
| ```{sql greg-lindberg-donations-over-time, connection=con, output.var="gl_donations_over_time"} | |||
| WITH | |||
| greg_lindberg AS ( | |||
| SELECT | |||
| payer_id, | |||
| org_name, | |||
| profession, | |||
| employers_name, | |||
| address_lookup | |||
| FROM receipts_payer | |||
| WHERE ( | |||
| UPPER(org_name) LIKE 'GREG%LINDBERG%' | |||
| ) | |||
| ) | |||
| SELECT YEAR(occur_date) AS year, SUM(amount) AS total | |||
| FROM receipts | |||
| INNER JOIN greg_lindberg USING (payer_id) | |||
| GROUP BY year | |||
| ``` | |||
| ```{r plot-gl-donations-over-time, echo = FALSE} | |||
| #| fig.width: 9 | |||
| library(ggplot2) | |||
| ggplot(gl_donations_over_time, aes(x = factor(year), y = total)) + | |||
| geom_col(fill = "#517898") + | |||
| labs( | |||
| title = "Greg Lindberg's Donations", | |||
| x = NULL, | |||
| y = "Total Donations" | |||
| ) + | |||
| scale_y_continuous(labels = scales::dollar) + | |||
| theme_minimal(14, "Helvetica") + | |||
| theme( | |||
| panel.grid.minor = element_blank(), | |||
| panel.grid.major.x = element_blank(), | |||
| plot.title.position="plot" | |||
| ) | |||
| ``` | |||
| ```{r cleanup} | |||
| #| include: false | |||
| DBI::dbDisconnect(con, shutdown=TRUE) | |||
| ``` | |||