|
- ---
- title: "Candidate Listing Resolution"
- author: Garrick Aden-Buie
- date: 2024-01-05
-
- format:
- html:
- embed-resources: true
- mermaid:
- theme: neutral
-
- editor:
- render-on-save: true
- ---
-
- ```{=html}
- <style>
- .knitsql-table {
- margin-bottom: 1rem;
- }
- </style>
- ```
-
- ## Overview
-
- ```{r setup}
- #| include: false
- library(tidyverse)
- library(fs)
- pkgload::load_all(here::here("process"))
- ```
-
- The complete dataset involves a number of tables from four data sources:
-
- 1. The Campaign Finance Reports from the NC State Board of Elections (SBOE);
- 2. The Candidate Listing from the NC SBOE;
- 3. The Voter Registration from the NC SBOE; and
- 4. Resolved addresses from the U.S. Census Bureau.
-
- The diagram below outlines the tables in the uploaded dataset
- and their general relationship to each other.
-
- ```{mermaid}
- flowchart LR
- subgraph "Campaign Finance reports"
- reports
- cover
- officers
- receipts
- expenses
- committees
- receipts_payer
- expenses_payee
- end
-
- subgraph "Candidate Listing"
- cl_candidates
- cl_elections
- cl_name_on_ballot
- cl_party
- cl_contact
- end
-
- committees --> reports
-
- reports --> cover
- reports --> officers
- reports --> receipts
- reports --> expenses
-
- receipts --> receipts_payer
- expenses --> expenses_payee
-
- cl_candidates --> cl_elections
- cl_elections --> cl_name_on_ballot
- cl_elections --> cl_party
- cl_elections --> cl_contact
-
- committee_candidate <--> committees
- committee_candidate <--> cl_candidates
- ```
-
- ### Campaign finance reports
-
- The primary goal of this project was to collect and organize the campaign finance reports.
- The core tables of this portion of the dataset are:
-
- 1. `reports`: This table provides a master list of reports filed with the SBOE.
- 2. `committees`: This table extracts the most recent committee information
- from the filed reports. If you're interested in a particular committee, this
- is likely the place you'll want to start.
- 3. `receipts`, `expenses`: These table provides a list of received contributions
- and expenses paid by the committee.
- 4. `receipts_payer`, `expenses_payee`: These tables provide the payer/payee
- information for the receipts and expenses, extracted from the `receipts` and
- `expenses` tables. I haven't de-duplicated the records in this table (yet).
- 5. `cover`: Each report has a cover "page" where key information about the
- committee or the period being reported is provided.
- 6. `officers`: Each committee has a list of officers. This table provides a
- master list of officers for all committees.
-
- All of the above tables have both `sboe_id` and `report_id` columns.
- `sboe_id` uniquely identifies a committee by its SBOE-assigned ID,
- and `report_id` uniquely identifies an individual campaign finance report.
- In the `committees` table,
- `report_id` refers to the latest report
- from which the committee contact information was extracted.
-
- The `sboe_id` is always the same for a given committee
- and is the best way to identify a committee.
- The `report_id` refers to a specific filing of a report,
- and I've included only the most recently filed reports in this database.
- Note that amended filings receive a new report ID,
- the `report_id` may change in the future
- if or when the committee files an amendment.
-
- ### Candidate listing
-
- To augment the campaign finance data,
- I've included several tables extracted from each year's candidate listing.
- The primary table of interest is `cl_candidates`.
- It contains individual candidates from the candidate listing
- and their contact information and party affiliation.
- If you're looking for a specific candidate,
- but don't know their SBOE ID or their committee's name,
- this is the place to start.
-
- Candidates from the candidate listing are linked to a specific SBOE committee
- via the `committee_candidate` table,
- which matches a `candidate_id` with an `sboe_id`.
- Note that the `candidate_id` is an ID I've created
- to help organize the candidate listing --
- although I'm sure the SBOE has its own internal ID for each candidate,
- they don't include them in the data they publish.
- This means that the `candidate_id` may change
- when the candidate listing is updated.
-
- ### Additional data sources
-
- Many tables have an `address_lookup` column.
- When this column is found in a table,
- it serves as the key for matching the address of that row
- with the resolved addresses in the `addresses` table.
- These addresses have been passed through the geocoding services
- provided by the U.S. Census Bureau,
- so the `addresses` is also a useful way
- to get the latitude and longitude on an address for mapping purposes.
-
- Additionally, I've included the voter registration data.
- It's not currently linked to any other tables,
- but it could be useful in observing trends in voter registration across counties,
- or for exploring demographic trends in the voting population.
-
-
- ```{r load-database}
- #| echo: false
- reports <- out_open_dataset_db(here::here("process/data-out/reports"))
- officers <- out_open_dataset_db(here::here("process/data-out/officers"))
- committees <- out_open_dataset_db(here::here("process/data-out/committees"))
- cl_candidates <- out_open_dataset_db(here::here("process/data-out/cl_candidates"))
- cl_elections <- out_open_dataset_db(here::here("process/data-out/cl_elections"))
- cl_party <- out_open_dataset_db(here::here("process/data-out/cl_party"))
- cl_name_on_ballot <- out_open_dataset_db(here::here("process/data-out/cl_name_on_ballot"))
- cl_contact <- out_open_dataset_db(here::here("process/data-out/cl_contact"))
- committee_candidate <- out_open_dataset_db(here::here("process/data-out/committee_candidate"))
-
- receipts <- out_open_dataset_db("../../process/data-out/receipts")
- receipts_payer <- out_open_dataset_db("../../process/data-out/receipts_payer")
-
- con <- duckdb_global_con()
- ```
-
- ## Finding a candidate
-
- You can often start by searching for a candidate in the `cl_candidates` table.
- Here's an example looking for **John Bell**.
-
- ```{sql john-bell, connection=con}
- SELECT *
- FROM cl_candidates
- WHERE (
- first_name = 'JOHN' AND
- middle_name = 'RICHARD' AND
- last_name = 'BELL'
- )
- ```
-
- Then, join the result with the `committee_candidate`
- to find the candidate's `sboe_id`,
- if a match has been identified.
- The candidate-committee linking uses probabilistic matching.
- which allows for some flexibility in the matching process.
- Note that not every candidate is linked to a committee
- and there are many more committees than candidates.
-
- ```{sql john-bell-committee-mapping, connection=con}
- WITH john_bell AS (
- SELECT *
- FROM cl_candidates
- WHERE (
- first_name = 'JOHN' AND
- middle_name = 'RICHARD' AND
- last_name = 'BELL'
- )
- )
-
- SELECT *
- FROM committee_candidate
- WHERE candidate_id IN (
- SELECT candidate_id
- FROM john_bell
- )
- ```
-
- The candidate listing also gives you
- a complete history of the candidate's election history.
- You might find some historically interesting information
- by joining `cl_candidates` with
-
- * `cl_elections` for specific election contests,
- * `cl_name_on_ballot` for the candidate's name on the ballot,
- * `cl_party` for the party affiliation of a candidate in an election, and
- * `cl_contact` for contact information for a candidate.
-
- Here's an example combining the above tables
- to show **the last three elections** for John Bell.
-
- ```{sql john-bell-elections, connection=con}
- WITH john_bell AS (
- SELECT *
- FROM cl_candidates
- WHERE (
- first_name = 'JOHN' AND
- middle_name = 'RICHARD' AND
- last_name = 'BELL'
- )
- )
-
- SELECT *
- FROM cl_elections
- LEFT JOIN cl_name_on_ballot USING (candidate_id, election_dt)
- LEFT JOIN cl_party USING (candidate_id, election_dt)
- LEFT JOIN cl_contact USING (candidate_id, election_dt)
- WHERE candidate_id IN (
- SELECT candidate_id
- FROM john_bell
- )
- ORDER BY election_dt DESC
- LIMIT 3
- ```
-
- ## Campaign Finance Reports
-
- Within the campaign finance report data,
- the best place to get start is with the `reports` or `committees` table.
-
- The `reports` table provides a master list of reports filed with the SBOE.
- These include only the most up-to-date reports,
- taking into account amended filings,
- so you do not need to worry about filtering out outdated reports.
-
- Here are the **first 5 reports filed by John Bell's committee**
- (`sboe_id='STA-8S285O-C-001'` taken from the previous query).
-
- ```{sql john-bell-reports, connection=con}
- SELECT *
- FROM reports
- WHERE sboe_id='STA-8S285O-C-001'
- LIMIT 5
- ```
-
- Each report has a `cover` page, which is included in the `cover` table,
- but I've extracted the most recent name and contact information for the committee
- into the `committees` table.
- Here, you're guaranteed to get a single row per committee,
- for example with John Bell's committee:
-
- ```{sql john-bell-committee, connection=con}
- SELECT *
- FROM committees
- WHERE sboe_id='STA-8S285O-C-001'
- ```
-
- Next, we have the `receipts` and `expenses` tables.
- Both are similarly structured, so I'll just demonstrate how to use `receipts`.
- First, we'll filter the `reports` table to get the 2016 Q1 report record for John Bell's committee.
- Then, we can `INNER JOIN` this table with the `receipts` table,
- which returns all of the receipts
- for the reports in the `filtered_reports` common table expression
- (CTE, i.e. the temporary tables created using `WITH ___ AS (__query__))`.
-
- ```{sql john-bell-receipts, connection=con}
- WITH filtered_reports AS (
- SELECT *
- FROM reports
- WHERE sboe_id='STA-8S285O-C-001'
- AND year=2016
- AND doc_name='First Quarter'
- )
-
- SELECT *
- FROM receipts
- INNER JOIN filtered_reports USING (report_id)
- LIMIT 5
- ```
-
- Once you have a list of receipts,
- you can start to do some analysis on these,
- like finding the total money received by contribution type.
- This time I'm filtering to John Bell's 2020 Q3 report.
-
- ```{sql john-bell-donations, connection=con}
- WITH filtered_reports AS (
- SELECT *
- FROM reports
- WHERE sboe_id='STA-8S285O-C-001'
- AND year=2020
- AND doc_name='Third Quarter'
- )
-
- SELECT
- receipt_type_desc,
- receipt_type_code,
- is_donation,
- SUM(amount) as total
- FROM receipts
- INNER JOIN filtered_reports USING (report_id)
- GROUP BY
- receipt_type_desc,
- receipt_type_code,
- is_donation
- ORDER BY total DESC
- ```
-
- ## Receipts and Payer Information
-
- As you may have noticed above,
- each receipt is linked to a payer in the `receipts_payer` table
- via the `payer_id` column.
-
- At this time, I have not de-duplicated the `receipts_payer` table,
- so there may be multiple records for a single payer.
- Separating the `receipts` and `receipts_payer` tables
- will allow us to de-duplicate payer information in the future.
-
- Compared with the committee and candidate information,
- the payer records are much noisier
- and don't have any highly reliable fields
- that we can use to de-duplicate the records.
- The probabilistic matching I used for linking candidates and committees
- will work but it's a relatively large engineering lift.
- The challenge is that there are about 750,000 unique payer records,
- which certainly isn't _big data_,
- but deduplicating the records requires more than 2.8 Billion comparisons
- (to compare every record with every other record).
-
- For now, I'll show you how to use the `receipts_payer` table to look for specific donors.
- I recommend starting with `receipts_payer`,
- finding all `payer_id` values that match your person(s) of interest,
- and then working from those records back to the `receipts` table.
-
- As an example, let's find the top donors who have contributed the most to NC candidates since 2016.
- The first CTE (`donations_total`) uses `receipts` to count the total amount donated by each `payer_id`.
- The second CTE (`top_donors`) filters the donors down to the top 20 donors.
- Finally, we join `top_donors` with `receipts_payer` to get the payer information for each donor.
-
- This gives us the "top 20" donors,
- but it's clear that some of them are the same person with slightly different payer information.
-
- ```{sql top-donors, connection=con}
- WITH
- donations_total AS (
- SELECT payer_id, SUM(amount) AS total
- FROM receipts
- WHERE (NOT((payer_id IS NULL)))
- GROUP BY payer_id
- ),
- top_donors AS (
- SELECT payer_id, total
- FROM (
- SELECT *, RANK() OVER (ORDER BY total DESC) AS donor_rank
- FROM donations_total
- )
- WHERE (donor_rank <= 20)
- )
-
- SELECT
- top_donors.*,
- org_name,
- profession,
- employers_name,
- address_lookup
- FROM top_donors
- LEFT JOIN receipts_payer
- ON (top_donors.payer_id = receipts_payer.payer_id)
- ORDER BY UPPER(org_name)
- ```
-
- Let's pick out **Greg Lindberg** from the list above.
- (I'm sure you're familiar with [Greg Lindberg](https://www.justice.gov/opa/pr/founder-and-chairman-multinational-investment-company-and-company-consultant-convicted) but I wasn't, and wow!)
-
-
- ```{sql greg-lindberg, connection=con, sql.max.print = 15}
- SELECT
- payer_id,
- org_name,
- profession,
- employers_name,
- address_lookup
- FROM receipts_payer
- WHERE (
- UPPER(org_name) LIKE 'GREG%LINDBERG%'
- )
- ```
-
- All of the above records are definitely the same person,
- so, because I'm curious,
- let's find out **how much money Greg Lindberg has donated**.
-
- ```{sql greg-lindberg-total, connection=con, output.var="gl_total"}
- WITH greg_lindberg AS (
- SELECT
- payer_id,
- org_name,
- profession,
- employers_name,
- address_lookup
- FROM receipts_payer
- WHERE (
- UPPER(org_name) LIKE 'GREG%LINDBERG%'
- )
- )
-
- SELECT SUM(amount) AS total
- FROM receipts
- INNER JOIN greg_lindberg USING (payer_id)
- ```
-
- The result of that query tells us that Greg Lindberg has donated
- **`r scales::dollar(gl_total$total, 1)`**
- to NC candidates since 2016.
- This number tracks with
- [an article from The News & Observer](https://www.newsobserver.com/news/politics-government/article228779794.html)
- that reports Greg Lindberg donated around $7.5 million from 2016 to 2018,
- a figure that includes donations to federal PACs.
-
- Now we can find out where Greg Lindberg has donated his money.
- The first CTE picks out the payer records for Greg Lindberg
- and the second finds his total donations to each committee.
- Finally, we join the total donations results with the `committees` table
- to get the committee names (and pick out the top 10).
-
- ```{sql greg-lindberg-donations, connection=con, output.var="gl_donations"}
- WITH
- greg_lindberg AS (
- SELECT
- payer_id,
- org_name,
- profession,
- employers_name,
- address_lookup
- FROM receipts_payer
- WHERE (
- UPPER(org_name) LIKE 'GREG%LINDBERG%'
- )
- ),
- greg_lindberg_donations AS (
- SELECT sboe_id, SUM(amount) AS total
- FROM receipts
- INNER JOIN greg_lindberg USING (payer_id)
- GROUP BY sboe_id
- )
-
- SELECT
- greg_lindberg_donations.sboe_id,
- committees.committee_name,
- greg_lindberg_donations.total
- FROM greg_lindberg_donations
- LEFT JOIN committees USING (sboe_id)
- ORDER BY total DESC
- LIMIT 10
- ```
-
- ```{r table-gl-donations, echo = FALSE}
- library(gt)
-
- gt(gl_donations) |>
- cols_label(
- sboe_id = "SBOE ID",
- committee_name = "Committee Name",
- total = "Total Donations"
- ) |>
- fmt_currency(columns = total, decimals = 0)
- ```
-
- And now, also because I'm curious,
- here is a plot showing Lindberg's donations over time.
- [Things went south for Lindberg](https://www.newsobserver.com/news/politics-government/article228776004.html)
- early in 2019,
- so it's unsurprising that his donations have dropped off after 2018.
-
- ```{sql greg-lindberg-donations-over-time, connection=con, output.var="gl_donations_over_time"}
- WITH
- greg_lindberg AS (
- SELECT
- payer_id,
- org_name,
- profession,
- employers_name,
- address_lookup
- FROM receipts_payer
- WHERE (
- UPPER(org_name) LIKE 'GREG%LINDBERG%'
- )
- )
-
- SELECT YEAR(occur_date) AS year, SUM(amount) AS total
- FROM receipts
- INNER JOIN greg_lindberg USING (payer_id)
- GROUP BY year
- ```
-
- ```{r plot-gl-donations-over-time, echo = FALSE}
- #| fig.width: 9
- library(ggplot2)
-
- ggplot(gl_donations_over_time, aes(x = factor(year), y = total)) +
- geom_col(fill = "#517898") +
- labs(
- title = "Greg Lindberg's Donations",
- x = NULL,
- y = "Total Donations"
- ) +
- scale_y_continuous(labels = scales::dollar) +
- theme_minimal(14, "Helvetica") +
- theme(
- panel.grid.minor = element_blank(),
- panel.grid.major.x = element_blank(),
- plot.title.position="plot"
- )
- ```
-
-
- ```{r cleanup}
- #| include: false
- DBI::dbDisconnect(con, shutdown=TRUE)
- ```
|