преди 2 години · d22124428c
--- a/reports/2023-12-17_about-tables/about-tables.html
+++ b/reports/2023-12-17_about-tables/about-tables.html
--- a/reports/2023-12-17_about-tables/about-tables.qmd
+++ b/reports/2023-12-17_about-tables/about-tables.qmd
@@ -0,0 +1,548 @@
 ---
 title: "Candidate Listing Resolution"
 author: Garrick Aden-Buie
 date: 2024-01-05

 format:
  html:
    embed-resources: true
    mermaid:
      theme: neutral

 editor:
  render-on-save: true
 ---

 ```{=html}
 <style>
 .knitsql-table {
  margin-bottom: 1rem;
 }
 </style>
 ```

 ## Overview

 ```{r setup}
 #| include: false
 library(tidyverse)
 library(fs)
 pkgload::load_all(here::here("process"))
 ```

 The complete dataset involves a number of tables from four data sources:

 1. The Campaign Finance Reports from the NC State Board of Elections (SBOE);
 2. The Candidate Listing from the NC SBOE;
 3. The Voter Registration from the NC SBOE; and
 4. Resolved addresses from the U.S. Census Bureau.

 The diagram below outlines the tables in the uploaded dataset
 and their general relationship to each other.

 ```{mermaid}
 flowchart LR
  subgraph "Campaign Finance reports"
      reports
      cover
      officers
      receipts
      expenses
      committees
      receipts_payer
      expenses_payee
  end

  subgraph "Candidate Listing"
    cl_candidates
    cl_elections
    cl_name_on_ballot
    cl_party
    cl_contact
  end

  committees --> reports

  reports --> cover
  reports --> officers
  reports --> receipts
  reports --> expenses

  receipts --> receipts_payer
  expenses --> expenses_payee

  cl_candidates --> cl_elections
  cl_elections --> cl_name_on_ballot
  cl_elections --> cl_party
  cl_elections --> cl_contact

  committee_candidate <--> committees
  committee_candidate <--> cl_candidates
 ```

 ### Campaign finance reports

 The primary goal of this project was to collect and organize the campaign finance reports.
 The core tables of this portion of the dataset are:

 1. `reports`: This table provides a master list of reports filed with the SBOE.
 2. `committees`: This table extracts the most recent committee information
   from the filed reports. If you're interested in a particular committee, this
   is likely the place you'll want to start.
 3. `receipts`, `expenses`: These table provides a list of received contributions
   and expenses paid by the committee.
 4. `receipts_payer`, `expenses_payee`: These tables provide the payer/payee
   information for the receipts and expenses, extracted from the `receipts` and
   `expenses` tables. I haven't de-duplicated the records in this table (yet).
 5. `cover`: Each report has a cover "page" where key information about the
   committee or the period being reported is provided.
 6. `officers`: Each committee has a list of officers. This table provides a
   master list of officers for all committees.

 All of the above tables have both `sboe_id` and `report_id` columns.
 `sboe_id` uniquely identifies a committee by its SBOE-assigned ID,
 and `report_id` uniquely identifies an individual campaign finance report.
 In the `committees` table,
 `report_id` refers to the latest report
 from which the committee contact information was extracted.

 The `sboe_id` is always the same for a given committee
 and is the best way to identify a committee.
 The `report_id` refers to a specific filing of a report,
 and I've included only the most recently filed reports in this database.
 Note that amended filings receive a new report ID,
 the `report_id` may change in the future
 if or when the committee files an amendment.

 ### Candidate listing

 To augment the campaign finance data,
 I've included several tables extracted from each year's candidate listing.
 The primary table of interest is `cl_candidates`.
 It contains individual candidates from the candidate listing
 and their contact information and party affiliation.
 If you're looking for a specific candidate,
 but don't know their SBOE ID or their committee's name,
 this is the place to start.

 Candidates from the candidate listing are linked to a specific SBOE committee
 via the `committee_candidate` table,
 which matches a `candidate_id` with an `sboe_id`.
 Note that the `candidate_id` is an ID I've created
 to help organize the candidate listing --
 although I'm sure the SBOE has its own internal ID for each candidate,
 they don't include them in the data they publish.
 This means that the `candidate_id` may change
 when the candidate listing is updated.

 ### Additional data sources

 Many tables have an `address_lookup` column.
 When this column is found in a table,
 it serves as the key for matching the address of that row
 with the resolved addresses in the `addresses` table.
 These addresses have been passed through the geocoding services
 provided by the U.S. Census Bureau,
 so the `addresses` is also a useful way
 to get the latitude and longitude on an address for mapping purposes.

 Additionally, I've included the voter registration data.
 It's not currently linked to any other tables,
 but it could be useful in observing trends in voter registration across counties,
 or for exploring demographic trends in the voting population.


 ```{r load-database}
 #| echo: false
 reports <- out_open_dataset_db(here::here("process/data-out/reports"))
 officers <- out_open_dataset_db(here::here("process/data-out/officers"))
 committees <- out_open_dataset_db(here::here("process/data-out/committees"))
 cl_candidates <- out_open_dataset_db(here::here("process/data-out/cl_candidates"))
 cl_elections <- out_open_dataset_db(here::here("process/data-out/cl_elections"))
 cl_party <- out_open_dataset_db(here::here("process/data-out/cl_party"))
 cl_name_on_ballot <- out_open_dataset_db(here::here("process/data-out/cl_name_on_ballot"))
 cl_contact <- out_open_dataset_db(here::here("process/data-out/cl_contact"))
 committee_candidate <- out_open_dataset_db(here::here("process/data-out/committee_candidate"))

 receipts <- out_open_dataset_db("../../process/data-out/receipts")
 receipts_payer <- out_open_dataset_db("../../process/data-out/receipts_payer")

 con <- duckdb_global_con()
 ```

 ## Finding a candidate

 You can often start by searching for a candidate in the `cl_candidates` table.
 Here's an example looking for **John Bell**.

 ```{sql john-bell, connection=con}
 SELECT *
 FROM cl_candidates
 WHERE (
  first_name = 'JOHN' AND
  middle_name = 'RICHARD' AND
  last_name = 'BELL'
 )
 ```

 Then, join the result with the `committee_candidate`
 to find the candidate's `sboe_id`,
 if a match has been identified.
 The candidate-committee linking uses probabilistic matching.
 which allows for some flexibility in the matching process.
 Note that not every candidate is linked to a committee
 and there are many more committees than candidates.

 ```{sql john-bell-committee-mapping, connection=con}
 WITH john_bell AS (
  SELECT *
  FROM cl_candidates
  WHERE (
    first_name = 'JOHN' AND
    middle_name = 'RICHARD' AND
    last_name = 'BELL'
  )
 )

 SELECT *
 FROM committee_candidate
 WHERE candidate_id IN (
  SELECT candidate_id
  FROM john_bell
 )
 ```

 The candidate listing also gives you
 a complete history of the candidate's election history.
 You might find some historically interesting information
 by joining `cl_candidates` with

 * `cl_elections` for specific election contests,
 * `cl_name_on_ballot` for the candidate's name on the ballot,
 * `cl_party` for the party affiliation of a candidate in an election, and
 * `cl_contact` for contact information for a candidate.

 Here's an example combining the above tables
 to show **the last three elections** for John Bell.

 ```{sql john-bell-elections, connection=con}
 WITH john_bell AS (
  SELECT *
  FROM cl_candidates
  WHERE (
    first_name = 'JOHN' AND
    middle_name = 'RICHARD' AND
    last_name = 'BELL'
  )
 )

 SELECT *
 FROM cl_elections
 LEFT JOIN cl_name_on_ballot USING (candidate_id, election_dt)
 LEFT JOIN cl_party USING (candidate_id, election_dt)
 LEFT JOIN cl_contact USING (candidate_id, election_dt)
 WHERE candidate_id IN (
  SELECT candidate_id
  FROM john_bell
 )
 ORDER BY election_dt DESC
 LIMIT 3
 ```

 ## Campaign Finance Reports

 Within the campaign finance report data,
 the best place to get start is with the `reports` or `committees` table.

 The `reports` table provides a master list of reports filed with the SBOE.
 These include only the most up-to-date reports,
 taking into account amended filings,
 so you do not need to worry about filtering out outdated reports.

 Here are the **first 5 reports filed by John Bell's committee**
 (`sboe_id='STA-8S285O-C-001'` taken from the previous query).

 ```{sql john-bell-reports, connection=con}
 SELECT *
 FROM reports
 WHERE sboe_id='STA-8S285O-C-001'
 LIMIT 5
 ```

 Each report has a `cover` page, which is included in the `cover` table,
 but I've extracted the most recent name and contact information for the committee
 into the `committees` table.
 Here, you're guaranteed to get a single row per committee,
 for example with John Bell's committee:

 ```{sql john-bell-committee, connection=con}
 SELECT *
 FROM committees
 WHERE sboe_id='STA-8S285O-C-001'
 ```

 Next, we have the `receipts` and `expenses` tables.
 Both are similarly structured, so I'll just demonstrate how to use `receipts`.
 First, we'll filter the `reports` table to get the 2016 Q1 report record for John Bell's committee.
 Then, we can `INNER JOIN` this table with the `receipts` table,
 which returns all of the receipts
 for the reports in the `filtered_reports` common table expression
 (CTE, i.e. the temporary tables created using `WITH ___ AS (__query__))`.

 ```{sql john-bell-receipts, connection=con}
 WITH filtered_reports AS (
  SELECT *
  FROM reports
  WHERE sboe_id='STA-8S285O-C-001'
    AND year=2016
    AND doc_name='First Quarter'
 )

 SELECT *
 FROM receipts
 INNER JOIN filtered_reports USING (report_id)
 LIMIT 5
 ```

 Once you have a list of receipts,
 you can start to do some analysis on these,
 like finding the total money received by contribution type.
 This time I'm filtering to John Bell's 2020 Q3 report.

 ```{sql john-bell-donations, connection=con}
 WITH filtered_reports AS (
  SELECT *
  FROM reports
  WHERE sboe_id='STA-8S285O-C-001'
    AND year=2020
    AND doc_name='Third Quarter'
 )

 SELECT
  receipt_type_desc,
  receipt_type_code,
  is_donation,
  SUM(amount) as total
 FROM receipts
 INNER JOIN filtered_reports USING (report_id)
 GROUP BY
  receipt_type_desc,
  receipt_type_code,
  is_donation
 ORDER BY total DESC
 ```

 ## Receipts and Payer Information

 As you may have noticed above,
 each receipt is linked to a payer in the `receipts_payer` table
 via the `payer_id` column.

 At this time, I have not de-duplicated the `receipts_payer` table,
 so there may be multiple records for a single payer.
 Separating the `receipts` and `receipts_payer` tables
 will allow us to de-duplicate payer information in the future.

 Compared with the committee and candidate information,
 the payer records are much noisier
 and don't have any highly reliable fields
 that we can use to de-duplicate the records.
 The probabilistic matching I used for linking candidates and committees
 will work but it's a relatively large engineering lift.
 The challenge is that there are about 750,000 unique payer records,
 which certainly isn't _big data_,
 but deduplicating the records requires more than 2.8 Billion comparisons
 (to compare every record with every other record).

 For now, I'll show you how to use the `receipts_payer` table to look for specific donors.
 I recommend starting with `receipts_payer`,
 finding all `payer_id` values that match your person(s) of interest,
 and then working from those records back to the `receipts` table.

 As an example, let's find the top donors who have contributed the most to NC candidates since 2016.
 The first CTE (`donations_total`) uses `receipts` to count the total amount donated by each `payer_id`.
 The second CTE (`top_donors`) filters the donors down to the top 20 donors.
 Finally, we join `top_donors` with `receipts_payer` to get the payer information for each donor.

 This gives us the "top 20" donors,
 but it's clear that some of them are the same person with slightly different payer information.

 ```{sql top-donors, connection=con}
 WITH
  donations_total AS (
    SELECT payer_id, SUM(amount) AS total
    FROM receipts
    WHERE (NOT((payer_id IS NULL)))
    GROUP BY payer_id
  ),
  top_donors AS (
    SELECT payer_id, total
    FROM (
      SELECT *, RANK() OVER (ORDER BY total DESC) AS donor_rank
      FROM donations_total
    )
    WHERE (donor_rank <= 20)
  )

 SELECT
  top_donors.*,
  org_name,
  profession,
  employers_name,
  address_lookup
 FROM top_donors
 LEFT JOIN receipts_payer
  ON (top_donors.payer_id = receipts_payer.payer_id)
 ORDER BY UPPER(org_name)
 ```

 Let's pick out **Greg Lindberg** from the list above.
 (I'm sure you're familiar with [Greg Lindberg](https://www.justice.gov/opa/pr/founder-and-chairman-multinational-investment-company-and-company-consultant-convicted) but I wasn't, and wow!)


 ```{sql greg-lindberg, connection=con, sql.max.print = 15}
 SELECT
  payer_id,
  org_name,
  profession,
  employers_name,
  address_lookup
 FROM receipts_payer
 WHERE (
  UPPER(org_name) LIKE 'GREG%LINDBERG%'
 )
 ```

 All of the above records are definitely the same person,
 so, because I'm curious,
 let's find out **how much money Greg Lindberg has donated**.

 ```{sql greg-lindberg-total, connection=con, output.var="gl_total"}
 WITH greg_lindberg AS (
  SELECT
    payer_id,
    org_name,
    profession,
    employers_name,
    address_lookup
  FROM receipts_payer
  WHERE (
    UPPER(org_name) LIKE 'GREG%LINDBERG%'
  )
 )

 SELECT SUM(amount) AS total
 FROM receipts
 INNER JOIN greg_lindberg USING (payer_id)
 ```

 The result of that query tells us that Greg Lindberg has donated
 **`r scales::dollar(gl_total$total, 1)`**
 to NC candidates since 2016.
 This number tracks with
 [an article from The News & Observer](https://www.newsobserver.com/news/politics-government/article228779794.html)
 that reports Greg Lindberg donated around $7.5 million from 2016 to 2018,
 a figure that includes donations to federal PACs.

 Now we can find out where Greg Lindberg has donated his money.
 The first CTE picks out the payer records for Greg Lindberg
 and the second finds his total donations to each committee.
 Finally, we join the total donations results with the `committees` table
 to get the committee names (and pick out the top 10).

 ```{sql greg-lindberg-donations, connection=con, output.var="gl_donations"}
 WITH
  greg_lindberg AS (
    SELECT
      payer_id,
      org_name,
      profession,
      employers_name,
      address_lookup
    FROM receipts_payer
    WHERE (
      UPPER(org_name) LIKE 'GREG%LINDBERG%'
    )
  ),
  greg_lindberg_donations AS (
    SELECT sboe_id, SUM(amount) AS total
    FROM receipts
    INNER JOIN greg_lindberg USING (payer_id)
    GROUP BY sboe_id
  )

 SELECT
  greg_lindberg_donations.sboe_id,
  committees.committee_name,
  greg_lindberg_donations.total
 FROM greg_lindberg_donations
 LEFT JOIN committees USING (sboe_id)
 ORDER BY total DESC
 LIMIT 10
 ```

 ```{r table-gl-donations, echo = FALSE}
 library(gt)

 gt(gl_donations) |>
  cols_label(
    sboe_id = "SBOE ID",
    committee_name = "Committee Name",
    total = "Total Donations"
  ) |>
  fmt_currency(columns = total, decimals = 0)
 ```

 And now, also because I'm curious,
 here is a plot showing Lindberg's donations over time.
 [Things went south for Lindberg](https://www.newsobserver.com/news/politics-government/article228776004.html)
 early in 2019,
 so it's unsurprising that his donations have dropped off after 2018.

 ```{sql greg-lindberg-donations-over-time, connection=con, output.var="gl_donations_over_time"}
 WITH
  greg_lindberg AS (
    SELECT
      payer_id,
      org_name,
      profession,
      employers_name,
      address_lookup
    FROM receipts_payer
    WHERE (
      UPPER(org_name) LIKE 'GREG%LINDBERG%'
    )
  )

 SELECT YEAR(occur_date) AS year, SUM(amount) AS total
 FROM receipts
 INNER JOIN greg_lindberg USING (payer_id)
 GROUP BY year
 ```

 ```{r plot-gl-donations-over-time, echo = FALSE}
 #| fig.width: 9
 library(ggplot2)

 ggplot(gl_donations_over_time, aes(x = factor(year), y = total)) +
  geom_col(fill = "#517898") +
  labs(
    title = "Greg Lindberg's Donations",
    x = NULL,
    y = "Total Donations"
  ) +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal(14, "Helvetica") +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    plot.title.position="plot"
  )
 ```


 ```{r cleanup}
 #| include: false
 DBI::dbDisconnect(con, shutdown=TRUE)
 ```