You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

549 satır
16KB

  1. ---
  2. title: "Candidate Listing Resolution"
  3. author: Garrick Aden-Buie
  4. date: 2024-01-05
  5. format:
  6. html:
  7. embed-resources: true
  8. mermaid:
  9. theme: neutral
  10. editor:
  11. render-on-save: true
  12. ---
  13. ```{=html}
  14. <style>
  15. .knitsql-table {
  16. margin-bottom: 1rem;
  17. }
  18. </style>
  19. ```
  20. ## Overview
  21. ```{r setup}
  22. #| include: false
  23. library(tidyverse)
  24. library(fs)
  25. pkgload::load_all(here::here("process"))
  26. ```
  27. The complete dataset involves a number of tables from four data sources:
  28. 1. The Campaign Finance Reports from the NC State Board of Elections (SBOE);
  29. 2. The Candidate Listing from the NC SBOE;
  30. 3. The Voter Registration from the NC SBOE; and
  31. 4. Resolved addresses from the U.S. Census Bureau.
  32. The diagram below outlines the tables in the uploaded dataset
  33. and their general relationship to each other.
  34. ```{mermaid}
  35. flowchart LR
  36. subgraph "Campaign Finance reports"
  37. reports
  38. cover
  39. officers
  40. receipts
  41. expenses
  42. committees
  43. receipts_payer
  44. expenses_payee
  45. end
  46. subgraph "Candidate Listing"
  47. cl_candidates
  48. cl_elections
  49. cl_name_on_ballot
  50. cl_party
  51. cl_contact
  52. end
  53. committees --> reports
  54. reports --> cover
  55. reports --> officers
  56. reports --> receipts
  57. reports --> expenses
  58. receipts --> receipts_payer
  59. expenses --> expenses_payee
  60. cl_candidates --> cl_elections
  61. cl_elections --> cl_name_on_ballot
  62. cl_elections --> cl_party
  63. cl_elections --> cl_contact
  64. committee_candidate <--> committees
  65. committee_candidate <--> cl_candidates
  66. ```
  67. ### Campaign finance reports
  68. The primary goal of this project was to collect and organize the campaign finance reports.
  69. The core tables of this portion of the dataset are:
  70. 1. `reports`: This table provides a master list of reports filed with the SBOE.
  71. 2. `committees`: This table extracts the most recent committee information
  72. from the filed reports. If you're interested in a particular committee, this
  73. is likely the place you'll want to start.
  74. 3. `receipts`, `expenses`: These table provides a list of received contributions
  75. and expenses paid by the committee.
  76. 4. `receipts_payer`, `expenses_payee`: These tables provide the payer/payee
  77. information for the receipts and expenses, extracted from the `receipts` and
  78. `expenses` tables. I haven't de-duplicated the records in this table (yet).
  79. 5. `cover`: Each report has a cover "page" where key information about the
  80. committee or the period being reported is provided.
  81. 6. `officers`: Each committee has a list of officers. This table provides a
  82. master list of officers for all committees.
  83. All of the above tables have both `sboe_id` and `report_id` columns.
  84. `sboe_id` uniquely identifies a committee by its SBOE-assigned ID,
  85. and `report_id` uniquely identifies an individual campaign finance report.
  86. In the `committees` table,
  87. `report_id` refers to the latest report
  88. from which the committee contact information was extracted.
  89. The `sboe_id` is always the same for a given committee
  90. and is the best way to identify a committee.
  91. The `report_id` refers to a specific filing of a report,
  92. and I've included only the most recently filed reports in this database.
  93. Note that amended filings receive a new report ID,
  94. the `report_id` may change in the future
  95. if or when the committee files an amendment.
  96. ### Candidate listing
  97. To augment the campaign finance data,
  98. I've included several tables extracted from each year's candidate listing.
  99. The primary table of interest is `cl_candidates`.
  100. It contains individual candidates from the candidate listing
  101. and their contact information and party affiliation.
  102. If you're looking for a specific candidate,
  103. but don't know their SBOE ID or their committee's name,
  104. this is the place to start.
  105. Candidates from the candidate listing are linked to a specific SBOE committee
  106. via the `committee_candidate` table,
  107. which matches a `candidate_id` with an `sboe_id`.
  108. Note that the `candidate_id` is an ID I've created
  109. to help organize the candidate listing --
  110. although I'm sure the SBOE has its own internal ID for each candidate,
  111. they don't include them in the data they publish.
  112. This means that the `candidate_id` may change
  113. when the candidate listing is updated.
  114. ### Additional data sources
  115. Many tables have an `address_lookup` column.
  116. When this column is found in a table,
  117. it serves as the key for matching the address of that row
  118. with the resolved addresses in the `addresses` table.
  119. These addresses have been passed through the geocoding services
  120. provided by the U.S. Census Bureau,
  121. so the `addresses` is also a useful way
  122. to get the latitude and longitude on an address for mapping purposes.
  123. Additionally, I've included the voter registration data.
  124. It's not currently linked to any other tables,
  125. but it could be useful in observing trends in voter registration across counties,
  126. or for exploring demographic trends in the voting population.
  127. ```{r load-database}
  128. #| echo: false
  129. reports <- out_open_dataset_db(here::here("process/data-out/reports"))
  130. officers <- out_open_dataset_db(here::here("process/data-out/officers"))
  131. committees <- out_open_dataset_db(here::here("process/data-out/committees"))
  132. cl_candidates <- out_open_dataset_db(here::here("process/data-out/cl_candidates"))
  133. cl_elections <- out_open_dataset_db(here::here("process/data-out/cl_elections"))
  134. cl_party <- out_open_dataset_db(here::here("process/data-out/cl_party"))
  135. cl_name_on_ballot <- out_open_dataset_db(here::here("process/data-out/cl_name_on_ballot"))
  136. cl_contact <- out_open_dataset_db(here::here("process/data-out/cl_contact"))
  137. committee_candidate <- out_open_dataset_db(here::here("process/data-out/committee_candidate"))
  138. receipts <- out_open_dataset_db("../../process/data-out/receipts")
  139. receipts_payer <- out_open_dataset_db("../../process/data-out/receipts_payer")
  140. con <- duckdb_global_con()
  141. ```
  142. ## Finding a candidate
  143. You can often start by searching for a candidate in the `cl_candidates` table.
  144. Here's an example looking for **John Bell**.
  145. ```{sql john-bell, connection=con}
  146. SELECT *
  147. FROM cl_candidates
  148. WHERE (
  149. first_name = 'JOHN' AND
  150. middle_name = 'RICHARD' AND
  151. last_name = 'BELL'
  152. )
  153. ```
  154. Then, join the result with the `committee_candidate`
  155. to find the candidate's `sboe_id`,
  156. if a match has been identified.
  157. The candidate-committee linking uses probabilistic matching.
  158. which allows for some flexibility in the matching process.
  159. Note that not every candidate is linked to a committee
  160. and there are many more committees than candidates.
  161. ```{sql john-bell-committee-mapping, connection=con}
  162. WITH john_bell AS (
  163. SELECT *
  164. FROM cl_candidates
  165. WHERE (
  166. first_name = 'JOHN' AND
  167. middle_name = 'RICHARD' AND
  168. last_name = 'BELL'
  169. )
  170. )
  171. SELECT *
  172. FROM committee_candidate
  173. WHERE candidate_id IN (
  174. SELECT candidate_id
  175. FROM john_bell
  176. )
  177. ```
  178. The candidate listing also gives you
  179. a complete history of the candidate's election history.
  180. You might find some historically interesting information
  181. by joining `cl_candidates` with
  182. * `cl_elections` for specific election contests,
  183. * `cl_name_on_ballot` for the candidate's name on the ballot,
  184. * `cl_party` for the party affiliation of a candidate in an election, and
  185. * `cl_contact` for contact information for a candidate.
  186. Here's an example combining the above tables
  187. to show **the last three elections** for John Bell.
  188. ```{sql john-bell-elections, connection=con}
  189. WITH john_bell AS (
  190. SELECT *
  191. FROM cl_candidates
  192. WHERE (
  193. first_name = 'JOHN' AND
  194. middle_name = 'RICHARD' AND
  195. last_name = 'BELL'
  196. )
  197. )
  198. SELECT *
  199. FROM cl_elections
  200. LEFT JOIN cl_name_on_ballot USING (candidate_id, election_dt)
  201. LEFT JOIN cl_party USING (candidate_id, election_dt)
  202. LEFT JOIN cl_contact USING (candidate_id, election_dt)
  203. WHERE candidate_id IN (
  204. SELECT candidate_id
  205. FROM john_bell
  206. )
  207. ORDER BY election_dt DESC
  208. LIMIT 3
  209. ```
  210. ## Campaign Finance Reports
  211. Within the campaign finance report data,
  212. the best place to get start is with the `reports` or `committees` table.
  213. The `reports` table provides a master list of reports filed with the SBOE.
  214. These include only the most up-to-date reports,
  215. taking into account amended filings,
  216. so you do not need to worry about filtering out outdated reports.
  217. Here are the **first 5 reports filed by John Bell's committee**
  218. (`sboe_id='STA-8S285O-C-001'` taken from the previous query).
  219. ```{sql john-bell-reports, connection=con}
  220. SELECT *
  221. FROM reports
  222. WHERE sboe_id='STA-8S285O-C-001'
  223. LIMIT 5
  224. ```
  225. Each report has a `cover` page, which is included in the `cover` table,
  226. but I've extracted the most recent name and contact information for the committee
  227. into the `committees` table.
  228. Here, you're guaranteed to get a single row per committee,
  229. for example with John Bell's committee:
  230. ```{sql john-bell-committee, connection=con}
  231. SELECT *
  232. FROM committees
  233. WHERE sboe_id='STA-8S285O-C-001'
  234. ```
  235. Next, we have the `receipts` and `expenses` tables.
  236. Both are similarly structured, so I'll just demonstrate how to use `receipts`.
  237. First, we'll filter the `reports` table to get the 2016 Q1 report record for John Bell's committee.
  238. Then, we can `INNER JOIN` this table with the `receipts` table,
  239. which returns all of the receipts
  240. for the reports in the `filtered_reports` common table expression
  241. (CTE, i.e. the temporary tables created using `WITH ___ AS (__query__))`.
  242. ```{sql john-bell-receipts, connection=con}
  243. WITH filtered_reports AS (
  244. SELECT *
  245. FROM reports
  246. WHERE sboe_id='STA-8S285O-C-001'
  247. AND year=2016
  248. AND doc_name='First Quarter'
  249. )
  250. SELECT *
  251. FROM receipts
  252. INNER JOIN filtered_reports USING (report_id)
  253. LIMIT 5
  254. ```
  255. Once you have a list of receipts,
  256. you can start to do some analysis on these,
  257. like finding the total money received by contribution type.
  258. This time I'm filtering to John Bell's 2020 Q3 report.
  259. ```{sql john-bell-donations, connection=con}
  260. WITH filtered_reports AS (
  261. SELECT *
  262. FROM reports
  263. WHERE sboe_id='STA-8S285O-C-001'
  264. AND year=2020
  265. AND doc_name='Third Quarter'
  266. )
  267. SELECT
  268. receipt_type_desc,
  269. receipt_type_code,
  270. is_donation,
  271. SUM(amount) as total
  272. FROM receipts
  273. INNER JOIN filtered_reports USING (report_id)
  274. GROUP BY
  275. receipt_type_desc,
  276. receipt_type_code,
  277. is_donation
  278. ORDER BY total DESC
  279. ```
  280. ## Receipts and Payer Information
  281. As you may have noticed above,
  282. each receipt is linked to a payer in the `receipts_payer` table
  283. via the `payer_id` column.
  284. At this time, I have not de-duplicated the `receipts_payer` table,
  285. so there may be multiple records for a single payer.
  286. Separating the `receipts` and `receipts_payer` tables
  287. will allow us to de-duplicate payer information in the future.
  288. Compared with the committee and candidate information,
  289. the payer records are much noisier
  290. and don't have any highly reliable fields
  291. that we can use to de-duplicate the records.
  292. The probabilistic matching I used for linking candidates and committees
  293. will work but it's a relatively large engineering lift.
  294. The challenge is that there are about 750,000 unique payer records,
  295. which certainly isn't _big data_,
  296. but deduplicating the records requires more than 2.8 Billion comparisons
  297. (to compare every record with every other record).
  298. For now, I'll show you how to use the `receipts_payer` table to look for specific donors.
  299. I recommend starting with `receipts_payer`,
  300. finding all `payer_id` values that match your person(s) of interest,
  301. and then working from those records back to the `receipts` table.
  302. As an example, let's find the top donors who have contributed the most to NC candidates since 2016.
  303. The first CTE (`donations_total`) uses `receipts` to count the total amount donated by each `payer_id`.
  304. The second CTE (`top_donors`) filters the donors down to the top 20 donors.
  305. Finally, we join `top_donors` with `receipts_payer` to get the payer information for each donor.
  306. This gives us the "top 20" donors,
  307. but it's clear that some of them are the same person with slightly different payer information.
  308. ```{sql top-donors, connection=con}
  309. WITH
  310. donations_total AS (
  311. SELECT payer_id, SUM(amount) AS total
  312. FROM receipts
  313. WHERE (NOT((payer_id IS NULL)))
  314. GROUP BY payer_id
  315. ),
  316. top_donors AS (
  317. SELECT payer_id, total
  318. FROM (
  319. SELECT *, RANK() OVER (ORDER BY total DESC) AS donor_rank
  320. FROM donations_total
  321. )
  322. WHERE (donor_rank <= 20)
  323. )
  324. SELECT
  325. top_donors.*,
  326. org_name,
  327. profession,
  328. employers_name,
  329. address_lookup
  330. FROM top_donors
  331. LEFT JOIN receipts_payer
  332. ON (top_donors.payer_id = receipts_payer.payer_id)
  333. ORDER BY UPPER(org_name)
  334. ```
  335. Let's pick out **Greg Lindberg** from the list above.
  336. (I'm sure you're familiar with [Greg Lindberg](https://www.justice.gov/opa/pr/founder-and-chairman-multinational-investment-company-and-company-consultant-convicted) but I wasn't, and wow!)
  337. ```{sql greg-lindberg, connection=con, sql.max.print = 15}
  338. SELECT
  339. payer_id,
  340. org_name,
  341. profession,
  342. employers_name,
  343. address_lookup
  344. FROM receipts_payer
  345. WHERE (
  346. UPPER(org_name) LIKE 'GREG%LINDBERG%'
  347. )
  348. ```
  349. All of the above records are definitely the same person,
  350. so, because I'm curious,
  351. let's find out **how much money Greg Lindberg has donated**.
  352. ```{sql greg-lindberg-total, connection=con, output.var="gl_total"}
  353. WITH greg_lindberg AS (
  354. SELECT
  355. payer_id,
  356. org_name,
  357. profession,
  358. employers_name,
  359. address_lookup
  360. FROM receipts_payer
  361. WHERE (
  362. UPPER(org_name) LIKE 'GREG%LINDBERG%'
  363. )
  364. )
  365. SELECT SUM(amount) AS total
  366. FROM receipts
  367. INNER JOIN greg_lindberg USING (payer_id)
  368. ```
  369. The result of that query tells us that Greg Lindberg has donated
  370. **`r scales::dollar(gl_total$total, 1)`**
  371. to NC candidates since 2016.
  372. This number tracks with
  373. [an article from The News & Observer](https://www.newsobserver.com/news/politics-government/article228779794.html)
  374. that reports Greg Lindberg donated around $7.5 million from 2016 to 2018,
  375. a figure that includes donations to federal PACs.
  376. Now we can find out where Greg Lindberg has donated his money.
  377. The first CTE picks out the payer records for Greg Lindberg
  378. and the second finds his total donations to each committee.
  379. Finally, we join the total donations results with the `committees` table
  380. to get the committee names (and pick out the top 10).
  381. ```{sql greg-lindberg-donations, connection=con, output.var="gl_donations"}
  382. WITH
  383. greg_lindberg AS (
  384. SELECT
  385. payer_id,
  386. org_name,
  387. profession,
  388. employers_name,
  389. address_lookup
  390. FROM receipts_payer
  391. WHERE (
  392. UPPER(org_name) LIKE 'GREG%LINDBERG%'
  393. )
  394. ),
  395. greg_lindberg_donations AS (
  396. SELECT sboe_id, SUM(amount) AS total
  397. FROM receipts
  398. INNER JOIN greg_lindberg USING (payer_id)
  399. GROUP BY sboe_id
  400. )
  401. SELECT
  402. greg_lindberg_donations.sboe_id,
  403. committees.committee_name,
  404. greg_lindberg_donations.total
  405. FROM greg_lindberg_donations
  406. LEFT JOIN committees USING (sboe_id)
  407. ORDER BY total DESC
  408. LIMIT 10
  409. ```
  410. ```{r table-gl-donations, echo = FALSE}
  411. library(gt)
  412. gt(gl_donations) |>
  413. cols_label(
  414. sboe_id = "SBOE ID",
  415. committee_name = "Committee Name",
  416. total = "Total Donations"
  417. ) |>
  418. fmt_currency(columns = total, decimals = 0)
  419. ```
  420. And now, also because I'm curious,
  421. here is a plot showing Lindberg's donations over time.
  422. [Things went south for Lindberg](https://www.newsobserver.com/news/politics-government/article228776004.html)
  423. early in 2019,
  424. so it's unsurprising that his donations have dropped off after 2018.
  425. ```{sql greg-lindberg-donations-over-time, connection=con, output.var="gl_donations_over_time"}
  426. WITH
  427. greg_lindberg AS (
  428. SELECT
  429. payer_id,
  430. org_name,
  431. profession,
  432. employers_name,
  433. address_lookup
  434. FROM receipts_payer
  435. WHERE (
  436. UPPER(org_name) LIKE 'GREG%LINDBERG%'
  437. )
  438. )
  439. SELECT YEAR(occur_date) AS year, SUM(amount) AS total
  440. FROM receipts
  441. INNER JOIN greg_lindberg USING (payer_id)
  442. GROUP BY year
  443. ```
  444. ```{r plot-gl-donations-over-time, echo = FALSE}
  445. #| fig.width: 9
  446. library(ggplot2)
  447. ggplot(gl_donations_over_time, aes(x = factor(year), y = total)) +
  448. geom_col(fill = "#517898") +
  449. labs(
  450. title = "Greg Lindberg's Donations",
  451. x = NULL,
  452. y = "Total Donations"
  453. ) +
  454. scale_y_continuous(labels = scales::dollar) +
  455. theme_minimal(14, "Helvetica") +
  456. theme(
  457. panel.grid.minor = element_blank(),
  458. panel.grid.major.x = element_blank(),
  459. plot.title.position="plot"
  460. )
  461. ```
  462. ```{r cleanup}
  463. #| include: false
  464. DBI::dbDisconnect(con, shutdown=TRUE)
  465. ```