Преглед на файлове

add final report doc

main
Garrick Aden-Buie преди 2 години
родител
ревизия
d22124428c
No known key found for this signature in database
променени са 2 файла, в които са добавени 7623 реда и са изтрити 0 реда
  1. +7075
    -0
      reports/2023-12-17_about-tables/about-tables.html
  2. +548
    -0
      reports/2023-12-17_about-tables/about-tables.qmd

+ 7075
- 0
reports/2023-12-17_about-tables/about-tables.html
Файловите разлики са ограничени, защото са твърде много
Целия файл


+ 548
- 0
reports/2023-12-17_about-tables/about-tables.qmd Целия файл

@@ -0,0 +1,548 @@
---
title: "Candidate Listing Resolution"
author: Garrick Aden-Buie
date: 2024-01-05

format:
html:
embed-resources: true
mermaid:
theme: neutral

editor:
render-on-save: true
---

```{=html}
<style>
.knitsql-table {
margin-bottom: 1rem;
}
</style>
```

## Overview

```{r setup}
#| include: false
library(tidyverse)
library(fs)
pkgload::load_all(here::here("process"))
```

The complete dataset involves a number of tables from four data sources:

1. The Campaign Finance Reports from the NC State Board of Elections (SBOE);
2. The Candidate Listing from the NC SBOE;
3. The Voter Registration from the NC SBOE; and
4. Resolved addresses from the U.S. Census Bureau.

The diagram below outlines the tables in the uploaded dataset
and their general relationship to each other.

```{mermaid}
flowchart LR
subgraph "Campaign Finance reports"
reports
cover
officers
receipts
expenses
committees
receipts_payer
expenses_payee
end

subgraph "Candidate Listing"
cl_candidates
cl_elections
cl_name_on_ballot
cl_party
cl_contact
end

committees --> reports

reports --> cover
reports --> officers
reports --> receipts
reports --> expenses

receipts --> receipts_payer
expenses --> expenses_payee

cl_candidates --> cl_elections
cl_elections --> cl_name_on_ballot
cl_elections --> cl_party
cl_elections --> cl_contact

committee_candidate <--> committees
committee_candidate <--> cl_candidates
```

### Campaign finance reports

The primary goal of this project was to collect and organize the campaign finance reports.
The core tables of this portion of the dataset are:

1. `reports`: This table provides a master list of reports filed with the SBOE.
2. `committees`: This table extracts the most recent committee information
from the filed reports. If you're interested in a particular committee, this
is likely the place you'll want to start.
3. `receipts`, `expenses`: These table provides a list of received contributions
and expenses paid by the committee.
4. `receipts_payer`, `expenses_payee`: These tables provide the payer/payee
information for the receipts and expenses, extracted from the `receipts` and
`expenses` tables. I haven't de-duplicated the records in this table (yet).
5. `cover`: Each report has a cover "page" where key information about the
committee or the period being reported is provided.
6. `officers`: Each committee has a list of officers. This table provides a
master list of officers for all committees.

All of the above tables have both `sboe_id` and `report_id` columns.
`sboe_id` uniquely identifies a committee by its SBOE-assigned ID,
and `report_id` uniquely identifies an individual campaign finance report.
In the `committees` table,
`report_id` refers to the latest report
from which the committee contact information was extracted.

The `sboe_id` is always the same for a given committee
and is the best way to identify a committee.
The `report_id` refers to a specific filing of a report,
and I've included only the most recently filed reports in this database.
Note that amended filings receive a new report ID,
the `report_id` may change in the future
if or when the committee files an amendment.

### Candidate listing

To augment the campaign finance data,
I've included several tables extracted from each year's candidate listing.
The primary table of interest is `cl_candidates`.
It contains individual candidates from the candidate listing
and their contact information and party affiliation.
If you're looking for a specific candidate,
but don't know their SBOE ID or their committee's name,
this is the place to start.

Candidates from the candidate listing are linked to a specific SBOE committee
via the `committee_candidate` table,
which matches a `candidate_id` with an `sboe_id`.
Note that the `candidate_id` is an ID I've created
to help organize the candidate listing --
although I'm sure the SBOE has its own internal ID for each candidate,
they don't include them in the data they publish.
This means that the `candidate_id` may change
when the candidate listing is updated.

### Additional data sources

Many tables have an `address_lookup` column.
When this column is found in a table,
it serves as the key for matching the address of that row
with the resolved addresses in the `addresses` table.
These addresses have been passed through the geocoding services
provided by the U.S. Census Bureau,
so the `addresses` is also a useful way
to get the latitude and longitude on an address for mapping purposes.

Additionally, I've included the voter registration data.
It's not currently linked to any other tables,
but it could be useful in observing trends in voter registration across counties,
or for exploring demographic trends in the voting population.


```{r load-database}
#| echo: false
reports <- out_open_dataset_db(here::here("process/data-out/reports"))
officers <- out_open_dataset_db(here::here("process/data-out/officers"))
committees <- out_open_dataset_db(here::here("process/data-out/committees"))
cl_candidates <- out_open_dataset_db(here::here("process/data-out/cl_candidates"))
cl_elections <- out_open_dataset_db(here::here("process/data-out/cl_elections"))
cl_party <- out_open_dataset_db(here::here("process/data-out/cl_party"))
cl_name_on_ballot <- out_open_dataset_db(here::here("process/data-out/cl_name_on_ballot"))
cl_contact <- out_open_dataset_db(here::here("process/data-out/cl_contact"))
committee_candidate <- out_open_dataset_db(here::here("process/data-out/committee_candidate"))

receipts <- out_open_dataset_db("../../process/data-out/receipts")
receipts_payer <- out_open_dataset_db("../../process/data-out/receipts_payer")

con <- duckdb_global_con()
```

## Finding a candidate

You can often start by searching for a candidate in the `cl_candidates` table.
Here's an example looking for **John Bell**.

```{sql john-bell, connection=con}
SELECT *
FROM cl_candidates
WHERE (
first_name = 'JOHN' AND
middle_name = 'RICHARD' AND
last_name = 'BELL'
)
```

Then, join the result with the `committee_candidate`
to find the candidate's `sboe_id`,
if a match has been identified.
The candidate-committee linking uses probabilistic matching.
which allows for some flexibility in the matching process.
Note that not every candidate is linked to a committee
and there are many more committees than candidates.

```{sql john-bell-committee-mapping, connection=con}
WITH john_bell AS (
SELECT *
FROM cl_candidates
WHERE (
first_name = 'JOHN' AND
middle_name = 'RICHARD' AND
last_name = 'BELL'
)
)

SELECT *
FROM committee_candidate
WHERE candidate_id IN (
SELECT candidate_id
FROM john_bell
)
```

The candidate listing also gives you
a complete history of the candidate's election history.
You might find some historically interesting information
by joining `cl_candidates` with

* `cl_elections` for specific election contests,
* `cl_name_on_ballot` for the candidate's name on the ballot,
* `cl_party` for the party affiliation of a candidate in an election, and
* `cl_contact` for contact information for a candidate.

Here's an example combining the above tables
to show **the last three elections** for John Bell.

```{sql john-bell-elections, connection=con}
WITH john_bell AS (
SELECT *
FROM cl_candidates
WHERE (
first_name = 'JOHN' AND
middle_name = 'RICHARD' AND
last_name = 'BELL'
)
)

SELECT *
FROM cl_elections
LEFT JOIN cl_name_on_ballot USING (candidate_id, election_dt)
LEFT JOIN cl_party USING (candidate_id, election_dt)
LEFT JOIN cl_contact USING (candidate_id, election_dt)
WHERE candidate_id IN (
SELECT candidate_id
FROM john_bell
)
ORDER BY election_dt DESC
LIMIT 3
```

## Campaign Finance Reports

Within the campaign finance report data,
the best place to get start is with the `reports` or `committees` table.

The `reports` table provides a master list of reports filed with the SBOE.
These include only the most up-to-date reports,
taking into account amended filings,
so you do not need to worry about filtering out outdated reports.

Here are the **first 5 reports filed by John Bell's committee**
(`sboe_id='STA-8S285O-C-001'` taken from the previous query).

```{sql john-bell-reports, connection=con}
SELECT *
FROM reports
WHERE sboe_id='STA-8S285O-C-001'
LIMIT 5
```

Each report has a `cover` page, which is included in the `cover` table,
but I've extracted the most recent name and contact information for the committee
into the `committees` table.
Here, you're guaranteed to get a single row per committee,
for example with John Bell's committee:

```{sql john-bell-committee, connection=con}
SELECT *
FROM committees
WHERE sboe_id='STA-8S285O-C-001'
```

Next, we have the `receipts` and `expenses` tables.
Both are similarly structured, so I'll just demonstrate how to use `receipts`.
First, we'll filter the `reports` table to get the 2016 Q1 report record for John Bell's committee.
Then, we can `INNER JOIN` this table with the `receipts` table,
which returns all of the receipts
for the reports in the `filtered_reports` common table expression
(CTE, i.e. the temporary tables created using `WITH ___ AS (__query__))`.

```{sql john-bell-receipts, connection=con}
WITH filtered_reports AS (
SELECT *
FROM reports
WHERE sboe_id='STA-8S285O-C-001'
AND year=2016
AND doc_name='First Quarter'
)

SELECT *
FROM receipts
INNER JOIN filtered_reports USING (report_id)
LIMIT 5
```

Once you have a list of receipts,
you can start to do some analysis on these,
like finding the total money received by contribution type.
This time I'm filtering to John Bell's 2020 Q3 report.

```{sql john-bell-donations, connection=con}
WITH filtered_reports AS (
SELECT *
FROM reports
WHERE sboe_id='STA-8S285O-C-001'
AND year=2020
AND doc_name='Third Quarter'
)

SELECT
receipt_type_desc,
receipt_type_code,
is_donation,
SUM(amount) as total
FROM receipts
INNER JOIN filtered_reports USING (report_id)
GROUP BY
receipt_type_desc,
receipt_type_code,
is_donation
ORDER BY total DESC
```

## Receipts and Payer Information

As you may have noticed above,
each receipt is linked to a payer in the `receipts_payer` table
via the `payer_id` column.

At this time, I have not de-duplicated the `receipts_payer` table,
so there may be multiple records for a single payer.
Separating the `receipts` and `receipts_payer` tables
will allow us to de-duplicate payer information in the future.

Compared with the committee and candidate information,
the payer records are much noisier
and don't have any highly reliable fields
that we can use to de-duplicate the records.
The probabilistic matching I used for linking candidates and committees
will work but it's a relatively large engineering lift.
The challenge is that there are about 750,000 unique payer records,
which certainly isn't _big data_,
but deduplicating the records requires more than 2.8 Billion comparisons
(to compare every record with every other record).

For now, I'll show you how to use the `receipts_payer` table to look for specific donors.
I recommend starting with `receipts_payer`,
finding all `payer_id` values that match your person(s) of interest,
and then working from those records back to the `receipts` table.

As an example, let's find the top donors who have contributed the most to NC candidates since 2016.
The first CTE (`donations_total`) uses `receipts` to count the total amount donated by each `payer_id`.
The second CTE (`top_donors`) filters the donors down to the top 20 donors.
Finally, we join `top_donors` with `receipts_payer` to get the payer information for each donor.

This gives us the "top 20" donors,
but it's clear that some of them are the same person with slightly different payer information.

```{sql top-donors, connection=con}
WITH
donations_total AS (
SELECT payer_id, SUM(amount) AS total
FROM receipts
WHERE (NOT((payer_id IS NULL)))
GROUP BY payer_id
),
top_donors AS (
SELECT payer_id, total
FROM (
SELECT *, RANK() OVER (ORDER BY total DESC) AS donor_rank
FROM donations_total
)
WHERE (donor_rank <= 20)
)

SELECT
top_donors.*,
org_name,
profession,
employers_name,
address_lookup
FROM top_donors
LEFT JOIN receipts_payer
ON (top_donors.payer_id = receipts_payer.payer_id)
ORDER BY UPPER(org_name)
```

Let's pick out **Greg Lindberg** from the list above.
(I'm sure you're familiar with [Greg Lindberg](https://www.justice.gov/opa/pr/founder-and-chairman-multinational-investment-company-and-company-consultant-convicted) but I wasn't, and wow!)


```{sql greg-lindberg, connection=con, sql.max.print = 15}
SELECT
payer_id,
org_name,
profession,
employers_name,
address_lookup
FROM receipts_payer
WHERE (
UPPER(org_name) LIKE 'GREG%LINDBERG%'
)
```

All of the above records are definitely the same person,
so, because I'm curious,
let's find out **how much money Greg Lindberg has donated**.

```{sql greg-lindberg-total, connection=con, output.var="gl_total"}
WITH greg_lindberg AS (
SELECT
payer_id,
org_name,
profession,
employers_name,
address_lookup
FROM receipts_payer
WHERE (
UPPER(org_name) LIKE 'GREG%LINDBERG%'
)
)

SELECT SUM(amount) AS total
FROM receipts
INNER JOIN greg_lindberg USING (payer_id)
```

The result of that query tells us that Greg Lindberg has donated
**`r scales::dollar(gl_total$total, 1)`**
to NC candidates since 2016.
This number tracks with
[an article from The News & Observer](https://www.newsobserver.com/news/politics-government/article228779794.html)
that reports Greg Lindberg donated around $7.5 million from 2016 to 2018,
a figure that includes donations to federal PACs.

Now we can find out where Greg Lindberg has donated his money.
The first CTE picks out the payer records for Greg Lindberg
and the second finds his total donations to each committee.
Finally, we join the total donations results with the `committees` table
to get the committee names (and pick out the top 10).

```{sql greg-lindberg-donations, connection=con, output.var="gl_donations"}
WITH
greg_lindberg AS (
SELECT
payer_id,
org_name,
profession,
employers_name,
address_lookup
FROM receipts_payer
WHERE (
UPPER(org_name) LIKE 'GREG%LINDBERG%'
)
),
greg_lindberg_donations AS (
SELECT sboe_id, SUM(amount) AS total
FROM receipts
INNER JOIN greg_lindberg USING (payer_id)
GROUP BY sboe_id
)

SELECT
greg_lindberg_donations.sboe_id,
committees.committee_name,
greg_lindberg_donations.total
FROM greg_lindberg_donations
LEFT JOIN committees USING (sboe_id)
ORDER BY total DESC
LIMIT 10
```

```{r table-gl-donations, echo = FALSE}
library(gt)

gt(gl_donations) |>
cols_label(
sboe_id = "SBOE ID",
committee_name = "Committee Name",
total = "Total Donations"
) |>
fmt_currency(columns = total, decimals = 0)
```

And now, also because I'm curious,
here is a plot showing Lindberg's donations over time.
[Things went south for Lindberg](https://www.newsobserver.com/news/politics-government/article228776004.html)
early in 2019,
so it's unsurprising that his donations have dropped off after 2018.

```{sql greg-lindberg-donations-over-time, connection=con, output.var="gl_donations_over_time"}
WITH
greg_lindberg AS (
SELECT
payer_id,
org_name,
profession,
employers_name,
address_lookup
FROM receipts_payer
WHERE (
UPPER(org_name) LIKE 'GREG%LINDBERG%'
)
)

SELECT YEAR(occur_date) AS year, SUM(amount) AS total
FROM receipts
INNER JOIN greg_lindberg USING (payer_id)
GROUP BY year
```

```{r plot-gl-donations-over-time, echo = FALSE}
#| fig.width: 9
library(ggplot2)

ggplot(gl_donations_over_time, aes(x = factor(year), y = total)) +
geom_col(fill = "#517898") +
labs(
title = "Greg Lindberg's Donations",
x = NULL,
y = "Total Donations"
) +
scale_y_continuous(labels = scales::dollar) +
theme_minimal(14, "Helvetica") +
theme(
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
plot.title.position="plot"
)
```


```{r cleanup}
#| include: false
DBI::dbDisconnect(con, shutdown=TRUE)
```


Loading…
Отказ
Запис