Переглянути джерело

status and collection process update reports

main
Garrick Aden-Buie 2 роки тому
джерело
коміт
fac27f746b
Не вдалося знайти GPG ключ що відповідає даному підпису
9 змінених файлів з 3983 додано та 0 видалено
  1. BIN
      reports/2023-09-28_example-dan-forest.xlsx
  2. BIN
      reports/2023-09-28_example-john-bell.xlsx
  3. BIN
      reports/2023-09-28_example-mitch-setzer.xlsx
  4. +74
    -0
      reports/2023-09-28_status-update.qmd
  5. +3645
    -0
      reports/2023-10-05_collect-process/2023-10-05_process.html
  6. +262
    -0
      reports/2023-10-05_collect-process/2023-10-05_process.qmd
  7. +2
    -0
      reports/2023-10-05_collect-process/_targets.yaml
  8. BIN
      reports/2023-10-05_collect-process/nc-sboe-cf-report-all.png
  9. BIN
      reports/2023-10-05_collect-process/nc-sboe-cf-search-by-year-report.png

BIN
reports/2023-09-28_example-dan-forest.xlsx Переглянути файл


BIN
reports/2023-09-28_example-john-bell.xlsx Переглянути файл


BIN
reports/2023-09-28_example-mitch-setzer.xlsx Переглянути файл


+ 74
- 0
reports/2023-09-28_status-update.qmd Переглянути файл

@@ -0,0 +1,74 @@
---
title: Status Update
author: Garrick Aden-Buie
format: pdf
---

```{r setup}
library(tidyverse)
library(fs)
```

```{r load-data}
tables <- dir_ls(here::here("data"))
names(tables) <- path_file(tables)

idx_report_list <- which(names(tables) == "report_list")

tbls_arrow <- tables[-idx_report_list] |>
map(arrow::open_dataset, partitioning = "sboe_id")

tbls_arrow$report_list <- arrow::open_dataset(tables["report_list"])

con <- DBI::dbConnect(duckdb::duckdb())
tbls_db <- imap(tbls_arrow, \(t,n) duckdb::duckdb_register_arrow(con, n, t))
```

```{r}
tbl(con, "cover") |>
filter(
date_from >= "2016-01-01",
date_from < date_to
) |>
mutate(
report = case_when(
date_from > (date_to + months(3)) ~ paste0(year(date_from), "-SA-", ifelse(month(date_from) < 6, "1", "2")),
TRUE ~ paste0(year(date_from), "-Q", quarter(date_from))
)
) |>
count(committee_type, report) |>
arrange(report) |>
ggplot() +
aes(x = report, y = n) +
geom_col() +
facet_wrap(vars(committee_type), scales = "free_y")
```


```{r}
all_data_by_sboe_id <- function(sboe_id) {
names(tables) |>
set_names() |>
map(function(name) {
tbl(con, name) |>
filter(sboe_id == !!sboe_id) |>
collect()
})
}


john_bell <- "STA-8S285O-C-001"
dan_forest <- "STA-M4HR0Y-C-001"
mitch_setzer <- "STA-C1877N-C-001"

jb_data <- john_bell |> all_data_by_sboe_id()
jb_data |> writexl::write_xlsx(here::here("reports/2023-09-28_example-john-bell.xlsx"))

df_data <- dan_forest |> all_data_by_sboe_id()
df_data |> writexl::write_xlsx(here::here("reports/2023-09-28_example-dan-forest.xlsx"))

ms_data <- mitch_setzer |> all_data_by_sboe_id()
ms_data |> writexl::write_xlsx(here::here("reports/2023-09-28_example-mitch-setzer.xlsx"))
```

```{r}

+ 3645
- 0
reports/2023-10-05_collect-process/2023-10-05_process.html
Різницю між файлами не показано, бо вона завелика
Переглянути файл


+ 262
- 0
reports/2023-10-05_collect-process/2023-10-05_process.qmd Переглянути файл

@@ -0,0 +1,262 @@
---
title: Data Collection Process
subtitle: Differentiators Data | NC Campaign Finance Data Project
author: Garrick Aden-Buie
format:
html:
embed-resources: true

editor:
render-on-save: true

execute:
echo: false
---

```{r setup}
#| include: false

pkgload::load_all(here::here("collect"))

library(ggplot2)
theme_set(theme_minimal(14, base_family = "Source Sans Pro"))

targets::tar_config_set(store = here::here("collect/_targets"))
```

## Reports by Year

The campaign finance data gathering begins by collecting a table with the list of all quarterly or semi-annual reports filed by committees in the NC SBOE system by reporting year.
The NC SBOE system allows you to search for reports by year and report type,
which I use to construct a URL that lists all reports of a given type for a given year.

::: {layout="[ [1,1] ]"}

```{r}
shiny::selectInput("year", "Year", choices = 2016:2023, selected = 2022, selectize = FALSE)
```

```{r}
report_choices <- c(
"Mid-Year Semi-Annual Report" = "RPMYSA",
"Year-End Semi-Annual Report" = "RPYESA",
"1st Quarter Report" = "RPQTR1",
"2nd Quarter Report" = "RPQTR2",
"3rd Quarter Report" = "RPQTR3",
"4th Quarter Report" = "RPQTR4"
)
shiny::selectInput("report_type", "Report", choices = report_choices, selected = "RPQTR1", selectize = FALSE)
```

:::

```{r}
epoxy::ui_epoxy_whisker("reports_by_year", "<p><a href=\"https://cf.ncsbe.gov/CFDocLkup/DocumentResult/?year={{year}}&reports=%27{{report_type}}%27\">https://cf.ncsbe.gov/CFDocLkup/DocumentResult/?year={{year}}&reports=%27{{report_type}}%27</a></p>")
```

```{js}
function initApp () {
const initData = {
reports_by_year: {
year: () => document.getElementById('year').value,
report_type: () => document.getElementById('report_type').value
}
}

const sendUpdatesToEpoxy = (outputId) => {
const data = {...initData[outputId]}
Object.keys(data).forEach(inputId => {
data[inputId] = data[inputId]()
})
EpoxyMustache.update_all({ [outputId]: data })
}

Object.keys(initData).forEach(outputId => {
sendUpdatesToEpoxy(outputId)

Object.keys(initData[outputId]).forEach(inputId => {
document
.getElementById(inputId)
.addEventListener('input', () => sendUpdatesToEpoxy(outputId))
})
})
}

initApp()
```

An individual link leads us to a master list of every report type,
say _2nd Quarter Report_,
filed in a year, e.g. 2022.
This table includes links to the actual report in the **Data** column on the right,
as well as a link to download the table as a `.csv` file via
the _Export list to .csv_ link at the top of the page.

![Master list of [2nd Quarter Reports filed in 2022](https://cf.ncsbe.gov/CFDocLkup/DocumentResult/?year=2022&reports=%27RPQTR2%27)](nc-sboe-cf-search-by-year-report.png)

Unfortunately,
the table that you get in the CSV does not include the link to the actual report,
instead that field includes only the word `DATA` in place of the link.
Furthermore, neither the CSV table nor the table displayed on the web page
include the actual report ID, although it's used to construct the URLs in the **DATA** links.

To get around these limitations,
I discovered that the source data for the displayed table is included in the HTML file
when you view the page source
(right click anywhere on the page and choose "View Page Source").
Somewhere in the middle of the file is a bit of JavaScript that starts with
`var data = [ ...`.
This is essentially a JSON representation of the table data,
and I could extract and read it into a table format.
This table is actually very well formatted and importantly includes the bare report id
as well as the committee's SBOE ID.

```{r}
# targets::tar_config_get("store")
doc_list <- tar_read(doc_list) |> select(-tar_group)
```

By scraping the table source data from every report list for years from 2016 through 2023,
and for every available report type,
I was able to put together a master list of every report of interest filed in the NC SBOE system.
We found `r epoxy::epoxy("{.comma nrow(doc_list)}")` reports in total;
here's a small preview of the values we have in this table:

```{r}
glimpse(doc_list)
```

Looking at the total reports by report type and year shows a predictable pattern:
on the whole, quarterly reports are filed in election (even) years,
and semi-annual reports are filed in non-election (odd) years.
There are also a few exceptions to this pattern,
where a small number of semi-annual reports are filed in even years,
and a larger but relatively small number of quarterly reports are filed in odd years.

```{r}
doc_names_levels <- c(
paste(c("First", "Second", "Third", "Fourth"), "Quarter"),
"Mid Year Semi-Annual",
"Year End Semi-Annual"
)

doc_list |>
mutate(doc_name = factor(doc_name, doc_names_levels)) |>
ggplot() +
aes(x = year) +
geom_bar() +
labs(x = NULL, y = NULL, title = "Number of Reports by Year and Report Type") +
facet_wrap(~ doc_name, ncol = 2)
```

## Reports by Committee

The next step is to collect the actual report data for each of the reports in our master list.
Fortunately, if we can use the `sboe_id` and `report_id` from the previous step
to jump straight to the report page in the NC SBOE system.
Here's an example taken at random from John Bell's reports:

```{r}
john_bell <- "STA-8S285O-C-001"

jb_report_detail_url <-
doc_list |>
filter(sboe_id == john_bell, report_id == "164085") |>
mutate(
export_url = req_report_detail(report_id, "all")$url,
url = glue::glue("https://cf.ncsbe.gov/CFOrgLkup/ReportDetail/?RID={report_id}&TP=ALL")
)
```

![<`r sub("ExportDetailResults", "ReportDetail", jb_report_detail_url$url)`>](nc-sboe-cf-report-all.png)

This single page _appears_ to have all of the data we want in one place,
and it even has an _Export data to .csv_ link at the top of the page.
The format of that exported data is a little unusual --
it's not really a CSV file but instead a text file with several CSV files concatenated together.
Here's what you get when you download the export linked to in the above screenshot:

```{r}
htmltools::withTags(
.noWS = c("outside", "inside"),
div(
class = "card mb-2",
div(class = "card-header", paste0(jb_report_detail_url$report_id, ".txt")),
div(
class = "card-body",
style = htmltools::css(maxHeight = "300px", overflowY = "scroll"),
pre(code(paste(readLines(jb_report_detail_url$export_url), collapse = "\n")))
)
)
)
```

## Receipts and Expenditures

The full report text (a.k.a. 7 CSV files in a trench coat)
is _so so close_ to what we want,
but at scale it doesn't work out that well.
It appears that the NC SBOE hand-rolled their CSV exports
and unfortunately don't create properly-formatted CSV files in some edge cases.
These edge cases are mostly places where the SBOE would have reasonably expected users to enter a short text description,
and instead users have entered longer text with commas and newlines,
which break the CSV format.

These "all report sections" text file exports are worth downloading
because they include several tables that we'd want to collect anyway,
and we want to reduce the number of files and HTML pages we have to request from the NC SBOE system.

For the two most important tables,
`receipts` and `expenditures`,
it turns out that the "all reports page" listed above
actually calls out to two unlisted, internal API endpoints
to create the Receipts and Expenditures tables.
These endpoints return JSON data which is guaranteed to be correctly formatted
and easier to process into a table format.

They say _there's no such thing as a free lunch_,
and apparently this rule also applies to the NC SBOE campaign finance website.
These API endpoints return at most 300 records at a time,
meaning that for most reports we have to make multiple requests
to gather all of the report data.

Here are two examples of the URLs I used to get the receipts and expenditures data for John Bell's report:

```{r}
url_receipts_json <- req_report_receipts(jb_report_detail_url$report_id, 0)
url_expenditures_json <- req_report_expenditures(jb_report_detail_url$report_id, 0)
```

* [Receipts for report `r jb_report_detail_url$report_id`, page 1](`r url_receipts_json$url`)
* [Expenditures for report `r jb_report_detail_url$report_id`, page 1](`r url_expenditures_json$url`)

## Final Set of Files

In the end, there are three files that we download directly from the NC SBOE system
for each report:

1. The "all reports" text file export
2. The receipts JSON data, saved as a (correctly-formatted) CSV file
3. The expenditures JSON data, saved as a CSV file

Locally, I've organized these into a directory structure that looks like this,
where `` `r john_bell` `` is John Bell's committee's SBOE ID:

```
data-raw/reports/STA-8S285O-C-001
├── all
│ ├── ...
│ ├── 164085_2019-01-25.txt
│ └── ...
├── expenditures
│ ├── ...
│ ├── 164085_2019-01-25_expenditures.csv
│ └── ...
└── receipts
├── ...
├── 164085_2019-01-25_receipts.csv
└── ....
```

The end result is a complete set of files for every quarterly or semi-annual campaign finance report
filed with the NC SBOE.

+ 2
- 0
reports/2023-10-05_collect-process/_targets.yaml Переглянути файл

@@ -0,0 +1,2 @@
main:
store: /Users/garrick/Dropbox/Projects/differentiators-data/nc-campaign-finance/collect/_targets

BIN
reports/2023-10-05_collect-process/nc-sboe-cf-report-all.png Переглянути файл

Before After
Width: 3496  |  Height: 1772  |  Size: 373KB

BIN
reports/2023-10-05_collect-process/nc-sboe-cf-search-by-year-report.png Переглянути файл

Before After
Width: 1837  |  Height: 1011  |  Size: 411KB

Завантаження…
Відмінити
Зберегти