|
- ---
- title: Data Collection Process
- subtitle: Differentiators Data | NC Campaign Finance Data Project
- author: Garrick Aden-Buie
- format:
- html:
- embed-resources: true
-
- editor:
- render-on-save: true
-
- execute:
- echo: false
- ---
-
- ```{r setup}
- #| include: false
-
- pkgload::load_all(here::here("collect"))
-
- library(ggplot2)
- theme_set(theme_minimal(14, base_family = "Source Sans Pro"))
-
- targets::tar_config_set(store = here::here("collect/_targets"))
- ```
-
- ## Reports by Year
-
- The campaign finance data gathering begins by collecting a table with the list of all quarterly or semi-annual reports filed by committees in the NC SBOE system by reporting year.
- The NC SBOE system allows you to search for reports by year and report type,
- which I use to construct a URL that lists all reports of a given type for a given year.
-
- ::: {layout="[ [1,1] ]"}
-
- ```{r}
- shiny::selectInput("year", "Year", choices = 2016:2023, selected = 2022, selectize = FALSE)
- ```
-
- ```{r}
- report_choices <- c(
- "Mid-Year Semi-Annual Report" = "RPMYSA",
- "Year-End Semi-Annual Report" = "RPYESA",
- "1st Quarter Report" = "RPQTR1",
- "2nd Quarter Report" = "RPQTR2",
- "3rd Quarter Report" = "RPQTR3",
- "4th Quarter Report" = "RPQTR4"
- )
- shiny::selectInput("report_type", "Report", choices = report_choices, selected = "RPQTR1", selectize = FALSE)
- ```
-
- :::
-
- ```{r}
- epoxy::ui_epoxy_whisker("reports_by_year", "<p><a href=\"https://cf.ncsbe.gov/CFDocLkup/DocumentResult/?year={{year}}&reports=%27{{report_type}}%27\">https://cf.ncsbe.gov/CFDocLkup/DocumentResult/?year={{year}}&reports=%27{{report_type}}%27</a></p>")
- ```
-
- ```{js}
- function initApp () {
- const initData = {
- reports_by_year: {
- year: () => document.getElementById('year').value,
- report_type: () => document.getElementById('report_type').value
- }
- }
-
- const sendUpdatesToEpoxy = (outputId) => {
- const data = {...initData[outputId]}
- Object.keys(data).forEach(inputId => {
- data[inputId] = data[inputId]()
- })
- EpoxyMustache.update_all({ [outputId]: data })
- }
-
- Object.keys(initData).forEach(outputId => {
- sendUpdatesToEpoxy(outputId)
-
- Object.keys(initData[outputId]).forEach(inputId => {
- document
- .getElementById(inputId)
- .addEventListener('input', () => sendUpdatesToEpoxy(outputId))
- })
- })
- }
-
- initApp()
- ```
-
- An individual link leads us to a master list of every report type,
- say _2nd Quarter Report_,
- filed in a year, e.g. 2022.
- This table includes links to the actual report in the **Data** column on the right,
- as well as a link to download the table as a `.csv` file via
- the _Export list to .csv_ link at the top of the page.
-
- ](nc-sboe-cf-search-by-year-report.png)
-
- Unfortunately,
- the table that you get in the CSV does not include the link to the actual report,
- instead that field includes only the word `DATA` in place of the link.
- Furthermore, neither the CSV table nor the table displayed on the web page
- include the actual report ID, although it's used to construct the URLs in the **DATA** links.
-
- To get around these limitations,
- I discovered that the source data for the displayed table is included in the HTML file
- when you view the page source
- (right click anywhere on the page and choose "View Page Source").
- Somewhere in the middle of the file is a bit of JavaScript that starts with
- `var data = [ ...`.
- This is essentially a JSON representation of the table data,
- and I could extract and read it into a table format.
- This table is actually very well formatted and importantly includes the bare report id
- as well as the committee's SBOE ID.
-
- ```{r}
- # targets::tar_config_get("store")
- doc_list <- tar_read(doc_list) |> select(-tar_group)
- ```
-
- By scraping the table source data from every report list for years from 2016 through 2023,
- and for every available report type,
- I was able to put together a master list of every report of interest filed in the NC SBOE system.
- We found `r epoxy::epoxy("{.comma nrow(doc_list)}")` reports in total;
- here's a small preview of the values we have in this table:
-
- ```{r}
- glimpse(doc_list)
- ```
-
- Looking at the total reports by report type and year shows a predictable pattern:
- on the whole, quarterly reports are filed in election (even) years,
- and semi-annual reports are filed in non-election (odd) years.
- There are also a few exceptions to this pattern,
- where a small number of semi-annual reports are filed in even years,
- and a larger but relatively small number of quarterly reports are filed in odd years.
-
- ```{r}
- doc_names_levels <- c(
- paste(c("First", "Second", "Third", "Fourth"), "Quarter"),
- "Mid Year Semi-Annual",
- "Year End Semi-Annual"
- )
-
- doc_list |>
- mutate(doc_name = factor(doc_name, doc_names_levels)) |>
- ggplot() +
- aes(x = year) +
- geom_bar() +
- labs(x = NULL, y = NULL, title = "Number of Reports by Year and Report Type") +
- facet_wrap(~ doc_name, ncol = 2)
- ```
-
- ## Reports by Committee
-
- The next step is to collect the actual report data for each of the reports in our master list.
- Fortunately, if we can use the `sboe_id` and `report_id` from the previous step
- to jump straight to the report page in the NC SBOE system.
- Here's an example taken at random from John Bell's reports:
-
- ```{r}
- john_bell <- "STA-8S285O-C-001"
-
- jb_report_detail_url <-
- doc_list |>
- filter(sboe_id == john_bell, report_id == "164085") |>
- mutate(
- export_url = req_report_detail(report_id, "all")$url,
- url = glue::glue("https://cf.ncsbe.gov/CFOrgLkup/ReportDetail/?RID={report_id}&TP=ALL")
- )
- ```
-
- 
-
- This single page _appears_ to have all of the data we want in one place,
- and it even has an _Export data to .csv_ link at the top of the page.
- The format of that exported data is a little unusual --
- it's not really a CSV file but instead a text file with several CSV files concatenated together.
- Here's what you get when you download the export linked to in the above screenshot:
-
- ```{r}
- htmltools::withTags(
- .noWS = c("outside", "inside"),
- div(
- class = "card mb-2",
- div(class = "card-header", paste0(jb_report_detail_url$report_id, ".txt")),
- div(
- class = "card-body",
- style = htmltools::css(maxHeight = "300px", overflowY = "scroll"),
- pre(code(paste(readLines(jb_report_detail_url$export_url), collapse = "\n")))
- )
- )
- )
- ```
-
- ## Receipts and Expenditures
-
- The full report text (a.k.a. 7 CSV files in a trench coat)
- is _so so close_ to what we want,
- but at scale it doesn't work out that well.
- It appears that the NC SBOE hand-rolled their CSV exports
- and unfortunately don't create properly-formatted CSV files in some edge cases.
- These edge cases are mostly places where the SBOE would have reasonably expected users to enter a short text description,
- and instead users have entered longer text with commas and newlines,
- which break the CSV format.
-
- These "all report sections" text file exports are worth downloading
- because they include several tables that we'd want to collect anyway,
- and we want to reduce the number of files and HTML pages we have to request from the NC SBOE system.
-
- For the two most important tables,
- `receipts` and `expenditures`,
- it turns out that the "all reports page" listed above
- actually calls out to two unlisted, internal API endpoints
- to create the Receipts and Expenditures tables.
- These endpoints return JSON data which is guaranteed to be correctly formatted
- and easier to process into a table format.
-
- They say _there's no such thing as a free lunch_,
- and apparently this rule also applies to the NC SBOE campaign finance website.
- These API endpoints return at most 300 records at a time,
- meaning that for most reports we have to make multiple requests
- to gather all of the report data.
-
- Here are two examples of the URLs I used to get the receipts and expenditures data for John Bell's report:
-
- ```{r}
- url_receipts_json <- req_report_receipts(jb_report_detail_url$report_id, 0)
- url_expenditures_json <- req_report_expenditures(jb_report_detail_url$report_id, 0)
- ```
-
- * [Receipts for report `r jb_report_detail_url$report_id`, page 1](`r url_receipts_json$url`)
- * [Expenditures for report `r jb_report_detail_url$report_id`, page 1](`r url_expenditures_json$url`)
-
- ## Final Set of Files
-
- In the end, there are three files that we download directly from the NC SBOE system
- for each report:
-
- 1. The "all reports" text file export
- 2. The receipts JSON data, saved as a (correctly-formatted) CSV file
- 3. The expenditures JSON data, saved as a CSV file
-
- Locally, I've organized these into a directory structure that looks like this,
- where `` `r john_bell` `` is John Bell's committee's SBOE ID:
-
- ```
- data-raw/reports/STA-8S285O-C-001
- ├── all
- │ ├── ...
- │ ├── 164085_2019-01-25.txt
- │ └── ...
- ├── expenditures
- │ ├── ...
- │ ├── 164085_2019-01-25_expenditures.csv
- │ └── ...
- └── receipts
- ├── ...
- ├── 164085_2019-01-25_receipts.csv
- └── ....
- ```
-
- The end result is a complete set of files for every quarterly or semi-annual campaign finance report
- filed with the NC SBOE.
|