Analytical Tools & Software Used
Methodology: Data Preparation & Preprocessing
1) Dataset grain: entity-year (t)
FAC files are naturally organized around submission/report records, so the first step was building a clean panel at the right grain: one row per entity per audit year.
2) Entity identifier strategy (EIN vs UEI)
Older years (2019-2021) contain many placeholder UEIs (e.g., GSA_MIGRATION), making UEI unreliable for identifying unique entities in those years. To keep continuity across 2019-2022:
- We used EIN as the stable entity key for modeling across years.
- UEI was treated as supplemental metadata (and crosswalked where clean in 2022).
3) Outcome creation: findings at t+1
We created:
- current-year outcomes:
has_finding_t,finding_count_t - next-year labels:
y_has_finding_t1by shifting outcomes forward one year within each entity
Only years with observable next-year outcomes were kept for supervised training:
- Predictors from 2019-2021
- Labels from 2020-2022
4) Feature engineering (award-based predictors)
From federal_awards-ay, we aggregated program activity into entity-year features such as:
- total amount expended (
total_expended) - number of award lines (
award_lines) - breadth across agencies and programs (
distinct_agencies,distinct_programs) - structural indicators (direct awards, major programs, pass-through, loans)
- concentration signals (e.g.,
max_program_total)
Model Choice and Performance
Chosen model: HistGradientBoosting (HGB)
We selected a HistGradientBoostingClassifier, a highly effective supervised learning model available in the scikit-learn Python library, because it's strong at detecting complex patterns in large datasets (for example, how award complexity and prior findings combine to elevate risk).
Performance metrics
We evaluated using two standard classification ranking metrics:
Interpretation: If you randomly choose one entity-year that will have findings next year and one that won't, the model ranks the "will have findings" case higher about 77% of the time.
PR-AUC is especially useful when the outcome is relatively uncommon (audit findings are not present for everyone). A PR-AUC of 0.5439 indicates the model does a solid job concentrating true positives toward the top of the ranking, which is useful for prioritization.
For comparison, here's a logistic regression model run on the same data as a baseline:
- ROC-AUC = 0.7575
- PR-AUC = 0.5044
The HGB model provided a meaningful lift, especially on PR-AUC.
Interpreting What the Model Learned
We also want the model to be explainable, not a black box. We use Permutation feature importance to accomplish just that.
Permutation importance measures how much performance drops when a feature is shuffled. The top signals included:
- prior findings indicators (whether findings occurred, and how many)
- award structure/complexity (e.g., direct award lines, major award lines)
- breadth across programs/agencies (distinct programs, agencies)
- program concentration (
max_program_total)
How This Output Can Be Used
This model supports more targeted oversight, while keeping final decisions with program staff, auditors, and policy.
Example applications:
- Funding review triggers: When risk is elevated, require additional documentation, consistency checks, or program review before moving forward.
- Repeat findings workflow step: For entities with recurring issues, make a risk-model check a standard part of reviewing new awards.
- Targeted outreach/support: Prioritize technical assistance or compliance support for high-risk entities.
Recommendations for Improvement
This short case study is just a taste of how powerful analytical tools can transform data into valuable insights. Some ways this model could be made even more effective include:
- Expanding the time window: Incorporate more audit years to increase training data, improve stability, and enable stronger "train on past → test on future" validation across multiple years
- Refining the outcome: Move beyond "any findings" by predicting more specific outcomes where possible - such as repeat findings, higher-volume findings, or finding categories (i.e., going concerns, material weaknesses) - to make the risk score more actionable for oversight workflows