Igor's Techno Club

Using Autoresearch Project to Build The Fastestest Java Decompiler

Most people first see this Andrej's Autoresearch project as an ML autotuning setup: an agent edits one file (train.py), runs short experiments, and keeps only measurable improvements. Under the hood, though, the real value is not ā€œLLM training.ā€ The real value is the architecture: a closed-loop research system with explicit goals, constrained change scope, objective evaluation, and hard keep/revert rules.

That pattern transfers cleanly to systems work, including decompiler optimization which is the pivotal part of Jar.Tools. I called my decompilation engine IPND, which I wanted to be the fastest was to decompile a java class into readable Java source code.

The Core Architecture (Domain-Agnostic)

At a high level, this project separates policy from execution:

flowchart LR A[Human defines objective + constraints] --> B[Program spec / playbook] B --> C[Agent proposes code change] C --> D[Run harness] D --> E[Collect metrics] E --> F{Beats baseline?} F -- yes --> G[Keep change] F -- no --> H[Discard/Revert] G --> I[Update baseline + log] H --> I I --> C

Three design choices make this robust across domains:

  1. Fixed evaluation protocol: same benchmark shape each iteration, so comparisons stay valid.
  2. Explicit baseline: every candidate is judged relative to a known reference, not gut feeling.
  3. Tight loop latency: faster iteration means more hypotheses tested per hour.

In ML, the metric is validation bits-per-byte. In decompiler work, the metric can be latency, memory, correctness parity, or all three.

Mapping the Architecture to the Decompiler Project

For the decompiler, I used the same loop but swapped in system-level targets:

The implementation already has natural seams for this:

flowchart TD C1[Client: class/jar upload] --> A1[API routes] A1 --> A2[Auth + input normalization] A2 --> A3{Class sync path or Jar async job} A3 -->|Class| C2[Parse classfile + emit source] A3 -->|Jar| J1[Create job + persist upload] J1 --> J2[Worker decompile loop] J2 --> K1[Core parser/decompiler/emitter] K1 --> Z1[Artifact ZIP + SUMMARY.md] Z1 --> R1[Status + download endpoints]

This is exactly what makes the architecture reusable: once a system has deterministic entry points and measurable outputs, it can be optimized by the same research loop regardless of domain.

How I Used It for Decompiler Logic Improvements

The practical cycle looked like this:

  1. Establish baseline with fixed corpus and repeat count.
  2. Profile CPU and memory hotspots (perf, heaptrack, runtime summaries).
  3. Hypothesize a change (for example: zip writer mode, decompile path behavior, branch coverage for edge cases).
  4. Patch and validate with tests and coverage gates.
  5. Re-benchmark and compare against baseline.
  6. Keep only measurable wins.
flowchart LR B[Baseline run] --> P[CPU + memory profiling] P --> H[Hotspot hypothesis] H --> X[Code change] X --> T[Tests + coverage] T --> R[Benchmark rerun] R --> D{Latency/memory better and correctness intact?} D -- yes --> K[Keep + document delta] D -- no --> N[Drop/iterate] K --> B N --> H

This gave us concrete, decision-ready metrics instead of anecdotal ā€œfeels fasterā€ claims. Example outcomes from the class decompilation track:

Notable Speed Changes With Code Examples

Below are concrete code-level changes that helped performance in the decompiler path.

1) Parallelize method decompilation only when class size justifies it

In crates/core/src/emit/mod.rs, method bodies are decompiled in parallel only for sufficiently large classes. Small classes stay serial to avoid scheduler overhead.

fn should_parallelize_method_decompile(coded_method_count: usize, total_code_bytes: usize) -> bool {
    coded_method_count >= 24 && total_code_bytes >= 12_000 && method_decompile_parallelism() > 1
}

if should_parallelize_method_decompile(coded_methods.len(), total_code_bytes) {
    let results = coded_methods
        .par_iter()
        .map(|(method_index, method)| {
            (*method_index, crate::decompile::decompile_method_v1(class, method, *method_index, decompile_opts))
        })
        .collect::<Vec<_>>();
    // write back results...
}

Why it matters:

2) Replace map-heavy method body storage with indexed slots

The emitter path uses vector-indexed storage for method bodies and moves values out with take(), reducing lookup and clone overhead.

let mut method_bodies: Vec<Option<crate::decompile::MethodBody>> = vec![None; class.methods.len()];
// fill method_bodies[method_index] = Some(body)

let body = method_bodies
    .get_mut(method_index)
    .and_then(|slot| slot.take());

Why it matters:

3) Add no-allocation fast paths in identifier rewriting

String-rewrite utilities now bail out immediately when there is nothing to replace, instead of always allocating an output string.

fn replace_identifier_all_if_needed(source: &str, from: &str, to: &str) -> Option<String> {
    if from.is_empty() || from == to { return None; }
    if !source.contains(from) { return None; }
    // rewrite only if needed...
    Some(out)
}

Why it matters:

4) Optimize artifact ZIP write path for throughput

For output packaging, I moved to low-cost compression by default and made ā€œstoredā€ mode configurable for memory-sensitive runs.

let file_options = if use_stored_artifact_entries() {
    SimpleFileOptions::default().compression_method(CompressionMethod::Stored)
} else {
    SimpleFileOptions::default()
        .compression_method(CompressionMethod::Deflated)
        .compression_level(Some(1))
};

Why it matters:

What was measured

On our class benchmark track (largest-class/top-N comparisons), current vs baseline showed sustained improvements:

These changes were only kept when they held against baseline under the same harness and passed the regression tests.

Current Full-Jar Decompiler vs CFR Numbers (Current Checkout)

To compare full jar decompilation (not per-class microbenchmarks), I used the same input jar for both tools:

Commands used:

# IPND full-jar decompile (API worker path)
IPND_PERF_PASSES=3 IPND_PERF_BUDGET_MS=50 \
cargo test -p ipdn perf_run_decompile_job_common_jar -- --ignored --nocapture

# CFR full-jar decompile (whole-jar invocation, 3 passes)
java -jar dist/tools/cfr-0.152.jar external_jars/commons-lang3-3.14.0.jar --outputdir <tmp> --silent true

Aggregate latency results (full jar)

Slice IPND mean (ms) CFR mean (ms) CFR/IPND ratio
Overall (all passes) 369.440 5998.479 16.237x
Cold pass only (pass 1) 409.150 6397.316 15.636x
Warm passes only (pass 2-3) 349.585 5799.060 16.588x

Supporting percentiles from the same run set:

Output artifact context:

Interpretation:

Why This Architecture Scales Beyond ML and Decompilers

The pattern works anywhere you can define:

That includes compilers, API backends, data pipelines, search ranking services, and frontend rendering performance.

The transferable blueprint is:

  1. Define objective as a metric, not a story.
  2. Lock evaluation protocol.
  3. Automate measurement and diffing.
  4. Require objective keep/revert decisions.
  5. Track baseline drift explicitly.

If you do just these five things, ā€œautonomous researchā€ stops being an ML novelty and becomes a general engineering operating model.

Under-the-Hood Components That Matter Most

A lot of teams underestimate this part. The architecture only works when each component is explicit and stable:

Without this separation, optimization efforts drift into ad-hoc debugging. With it, every iteration contributes to a cumulative research trajectory.

A Practical Template for Other Projects

If you want to adapt this architecture to a new project, start with a minimal contract:

  1. Pick one benchmark corpus that matches production pressure.
  2. Record one immutable baseline run.
  3. Define pass/fail thresholds for regressions.
  4. Automate one command that prints current vs baseline.
  5. Gate merges on those numbers.

Once that is in place, you can scale out to multi-objective optimization (speed, memory, reliability, quality) without losing control of experiment integrity.

Closing

What started as an ML experiment loop is really a system for disciplined optimization under uncertainty. In our decompiler work, that architecture let us improve speed and memory without sacrificing correctness or API stability. The key was not domain-specific tricks; it was the loop design itself: baseline, profile, patch, verify, compare, repeat.