monkeSearch — Benchmark Results

LLM Query Parsing Accuracy by Model Size

Measures how well each model converts natural language file search queries (e.g. “python scripts from 3 days ago”) into structured JSON — extracting the correct file type, time range, and specificity.
Tests across 80 queries covering file type, temporal awareness, combined, and noise-resistance categories.
Each model runs the identical test suite against a llama-server endpoint, aimed at testing smol models (<3B) to support potato devices.

Overall averages all categories into a single score. File Type tests whether the model correctly maps descriptions (e.g., “python scripts”) to the right extension (.py). Temporal Awareness evaluates parsing of time expressions (“3 days ago”, “last week”) into structured date ranges — combining value (range magnitude) and direction (before/after) scores. Specificity measures the model’s ability to determine whether a query refers to a specific file or a broad category.

Structured Extraction Benchmark

LLM Query Parsing Accuracy by Model Size

Model Results