Overall averages all categories into a single score.
File Type tests whether the model correctly maps descriptions
(e.g., “python scripts”) to the right extension (.py).
Temporal Awareness evaluates parsing of time expressions
(“3 days ago”, “last week”) into structured date ranges —
combining value (range magnitude) and direction (before/after) scores.
Specificity measures the model’s ability to determine
whether a query refers to a specific file or a broad category.