Multiple-choice question benchmark

To evaluate AstroSage-Llama-3.1-8B’s performance, we employed the multiple-choice question benchmark from the first paper in this series11, AstroMLab 1. This benchmark consists of diverse astronomy-related questions generated from selected Annual Review of Astronomy and Astrophysics (ARAA) papers and remains, to our knowledge, the only comprehensive astronomy-specific benchmarking effort available. We refer interested readers to the original paper for detailed benchmark specifications.
Importantly, we deliberately excluded the ARAA papers from AstroSage-Llama-3.1-8B’s training dataset. This strategic exclusion enables us to evaluate the model’s broader understanding of astronomical concepts rather than its ability to recall specific information from the source materials. This approach helps ensure that the benchmark scores reflect AstroSage-Llama-3.1-8B’s genuine comprehension of astronomy rather than mere memorization of the content used to create the questions.
Our choice to primarily evaluate AstroSage-Llama-3.1-8B with a knowledge-based benchmark was motivated by two key factors. First, this benchmark represents the only extensively tested and human-vetted dataset available for astronomical knowledge assessment. Second, while astronomical knowledge recall represents just one aspect of LLM capabilities, it serves as a critical foundation for more advanced applications such as scientific agents. The primary goal is to demonstrate that proper fine-tuning of a relatively small model can significantly improve performance on a specific task—an achievement not previously demonstrated in astronomy.
The performance score is calculated as the fraction of correctly answered multiple-choice questions in the benchmarking dataset. The resulting scores are shown in Fig. 3, where round symbols represent scores for cutting-edge proprietary and open-weight models. The open-weight models are also marked with an outer circle. The x-axis displays the cost per \(10^5\) tokens, a metric chosen based on practical applications: in the first (and to our knowledge, only) implementation of astronomical agents9, analyzing a celestial source’s spectral energy distribution from James Webb Space Telescope data requires approximately \(10^5\) tokens. The top x-axis shows costs scaled to 3B (\(3\times 10^9\)) tokens, roughly equivalent to the entire astro-ph section of the arXiv. For proprietary models, we use current token costs (averaging input and output costs where they differ), while open-weight model costs are estimated based on typical pricing of commercial GPU platforms.
Specialized astronomical LLMs are denoted by star symbols, except for the first AstroLLaMA model5, whose score falls below the plot’s lower limit. The bottom right panel shows the typical uncertainty (calculated using the Wilson score interval), demonstrating that our dataset of 4425 multiple-choice questions provides sufficiently small sampling noise to establish robust performance differences. We have updated all scores using the latest model versions following the methodology from11, AstroMLab 1.
The diagonal dashed lines represent a universal cost-efficiency trade-off observed across major model series (e.g. Llama, GPT, GLM) that simultaneously released models at multiple sizes. We consistently observe a 3.5-point improvement in performance for every 10-fold increase in cost across model families. Each dashed line represents this equivalent trade-off, offset by 3.5 percentage points (equivalent to a 10-fold gain in cost-effectiveness). Despite similar performance on general benchmarks, cutting-edge models can differ by up to 1000-fold in cost-effectiveness on astronomical tasks, highlighting the importance of specialized astronomical benchmarks for evaluating performance on niche technical domains.
To establish a human performance baseline, two domain experts from our team independently completed a random subset of benchmark questions under controlled conditions. The experts answered on the order of one hundred questions each, taking around 30 seconds per question. No external references, web searches, or language model assistance were used. Both experts achieved remarkably consistent scores of approximately 68%, which we designate as the “Indicative Human Domain Expert” score. The fact that most evaluated LLMs significantly surpassed this baseline demonstrates both the benchmark’s comprehensive scope and difficulty, while highlighting the remarkable capabilities of current LLMs in capturing and applying complex astronomical knowledge.
As previously noted in7, AstroMLab 2, existing specialized astronomical LLMs (shown as open stars in Fig. 3) fail to outperform baseline models of comparable parameter size. In many cases, suboptimal specialization techniques actually led to performance degradation. In contrast, AstroSage-Llama-3.1-8B, despite its modest size of 8 billion parameters, achieved an accuracy of 80.9% on this benchmark-comparable to OpenAI’s latest flagship models (GPT-4o: 80.4%) and the best 90B-parameter open-weight Meta-Llama models (80.6%). This performance is particularly notable because AstroSage-Llama-3.1-8B achieves these results at approximately one-thousandth the inference cost of proprietary models and one-hundredth the cost of open-weight models. Furthermore, it demonstrates an 8-point improvement over its baseline model, Meta-Llama-3.1-8B (72.9%). To our knowledge, this represents the first demonstration of a specialized astronomical LLM achieving objectively verified improvements through model fine-tuning.
General-purpose benchmarks
To ensure our domain specialization did not compromise general capabilities, we evaluated AstroSage-Llama-3.1-8B across a comprehensive suite of standard language model benchmarks. These include IF-EVAL25 (instruction following), BBH26 (binary hypothesis testing), MATH27 (mathematical reasoning), GPQA28(graduate-level science questions), MUSR29 (real-world decision-making scenarios), and MMLU-PRO30 (an expanded version of MMLU with more challenging reasoning questions). As shown in Fig. 4, our CPT+SFT model (green, initialized from the Llama-3.1 base model) initially performed below the Llama-3.1 instruct model (purple) on five out of the six non-astronomy benchmarks. This was expected, given that Meta’s proprietary SFT dataset for their instruct model likely far exceeds what is feasible for an academic research group to reproduce. The merging procedure, pulling in only 25% of its weight from Meta-Llama-3.1-8B-Instruct, allowed us to recover much of this performance deficit.
Crucially, this performance recovery through model merging did not compromise AstroSage-Llama-3.1-8B’s astronomical expertise-it maintained its 8-point improvement (representing more than 100-fold increase in cost-effectiveness) on astronomical Q & A tasks while largely preserving capabilities across most general benchmarks. The only notable performance decrease occurred in IF-EVAL, which tests instruction following. This limited decline is unsurprising, as instruction following remains one of the more brittle capabilities in language models and likely heavily depends on the proprietary training data used in Meta’s instruct model. In fact, when comparing AstroSage-Llama-3.1-8B to BAAI/Infinity-Instruct-7M-Gen-LLaMA3 1-8B, the latter shows an even more severe performance deficit, highlighting how our refined training strategy and expanded SFT dataset represent crucial improvements. Ultimately, our model merging approach successfully preserved most general capabilities without sacrificing the gained astronomical expertise. This balance is essential, as it enables AstroSage-Llama-3.1-8B to engage in natural conversations and assist with broader tasks while excelling in astronomy-specific applications.
Human blind rankings
Simultaneously with8, we performed a blinded preference ranking of AstroSage-Llama-3.1-8B against Meta-Llama-3.1-8B-Instruct. Fifteen questions were written about diverse areas of cosmology, spanning layperson to professional-level difficulty. Independent evaluators compared responses from AstroSage-Llama-3.1-8B and Meta-Llama-3.1-8B-Instruct, with both models receiving identical system prompts tailored to the cosmology context.
Responses were presented in randomized order to the evaluators, who rated their quality without knowledge of the source. The evaluation involved three evaluators, who preferred the AstroSage-Llama-3.1-8B answers in 73% of cases, reflecting a statistically significant preference for AstroSage-Llama-3.1-8B.