Choosing a model
Science & research
The models that hold up when the question is graduate-level and the source is a dense technical paper.
Science work asks for two skills prose fluency does not train. A model has to carry a chain of technical reasoning without dropping a step, and it has to say it does not know rather than paper over the gap with a confident invention. The spread between models on both widens fast as the questions get harder.
The yardstick is GPQA Diamond: graduate-level questions in biology, chemistry, and physics, written so a non-expert with a search engine still mostly fails them. A GPQA score tells you how far a model pushes a domain-specific argument before it starts guessing. It is one of the evals inside the Intelligence Index on the models ledger, and the per-model numbers below come from there.
What the work needs
Split a research session into the jobs a model actually performs and the “smartest” single pick dissolves. Each job rewards a different trait.
- Technical reading
- Summarizing a paper, extracting a method, comparing two results. This leans on the context window more than on raw reasoning: the answer sits in the text, and the job is to read it carefully. See Long context & RAG when the document runs long.
- Graduate-level questions
- Closed-book domain reasoning, the GPQA shape. Parameter count and tuning dominate here. A small model will sound plausible and be wrong.
- Derivations and proofs
- Multi-step math and formal argument. This is a reasoning job, and it belongs to Reasoning & math, where the reasoning-tuned line lives.
The models that hold up
On closed-book technical questions, capability tracks size closely, and the dense Qwen 3 line leads the locally runnable field on GPQA. Below 8B the score falls off a cliff. A 4B model is genuinely useful for reading and drafting, but it will answer a graduate physics question it has no business answering, and sound sure doing it.
| Model | Size | GPQA | Notes |
|---|---|---|---|
| Qwen 3 32B | 32.8B | 68.4 | Strongest dense local pick |
| Qwen 3 14B | 14.8B | 64.0 | The everyday sweet spot |
| Qwen 3 8B | 8.2B | 62.0 | Holds up at a smaller footprint |
| R1 Distill 32B | 32.8B | 62.1 | Reasoner, not a tool-caller |
| Phi-4 | 14.7B | 56.1 | Strong logic, narrower knowledge |
| Gemma 3 12B | 12B | 40.9 | Better for reading than for GPQA |
GPQA Diamond, % correct, published reasoning-mode figures. Full specs and speeds live on the models ledger.
With memory to spare, Qwen 3 32B is the most dependable local answer for hard domain questions. On a 32GB-or-better machine, Qwen 3 30B A3B reaches comparable depth at roughly small-model speed: a sparse mixture-of-experts model that activates only a few billion parameters per token. The pick most people should keep resident is Qwen 3 14B. It scores within a few points of the 32B on GPQA, fits comfortably, and calls tools reliably, the trait that matters the moment you point it at a folder of papers.
Reasoning depth versus domain knowledge
Two model types crowd the science use case, and they fail in opposite directions. A broadly trained instruct model like Qwen 3 carries more domain facts. A reasoning distill, the DeepSeek R1 distills or Phi-4, thinks more carefully through a derivation but knows less. Phi-4 trains heavily on curated and synthetic data, so it runs strong on logic while carrying thinner coverage of any one field. For a closed-book chemistry fact, reach for the broader model. For working a problem step by step, reach for the reasoner. The reasoning distills are not built for tool use, so keep them out of a paper-reading agent.
Research assistance with grounding
Stop asking a local model to recall and start asking it to read. Hand it the source. Retrieval over your own PDFs and notes turns a graduate question into a reading-comprehension question, which even an 8B model handles well, and it leaves you a passage to check the answer against. That pattern, plus the models and context settings that keep it fast, belongs to Long context & RAG.
For a paper-reading agent, give a 14B-or-larger instruct model a read-only grant over the folder and let it cite what it finds. Conifer treats everything a tool returns as untrusted output, so a hostile instruction buried in a downloaded PDF cannot quietly redirect the run. The grant model is covered in The grant model.
Why this matters for research
Unpublished results, a grant draft, patient data, an idea you are not ready to disclose: research runs on text you have real reasons never to paste into a cloud endpoint. Conifer runs the weights on your own hardware and nothing leaves the machine. That is the default, not a setting you have to go find. The full statement is on The local-first guarantee.
To stand up a model behind a tool that already speaks the OpenAI API, point it at the local server:
conifer serve --model qwen3-14bFor the broader decision and the other tasks in this section, start from How to choose.