Science & research

The models that hold up when the question is graduate-level and the source is a dense technical paper.

Science work asks for two skills prose fluency does not train. A model has to carry a chain of technical reasoning without dropping a step, and it has to say it does not know rather than paper over the gap with a confident invention. The spread between models on both widens fast as the questions get harder.

The yardstick is GPQA Diamond: graduate-level questions in biology, chemistry, and physics, written so a non-expert with a search engine still mostly fails them. A GPQA score tells you how far a model pushes a domain-specific argument before it starts guessing. It is one of the evals inside the Intelligence Index on the models ledger, and the per-model numbers below come from there.

What the work needs

Split a research session into the jobs a model actually performs and the “smartest” single pick dissolves. Each job rewards a different trait.

Technical reading: Summarizing a paper, extracting a method, comparing two results. This leans on the context window more than on raw reasoning: the answer sits in the text, and the job is to read it carefully. See Long context & RAG when the document runs long.
Graduate-level questions: Closed-book domain reasoning, the GPQA shape. Parameter count and tuning dominate here. A small model will sound plausible and be wrong.
Derivations and proofs: Multi-step math and formal argument. This is a reasoning job, and it belongs to Reasoning & math, where the reasoning-tuned line lives.

The models that hold up

On closed-book technical questions, capability tracks size closely, and the dense Qwen 3 line leads the locally runnable field on GPQA. Below 8B the score falls off a cliff. A 4B model is genuinely useful for reading and drafting, but it will answer a graduate physics question it has no business answering, and sound sure doing it.

GPQA Diamond on models Conifer runs locally
Model	Size	GPQA	Notes
Qwen 3 32B	32.8B	68.4	Strongest dense local pick
Qwen 3 14B	14.8B	64.0	The everyday sweet spot
Qwen 3 8B	8.2B	62.0	Holds up at a smaller footprint
R1 Distill 32B	32.8B	62.1	Reasoner, not a tool-caller
Phi-4	14.7B	56.1	Strong logic, narrower knowledge
Gemma 3 12B	12B	40.9	Better for reading than for GPQA

GPQA Diamond, % correct, published reasoning-mode figures. Full specs and speeds live on the models ledger.

With memory to spare, Qwen 3 32B is the most dependable local answer for hard domain questions. On a 32GB-or-better machine, Qwen 3 30B A3B reaches comparable depth at roughly small-model speed: a sparse mixture-of-experts model that activates only a few billion parameters per token. The pick most people should keep resident is Qwen 3 14B. It scores within a few points of the 32B on GPQA, fits comfortably, and calls tools reliably, the trait that matters the moment you point it at a folder of papers.

Reasoning depth versus domain knowledge

Two model types crowd the science use case, and they fail in opposite directions. A broadly trained instruct model like Qwen 3 carries more domain facts. A reasoning distill, the DeepSeek R1 distills or Phi-4, thinks more carefully through a derivation but knows less. Phi-4 trains heavily on curated and synthetic data, so it runs strong on logic while carrying thinner coverage of any one field. For a closed-book chemistry fact, reach for the broader model. For working a problem step by step, reach for the reasoner. The reasoning distills are not built for tool use, so keep them out of a paper-reading agent.

Research assistance with grounding

Stop asking a local model to recall and start asking it to read. Hand it the source. Retrieval over your own PDFs and notes turns a graduate question into a reading-comprehension question, which even an 8B model handles well, and it leaves you a passage to check the answer against. That pattern, plus the models and context settings that keep it fast, belongs to Long context & RAG.

For a paper-reading agent, give a 14B-or-larger instruct model a read-only grant over the folder and let it cite what it finds. Conifer treats everything a tool returns as untrusted output, so a hostile instruction buried in a downloaded PDF cannot quietly redirect the run. The grant model is covered in The grant model.

Why this matters for research

Unpublished results, a grant draft, patient data, an idea you are not ready to disclose: research runs on text you have real reasons never to paste into a cloud endpoint. Conifer runs the weights on your own hardware and nothing leaves the machine. That is the default, not a setting you have to go find. The full statement is on The local-first guarantee.

To stand up a model behind a tool that already speaks the OpenAI API, point it at the local server:

terminal

conifer serve --model qwen3-14b

For the broader decision and the other tasks in this section, start from How to choose.

What the work needs#

The models that hold up#

Reasoning depth versus domain knowledge#

Research assistance with grounding#

Why this matters for research#

What the work needs

The models that hold up

Reasoning depth versus domain knowledge

Research assistance with grounding

Why this matters for research