Function Calling vs JSON Mode vs Plain Text

A practical comparison of plain text, JSON mode, and function calling for developers building reliable LLM applications.

If you build with LLMs long enough, you eventually face the same implementation question: should the model answer in plain text, return JSON, or call a function? This comparison is for developers who want a durable way to choose the right output method based on reliability, UX, validation needs, and system design rather than on temporary API fashion. By the end, you should be able to map each method to the right class of task, avoid common failure modes, and know when to revisit your choice as models and tools change.

Overview

There is no universal winner in the function calling vs JSON mode debate, and plain text prompting is still the right choice more often than many teams expect. The best option depends on what your application needs after generation.

A useful way to frame the decision is this:

Plain text prompting is best when the output is meant for humans first.
JSON mode is best when the model should return structured data that your application will parse.
Function calling is best when the model must choose an action or route data into a defined tool interface.

That sounds simple, but the real differences show up in production. A support assistant that drafts replies has different requirements from a ticket triage system, and both differ from an agent that looks up order status or updates a CRM. Prompt engineering for developers is not just about wording prompts well. It is also about choosing the output contract that reduces ambiguity and operational risk.

In practice, teams often make one of two mistakes. The first is overusing plain text prompting for machine-read workflows, then patching the result with fragile regex or post-processing. The second is over-structuring everything, forcing JSON or tool calls even when the user simply wants a readable answer. Both mistakes create avoidable complexity.

As a durable rule: choose the least rigid format that still meets your downstream requirements. If humans are the primary reader, start with plain text. If software must consume the response, start with structured output. If your system needs the model to select an operation from a known set of tools, start with function calling.

For a deeper look at schema design and fallback handling, see the Structured Output Prompting Guide: JSON Schemas, Validation Rules, and Failure Recovery.

How to compare options

The easiest way to compare LLM API output methods is to evaluate them against the job they need to do, not against abstract prompt engineering examples. Before choosing a method, answer five implementation questions.

1. Who is the immediate consumer of the output?

If a person is going to read the result directly, plain text prompting usually gives the best experience. It supports tone, explanation, and nuance with minimal formatting friction. If your application code is the consumer, JSON mode or function calling is usually safer.

2. Does the output need to trigger an action?

If the model is not just generating content but deciding whether to search, send, classify, route, or update a system, function calling is often the cleanest fit. It creates a clearer boundary between “decide what to do” and “do it.”

3. How strict is the schema?

Some tasks need light structure, such as a label, summary, and confidence score. Others need a strict contract with required fields, data types, and bounded values. JSON mode works well when the output is data. Function calling works well when the output is an instruction to invoke a specific interface with named arguments.

4. What happens when the model is wrong?

This is where many prompt optimization efforts become practical. A malformed blog outline is annoying. A malformed refund action is risky. If failure has operational consequences, favor stronger constraints, validation, and retry logic. If failure is low-risk and visible to a human, plain text may still be the best tradeoff.

5. How much explanation do you need alongside the result?

Plain text is naturally better for rich explanation. JSON can include explanation fields, but forcing every thought into key-value structure can make outputs brittle or verbose. Function calling is even more constrained: it is excellent for action selection, but often needs a second pass if you also want a polished user-facing explanation.

A practical evaluation matrix for prompt testing often includes:

Schema adherence
Task completion rate
Human readability
Error recovery effort
Latency and token overhead
Ease of logging and debugging
Security and guardrail fit

If you are building a team process around prompt engineering, versioning, and evaluation, the ideas in How to Build a Prompt Playground for Your Team: Versioning, Testing, and Approval Flows are a useful companion.

Feature-by-feature breakdown

This section compares plain text prompting, JSON mode, and function calling on the issues that matter most in real LLM app development.

Plain text prompting

What it is: You ask the model for a natural language answer, optionally with formatting instructions such as bullets, Markdown, or sections.

Where it shines:

User-facing writing
Explanations, summaries, and brainstorming
Drafts that a person will review
Rapid prototyping

Strengths:

Most flexible and expressive output style
Usually simplest to prompt and debug
Works well for open-ended tasks where exact structure is not critical
Can produce more natural user experiences

Weaknesses:

Harder to parse reliably in code
Formatting can drift even with clear instructions
Downstream automation often becomes fragile
Mixed content and data are harder to validate

Best prompt engineering use: Treat plain text as a presentation format, not a data transport layer. If your app is scraping values from prose, that is usually a sign to revisit the design.

Examples include email drafting, summarization, knowledge explanation, and report writing. For adjacent prompt patterns, see Prompt Engineering for Email Writing: Sales Outreach, Follow-Ups, and Support Replies and AI Summarizer Prompt Guide: Best Prompts for Notes, Meetings, PDFs, and Long Articles.

JSON mode

What it is: You instruct the model to return valid JSON, often shaped by a schema or explicit field list.

Where it shines:

Classification outputs
Extraction tasks
Pipeline inputs
UI rendering from structured fields

Strengths:

Easier for application code to parse
Supports validation, retries, and deterministic post-processing
Good middle ground between flexibility and reliability
Works well for prompt chaining

Weaknesses:

Can still fail schema expectations unless validated
May encourage bloated designs with too many fields
Less natural for direct user-facing output
Different vendors expose structured generation differently

Best prompt engineering use: Use JSON when you want the model to produce data, not when you want it to directly act. Think extraction, scoring, categorization, and slot filling.

A sentiment classifier is a good example. Instead of asking for a paragraph explanation first, you might request:

{
  "label": "positive|negative|neutral",
  "confidence": 0.0,
  "evidence": ["..."]
}

That structure is easier to test, compare, and store. For more on this style, see Sentiment Analysis Prompt Guide: Accurate Labels, Confidence Scores, and Edge Cases.

Function calling

What it is: You provide one or more tool or function definitions, and the model chooses whether to call one and with which arguments.

Where it shines:

Tool-using agents
Search and retrieval workflows
Transactional interfaces
Routing and orchestration layers

Strengths:

Clear separation between model reasoning and system action
Useful for controlled interfaces with known parameters
Can reduce ambiguity about what the model should do next
Fits multi-step workflows and agent prompts

Weaknesses:

More engineering overhead than plain text or simple JSON
Tool design quality strongly affects performance
May require a second generation step for polished user messaging
Poorly scoped tools can create unnecessary complexity

Best prompt engineering use: Use function calling when the model is making a decision about system behavior. Good examples include “search the knowledge base,” “create support ticket,” “get order status,” or “route this request to billing.”

Function calling is especially useful when you want a constrained action vocabulary. It is less useful when the real need is simply “return a clean object I can parse.” In those cases, JSON mode is often simpler.

Reliability and validation

If reliability is your top priority, none of these methods should be trusted without validation. Plain text needs output checks if it feeds another system. JSON needs schema validation and retry handling. Function calling needs argument validation, tool permission checks, and failure fallbacks.

This is where prompt testing matters more than prompt cleverness. Run representative cases, edge cases, adversarial inputs, and malformed contexts. If you need a debugging workflow, start with Prompt Debugging Guide: Why Your LLM Output Fails and How to Fix It.

Security and prompt injection exposure

Any output mode can be affected by bad instructions in retrieved or user-supplied content, but the risks differ. Plain text mainly risks misleading or unsafe responses. JSON mode risks polluted fields or invalid structure. Function calling raises the highest stakes because the model may attempt an unintended action if your guardrails are weak.

That does not mean function calling is unsafe by default. It means the surrounding system must enforce authorization, tool access rules, argument validation, and action confirmation where needed. For more, review Prompt Injection Prevention Checklist for LLM Apps.

Developer ergonomics

Plain text is easiest to start with. JSON is often easiest to productionize for data tasks. Function calling is usually the most maintainable once an application genuinely needs tools, but it also demands more design discipline.

One overlooked factor is debugging visibility. Plain text failures are obvious to humans. JSON failures are easier to detect automatically. Function-calling failures often require inspecting tool selection, arguments, tool output, and follow-up response behavior. If your team lacks good observability, the “most advanced” option can become the hardest to operate.

Best fit by scenario

If you want a practical shortcut, start with the scenario and work backward.

Use plain text prompting when:

You are generating emails, summaries, explanations, or drafts
A human reviews the result before action
The exact field structure is not important
You want maximum fluency and readability

Example: A support copilot drafts a reply to a customer. The agent will read and edit it before sending. Plain text is the natural fit.

Use JSON mode when:

You need labels, fields, scores, entities, or normalized records
Your UI or backend expects structured data
You are building prompt chaining steps
You need repeatable evaluation and parsing

Example: A ticket triage service extracts urgency, issue type, account ID, and next-best queue from inbound messages. JSON is the right default.

Use function calling when:

The model must decide whether to use a tool
You have defined operations with known parameters
You are building an assistant that retrieves, routes, or updates
You want the model to produce an action request rather than a prose answer

Example: A customer assistant checks order status, looks up return policy, and opens tickets. Function calling is usually the right architecture.

Use a hybrid approach when:

You need both machine structure and human explanation
You want the model to call a tool, then summarize the result
You want structured evaluation data plus a readable response

A common production pattern looks like this:

The model uses function calling to choose a tool and provide arguments.
Your system executes the tool and validates the result.
The model returns a final plain text answer for the user.

Another pattern is JSON-first classification followed by a plain text explanation only if needed. This reduces tokens and keeps the pipeline easier to test.

If you are comparing developer-facing utilities that support these workflows, such as JSON validation or debugging tools, this related article may help: JSON Formatter vs SQL Formatter vs Regex Tester: Which Developer Utilities Deserve a Place in AI Toolchains?

A simple decision rule

When to use function calling can be reduced to one question: Do I want the model to select and parameterize an operation? If yes, function calling is likely appropriate. If not, ask whether the output is data or prose. If it is data, choose JSON mode. If it is prose, choose plain text.

When to revisit

Your first choice does not need to be permanent. Revisit this decision when the surrounding constraints change, not only when a model vendor announces a new feature.

In practice, review your output method when:

Pricing changes make extra retries, long schemas, or multi-step tool flows more or less acceptable
Model behavior improves enough that a previously fragile JSON workflow becomes stable
New API features appear that strengthen schema enforcement or tool orchestration
Your app scope expands from content generation into retrieval, routing, or transactions
Failure costs rise because the system moves closer to real actions or customer-facing automation
Evaluation results drift and you see more malformed outputs, incorrect tool choices, or user-visible confusion

A practical quarterly review is enough for many teams. Pull a sample of recent tasks and ask:

Is the current format still the simplest thing that works?
Are we forcing prose into data extraction, or data extraction into prose?
How often are we retrying, repairing, or manually correcting outputs?
Would a different method reduce code complexity or support burden?
Have new vendors or model updates changed the tradeoff?

If costs are part of the decision, pair this review with a broader model selection pass using LLM Pricing Comparison for API Users: Token Costs, Context Windows, and Hidden Tradeoffs.

The most useful action you can take after reading this article is to test one task in all three formats. Choose a real workflow, such as lead classification, support routing, or retrieval-assisted answering. Run the same dataset through plain text prompting, JSON mode, and function calling. Compare not just answer quality, but parse rate, validation failures, engineering complexity, and ease of debugging. That small benchmark will teach you more than generic best practices.

For teams hiring around these skills, it is also worth assessing whether candidates can choose the right output contract rather than merely write long prompts. This article is a good companion: Prompt Engineering Interview Questions and Practical Tests for Hiring Teams.

The durable takeaway is straightforward. Plain text is for communication. JSON is for structured data. Function calling is for controlled action selection. Most production systems improve when they stop treating these as competing trends and start treating them as separate tools in a prompt engineering toolkit.