The Philosophy of Saving Money in the AI Era: How to Spend Every Token Wisely

金色财经_ · 2026-04-03T11:54:19+00:00

作者：Sleepy.mdIn the era when telegrams charged by the word, ink and paper were equivalent to money. People were accustomed to condensing thousands of words into the utmost brevity—"Quick reply" could replace a long letter, and "Stay safe" was the most important reminder.Later, the telephone was brought into homes, but long-distance calls were billed by the second. Parents’ long-distance conversations were always concise; once the main topic was finished, they would hurriedly hang up. If the conversation slightly extended, the concern over the phone bill would cut off the warm exchange just as it was beginning.Then, broadband entered homes, and internet access was billed by the hour. People stared at the timer on the screen, closing web pages as soon as they opened them, only daring to download videos. Streaming media was a luxury verb at the time. At the end of every download progress bar, there was a hidden longing to "connect to the world" and a fear of "insufficient balance."The unit of billing kept changing, but the instinct to save money remained unchanged through the ages.Today, Tokens have become the currency of the AI era. However, most people have yet to learn

金色财经_

2026-04-03 11:54:19

Author: Sleepy.md

In the Telegraph era when you were charged by the word, ink and brushwork were money. People were used to compressing countless words to the utmost limit: “quickly return” was worth more than a long letter, and “safe” was the heaviest reminder.

Later, the telephone came into the home, but long-distance charges were billed by the second. Your parents’ long-distance calls were always brief and to the point—once the matter was said, the call would be hung up in a hurry. And if the conversation even slightly stretched on, the thought of “worrying about the phone bill” would cut off the cold greetings right as they started.

Later still, broadband came into the home, and going online was charged by the hour. People stared at the timer on the screen: webpages were opened and closed right away, videos could only be downloaded, and streaming was a luxurious verb back then. At the end of every download progress bar, there was both a longing to “connect to the world” and a guarded fear of “insufficient balance.”

The unit you’re charged for kept changing, but the instinct to save money never did.

Nowadays, Token has become the currency of the AI era. However, most people still haven’t learned how to budget carefully in this age, because we still don’t know how to account for gains and losses within invisible algorithms.

When ChatGPT first came out in 2022, almost nobody cared what Token even was. That was the era of AI “family-style big pot”—pay $20 a month and you could chat as much as you wanted.

But ever since AI Agents have started to catch fire recently, Token spending has turned into something every person using an AI Agent must care about.

Unlike simple question-and-answer conversations, behind a task flow are hundreds or thousands of API calls. An agent’s independent thinking comes at a cost—every self-correction and every tool call corresponds to numbers that jump on the bill. Then you’ll find the money you topped up suddenly isn’t enough, and you still have no idea what the Agent actually did.

In real life, everyone knows how to save money. When you go to the market to buy vegetables, we know to clean off the muddy, rotten leaves before weighing them. When you take a car to the airport, the driver knows how to avoid the elevated roads during the rush hour.

The money-saving logic in the digital world is basically the same—only the billing units have changed from “jin” and “kilometers” to Token.

In the past, saving was because of scarcity; but in the AI era, saving is for precision.

We hope this article will help you lay out a methodology for saving money in the AI era, so you can spend every cent where it matters.

Before weighing on the scale, pick out the bad leaves

In the AI era, the value of information is no longer determined by breadth, but by purity.

AI’s billing logic charges by the number of words it reads. No matter whether you feed it true insights or meaningless format filler, as long as it reads it, you pay.

So the first way to save Token is to bake “signal-to-noise ratio” into your subconscious.

Every word, every image, and every line of code you feed to the AI costs money. So before handing anything to AI, remember to ask yourself: how much of this does AI truly need? How much is muddy, rotten leaf?

For example, long-winded openers like “Hello, please help me…,” repeated background introductions, and code comments that you didn’t clean up—these are all muddy, rotten leaves.

Beyond that, the most common waste is simply throwing a PDF or webpage screenshots straight at the AI. Sure, you save yourself effort, but “saving effort” in the AI era often means “expensive.”

A properly formatted PDF includes not only the main text, but also headers and footers, chart labels, hidden watermarks, and a large amount of formatting code used for layout. None of this helps the AI understand your question, but you’re billed for all of it.

Next time, remember to convert the PDF into clean Markdown text first, then feed it to the AI. When you turn a 10MB PDF into 10KB of clean text, you save not only 99% of the money—you also make the AI “brain” run much faster than before.

Images are another money-sink.

In the logic of visual models, the AI doesn’t care whether your photos are beautiful—it only cares how many pixels worth of area you occupy.

For example, using Claude’s official calculation logic: Image Token consumption = width in pixels × height in pixels ÷ 750.

A 1000×1000 pixel image consumes about 1334 Tokens. At Claude Sonnet 4.6’s pricing, that’s about $0.004 per image;

But if you compress the same image to 200×200 pixels, it only consumes 54 Tokens, bringing the cost down to $0.00016—25 times less.

Many people just feed high-definition photos taken with their phones or 4K screenshots to the AI, not realizing that the Tokens consumed by those images might be enough for the AI to read most of a mid-length novel. If the task is only to recognize text in the image or make simple visual judgments—like having the AI read the amount on an invoice, read text in an instruction manual, or determine whether there are traffic lights in the image—then 4K resolution is pure waste. Compressing the image to the smallest usable resolution is enough.

But the real reason the input side is so easy to waste Token isn’t the file format—it’s an inefficient way of speaking.

Many people treat AI like a real neighbor and get used to using social-style rambling chatter to communicate. They throw out a line like “Help me write a webpage,” then once the AI spits out a half-finished draft, they add details, and then tug back and forth repeatedly. This toothpaste-squeezing style of conversation makes the AI generate content again and again, and each round of edits stacks additional Token consumption.

In practice, engineers at Tencent Cloud found that for the same requirement, the number of Tokens consumed by a toothpaste-squeezing multi-round conversation is often 3 to 5 times that of clearly stating it all at once.

The real way to save money is to give up this low-efficiency social probing and state your requirements, boundary conditions, and reference examples clearly in one go. Spend less effort explaining what “not to do,” because negations often cost more understanding than affirmations. Tell it directly “how to do it,” and provide a clear correct example.

Also, if you know where the target is, just tell the AI plainly—don’t let it act like a detective.

When you command the AI to “find code related to the user,” it must perform large-scale scanning, analysis, and guessing in the background. But when you directly tell it “look at the src/services/user.ts file,” the Token consumption is completely different—in the digital world, information parity is the biggest form of saving.

Don’t pay for AI’s “politeness”

There’s an unspoken rule in how large models are billed that many people don’t realize: output Tokens are typically 3 to 5 times more expensive than input Tokens.

That means the words the AI produces cost far more than what you say to it. For instance, using Claude Sonnet 4.6 pricing: input costs $3 per million Tokens, while output jumps to $15—an entire 5x price difference.

Those polite opening lines like “Okay, I fully understand your needs, and I’ll answer you now…” and those courteous closing remarks like “I hope the above helps you…” are social pleasantries in real human communication. But on an API bill, all that small talk with no incremental information also costs you money.

The most effective way to curb output-side waste is to set rules for the AI. Use system instructions to make it clear: no pleasantries, no explanations, no restating the request—just give the answer.

These rules only need to be set once, and they take effect in every conversation. It’s a true “invest once, benefit forever” financial strategy. But when people establish rules, many fall into another misconception: piling up instructions with long natural language.

Engineer test data shows that the effectiveness of an instruction isn’t about how many words it has—it’s about its density. Compress a 500-word system prompt to 180 words by deleting meaningless politeness, merging repeated instructions, and restructuring paragraphs into concise, itemized checklists. The AI’s output quality barely changes, but Token consumption per call drops by as much as 64%.

Another more proactive control method is limiting output length. Many people never set an upper limit on output. They let the AI run wild. This kind of permissiveness with expressive freedom often leads to completely uncontrolled costs. You might only need a short line that stops at “just enough,” but the AI, trying to show some kind of “intellectual sincerity,” ends up generating an 800-word little essay for you.

If what you’re after is pure data, you should force the AI to return a structured format, not a long natural-language description. With the same amount of information, JSON Token consumption is far lower than prose paragraphs. That’s because structured data removes all redundant connectors, filler words, and explanatory modifiers—keeping only the high-concentration logical core.

In the AI era, you should be clear that what’s worth paying for isn’t the AI’s meaningless self-explanation, but the value of the results.

Besides that, the AI’s “overthinking” is also wildly eating into your account balance.

Some advanced models have an “extended thinking” mode. They do a massive amount of internal reasoning before answering. That reasoning process also gets billed—and it’s priced at output rates, making it very expensive.

This mode is essentially designed for “complex tasks that need deep logical support.” But most people choose it even for simple questions. For tasks that don’t require deep reasoning, explicitly tell the AI “no need to explain the thought process—just give the answer,” or manually turn off extended thinking, and you can save a lot of money.

Don’t let AI rehash old accounts

Large models don’t truly have memory—they just frantically rehash old accounts.

This is a core underlying mechanism many people don’t know. Each time you send a new message in a conversation window, the AI doesn’t start understanding from your latest message. Instead, it re-reads everything you’ve discussed before—every round of dialogue, every code snippet, and every cited document—then only answers.

On Token bills, this kind of “reviewing the past to learn anew” is absolutely not free. As the number of dialogue rounds accumulates, even if you’re only asking about a simple word, the cost of the AI re-reading the entire old book behind the scenes grows by geometric multiples. This mechanism means: the heavier the conversation history, the more expensive each of your questions becomes.

Someone tracked 496 real conversations containing more than 20 messages and found that at the 1st message, the AI read an average of 14,000 Tokens, costing about 3.6 cents per message. At the 50th message, it read an average of 79,000 Tokens, costing about 4.5 cents per message—an 80% increase. And since the context keeps getting longer, by the 50th message the context the AI has to re-process is already 5.6 times that of the 1st message.

The simplest habit to fix this is: one task, one conversation box.

When a topic is finished, decisively start a new conversation. Don’t treat AI like a never-off chat window. This sounds simple, but many people still can’t do it. They always feel like “what if we need the earlier information later.” In fact, that “what if” almost never happens—but to cover that “what if,” you’ve already paid several times more money on every new message.

When a conversation really needs to continue but the context has become very long, we can use some compression tools. Claude Code has a /compact command that can condense long conversation history into a short summary, helping you do a bit of cyber spring cleaning.

There’s also another money-saving logic: Prompt Caching. If you repeatedly use the same system prompt, or if every conversation involves quoting the same reference document, the AI will cache that content. On the next call, you pay only a small caching-read fee instead of being billed for the full amount every time.

Anthropic’s official pricing shows that cached-hit Tokens cost 1/10 of the normal price. OpenAI’s Prompt Caching can similarly reduce input costs by about 50%. A paper published on arXiv in January 2026 tested long tasks across multiple AI platforms and found that prompt caching can reduce API costs by 45% to 80%.

In other words, with the same content, you pay full price the first time you feed it to the AI, and afterward each call only costs 1/10. For users who repeatedly use the same set of规范 documents or system prompts every day, this feature can save a large number of Tokens.

But Prompt Caching has a prerequisite: the content and order of your system prompt and reference documents must remain exactly the same, and they must be placed at the very beginning of the conversation. Once any content changes, the cache becomes invalid and you’ll be billed at the full rate again. So if you have a set of fixed work standards, lock them in and don’t modify them casually.

The last context-management trick is loading on demand. Many people like to shove all the rules, documents, and notes into the system prompt at once, again for the “just in case” reason.

The cost, though, is that you’re clearly only doing a very simple task, yet you’re forced to load thousands of words of rules—wasting tons of Tokens for nothing. Claude Code’s official documentation recommends keeping CLAUDE.md within 200 lines. Split specialized rules for different scenarios into independent skill files, and load only the rules for the scenario you’re using. Keeping your context absolutely pure is respect for the highest level of computing power.

Don’t take a Porsche to buy groceries

Different AI models have huge price gaps.

Claude Opus 4.6 costs $5 for input per million Tokens and $25 for output. Claude Haiku 3.5 costs only $0.8 for input and $4 for output—nearly a 6x difference. If you use the top-tier model to handle the messy work of collecting资料 and formatting, not only is it slower, it’s also more expensive.

The smart approach is to bring the common “class division” thinking from human society into the AI world: tasks of different difficulty should be assigned to models with different price points.

Just like in the real world, when hiring people to do work, you wouldn’t specifically hire a million-dollar-a-year expert to move bricks at a construction site.

AI is the same. Claude Code’s official documentation also explicitly recommends that Sonnet handles most programming tasks, Opus is reserved for complex architecture decisions and multi-step reasoning, and simple subtasks should be assigned to Haiku.

A more concrete, hands-on plan is to build a “two-stage workflow.” In the first stage, use a free or cheap baseline model for the dirty work—things like information gathering, format cleaning, draft generation, and simple categorization and summarization. Then in the second stage, feed the distilled, high-purity core to a top model for core decision-making and deep polishing.

For example, if you need to analyze a 100-page industry report, you can first use Gemini Flash to extract the report’s key data and conclusions, and organize them into a 10-page summary. Then hand that summary to Claude Opus for deep analysis and judgment. This two-stage workflow can dramatically compress costs while maintaining quality.

More advanced than just splitting into segments is deep division of labor based on task decomposition. A complex engineering task can be broken down into several independent subtasks and matched with the most appropriate model.

For example, for a task that involves writing code, let a cheap model first write the scaffolding and boilerplate code, then only hand the core logic to the expensive model to implement. Each subtask has a clean, focused context; the results are more accurate and the cost is lower too.

You didn’t actually need to spend Tokens

All the discussions above are, in essence, solving tactical problems of “how to save money.” But a more fundamental question is often overlooked: does this action actually need to spend Tokens?

The most extreme saving isn’t algorithm optimization—it’s cutting out needless decisions. We’re used to asking AI for万能 answers, but in many scenarios, calling an expensive large model is like using an anti-aircraft gun to swat a mosquito.

For example, if you let AI automatically handle emails, it will treat every email as an independent task to understand, categorize, and reply—Token consumption is huge. But if you first spend 30 seconds scanning your inbox, manually filtering out emails that clearly don’t need AI handling, then send only the remaining ones to AI, the cost drops immediately to a small fraction of the original. Human judgment here isn’t a barrier—it’s the best filter.

People in the Telegraph era knew that sending one more word costs more money, so they would weigh it out—that’s intuitive sensitivity to resources. The AI era is the same. When you truly know what it costs to have AI say one more thing, you naturally weigh whether it’s worth having AI do it, whether the task needs a top-tier model or a cheap one, and whether that chunk of context is still useful.

This weighing itself is the most money-saving ability. In an era where compute power keeps getting more expensive, the smartest way isn’t to replace humans with AI, but to have AI and humans do what each is best at. When this sensitivity to Tokens becomes ingrained as a reflex, you truly move from being a servant of compute power back to being its master.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.