
Every time you ask an AI to summarise something, write an email, generate an image, or make a recommendation, the model is doing inference. This is the value generative AI brings to everyday use cases — especially in business. AI systems perform billions of inferences a day, powering customer support, sales workflows, internal tools, and more.
But inferences aren’t free. For most enterprises, the bulk of the AI bill today comes from inference, not training. And companies care about only three things:
- Speed: how fast the model replies
- Cost: every inference input like tokens charged per workload
- Scalability: whether the system can handle millions of queries reliably
As generative AI becomes central to business operations, many companies are waking up to rising AI bills and slower response times from large LLMs. US-based Groq claims it is solving exactly this problem.
“We have two core principles: bring down the cost of inference, and make inference faster with lower latency,” said Amar Singh, Sales Executive GTM APAC at Groq. “If you already have a product customers love and you’re using inference today, that’s great — but the cost will creep up over time. Running inference isn’t like spinning up regular cloud hardware. It’s a completely different beast. Our goal is to make it both faster and cheaper.”
Singh was speaking at a masterclass during TechSparks 2025.
Groq is building strong traction in India; The company is also working with Paytm, the leading digital payments and financial services distribution company in India.
Building and scaling AI requires a shift in mindset: from “it works” to “it scales profitably.” Many startups stumble here, burning runway on inefficient models. Deploying AI brings in business and grows revenue, but compute costs often rise exponentially when companies rely on large models like GPT-5 or Claude 4 Opus.
Debjyoti Biswas, AI Solutions Architect at Groq, illustrated this with an analogy: it’s like delivering chai to your neighbourhood using an adventure bike such as a Royal Enfield Himalayan — when simply walking would be more efficient.
Latency is another problem. Large LLMs can take several seconds to generate a response, and in consumer-facing interactions this delay directly impacts usage and engagement.
“During a phone conversation, if the other person is using GPT-5 as an AI agent, they’re not going to wait 10 seconds for you to respond,” Singh said. “They’ll drop off. You need something that balances speed and performance.”
Biswas said the best strategy for companies deploying AI is to follow a ‘Bento Box approach.
Think of your LLM infrastructure as a diverse toolkit — a bento box where each specialised model matches a specific need
his presentation noted.
“You could use a very large, GPT-5–class model — the equivalent of a surgeon. It’s powerful, but also expensive and slow, so you’d only bring it in for truly complex use cases,” he explained. “You wouldn’t go to a surgeon for a common cold. In most situations, you’d rely on a nurse. And a nurse can handle 60–80% of your needs. It’s the same with AI. You can offload the majority of tasks to a smaller, cheaper model that does the job perfectly well. Save the ‘surgeon’ models for when you actually need them.”
Singh added that there are four major factors businesses need to account for when deploying GenAI: the scaling challenge, a model-selection framework, benchmarking, and optimising for the first token.
It is a fast changing landscape. Models are going to be phased out. It is very important to build your open LLM evals at the very start of your AI journey, so there’s no trade-off that you have to make down the line.
Singh concluded at the end of the session.

Discover more from News Link360
Subscribe to get the latest posts sent to your email.
