Blog
No items found.

Better tax answers: Blue J now runs on OpenAI's latest GPT-4.1 model

2 Min Read

Our mission at Blue J has always been to generate the best possible tax answers. Regularly incorporating the latest innovations in generative AI technology is a core part of how we deliver on this promise. Which is why we’re thrilled to announce the integration of OpenAI's new GPT-4.1 model into our AI-powered tax research solution. This integration means even better tax answers for the thousands of firms that trust us to answer their toughest tax questions. 

Working side by side with OpenAI to build the future of tax research

As part of our ongoing collaboration, we have been working with OpenAI to get early access to the new GPT-4.1 model over the past several months. Our deep expertise in tax research meant that Blue J could offer OpenAI a unique lens to testing, putting their new model to task in a wide-ranging series of challenging tax scenarios—offering valuable insight for OpenAI to further refine their model. 

Our extensive testing demonstrates that the new GPT-4.1 significantly surpasses leading models, including Anthropic’s Claude 3.7 and Google’s Gemini 2.5 Pro, in practical tax research applications. Particularly when compared to the previous GPT-4o iteration Blue J had been using, this model presents a substantial improvement in performance.

Why this upgrade matters

Blue J has now been in the market for 2 years, during which time we’ve been constantly refining our solution based on user feedback. Due to this continual improvement, finding specific failure points has become increasingly challenging. However, Blue J’s rapid growth in users and questions asked has opened new opportunities for improvement only available at a larger scale. It is largely due to the thousands of firms using Blue J to ask tax research questions at a rate of several million per year that we’ve been able to achieve our current <0.2% disagree rate from users.

With that volume of usage, even 0.2% is incredibly helpful in allowing us to identify the exact issues our users are experiencing. We use this information to refine the product in a deliberate and targeted manner, leading to even further reductions in the disagree rate. This tight feedback loop ensures Blue J remains at the forefront of generative AI-powered tax research, continually widening the gap with competing products. It’s all part of how we’re continuously working to deliver better tax answers. 

In order to fully illustrate the impact of these targeted disagrees that focus on particular capabilities of the model, we’ve collected some specific metrics from our testing process with the new GPT-4.1. Blue J uses a variety of techniques for testing, ranging from automated test suites that automatically grade simpler questions at scale, to deeply complex tax questions requiring our tax research team to manually inspect the answers for correctness and comprehensiveness. As part of our evaluation of GPT-4.1, we ran the latest, state-of-the-art models across this test suite, which consists of ~300 questions in total.

Table 1: Reduction in error rate

The percentages represent reductions in error rates—so a 50% reduction means half of the previously encountered issues in a given category have been completely resolved. Typically, such substantial improvements require extensive algorithmic and prompt engineering changes; rarely are they achieved by simply updating the model we use to run Blue J. GPT-4.1 stands out due to three key advancements:

  1. Enhanced Instruction Following: This is crucial for Blue J, which relies exclusively on an extensive database of proprietary, authoritative tax information, rather than the model’s internal knowledge.
  2. Superior Comprehension: Tax research involves complex rules that must be read, understood, and applied cohesively, demanding high-level comprehension.
  3. Expanded Token Window: Increasing from 128,000 tokens to 1 million tokens enables more comprehensive context handling and eliminates complicated parsing steps typically needed for lengthier materials, such as IRS publications.

Not only will the tax experts using Blue J immediately benefit from this upgrade, but the potential for future enhancements via algorithmic improvements and prompt engineering is massive. In the coming weeks, we’ll continue rolling out additional advancements, in order to realize the full potential of this robust new model.

Pushing tax research forward

The integration of GPT-4.1 marks one of the largest performance increases we’ve seen in the past 12 months, further cementing Blue J as the leader in AI-powered tax research. But this is only the beginning. Our uniquely collaborative relationship with OpenAI ensures tax professionals who choose Blue J will always have access to the latest innovations in generative AI. As OpenAI continues to push its technology forward, Blue J will be right there alongside them to bring the latest in generative AI to tax research. 

To discover what the future of tax research can look like for your firm, book a demo to speak with one of our experts. 

Stay up to date