Treasury has estimated that a Microsoft 365 Copilot licence for a “mid-level” government staffer could pay for itself if it freed up 13 minutes of their time per week for “higher-value tasks”.
The surprisingly detailed review – by a part of the federal Treasury called the Australian Centre for Evaluation (ACE) – offers a different perspective on last year’s government-wide Copilot trial than an aggregate report by the Digital Transformation Agency (DTA).
ACE said that it “conducted an internal evaluation of Treasury’s Copilot trial to capture data regarding the Copilot trial within Treasury, which otherwise would have fallen outside the scope of the DTA’s investigation.”
The 13-minute time saving per week metric, based on the time of an APS6 level staffer, was unfortunately not quantified in the evaluation, however ACE deemed it “likely” to accrue.
APS6 is the highest level of officer in the Australian public service, with a base salary of between $94,300 and $114,243 at Treasury.
The metric appears to set a fairly low bar to justify the extra cost of licensing Copilot; however, does not necessarily pave the way for a broader deployment.
The review suggests any future deployment – of Copilot, or indeed any GenAI tool – should be targeted to specific staff and use cases.
It also appears to have left Treasury somewhat at a crossroads on how to progress with GenAI.
“The current trend of adoption of generative AI indicates it is likely to become the norm to use generative AI for many basic tasks in the future,” ACE wrote.
“The critical question in this context is whether Treasury should consider adopting generative AI in its current state or wait until the product is more advanced.
“If adopted in its current state, staff can build their capability and experience the technology as it continues to be improved and iterated. Alternatively, Treasury could wait for a more advanced product, and staff will receive access to generative AI products as late adopters.
“There are immediate costs to adopting this technology associated with the licences required alongside onboarding and training costs.”
13-minute efficiency saving
Treasury had 218 people trial Copilot for 14 weeks, and it appears most if not all “retained their licences”.
The trial was largely senior officers and executives, with 86 percent of the cohort an APS5 level or above.
Staff went into the trial with high expectations – 92 percent thought Copilot would reduce the time they spent on “low-value or low-priority tasks”.
By trial end, 62 percent said “they experienced time savings in their work when using Copilot for basic administrative work and processes.”
“ACE estimates that a staff member on the current APS6 salary would need to redirect approximately 13 minutes of time per week to higher value tasks for the current licence cost to be offset,” it wrote.
“Although the data collected during this trial did not quantify time savings, the results of this trial suggest that the productivity benefits and time savings associated with Copilot are likely to offset the licence costs.”
The DTA, in its broader evaluation, suggested a far more optimistic “average” time saving of around an hour a day. In agencies where that figure holds, Copilot would pay for itself many times over.
More junior staff, while not a significant part of the Treasury trial, also appeared to be a cohort that could substantially benefit from a tool like Copilot.
Time savings could be redirected into skill development, the reviewers found.
“Copilot provided opportunities for junior staff to free up time for more strategic or complex work … [and]supported junior staff to undertake basic administrative duties more efficiently,” ACE wrote.
“This allowed more time to engage in professional development opportunities and undertake work of more substance such as drafting policy briefs or data analysis.
“This highlights that Copilot has the potential to create opportunities for junior staff to learn and develop at a faster pace than has otherwise been available to them.”
Hard to measure
One key challenge in quantifying the efficiency metric is that Copilot didn’t produce a finished product in most cases, and it became hard to delineate between machine and human effort.
“The evaluation did not find clear evidence that Copilot helped improve work outcomes during the short trial period, but there were promising indicators,” ACE wrote.
“The effects of Copilot are more difficult to trace because work typically undergoes further revisions prior to finalisation.”
Another key challenge is that while individual users reported efficiencies, their direct managers often didn’t see any difference.
“Despite Copilot contributing to an individual’s personal experiences of improved efficiencies on basic administrative tasks, the product has not gone as far as improving noticeable work efficiencies across entire teams,” ACE wrote.
Individual trialists reported a mixed experience; consistent with the earlier DTA evaluation, some staff appeared to give up on Copilot when it did not meet their expectations.
For example, going into the trial, 75 percent of people thought Copilot could support up to half of their tasks; by the end of the trial, 59 percent said it provided “little to no” help.
“For some tasks, using Copilot was not reliably more efficient compared to completing the task manually,” ACE found.
Costs, controls curtailed trial
The federal Copilot trial was somewhat doomed from the outset because much government data is not held in the Microsoft ecosystem, and therefore could not be interacted with using Copilot.
The ACE review of Treasury’s trial found the department’s security and privacy controls, together with a lack of resourcing, did not work in Copilot’s or users’ favour.
Within Treasury, the system restrictions led some staff to conclude that “Copilot did not perform as well as generative AI products [they] had used elsewhere.”
“Some participants identified that the unrestricted version of Copilot performed better than Copilot limited to Treasury’s internal systems, indicating that Treasury’s necessary privacy and security restrictions limited the product’s quality,” ACE wrote.
Costs also curtailed the trial: staff were given access to Copilot with little training or specific guidance on good use cases.
That’s a curly problem: the DTA trial would have been intended to flush out use cases across government, but not everyone was equipped to find them.
“Due to limited trial resources, Treasury trial onboarding was limited to providing an onboarding session to participants and the basic Digital Transformation Agency generative AI training via Treasury’s online learning platform,” ACE wrote.
“The onboarding session included information about Treasury’s trial of Copilot, trial requirements, and an indication and explanation of the trial’s specified use cases.
“However, providing further training or individualised support to trial participants was outside of scope and available resourcing.”
DTA done with whole-of-gov AI trials, for now
For its part, the DTA said that as of last month, it has “no plans to conduct further whole-of-government trials of generative AI products”.
It said that agencies “may conduct their own trials or evaluations” individually.
Given the challenges of applying a commercial model in a federal government context, the DTA also addressed whether the government might dip in and train its own model.
“As of January 2025, the Australian government is not exploring a bespoke, whole-of-government generative AI model,” it said.
“Agencies may choose to procure, develop or collaborate on bespoke models to meet their specific needs.”