Browsing the internet last night long after I should have been asleep (as one does), I was served an article that Google thought would be of interest to me: “Government Test Finds That AI Wildly Underperforms Compared to Human Employees” [The Byte]. This article was based on another article that appeared on the hilariously-named Australian outlet Crikey: “AI worse than humans in every way at summarising information, government trial finds.” Do we even need to dive any deeper into this matter? As a human writer, I often get annoyed when consumers of internet articles don’t read past the headline because there’s usually much more nuance to be found within that couldn’t fit in a headline (I’m looking at you, Reddit). In this case though, it’s exactly what you think it is.
But we’ve got minimum word counts to hit here so dive we shall.
Here’s what Crikey said:
Artificial intelligence is worse than humans in every way at summarising documents and might actually create additional work for people, a government trial of the technology has found.
Amazon conducted the test earlier this year for Australia’s corporate regulator the Securities and Investments Commission (ASIC) using submissions made to an inquiry. The outcome of the trial was revealed in an answer to a questions on notice at the Senate select committee on adopting artificial intelligence.
As some of you are aware, lawmakers over in their corner of the world have been quite busy investigating Big 4 firms — often in an official capacity, grilling the CEOs of Deloitte, EY, KPMG, and PwC in parliamentary hot seats — ever since it was discovered that PwC was double-dipping on confidential government information it then tried to “sell” to clients to assist in tax avoidance. Big scandal. Yuge. And quite annoying to Big 4 leadership who have better things to do than answer intimate questions asked by angry senators on the parliament floor.
At issue for this article is this inquiry: Ethics and Professional Accountability: Structural Challenges in the Audit, Assurance and Consultancy Industry:
On June 22, 2023, the Parliamentary Joint Committee on Corporations and Financial Services resolved to commence an inquiry into recent allegations of and responses to misconduct in the Australian operations of the major accounting, audit, and consultancy firms (including but not exclusive to the ‘Big Four’).
See what you did, PwC? You’re even fucking things up for Grant Thornton. Way to go.
The committee accepted public comment through August 2023.
For their part, the Australian Securities and Investments Commission (ASIC) wanted to summarize a sample of these public submissions using generative AI and to do so, they procured Amazon Web Services (AWS) Professional Services to run a Proof of Concept (PoC) between January 15 and February 16 of this year 2024. The PoC was meant to assess the capability of gen AI; to explore and trial these technologies, to focus on measuring the quality of the generated output rather than performance of the models and, to understand the future potential for business use of generative AI.
The PoC was not used for any of ASIC’s regulatory work or business activities, this was only an experiment. If successful, it could save humans a ton of time they used to spend consuming and summarizing mounds of text.
The PoC consisted of multiple phases, with a preparation/set-up stage occurring before the PoC:
- Phase 1: Selection of the Large Language Model (LLM) to be used in Phase 2
- Phase 2: Experimentation and optimisation with the selected LLM (Llama2-70B)
- Phase 3: Final assessment
Llama 2-70B is Meta’s (aka Facebook’s) pretrained models ranging in scale from 7 billion to 70 billion parameters.
The project team consisted of ASIC’s Chief Data and Analytics Office (CDAO) team, ASIC’s Regulatory Reform and Implementation team (who acted as subject matter experts) and AWS.
And here’s how Llama 2 performed on the task:
The final assessment results of the PoC showed that out of a maximum of 75 points, the aggregated human summaries scored 61 (81%) and the aggregated Gen AI summaries scored 35 (47%). Whilst the Gen AI summaries scored lower on all criteria, it is important to note the PoC tested the performance of one particular AI model (Llama2-70B) at one point in time. The PoC was also specific to one use case with prompts selected for this kind of inquiry.
In the final assessment ASIC assessors generally agreed that AI outputs could potentially create more work if used (in current state), due to the need to fact check outputs, or because the original source material actually presented information better. The assessments showed that one of the most significant issues with the model was its limited ability to pick-up the nuance or context required to analyse submissions.
In other words, this particular LLM sucked at this particular task. The ASIC report goes out of its way not to hurt Llama 2’s feelings (smart, that mf’s grandchildren will rule over us one day) but ultimately the conclusion is that, for now, humans outperform the LLM.
Excuse the weird Brit spelling in these key observations, we all know in our hearts that the American ‘z’ is superior:
- To a human, the request to summarise a document appears straightforward. However, the task could consist of several different actions depending on the specifics of the summarisation request. For example: answer questions, find references, impose a word limit. In the PoC the summarisation task was achieved by a series of discreet tasks. The selected LLM was found to perform strongly with some actions and less capably with others.
- Prompting (prompt engineering) was key. ‘Generic’ prompting without specific directions or considerations resulted in lower quality output compared to specific or targeted prompting.
- An environment for rapid experimentation and iteration is necessary, as well as monitoring outcomes.
- Collaboration and active feedback loops between data scientists and subject matter experts was essential.
- The duration of the PoC was relatively short and allowed limited time for optimisation of the LLM.
- Technology is advancing rapidly in this area. More powerful and accurate models and GenAI solutions are being continually released, with several promising models released during the period of the PoC. It is highly likely that future models will improve performance and accuracy of the results.
For the time being, regulators are going to be stuck pushing their own paper.