Allens AI Competence Review for Australian Legal Advice
In the past two years, generative artificial intelligence (AI) tools have made significant strides, driven by advancements in large language models (LLMs). These tools are already transforming firms' operations, particularly in the legal sector. However, their true effectiveness in providing legal advice remains speculative. While AI tools show promise in identifying patterns in data and generating appropriate sentences, they struggle with the nuanced judgement essential in legal practice. The distinct nature of legal language further complicates their application.
To systematically evaluate generative AI's capability in answering legal questions, Allens, in collaboration with Linklaters LLP, developed the Allens AI Australian Law Benchmark. This benchmark tests the ability of leading LLMs (as of February 2024) to answer legal questions under Australian law, simulating how a layperson might use AI instead of consulting a human lawyer.
The key findings from the Allens AI Benchmark highlight several critical points:
Need for Expert Supervision: The tested models should not be used for Australian legal advice without expert human oversight. Relying on these models without knowing the correct answer poses significant risks.
Performance Ranking: GPT-4 emerged as the strongest overall performer, followed by Perplexity. LLaMa 2, Claude 2, and Gemini-1 showed similar performance levels.
Reliability Issues: Even the best-performing LLMs in 2024 were not consistently reliable for answering legal questions. While these tools can assist in summarising well-understood areas of law, their outputs require thorough review by legal experts to ensure accuracy.
Critical Reasoning Challenges: For tasks involving critical reasoning, none of the tested tools (including GPT-4, Gemini 1, Claude 2, Perplexity, and LLaMa 2) can be trusted to provide correct legal advice without expert supervision. These models often produced incorrect answers or missed the point of the questions, displaying unwarranted confidence in their responses.
Citation Problems: Poor citation practices were prevalent among the models. Issues included:
- Inability to select authoritative legal sources over unauthoritative ones.
- Fabrication of case names.
- Misattributing fictional extracts to correct sources or choosing incorrect pinpoint citations.
- Citing entire pieces of legislation without specifying relevant sections.
Jurisdictional 'Infection': The incorporation of legal analysis from larger jurisdictions (UK and EU) posed a significant problem for smaller jurisdictions like Australia. Although asked to provide answers from an Australian law perspective, many responses included incorrect citations and analysis from UK and EU laws.
Need for Safeguards: Businesses considering using generative AI technologies must implement safeguards to govern the use of AI outputs. In the legal context, outputs need careful review by experts to verify their accuracy and relevance and ensure they do not contain fictitious citations.
Enduring Role of Human Lawyers: Even if LLMs achieve or surpass the benchmark, the role of human lawyers will remain crucial. Answering legal questions accurately is only a small part of an Australian lawyer's duties, which increasingly resemble those of strategic advisors.
In conclusion
While generative AI tools have potential in the legal sector, they currently require substantial human supervision and verification. Their limitations, particularly in critical reasoning and citation accuracy, underscore the ongoing necessity of human expertise in legal practice.
This article summarises a research article published on the Allens’ website. Obtain the Allens AI Australian Law Benchmark report to learn more about the testing methodology and review.