The Evaluation of GenAI Capabilities to Implement Professional Tasks

KouzminovY., & KruchinskaiaE. (2024). The Evaluation of GenAI Capabilities to Implement Professional Tasks. Foresight and STI Governance, 18(4), 67-76. https://doi.org/10.17323/2500-2597.2024.4.67.76

Abstract

Generative AI (GenAI) or large language models (LLMs) have been running the world since 2022, but despite all the trends surrounding the use of generative models, these cannot yet be used professionally. While they are most valued for ‘knowing everything’, nonetheless GenAI models cannot explain and prove. In this way we conceptualize the most recent problem of LLMs as the general trend of mistakes even in the core of knowledge and non-causality of mistake via the complexity of question, as the mistake can be named as an accident and be everywhere as the most limitation of professionalism. At their current stage of development, LLMs are not widely used in a professional context, nor have they replaced human workers. They do not event extend workers’ professional abilities.. These limitations of GenAI have one general: non-repayment. This article seeks to analyze GenAI’s professional viability by examining two models (GigaChatPro, GPT-4) in three fields of knowledge (economics, law, education) based on our unique Bloom’s taxonomy benchmark. To prove our assumption concerning the low possibility of its professional usage, we test three hypotheses: 1) the number of parameters of models have low elasticity regarding difficulty and taxonomy with even the right answer; 2) difficulty and taxonomy jointly have no effect on the correctness of an answer, 3) multiple choice is a factor that decreases the number of right answers of a model. We also present the results of GPT-4 and GigaChat MAX on our benchmark. Finally, we suggest what can be done about the limitations of GenAI’s architecture to reach at least a quasi-professional use.

https://doi.org/10.17323/2500-2597.2024.4.67.76

PDF Рус (Русский)

PDF

References

НИУ ВШЭ (2024) Подготовка высококвалифицированных кадров в области искусственного интеллекта (под науч. ред. Л.М. Гохберга), М.: НИУ ВШЭ.

Alimardani A. (2024) Generative artificial intelligence vs. law students: An empirical study on criminal law exam performance. Law, Innovation and Technology, 2392932, 1-43. DOI: https://doi.org/10.1080/17579961.2024.2392932

Al-Zahrani A., Alasmari T. (2024) Exploring the impact of artificial intelligence on higher education: The dynamics of ethical, social, and educational implications. Humanities and Social Sciences Communications, 11(1), 912. DOI: https://doi.org/10.1057/s41599-024-03432-4

Al-Zahrani A.M. (2024) From Traditionalism to Algorithms: Embracing Artificial Intelligence for Effective University Teaching and Learning. IgMin Research, 2(2), 102-112. DOI: https://doi.org/10.61927/igmin151

Anthis J., Lum K., Ekstrand M., Feller A., D'Amour A., Tan C. (2024) The impossibility of fair LLMs (ArXiv paper 2406.03198). DOI: https://doi.org/10.48550/arXiv.2406.03198

Antoniak S., Krutul M., Pióro M., Krajewski J., Ludziejewski J., Ciebiera K., Król K., Odrzygóźdź T., Cygan M., Jaszczur S. (2023) Mixture of Tokens: Continuous MoE through Cross-Example Aggregation (ArXiv paper 2310.15961). DOI: https://doi.org/10.48550/arXiv.2310.15961

Bloom B.S., Engelhart M.D., Furst E.J., Hill W.H., Krathwohl D.R. (1956) Taxonomy of Educational Objectives: The Classification of Educational Goals (Handbook 1: Cognitive Domain), Ann Arbor, MI: Edwards Bros.

Borji A. (2023) A categorical archive of Chat GPT failures (ArXiv paper 2302.03494). DOI: https://doi.org/10.48550/arXiv.2302.03494

Cai W., Jiang J., Wang F., Tang J., Kim S., Huang J. (2024) A Survey on Mixture of Experts (ArXiv paper 2407.06204). DOI: https://doi.org/10.48550/arXiv.2407.06204

Chen Y., Esmaeilzadeh P. (2024) Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security Challenges. Journal of Medical Internet Research, 26, e53008. DOI: https://doi.org/10.2196/53008

Cheung M. (2024) A Reality check of the benefits of LLM in business (ArXiv paper 2406.10249). DOI: https://doi.org/10.48550/arXiv.2406.10249

Choi J., Palumbo N., Chalasani P., Engelhard M.M., Jha S., Kumar A., Page D. (2024) MALADE: Orchestration of LLM-powered Agents with Retrieval Augmented Generation for Pharmacovigilance (ArXiv paper 2408.01869). DOI: https://doi.org/10.48550/arXiv.2408.01869

Chu H.C., Hwang G.H., Tu Y.F., Yang K.H. (2022) Roles and research trends of artificial intelligence in higher education: A systematic review of the top 50 most-cited articles. Australasian Journal of Educational Technology, 38(3), 22-42.

Dai C-P, Ke F. (2022) Educational applications of artificial intelligence in simulation-based learning: A systematic mapping review. Computers and Education: Artificial Intelligence, 3, 100087. DOI: https://doi.org/10.1016/j.caeai.2022.100087

Gill S.S., Xu M., Patros P., Wu H., Kaur R., Kaur K., Fuller S., Singh M., Arora P., Kumar A.P., Stankovski V., Abraham A., Ghosh S.K., Lutfiyya H., Kanhere S.S., Bahsoon R., Rana O., Dustdar S., Sakellariou R., Uhlig S., Buyya R. (2023) Transformative Effects of ChatGPT on Modern Education: Emerging Era of AI Chatbots. Internet of Things and Cyber-Physical Systems, 4, 19-23. DOI: https://doi.org/10.1016/j.iotcps.2023.06.002

Han S.J., Ransom K.J., Perfors A., Kemp C. (2023) Inductive reasoning in humans and large language models. Cognitive Systems Research, 83, 1-28. DOI: https://doi.org/10.1016/j.cogsys.2023.101155

Hassan R., Ali A., Howe C.W., Zin A.M. (2022) Constructive alignment by implementing design thinking approach in artificial intelligence course: Learners' experience. AIP Conference Proceedings, 2433(1), 0072986. DOI: https://doi.org/10.1063/5.0072986

Hendrycks D., Burns C., Basart S., Zou A., Mazeika M., Song D., Steinhardt J. (2020) Measuring Massive Multitask Language Understanding (ArXiv paper 2009.03300). DOI: https://doi.org/10.48550/arXiv.2009.03300

IDC (2024) The Global Impact of Artificial Intelligence on the Economy and Jobs, Needham, MA: IDC Corporate.

Jin B., Liu G., Han C., Jiang M., Ji H., Han J. (2023) Large Language Models on Graphs: A Comprehensive Survey (ArXiv paper 2312.02783). DOI: https://doi.org/10.48550/arXiv.2312.02783

Kardanova E., Ivanova A., Tarasova K., Pashchenko T., Tikhoniuk A., Yusupova E., Kasprzhak A.G., Kuzminov Y., Kruchinskaia E., Brun I. (2024) A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models (arXiv paper 2411.00045). DOI: https://doi.org/10.48550/arXiv.2411.00045

Kuhn T.S. (1977) The Essential Tension, Chicago: University of Chicago Press.

Lai J., Gan W., Wu J., Qi Z., Yu P.S. (2023) Large Language Models in Law: A Survey (ArXiv paper 2312.03718). DOI: https://doi.org/10.48550/arXiv.2312.03718

Lakatos I. (1963) Proofs and Refutations (I). British Journal for the Philosophy of Science, 14(53), 1-25.

Lakatos I. (1970a) Falsification and the Methodology of Scientific Research Programmes. In: Criticism and the Growth of Knowledge (eds. I. Lakatos, A. Musgrave), Aberdeen: Cambridge University Press, pp. 91-195.

Lakatos I. (1970b) History of Science and Its Rational Reconstructions. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association, pp. 91-136.

Liang L., Sun M., Gui Z. et al. (2024) KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation. ArXiv paper 2409.13731, 1-33. DOI: https://doi.org/10.48550/arXiv.2409.13731

Liu N.F., Lin K., Hewitt J., Paranjape A., Bevilacqua M., Petroni F., Liang P. (2023) Lost in the Middle: How language models use long contexts (ArXiv paper 2307.03172). DOI: https://doi.org/10.48550/arXiv.2307.03172

Luo L., Li Y.F., Haffari G., Pan S. (2023) Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning (ArXiv paper 2310.01061). DOI: https://doi.org/10.48550/arXiv.2310.01061

McKnight M.A., Gilstrap C.M., Gilstrap C.A., Bacic D., Shemroske K., Srivastava S. (2024) Generative Artificial Intelligence in Applied Business Contexts: A systematic review, lexical analysis, and research framework. Journal of Applied Business and Economics, 26(2), 7040. DOI: https://doi.org/10.33423/jabe.v26i2.7040

Mirzadeh I., Alizadeh K., Shahrokhi H., Tuzel O., Bengio S., Farajtabar M. (2024) GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models (ArXiv paper 2410.05229). DOI: https://doi.org/10.48550/arXiv.2410.05229

Mortlock R., Lucas C. (2024) Generative artificial intelligence (Gen-AI) in pharmacy education: Utilization and implications for academic integrity: A scoping review. Exploratory Research in Clinical and Social Pharmacy, 15, 100481. DOI: https://doi.org/10.1016/j.rcsop.2024.100481

Naveed H., Khan A.U., Qiu S., Saqib M., Anwar S., Usman M., Akhtar N., Barnes N., Mian A. (2023) A comprehensive overview of large language models (ArXiv paper 2307.06435). DOI: https://doi.org/10.48550/arXiv.2307.06435

Nguyen H., Fungwacharakorn W., Satoh K. (2023) Enhancing logical reasoning in large language models to facilitate legal applications (ArXiv paper 2311.13095). DOI: https://doi.org/10.48550/arXiv.2311.13095

Noever D., Ciolino M. (2023) Professional Certification Benchmark Dataset: The first 500 jobs for large language models (ArXiv 2305.05377). DOI: https://doi.org/10.48550/arXiv.2305.05377

OECD (2024) OECD Economic Outlook (Interim Report, September 2024), Paris: OECD.

Ogunleye B., Zakariyyah K.I., Ajao O., Olayinka O., Sharma H. (2024) A Systematic Review of Generative AI for Teaching and Learning practice. Education Sciences, 14(6), 14060636. DOI: https://doi.org/10.3390/educsci14060636

ORR (2023) Rail industry finance (UK): April 2022 to March 2023, London: Office of Rail and Road.

Rasal S., Hauer E.J. (2024) Navigating Complexity: Orchestrated Problem Solving with Multi-Agent LLMs (ArXiv paper 2402.16713). DOI: https://doi.org/10.48550/arXiv.2402.16713

Sanmartin D. (2024) KG-RAG: Bridging the gap between knowledge and creativity (ArXiv paper 2405.12035). DOI: https://doi.org/10.48550/arXiv.2405.12035

Shapira E., Madmon O., Reichart R., Tennenholtz M. (2024) Can LLMs replace economic choice prediction labs? The case of language-based persuasion games (ArXiv paper 2401.17435). DOI: https://doi.org/10.48550/arXiv.2401.17435

Sohail S.S., Faiza Farhat F., Himeur Y., Nadeem M., Madsen D.O., Singh Y., Atalla S., Mansoor W.. (2023) Decoding ChatGPT: A taxonomy of existing research, current challenges, and possible future directions. Journal of King Saud University - Computer and Information Sciences, 35(8). DOI: https://doi.org/10.1016/j.jksuci.2023.101675

Strachan J., Albergo D., Borghini G., Pansardi O., Scaliti E., Gupta S., Saxena K., Rufo A., Panzeri S., Manzi G., Graziano M.S.A., Becchiol C. (2024) Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7), 1285-1295. DOI: https://doi.org/10.1038/s41562-024-01882-z

Sun J., Xu C., Tang L., Wang S., Lin C., Gong Y., Ni L.M., Shum H.Y., Guo J. (2023) Think-on-Graph: Deep and responsible reasoning of large language model on knowledge graph (ArXiv paper 2307.07697). DOI: https://doi.org/10.48550/arXiv.2307.07697

Thomson Reuters (2024) 2024 Generative AI in Professional Services, Toronto: Thomson Reuters Institute.

Turnock D. (1998) An Historical Geography of Railways in Great Britain and Ireland (1st ed), New York: Routledge.

Wan Y., Wang W., Yang Y., Yuan Y., Huang J., He P., Jiao W., Lyu M.R. (2024) A ∧ B ⇔ B ∧ A: Triggering logical reasoning failures in large language models (ArXiv paper 2401.00757). DOI: https://doi.org/10.48550/arXiv.2401.00757

Wang Y., Ma X., Zhang G., Ni Y., Chandra A., Guo S., Ren W., Arulraj A., He X., Jiang Z., Li T., Ku M., Wang K., Zhuang A., Fan R., Yue X., Chen W. (2024) MMLU-Pro: A more robust and challenging Multi-Task Language Understanding benchmark (ArXiv paper 2406.01574). DOI: https://doi.org/10.48550/arXiv.2406.01574

Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., Ed H., Quoc C.V., Zhou L.D. (2022) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (ArXiv paper 2201.11903). DOI: https://doi.org/10.48550/arXiv.2201.11903

Wen Y., Wang Z., Sun J. (2023) MindMap: Knowledge Graph prompting sparks graph of thoughts in large language models (ArXiv paper 2308.09729). DOI: https://doi.org/10.48550/arXiv.2308.09729

Xu Z., Cruz M.J., Guevara M., Wang T., Deshpande M., Wang X., Li Z. (2024) Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering (ArXiv paper 2404.17723). DOI: https://doi.org/10.48550/arXiv.2404.17723

Yang L., Chen H., Li Z., Ding X., Wu X. (2023) Give Us the Facts: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling (ArXiv paper 2306.11489). DOI: https://doi.org/10.48550/arXiv.2306.11489

Zhang Y., Ding H., Shui Z., Ma Y., Zou J., Deoras A., Wang H. (2021) Language models as recommender systems: Evaluations and limitations. Paper presented at the NeurIPS 2021 Workshop on I (Still) Can't Believe It's Not Better.

Zhang Y., Sun R., Chen Y., Pfister T., Zhang R., Arik S.O. (2024) Chain of Agents: Large language models collaborating on Long-Context Tasks (ArXiv paper 2406.02818). DOI: https://doi.org/10.48550/arXiv.2406.02818

Zhong Z., Xia M., Chen S., Lewis M. (2024) Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training (ArXiv paper 2405.03133). DOI: https://doi.org/10.48550/arXiv.2405.03133

Zhou J.P., Luo K.Z., Gu J., Yuan J., Weinberger K.Q., Sun W. (2024) Orchestrating LLMs with Different Personalizations (ArXiv paper 2407.04181). DOI: https://doi.org/10.48550/arXiv.2407.04181

Zhu Y., Wang X., Chen J., Qiao S., Ou Y., Yao Y., Deng S., Chen H., Zhang N. (2023) LLMS for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities (ArXiv paper 2305.13168). DOI: https://doi.org/10.48550/arXiv.2305.13168

This work is licensed under a Creative Commons Attribution 4.0 International License.

Downloads

Download data is not yet available.

The Evaluation of GenAI Capabilities to Implement Professional Tasks

Keywords

How to Cite

Download Citation

Abstract

References

Downloads