Thursday, November 7, 2024

Generative AI in My Teaching - The Final Shoe Drops

I recently published a paper in the journal npj Digital Medicine looking at how well generative artificial intelligence (AI) systems perform in my well-known introductory biomedical and health informatics course. This online course is taught to three audiences: graduate students (required in our graduate program and taken as an elective by students in several others, including public health, basic sciences, and nursing), continuing education students (the well-known 10x10 ["ten by ten"] course), and medical students at Oregon Health & Science University (OHSU). Student assessment varies for the three different course audiences and is carried out by up to three activities: multiple-choice quizzes (MCQs) of ten questions for each of the ten units of the course, a short-answer final exam, and a 10-15 page term paper on an appropriate topic of the student's choice. Graduate students must complete all three forms of assessment, while continuing education and medical students write a shorter 2-3 page paper. The final exam is required for graduate students and optional for continuing education students who want to obtain graduate credit, usually with the purpose of pursuing further study in the field.

As also described in a summary of the paper by OHSU, our research found that any of six well-known large-language model (LLM) systems score better than up to 75% of all students and easily achieve a passing grade for the two-thirds of the course assessment that include the MCQs and final exam. The LLMs were assessed and compared with the 139 students who took last year’s (2023) version of the course. The results of the study bring into question how I will assess students in the course going forward.

I thought I had some solace in that LLMs still could not write what is the final third of the assessment for graduate students, namely the term paper. Alas, this was before someone pointed me to a new generative AI system from researchers at Stanford University, led by Professor Monica Lam, called Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking (STORM). Although STORM is designed to guide users through an iterative pre-writing process, its Web interface also allows a two-part prompt to define the topic of the paper and its format, length, and other attributes. When I ask for what I want for that last part of the course assessment, namely the 10-15 page paper on a focused topic, STORM serves up a paper that, while perhaps superficial in some coverage, generates a paper that for the most part satisfies the requirements of my paper-grading rubric. My co-author of the npj Digital Medicine paper Kate Fultz-Hollis noted that the first papers I generated did not have many peer-reviewed citations in the reference, but that was easily fixed by asking for that explicitly in the prompt.

Now I must ask, has the final shoe dropped, i.e., can generative AI now pretty much do everything needed to pass my course? I hope that students will still want to learn informatics, but clearly that will not be a requirement for passing. Those of us who are educators have new challenges from generative AI, which is based on their ability to perform as well as students in a variety of learning assessments. One researcher, business professor Ethan Mollick from the University of Pennsylvania, has called this the homework apocalypse.

Some have argued that new approaches to assessment are required, and Professor Mollick has a wealth of ideas. Many of these ideas are challenging to implement, especially in large online courses and when there is a true knowledge base for which we aim for students to learn. I do agree with those who advocate we should not merely assess students to regurgitate facts, especially in an era when finding facts online is easier than ever. But I do try in my teaching (maybe not always succeeding) to have students apply the knowledge they are learning. I find MCQs are actually pretty good at that.

Nonetheless, the implication for these results is that generative AI systems challenge our ability to assess student learning. This will require us to make modifications in how we evaluate students. This does not mean we should ban LLMs, but that we need to find ways to ensure enough learning so students can think critically based on a core of fundamental knowledge.

We also need to answer other questions: Is there a core of knowledge about which students should be able to answer questions without digital assistance? Does this core of knowledge facilitate higher-order thinking about a discipline? Does that core enable thoughtful searching for information, via classic search or LLMs, for information beyond the human’s memory store? Should we have explicit policies around the use of generative AI in specific courses (here is mine)? Is it appropriate to try to maintain rigor in academic teaching, and if so, how?

I have talked in a number of forums about these issues, and find many other educators are struggling to address these challenges like I am. Clearly we will need solutions for these problems in ways that optimizes student learning and critical thinking while make the best of these tools that can enhance our performance in tasks we are learning.