LM101-037: How to Build a Smart Computerized Adaptive Testing Machine using Item Response Theory

By | October 12, 2015
In this episode, we discuss the problem of how to build a smart computerized adaptive testing machine using Item Response Theory (IRT).

LM101-037: How to Build a Smart Computerized Adaptive Testing Machine using Item Response Theory

Episode Summary:

In this episode, we discuss the problem of how to build a smart computerized adaptive testing machine using Item Response Theory (IRT). Suppose that you are teaching a student a particular target set of knowledge. Examples of such situations obviously occur in nursery school, elementary school, junior high school, high school, and college. However, such situations also occur in industry when top professionals in a particular field attend an advanced training seminar. All of these situations would benefit from a smart adaptive assessment machine which attempts to estimate a student’s knowledge in real-time. Such a machine could then use that information to optimize the choice and order of questions to be presented to the student in order to develop a customized exam for efficiently assessing the student’s knowledge level and possibly guiding instructional strategies. Both tutorial notes and advanced implementational notes can be found in the show notes at: www.learningmachines101.com .

Show Notes:

Hello everyone! Welcome to the thirty-seventh podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

In this episode, we discuss the problem of how to build a smart computerized adaptive testing machine using Item Response Theory (IRT). Suppose you are teaching a student a particular target set of knowledge. Examples of such situations obviously occur in nursery school, elementary school, junior high school, high school, and college. However, such situations also occur in industry when top professionals in a particular field attend an advanced seminar for training. All of these situations would benefit from a smart adaptive assessment machine which attempts to estimate a student’s knowledge in real-time. Such a machine could then use that information to optimize the choice and order of questions to be presented to the student in order to develop a customized exam for efficiently assessing the student’s knowledge level and possibly guiding instructional strategies. Such a system could also be incorporated into an adaptive tutoring system as a component which provides real-time assessments of the student’s current knowledge state for the purpose of selecting and revising the current instructional strategy in real-time as well. This is how expert human tutors instruct students on an individual basis.

A critical component of the teaching process is assessment or testing. Assessment is important for at least two reasons. First, an assessment procedure is essential for evaluating the success of the teaching experience. For example, a student who receives the grade of A in a Calculus class is presumed to have acquired more knowledge of Calculus than a student who receives the grade of D in a Calculus class although this assumption may not necessarily hold in practice.  The assignment of a grade to a student is some complicated function of the student’s unique background and abilities, the assessment procedure, and other factors which combine with the subjective assessment of the instructor.

Second, assessment procedures do not have to be fundamentally evaluative in nature. Assessment procedures can be diagnostic. That is, the purpose of an assessment procedure might be to identify the student’s current level of mastery and try to understand what types of intervention strategies would be appropriate to bring the student up to the next level. For example, one student might be very strong in calculus but may have limited knowledge of computer programming skills while another student might have excellent computer programming skills and a limited knowledge of calculus. It would not be efficient to teach a course which devotes 50% of the time to calculus and 50% of the time to computer programming skills which contains these two students. A much more efficient approach would be to develop two courses which are customized respectively for the two students. The first course might devote 90% of the course to learning calculus skills and only 10% to computer programming skills, while the second course might devote 90% of the course to learning computer programming skills and only 10% to learning calculus skills. Of course, this requires either two instructors or a single instructor who has the energy of two instructors!!! In addition, the above strategy assumes that somehow we have identified what knowledge and skills the students possess before  the teaching process begins.

So now that we have talked about two fundamental but distinct objectives of educational assessment, this leads us to the fundamental question of how can we exploit computer technology to adaptively assess knowledge and understanding? The basic idea is that the student interacts with a computer. The computer is an active participant in the assessment and tutoring process. As the student interacts with the computer, the computer acquires an improved understanding of the student’s strengths and weaknesses. Based upon this understanding, the computer decides not only selects and revise the optimal training curriculum in real-time but also might select and revise the optimal assessment procedure in real-time as well. In other words, the computer might start off by asking the student some moderately difficult questions. If the student has difficulty with these questions, then the computer presents easier questions to the student. As a consequence of this procedure, each student actually is tested with a different exam which is customized to that student’s strength’s and weaknesses.

Of course such a procedure, surfaces a variety of interesting and challenging issues. For example, if each student is tested with a different exam, then how can exam scores be compared? A low-knowledge student who receives a series of easy questions might answer many more questions than a high-knowledge student who answers a relatively small number of very difficult questions. Simply counting the number of questions which are correctly answered would lead to misleading conclusions.

Item Response Theory is based upon the assumption that one has a collection of questions which are called “items”. In addition, it is assumed that there is a correct answer for each “item” and the student either gets the “item” correct or gets the “item” wrong. Although such assumptions are not entirely unreasonable, it is important to appreciate that the mathematics of Item Response Theory was based upon concepts of education which were prevalent in the 1950s. Specifically, the concept that the answer to a given question can be classified as either “correct” or “wrong”. In the real world, a given question may have many answers which are all partially wrong and all partially correct.

So, under the assumption that we have a collection of questions about a particular knowledge domain which can be answered either correctly or incorrectly, the assessment problem is concerned with using these questions to assess a student’s knowledge of that domain.  So how do we do this?

The first step is to identify the major components of a particular knowledge domain. For example, the knowledge domain of “arithmetic” might be assumed to consist of the following four major components: (1) addition and subtraction, (2) multiplication and division, (3) negative numbers, and (4) fractions and decimals. Then, we might devise a set of questions for each of these four domains. Suppose, we develop 100 questions for each of these four domains. Our knowledge assessment procedure is intended to evaluate to what degree a student has mastered each of these four domains. So, for example, the assessment of one student might show that they have a good knowledge of arithmetic except their knowledge of fractions and decimals is very poor. Thus, each student receives four numbers indicating their performance in each of these four major domains.

To assess performance in the “addition and subtraction” domain, one could simply report the percentage of questions correctly answered by the student but such a measure is problematic because some of the questions might be very easy questions while other questions might be very difficult questions. One student might answer 40% of the easy questions while another student might answer 40% of the difficult questions, it does not seem reasonable to assign the same performance score to both students. The problem then becomes how to weight the relative difficulty of each question in an appropriate manner. Notice that this is also relevant to dealing with issues of adaptive testing. In adaptive testing, each student receives a customized test which will consist of a different set of questions. If each student receives different questions and possibly even a different number of questions, somehow this needs to be taken into account in order to compare the performance of two different students who are each taking different exams.

There are various versions of Item Response Theory which deal with these issues but today we will consider the simplest version. Recall, from Episode 14 that a logistic sigmoidal function is a function whose output value increases as the input value increases. In addition, the maximum output value can never exceed one and the minimum output value can never be less than zero. In Item Response Theory, it is assumed that the probability that the student answers a question correctly is equal to a logistic sigmoidal function of a parameter called the “ability parameter” minus another parameter called the “item difficulty parameter”. When a student takes an exam consisting of M questions it is assumed that each question has its own difficulty parameter and all questions on the exam have a common value for the ability parameter. Thus, the probability a question is correctly answered is the difference between the student-specific ability parameter and the item-specific difficulty parameter. We will refer to this probability as the “predicted probability”. If the Item Response Theory probability model is correct and the item-specific difficulty parameters are correctly estimated, then one can estimate the ability parameter of a student. So…this is the key idea. Instead of just having parameters in the model which specify individual differences in students we also include parameters in the model which specify individual differences in items.

The procedure for estimating the student-specific parameters and the item-specific parameters  is based upon the method of Maximum Likelihood Estimation which was described in Episode 10. The concept of Maximum Likelihood Estimation means that the parameter values are chosen such that probability of the observed data is maximized. To keep things simple, suppose that we have one student who is taking a 1 question test.  Suppose that Student 1 takes the 1 question test and gets the question correct. Then the probability of observing a correct answer to the 1 question test in this case is equal to the predicted probability the question is answered correctly. The goal of the learning process is to adjust the ability parameter and the item-specific difficulty parameter so that the probability of observing this correct answer is maximized. Since there is only 1 student answering one question, the percentage of times the student answers that question correctly can be used to directly calculate the difference between the student’s ability parameter and the item difficulty parameter. Although this difference  can be uniquely estimated in this simple example, there are not enough constraints on this problem to figure out a specific numerical value for the student’s ability parameter and a specific numerical value for the item difficulty parameter. In this case, we refer to the parameter estimation problem as non-identifiable.

The total number of parameters is equal to the number of students plus the number of unique exam questions since each student has his or her unique ability parameter which needs to be estimated and each exam question has its unique item-difficulty parameter which needs to be estimated. The number of data points is the number of students multiplied by the number of unique exam questions. If the total number of parameters is less than the number of data points, then this is a sufficient but not necessary condition for the parameter estimation problem to be non-identifiable.

In order to estimate the item-specific parameters and student-specific parameters for the case where there is more than 1 student and more than 1 unique exam question using maximum likelihood estimation, we use the following procedure. Suppose a student gets the first two questions correct on the exam and gets the third question wrong. Then the observed probability of that pattern of responses under our probability model would be the predicted probability question 1 is correctly answered computed from the student’s ability parameter and the question 1 item difficulty parameter multiplied by the predicted probability question 2 is correctly answered computed from the student’s ability parameter and the question 2 item difficulty parameter multiplied by the predicted probability question 3 is incorrectly answered computed from the student’s ability parameter and the question 3 item difficulty parameter. The product of these three probabilities is the probability of the student’s observed response pattern or the Likelihood of the Student Response pattern. If there are N subjects, then the probabilities of the N response patterns from the N subjects are multiplied together to obtain what is called the Likelihood of the Data. Notice that we are assuming that the response to one exam question is conditionally independent of the response to another exam question given that the subject’s ability is known and the item difficulty parameter values of the two questions are known. We are also assuming that the responses of one subject are independent of the responses of another subject.

This resulting probability which is the product of the response patterns of the N students  is functionally dependent upon the N ability-specific parameters for the N students and the M item-difficulty parameters associated with the M unique exam questions. The total number of parameters is N+M and the total number of data points is MN. So a necessary but not sufficient condition for the parameter values to be identifiable is that M multiplied by N is greater than M plus N.

Keeping these issues of having enough students for the analysis to make sense in mind, we now have an optimization problem. We simply maximize the Likelihood of the Data which is the product of the likelihoods of all student response patterns. This illustrates the case of simultaneous estimation of ability and item difficulty parameters.

 In practice, it is usually more convenient and effective to use a “calibrated exam”. That is, this is an exam in which the item difficulty parameters have been already estimated and are treated as constants. An advantage of such a procedure is that student ability parameter can be reliably estimated using fewer data points. Another advantage of such a procedure is that since the item difficulty parameters of the items are known in advance, one knows in advance which items are going to be difficult and which are going to be easy. This is also helpful in comparing items between exams. So an item on one exam can be switched for another if they have the same item difficulty. All of these ideas will be important when we talk about Adaptive Testing in Item Response Theory. But before talking about this issue, it is important to discuss what type of methodology should be used to estimate the item difficulty parameters.

One could of course, simply use the simultaneous estimation method to estimate student ability parameters and item difficulty parameters. This is not necessarily a bad approach but it may not be the best approach. An alternative methodology is to integrate out the ability parameters and optimize that resulting function. This will give better quality results but the estimation procedure is much more complicated. In future episodes of Learning Machines 101 we will discuss how to do this type of multidimensional integration using Monte Carlo Approximation Methods and also how to do it using “Laplace Approximation” methods which approximate the integrand of the multidimensional integral with a simpler function which is easy to integrate in high-dimensions.

There is one more concept that needs to be introduced before we can talk about the concept of Adaptive Testing. This is the concept of “sampling error”.  Suppose we assume that the item difficulty parameters have been estimated on some other very very large data set so that we are now going to treat the item difficulty parameter values we have estimated as constants.

Since the item difficulty parameter values are constants we can design an exam which has some easy questions identified by questions with small item difficulty parameter values and some hard questions identified by questions with large item difficulty parameter values. Also to keep the discussion simple, let’s fantasize and assume that not only we have calculated the item difficulty parameter values in advance but all of these item difficulty parameter values are correctly calculated and our probability model of student’s response is perfect. We can and must relax these assumptions but just assume they hold for now.

Under these assumptions, suppose a student takes an exam with 10 questions. The exam only has one free parameter which is the student’s ability parameter. This parameter is common to all 10 questions. Again, we use maximum likelihood estimation to find the student’s ability parameter that makes the likelihood of the observed student response pattern for this exam as large as possible. Suppose as a result of this parameter estimation procedure we obtain the ability parameter value equal to TWO. Now we add another question to the exam so that the exam has 11 questions and we re-estimate the student’s ability parameter value and we find that it is equal to TWO and one-tenth. So what is going on? Why did the student’s ability parameter value change from TWO to TWO and one-tenth? This is a simple consequence of the fundamental concept of maximum likelihood estimation which states that as the number of data points becomes large, the maximum likelihood estimates get closer and closer to a parameter value which specifies a best approximation to the distribution of data in the environment. In other words, we need to ask a sufficient number of exam questions in order to reliably estimate the student’s ability parameter value.

Let’s take this idea one step further. Suppose that we test the student with 3 different exams. Each exam is assumed to have M items. As before that all item difficulty parameter values have been already estimated from a previous data analysis. Because we know the item difficulty parameter values we can “equate” the exams so that the first item of all 3 exams has the same difficulty parameter value, the second item of all 3 exams has the same difficulty parameter value, and so on. We will call these 3 exams…equated exams.

From these three different equated exams when M=2, we obtain three different estimates of ability: 1, 2, and 3. The average of the estimated ability parameter values: 1, 2, 3 is equal to 2. The standard deviation of the estimated ability parameter values: 1, 2, 3 is equal to 1. This standard deviation is called the “sampling error” for an exam of length M=2. Now suppose that the length of the exam was M=200 so that we ask 200 questions rather than 2 questions to the student. We construct three different equated exams where M=200 and obtain three different estimates of ability which are: 2 point 1, 2 point 2, and 2 point three. By asking more questions on each of the three exams and using equated exams, the different estimates of the student’s ability parameter value begin to converge.

Notice that we estimated the “sampling error of ability” for an exam of length M by presenting several exams of length M to the student. In future episodes of Learning Machines 101 we will show how we can use simulation methods to estimate the sampling error of ability for an exam of length M by only presenting a single exam of length M to the student. Such methods are sometimes called nonparametric bootstrap simulation methods. In other future episodes of Learning Machines 101 we will show how we can derive a mathematical formula to estimate the sampling error of ability for an exam of length M by only presenting a single exam of length M to the student and without doing any simulations using the “sandwich covariance matrix estimator”. However, each of these methods requires its own podcast!! So, for today, simply assume that if we have a student taking an exam of length M we can estimate the sampling error of their estimated ability parameter as well as the ability parameter value itself. We can do this in one of three different ways as previously discussed. We can administer multiple exams of length M to the student (which is not desirable in practice), we can administer a single exam of length M to the student and do simulation studies, or we can administer a single exam of length M to the student and calculate the sampling error with a special formula. To summarize, the “sampling error of ability” refers to the variation in the observed estimated ability of a student which is due to a combination of variations in student performance as well as variations in exam content.

Given the above background, we can now discuss one approach for implementing adaptive assessment using Item Response Theory. First, we obtain an extremely large data sample so that item-specific parameters and student-specific parameters can be reliably estimated using simultaneous maximum likelihood estimation. Second, a short exam consisting of equal numbers of low-difficulty, moderate-difficulty, and high-difficulty items is presented to a student. The student’s ability parameter and its sampling error are then estimated using either simulation studies or using appropriate mathematical formulas. If the sampling error is too large, then the computer selects an additional question to be presented to the student. This question, however, is not selected randomly. Suppose there are 10,000 questions in the database. The computer estimates how much the sampling error of the ability parameter estimate if question 1 is presented to the student. The computer then simulates how much the sampling error of the ability parameter estimate if question 2 is presented to the student. This process is continued until the computer has identified which question will reduce the sampling error of the ability parameter estimate most rapidly. If the sampling error of the estimator of the student’s ability parameter is still not sufficiently small, then this procedure is repeated and the computer chooses another question designed to reduce the sampling error of the student’s ability parameter estimate most rapidly.

Computerized Adaptive Testing (CAT) using Item Response Theory (IRT) is a powerful tool which can be used to efficiently guide not only assessment processes in both children and adults but also used to guide tutoring processes by identifying which items are “too easy” or “too challenging” and adjusting the choice of items in real-time.

The website for the International Association for Computerized and Adaptive Testing is a great resource for work in the area of Item Response Theory. On this website you will find information about current research and conferences in the field of Item Response Theory, information about the Journal of Computerized Adaptive Testing, and Computerized Adaptive Testing Software which you can download for free as well as Computerized Adaptive Testing software intended for commercial applications. The specific hyperlinks for these websites can be found in the show notes for this episode at: www.learningmachines101.com .

There are lots of variations of Item Response Theory which we have not discussed here. For example, a two parameter Item Response Theory model includes not only a difficulty level parameter but a discrimination parameter. The discrimination parameter is an additional item-specific parameter which multiplies by the difference between the ability and difficulty parameter for the purpose of increasing or decreasing the probability an item is correctly answered as a function of the difference between the student’s ability parameter and the item difficulty parameter.  A third item-specific parameter is also sometimes introduced which adjusts the guessing rate. Specifically, this third parameter is a number between zero and one which can be interpreted as the probability that the student will guess the correct answer.

The Item Response Theory model is an extremely simple but powerful model. It embodies a large number of important principles which could be naturally extended to more sophisticated models. A fundamental limitation of the Item Response Theory model is that it is based upon the idea that a student’s ability can be represented as a single number. Such an assumption implies that the representation of the mastery of a skill is one-dimensional. In reality, this assumption is clearly problematic. Suppose that we are interested in assessing knowledge in a particular domain. Some individuals may be experts in some areas of that domain and novices in other areas in that domain. It does not make sense to attempt to impose some sort of ordering on those individuals which classifies each individual as having a skill level which is less than one individual yet greater than another. For example, suppose we have a dozen students who are doctoral students specializing in Philosophy. All of the students are experts but some will have specialized knowledge in particular areas and less knowledge in other areas. This principle holds for professors, phd students, and first graders. A first grader may have difficulty answering a reading comprehension question because they have difficulty with word decoding strategies, working memory deficits, or a lack of prior knowledge. These comments emphasize that a better representation of a student’s understanding of a knowledge domain is not a single number but rather a list of numbers which specifies a knowledge or skill mastery profile.

Another fundamental limitation of the classical Item Response Theory approach is that the answers to multiple choice or multiple response questions may not provide important insights into how someone’s understanding and knowledge of a particular domain is organized. Constructed response questions such as essay questions or questions asking students to recall or summarize key ideas can provide important clues to the organization of mental structures.

In future episodes of Learning Machines 101, we will talk about state-of-the-art approaches to knowledge assessment such as: Automated Essay Grading Technology, Knowledge Space Theory, and Topical Hidden Markov Models. These more recent developments in educational technology represent important directions in Educational Technology Assessment which address fundamental limitations of Item Response Theory.

So thanks for your participation in today’s show! I greatly appreciate your support and interest in the show!! If you liked this show, then I would really appreciate it if you would share this show with your friends and colleagues via email, twitter, facebook, Linked-In, or any other way you typically communicate with your friends and colleagues.

If you are a member of the Learning Machines 101 community, please update your user profile. If you look carefully you can provide specific information about your interests on the user profile when you register for learning machines 101 or when you receive the bi-monthly Learning Machines 101 email update!

You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!

Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time.

Also check out the Statistical Machine Learning Forum on LinkedIn and Twitter at “lm101talk”.

From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group!

Further Reading:
Wikipedia Item Response Theory Article
International Association for Computerized and Adaptive Testing
Automated Essay Grading Technology,
Knowledge Space Theory,
Topical Hidden Markov Models
Lord (1980). Applications of Item Response Theory to Practical Testing Problems

Related Episodes of Learning Machines 101:
Episode 10 (Maximum Likelihood Estimation)
Episode 14 (Function Approximation)

 

Leave a Reply

Your email address will not be published. Required fields are marked *