By Amanda A. Wolkowitz, PhD, Alpine Testing Solutions Inc., Brett P. Foley, PhD, Alpine Testing Solutions Inc., Jared Zurn, AIA, CAE, NCARB, Corina Owens, PhD, Alpine Testing Solutions Inc., and Jim Mendes, Adobe
The effects of the number of options listed for a single answer, multiple-choice question is a topic that has been discussed repeatedly for more than 70 years. Multiple researchers have come to a common conclusion: 3-option multiple-choice (MC3) items tend to perform just as well psychometrically — if not better than — 4-option multiple-choice (MC4) items.1-5, 7-11 A few reasons to consider MC3 items include:
- MC3 items may improve overall content validity of a test6 because with shorter items, more content can be covered in the same amount of time.
- While the theoretical probability of guessing the correct answer does increase when switching from an MC4 to an MC3 item, studies have found no practical difference in item performance when comparing MC3 items to MC4 or 5-option multiple-choice (MC5) items.1, 6, 8
- MC3 items can be developed more quickly, resulting in less development expense.
In 2020, the National Council of Architectural Registration Boards (NCARB), which develops the Architect Registration Examination® (ARE®), worked with Alpine Testing Solutions Inc. (Alpine) to convert select MC4 items that had at least one poorly performing distractor to MC3 items. The goal was to convert 10-20% of the existing operational (i.e., scored) multiple-choice items and include them as operational items on new forms. (The exam includes other item types that were not changed, such as check-all-that-apply, quantitative fill-in-the-blank, drag-and-drop and hotspots.) The intent was to avoid pretesting the converted items and allow the test publisher to maintain the size of their operational item bank. Including MC3 items that were operational as well as additional MC3 items that were in the pretest blocks of the test helped ensure candidates put forth equal effort on all items.
The high-level steps that were implemented to convert MC4 to MC3 items, estimate the MC3 item parameters, estimate the initial cut scores for the new forms and confirm the findings were as follows:
1. Identify MC4 items with at least one non-functioning distractor (NFD). The criteria* used for the ARE items were:
- a. Distractor selected by <5% of candidates of who answered the item incorrectly OR
- b. Distractor with a positive item-total score correlation.**
If an item had more than one NFD, then the distractor with lower endorsement was selected as the option to remove.*** If the NFDs had equal endorsement, then the distractor with the higher item-total score correlation was selected as the option to remove.
2. Review and approve the selected items for conversion from MC4 to MC3.
Review and approval of the MC3 items used on the operational forms was completed by subject matter experts.
3. Estimate the Rasch item measures for the newly converted MC3 items in two ways:
- a. Assuming candidates who selected an NFD for an item would answer the item correctly if it were an MC3 item; and
- b. Assuming candidates who selected an NFD for an item would not answer the item correctly if it were an MC3 item.
It was expected that the actual Rasch item measure would likely fall between these two extremes based on the additional assumption that the option selected by candidates who did not select an NFD would be unchanged. This expectation was compared to the actual Rasch item measures estimated using real data after the exam was administered (see Step 5).
4. Assemble the new forms using a pre-equating model with the following constraints:
- a. Only use the items that had small differences in the Rasch measures found in Steps 3a/3b.
- b. Estimate the cut score in two ways:
- i. Based on the Rasch measure in Step 3a
- ii. Based on the Rasch measure in Step 3b
- c. Use the Rasch measures in step 4b to estimate the corresponding raw cut scores for the forms twice (once for the low estimates and once for the high estimates). Ensure forms were built so that the two estimated cut scores were:
- i. Within 0.15 raw score points of each other, and
- ii. Within 0.25 of the same integer cut score.
5. NCARB released two of the four forms in each division of the ARE in December 2020. Scores were delayed due to other changes taking place with the ARE that required a psychometric evaluation before release. This allowed the operational MC3 item parameters previously estimated from the data and the initial cut scores estimated in Step 4 to be compared to newly calculated Rasch item measures based on live MC3 data. Using these results, the initial cut scores were also verified.
6. Following the success of Step 5, the final two forms for each division were released with the estimated cut score in Step 4 without delayed scoring in February 2021. The cut scores were confirmed as soon as sufficient data was available.
The conversion/estimation method was successful:
- The effective (rounded) initial cut score estimates for each of the 24 exam forms equaled the cut score estimates obtained after calibrating the MC3 items with live data with 100% accuracy.
- Ninety-three percent of the final cut scores based on live data were within 0.15 raw score points of the initially estimated cut scores. The greatest difference between the initially estimated and confirmed cut scores was 0.37 raw score points.
- Of the 176 converted MC3 items, 88% were within 0.50 logits of the initially estimated Rasch measure ranges. While 0.50 logits is an arbitrary threshold for comparison, it does indicate that the Rasch measures estimated for most items through the method described above were near the observed Rasch measures.****
Candidate reaction to the MC3 item type was positive:
- By informing candidates that new MC3 items could be operational, candidates completed these items with as much intentionality as other items on the forms.
- The test publisher collects candidate feedback on each exam delivery with a post exam survey as well as via a moderated online community. Candidates have expressed no concerns through either channel over experiencing both MC4 and MC3 items during the same exam administration.
We recommend that other programs planning a similar conversion consider the following:
- When converting items, choose a stringent NFD definition to only convert a small percentage of the total items.
- Only use converted items with small differences between the MC3 Rasch values (as estimated in Steps 3a/3b).
- Only use this process for the initial set of forms involved in the MC3 conversion. On future pre-equated forms, newly written or additional converted MC3 items should follow a traditional pretesting plan.
* The criteria established for the ARE exams was a practical decision based on NCARB’s goal to convert 10-20% of the existing operational MC4 items and an analysis of the item and option level data. For this program, these criteria did a reasonable job of identifying the worst performing distractors and resulted in flagging a reasonable number of items. Other programs should review their program’s goals and data to determine appropriate flagging criteria.
** The item-total score correlation (sometimes referred to as the point-biserial correlation for dichotomously scored items) is the correlation of the correct/incorrect responses for an item with the total scores earned on the exam.
***In the cases in which there were two NFDs for an item, the item often had one distractor selected by fewer than 5% of candidates who answered the item incorrectly and another distractor that had a positive item-total score correlation. If an item had two distractors meeting the same 5% or the same correlation criteria, then the item could still be converted, however, a program may choose to retire such an item.
****The MC3 items selected for the forms were those in which the estimated Rasch measures were close to the original MC4 Rasch measures. This allowed the criteria stated in Step 4c to be more easily met. Other than this rule of thumb, we did not specify an operational definition of a “small” difficulty shift due to the MC3 conversion.
- Baghaei, P. & Amrahi, N. (2011). The effects of the number of options on the psychometric characteristics of multiple choice items. Psychological test and assessment modeling, 53(2), 197-211.
- Bruno, J. E., & Dirkzwager, A. (1995). Determining the optimal number of alternatives to a multiple-choice test item: An information theoretic perspective. Educational and Psychological Measurement, 55(6), 959–66.
- Cizek, G. J., Robinson, K. L., & O’Day, D. M. (1998). Nonfunctioning options: A closer look. Educational and Psychological Measurement, 58(4), 605–11.
- Dehnad, A., Nasser, H., & Hoesseini, A. F. (2014). A comparison between three- and four-option multiple choice questions. Procedia – Social and Behavioral Sciences, 98, 398-403. Retrieved from www.sciencedirect.com
- Delgado, A. R., & Prieto, G. (1998). Further evidence favoring three-option items in multiple-choice tests. European Journal of Psychological Assessment, 14(3), 197-201.
- Haladyna, T. M. & Rodriguez, M. C. (2013). Developing and Validating Test Items. Routledge: New York, NY.
- Mackey, P. & Konold, T. R. (2015). What is the optimal number of distractors in exam items? Case Study. Institute for Credentialing Excellence.
- Rodriguez, M. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practices, 24(2), 3-13.
- Rogausch, A., Hofer, R., & Krebs, R. (2010). Rarely selected distractors in high stakes medical multiple-choice examinations and their recognition by item authors: a simulation and survey. BMC Medical Education 10(85), 1-9. https://doi.org/10.1186/1472-6920-10-85
- Tarrant, A. & Ware, J. (2010). A comparison of the psychometric properties of three-and four-option multiple-choice questions in nursing assessments. Nurse Education Today, 30(6), 539-543.
- Vegada, B., Shukla, A., Khilnani, A., Charan, J., & Desai, C. (2016). Comparison between three option, four option and five option multiple choice question tests for quality parameters: A randomized study. Indian Journal of Phamacology, 48, 571-5. Retrieved from http://www.ijp-online.com/text.asp?2016/48/5/571/190757
Amanda A. Wolkowitz, PhD,
Amanda is a senior psychometrician at Alpine Testing Solutions, Inc, where she works with a wide range of organizations on all aspects of the test development cycle. Amanda also serves as a psychometric assessor for the ANSI National Accreditation Board (ANAB), is a member of the board of directors for the National Center for Employee Ownership (NCEO) and is a reviewer for Practical Assessment, Research, & Evaluation (PARE). Amanda has presented and published research on a wide range of psychometric topics. Some of her most recent publications include the topics of alternative item types and equating with small sample sizes. She also was a contributing author to the scoring and equating chapter in the third edition of the Institute for Credentialing Excellence (I.C.E.) Handbook.
Brett P. Foley, PhD
Brett is the director of professional credentialing and a senior psychometrician at Alpine Testing Solutions. He has worked with many types of testing programs in licensure, certification and education. Dr. Foley received his doctorate in quantitative, qualitative and psychometric methods from the Department of Educational Psychology at the University of Nebraska-Lincoln. He is a past-president of the Northern Rocky Mountain Educational Research Association and currently serves as chair of the Nebraska Board of Engineers and Architects as a public member. His research interests include standard setting, policy considerations in testing and using visual displays to inform the test development process.
Jared Zurn, AIA, CAE
Jared is the vice president of examination and serves as a member of the senior leadership team at the National Council of Architectural Registration Boards. Zurn is responsible for the strategic direction of examination related initiatives and professional ethics initiatives, as well as oversight of and participation in research regarding the current and future states of the architectural profession. Zurn is a licensed architect and certified association executive.
Before joining NCARB, Zurn operated a sole proprietorship architectural design firm in northern Minnesota. He also served as faculty of the architectural technology program at Minnesota State Community and Technical College where he served as a division chair and led the architectural technology program in the areas of curriculum development, course assessment and program outcome assessment.
Corina Owens, PhD
Corina is the director of IT credentialing and senior psychometrician at Alpine Testing Solutions. She has a broad background in education, with specific training and expertise in educational measurement/psychometrics and advanced statistical analysis. Prior to joining Alpine Testing Solutions, she worked for RTI International as a research psychometrician where her work focused on providing analytical evaluations of data collection tools measuring educational attainment, adolescent substance abuse and experience of care. In addition, she worked at Professional Testing Inc. as a psychometrician where she provided advanced statistical and psychometric analyses to several certification and licensure organizations. She has published multiple peer reviewed articles and presented over 30 presentations and proceedings at both regional and national conferences. Corina received her doctorate in educational measurement and research from the University of South Florida.
Jim is a certification development manager on Adobe’s curriculum and credential team. In his 20-year career at Adobe, Jim has managed the development of over 200 exams supporting Adobe’s Experience Cloud, Advertising Cloud and Creative Cloud products and solutions. Prior to joining Adobe, Jim spent eight years at Microsoft in a variety of roles including technical trainer, business operations manager and exam development program manager. Jim holds degrees in mathematics (numerical analysis) and computer science and a Master of Business Administration (finance).