Test Scoring

 

Test Scoring - Explanation of Item Statistics

The following description of the test item statistics produced by this program is rather general in nature. For more specific information concerning any particular statistic, reference should be made to any good statistics or test and measurements textbook.

The column headed item gives the question number. The columns headed A, B, C, D, E, and omitted indicate the percent and number of students who responded accordingly. These can be used to determine the pattern of responses for multiple-choice items.

The difficulty index of a test item indicates the proportion of students who respond correctly to the item. For example, if the difficulty of an item were 65, this would indicate that 65 percent of the students answered the item correctly. The higher the difficulty index, the easier the item. A classroom test covering related subject matter should contain items with a fairly wide range of difficulty values. However, items with indices at or below the chance level (25 or lower for an item with 4 alternatives-20 or lower for an item with 5 alternatives) are undesirable. Equally undesirable are extremely easy items with difficulties approaching 1000, as they merely add a constant to the scores. Test reliability and validity will be maximized if most item difficulties are somewhat easier than halfway between the chance levels and 100. Under ordinary circumstances, then, a test consisting of items with 4 alternatives should contain many items with difficulties in 60-85 ranges and the remaining should be scattered between 25 and 100. Tests consisting of items with 2 alternatives (true-false items) should have difficulties between 50- 100, with a concentration in the 75-909 ranges.

The point-biserial correlation coefficient measures the relationship between the score on the item and the score on the test. The value of this statistic ranges between -100 and +100. A high positive value indicates that those who answered the item correctly also received higher scores on the test than those who answered the item incorrectly. A high negative value indicates that those who answered the item correctly received lower scores on the test than those who answered the test item incorrectly. A near-zero value indicates that there is little relationship between the score on the item and the score on the test. It is desirable to retain items with a high positive correlation coefficient and to eliminate those with near zero or negative values. As a rough guide, it is suggested that the items with negative or near-zero (10 or less) correlations be eliminated or substantially revised, and those with low positive (10-30) correlations be studied to determine how improvement might be accomplished.

The Kuder-Richardson internal consistency formula number 20 has been used to computer the reliability estimate provided in this analysis. A reliability coefficient of this type gives and indication of the extent to which individuals taking the test again will receive the same scores. Vales of the Kuder-Richardson reliability estimate range between 0.000 and 1.000. A value close to +1.000 indicates the test exhibits a high degree of reliability. Estimates should be interpreted cautiously if large numbers of students are unable to complete the test within the allotted time. For a typical 50-minute classroom examination covering related subject matter, a reliability coefficient of at least .75 is desirable. Reliability can be improved through item revision based upon the item analysis data computed in this program. Lengthening the test (when this is practical) will also increase reliability, particularly in the case of short examinations.

The standard error of measurement is an estimate of the probably extend of error in test scores. It is interpreted in the same manner as standard deviation. A standard error of measurement of 3.500, for example, indicates that for any particular test score, the odds are 2 to 1 that the student's true score (his average score on several similar tests) will not deviate from the one obtained by more than 3.500 points. The more reliable and error-free the test, the smaller the standard error or measurement. This direct application to scores makes the standard error or measurement especially useful when evaluating differences among students or assigning grades.