Development and Validation of the Mathematics Test for Tenth Grade Jordanian Students, According to the Partial Credit Model

The study aimed at developing and validating the mathematics test for 10th –grade students according to the Rasch partial credit model (PCM) by using the descriptive approach as it is appropriate for the study aims. To achieve the study's objective, what constructed the essay type test consisted of 25 items based on the (IRT) according to the Rasch PCM. what conducted the first administration of the test to verify the validity and reliability of the test. To verify the "face validity" of the test's objectives, they were presented to a group of 12 arbitrators who work as teachers and educational supervisors. They found that the contents are representative of the level of the goal, which is pursuing in theory. The empirical reliability was calculated for the test, where the value of person reliability reached 0.91. Moreover, the items reached 0.93. The study population consisted of all 10th-grade students at the schools belonging to the “Directorate of Education of Irbid District,” whose numbers were 7365, represented by 3612 male students, and 3753 female students. According to the class regarding their sex (gender), a sample according to a cluster as the test unit was the class section. The sample size of the study was 250 male and female students. According to the PCM, this study's findings have brought several issues concerning the mathematics subject achievement by verifying the tests and reliability and accomplishing the IRT's suppositions.


Introduction
Construction and validation of tests, especially academic achievement measures, contain complicated steps, procedures, the interrelationship of various ideas and latent variables. Subsequently, confirmed what must follow guidelines to develop a firmly identified test with the expected outcomes. The two most essential steps in test development as spelled out by (Haladyna & Downing, 2011) are; (i) Item development, which includes content definition, preparation of test specifications, preparation of the item pool, content validation/experts judgment, pilot testing of the items, data analysis, and revision of test items. (ii) Item validation through item analysis. All these explained processes are closely linked with others. Additionally, these processes are carefully accomplished to ensure the instrument's validity and reliability developed and used to estimate items and a person's ability. Validity is concerns about how assessment systems are built. Whether the assessment tool (Test) is standardized or locally-designed, the aim is to use an instrument that produces a true estimate of the examinee's ability that could support valid inferences. The purpose of assessing student's learning includes licensing, certification, diagnosis, and placement.
The field of educational and psychological assessment and evaluation has received increased research attention by psychologists and educators. This field's primary objective was to reveal individual differences of all kinds, whether inter-individual differences between groups or intra-individual differences. Measuring methods and instruments have been varied to achieve this goal. The assessment quality depends on the quality of performance and the quality of the measurement process in Classical Test Theory (CTT). These efforts have led to the transition from the CTT used in the design of the tests, which have been used for a long time in the educational and psychological evaluation, to the modern approach, IRT, or the Latent Trait Theory (LTT) (Newton & Shaw, 2014).
CTT being a traditional theory, still attract the measurement community in test development and analysis due to its theoretical and practical simplicity. The continuous application of CTT in item analysis is because of its "weak assumptions," which can easily be met by test data (De Champlain, 2010; Hambleton & Jones, 1993). Although, as a result of its continuous utilization, researchers have questioned it's in the present-day measurement community (Zaman, Kashmiri, Mubarak, & Ali, 2008). The PCM could be a one-dimensional model for the analysis of responses recorded in two or more ordered classes, as well as Samejima's graded response model (GRM) (Samejima, 1969). The PCM differs from the GRM. However, therein it belongs to the Rasch family of models, then share the identifying characteristics of that family: separable person and item parameters, adequate statistics, hence, integrated additively. These characteristics modify "specifically objective" comparisons of persons and items (Rasch, 1977) and permit every set of model parameters to be conditioned out of others' estimation procedures.
In education, assessment is an important matter to identify educational success. The results of the educational evaluation have a significant function that will be useful in other academic processes. Two primary tasks measure the students' achievements and motivate and direct students' learning (Frisbie & Becker, 1991). One of the significant functions in mathematics education is identifying how far the students have possessed specific subjects such as mathematics. Besides, to determine the students' knowledge or understanding, the assessment results also provide certain concepts, such as mathematics concepts that the students have not mastered. The teachers or school might improve the learning process through the assessment results, and students can change their strategy for studying. The educational assessment results in Jordan, especially in mathematics, have not satisfied many parties over the years. It can be seen in the situation that might be found both using students' average score in the Mathematics in some research and studies which investigated in this subject and the results provided by the international studies. Based on the results of these studies, the students' achievements have not been satisfying (Khasawneh, 2009;Rabab'h & Veloo, 2015).
The researchers and mathematics teachers noticed the students' fear of essay questions at the expense of the multiple-choice questions, as they prefer the multiple-choice questions much more of the essay/structural questions. Also, they have noticed the lack of studies that have addressed the mathematics curriculum in particular according to the PCM, which aims to determine the difficulty coefficient for each step while answering the items of polytomous responses, which is considered as a generalization of the Rasch model in the dichotomous responses item (Tuckman, 1993). As a consequence, this paper identified the problem in general in an attempt to select items from achievement test in the subject of mathematics, specifically for students in the 10th grade, because this stage is the transition from the primary stage to the secondary stage, also to demonstrate the importance and effectiveness of the PCM in achievement tests or other tests. The test has psychometric characteristics so that it can be applied and used in public and private schools. According to the PCM, this study was designed to provide the mathematics test for students in the 10th grade.
The primary purpose of this study was to develop and validate the mathematics test (MT) for 10th-grade Jordanian students, according to the PCM, and depend on the Rasch model IRT. The two models used to obtain valid and reliable test items relevant to measuring students' true ability from traditional and modern measurement perspectives. Besides, the analysis determined the appropriate items that satisfied specific criteria for item quality. In light of these and many concerns, this study was conducted to investigate the nature of the IRT item parameters for Jordan's mathematics test. Overall, IRT models can be divided into two categories: uni-dimensional and multidimensional. Uni-dimensional models require a single trait (ability) dimension θ. Multidimensional IRT models model response data hypothesized to emerge from multiple traits. However, because of the significantly increased complexity, most IRT research and applications utilize a uni-dimensional model. IRT models can also be categorized depending on the number of scored responses. The typical multiplechoice item is dichotomous; although there maybe four or five options, it is still scored only as correct/incorrect (right/wrong).

Number of IRT parameters
Dichotomous IRT models are described by the number of parameters they make use of (Thissen & Orlando, 2001). The 3PL is named so because it employs three item parameters. The two-parameter model (2PL) assumes that the data have no guessing but that items can vary in location bi and discrimination ai. The one-parameter model (1PL) assumes that guessing is a part of the ability. All items that fit the model have equivalent discriminations so that a single parameter bi only describes items.

The Rasch model
The Rasch model is often considered to be the 1PL IRT model. However, Rasch modeling proponents prefer to view it as a completely different approach to conceptualizing the relationship between data and theory (Andrich, 1989). Like other statistical modeling approaches, IRT emphasizes the importance of a model's fit to observed data (Steinberg, 2000). In contrast, the Rasch model emphasizes the priority of fundamental measurement requirements, with good data-model fit being an essential but secondary requirement to be met before a test or research instrument can be claimed to measure a trait (Andrich, 2004). There are some IRT models with polytomous responses: many different models of the IRT appeared (Nering & Ostini, 2011;Ostini & Nering, 2006). Each of these models had a specific purpose. These models were mentioned as follows, with some supported studies for them:

Normal Ogive Model (NOM):
The NOM was the first IRT model for measuring psychological and educational latent traits (Ferguson, 1942;Lawley, 1943;Mosier, 1940;Richardson, 1936). The NOM was refined later by (Lord & Novick, 1968). An item characteristic curve (ICC) is derived from the cumulative density function (CDF) of a normal distribution in the model. Besides, some studies that applied this model (Tran & Formann, 2009)

Partial Credit Model (PCM):
The PCM is an extension of the 1PLM Rasch model (Masters, 1982). The study of (Choi & Swartz, 2009) applied this model.

Generalized Partial Credit Model (GPCM):
The GPCM (Muraki, 1992) is a generalization of the PCM with a parameter for item discrimination added to the model. The study of (Chen, 2010) used this model.

Rating Scale Model (RSM):
There are two different approaches to the RSM (Andersen, 1997) proposed a response function, in which the values of the category scores are directly used as a part of the function. Another form of the RSM was proposed by (Andrich, 1978), which can be seen as a PCM modification. The recent studies that used this model were (Dehqan, Yadegari, Asgari, Scherer, & Dabirmoghadam, 2017; Gómez, Arias, Verdugo, & Navas, 2012).

Graded Response Model (GRM):
The GRM was introduced by (Samejima, 1969) to handle ordered polytomous categories such as letter grading, A, B, C, D, and F, also polytomous responses to attitudinal statements such as a Likert scale. The study of (LaHuis, Clark, & O'Brien, 2011) adopted this model.

Nominal Response Model (NRM):
The NRM, also called the Nominal Categories Model (NCM), was introduced by (Bock, 1972). Unlike the other polytomous IRT models introduced above, NRM's polytomous responses are unordered or at least not assumed to be ordered. Even though responses are often coded numerically (for example, 0,1, 2…, m), the responses' values do not represent some scores on items but just nominal indications for response categories. There are some applications of the NRM found in uses with multiple-choice items. As for models of polytomous responses, it is used when the response consists of many scores, and each score has a Difficulty Coefficient (DC) according to the modal used. One of these models is the (PCM) which identifies DC each step (k) while answering item (i) of polytomous responses, as well as determining the latent ability of the person and his performance. There is also the (GRM), each item has a Discrimination Index, and each section of the response has DC (Mislevy & Verhelst, 1990). Some recent studies that adopted this model were (Huggins-Manley & Algina, 2015).

Research Questions
The following research questions were raised to guide the study: • Do the mathematics test data for tenth-grade students achieved the assumptions of Item Response Theory (IRT)?
• To what degree does the mathematics test's data conformity to the tenth-grade students with the Partial Credit Model (PCM)? • What are the estimates of the values of the parameter of the items according to the Partial Credit Model?
• What are the estimates of the values of the person's ability depending on the model used?

Research Design
The study aimed at the development and validation of the mathematics test for 10th-grade students according to Rasch (PCM) by using a survey design as it is appropriate for the study aims.

3.2.1Population of the Study
The study population consisted of all 10th-grade students at the schools belonging to the "Directorate of Education of Irbid District," whose numbers were 7365, distributed to 3612 male students and 3753 female students.

Sample and Sampling Techniques
The sample in this study was drawn using stratified random sampling technique that chosen according to their gender and then the cluster, as the unit of choice was the classroom division, where 14 schools were divided into 7 male schools and 7 female schools, where the study sample size reached 250 male and female students. The instrument is a mathematics test constructed for 10th-grade students and estimated the Difficulty Coefficient (DC) following Rasch (PCM). The test consisted of (25 items) of essay type, and each item has multiple answers as each item needs. The test items covered the whole mathematics subject. The 25 items of the essay type constructed based on the (IRT), according to the Rasch PCM, was administered on the sample of 10th-grade students, which are mainly under the control of their respective schools. Each Item had four answers following the steps of the achievement test. After receiving specific instruction for the test by teachers under the monitoring and evaluation unit's supervision, responsible for the regulation of primary education in the ministries, in the north area "Irbid district" in March 2021. After Coordinating with the school, were set a management date for a visit to set a date for applying the study tool. They informed that which used the information obtained for scientific research after applying the study tool in its final form on the targeted sample study. Then Collected the questionnaires, auditing and analyzing them statistically, to answer the questions of the study, and came up with appropriate recommendations in the light of the results. The data collected were analyzed by using SPSS V 23. For factor analysis, estimated abilities of the 10th-grade students, mean, standard deviation (SD), standard error (SE), and correlation coefficient bi-serial. Moreover, it was used (winsteps V 3.72.3) for students' conformity ability on the test.

Results
After analyzing the data obtained from the instrument, results were presented in table based on the research questions

First Research Question:
Do the mathematics test data for 10th-grade students achieve the assumptions of (IRT)? To answer this question, the factor analysis was conducted using SPSS V23.0 to verify uni-dimensional assumption test items, as shown in table 1.

Research Article
Vol.12 No.6 (2021), 1527-1536 Table.1 presents the results of the factor analysis of the mathematics test items indicated to a uni-dimensional investigation of three indicators: the result of dividing the first factor's Eigenvalue by the Eigenvalue of the second greater than 2. Then, the result of dividing the quotient of the root of the second Eigenvalue from the first one on the quotient of the root of the third one from the second one has a high value, the value of the variance explained of the first component is higher than 20.0% (Hattie, 1985). Figure.1 Showing the Eigenvalues for the factors that make up the test was used with emphasis on a unidimensional assumption.

Table.2:
The frequencies and percentages of local independance of the items of test. Table.2:Highlighted the assumption of (LI) for the items test was verified by calculating the standard value ( ِχ2) of the standardized form of the LI (Standardized LD χ2), each pair of test items (300) has a correlational pair that is calculated by multiplying (25) items by (24) and then dividing by (2), using the (IrtPro V3.1.21505.4001) software. The frequencies and percentages of both LI cases were then monitored provided that the standard LI value is greater than (10), indicating that the LI of a certain number of correlational pair has not been achieved and vice versa if 10 is lower, indicating that the LI of a certain number of correlational pair has been achieved. Moreover, table 2 also shows that LI is achieved in 295 a correlational pair of 300 a correlational pair to items of a test in percentage 98.33%.

Second Research Question:
To what degree does the mathematics test data conformity for the 10th -grade students with the (PCM)?

Third Research Question
What are the estimates of the values of the parameter of the items according to the Partial Credit Model (PCM)? To answer this question, the descriptive statistics conducted to each raw score and the 10th-grade students' ability who's matching the test, the (SE) for ability according to the Rasch model, and (PCM).

Fourth Research Question:
What are the estimates of the values of the person's ability depending on the model used? To answer this question; the difficulty parameter values for the estimated the mathematics test, SE, and the Point biserial Correlation Coefficient for 10th-grade students were calculated according to the (PCM)  Table 5 highlighted that the (INFIT), according to Mean-Square Residuals (MSR) for observed frequencies on expected frequency range from -1.99 to 1.74. Moreover, the (OUTFIT), according to Standardized Mean-Square Residuals (SMSR) for observed frequencies on expected frequencies ranges from -1.99 to 1.69.

Discussions of Findings
The result of the first study question showed, achieving the results of the 10th-grade students on the mathematics test for an un-dimensionality assumption according to the IRT depending on the PCM with three indicators, which means that the performance of the students examined on the test can be attributed to a dominant trait or only one ability, as some (latent-trait models) assume the existence of a single-trait that lies behind the interpretation of the performance of the students examined on the test. Likewise, which means that the test items were homogenous among themselves and measure the same trait and that the items, despite their different difficulties, did not differ among themselves in terms of measuring the same trait.
The results for the first study question also showed achieving the second assumption of IRT. It is a local independence LI, which means that the examined students' responses to the various tests were statistically independent. In other words, the examined students' performance doesn't affect either negatively or positively on the items on the test on (his/her) response to any other items of the test. This means that there is reliability in assessing students' abilities and the difficulty and reliability of the items, despite the difference in the sample of individuals used in the measurement scale as long as it is an appropriate sample. Besides, there is reliability in estimating both the individual's ability and the item difficulty and their reliability, despite the difference in the group of items used in the measurement, as long as it is an appropriate item. Moreover, the result of the first question reveals achieving the monotonicity assumption, which means that the probability of responding correctly to the terms should increase with increasing ability. Besides, for this explanation, this means that the speed factor doesn't play a role in the response of the examined student to the test items, meaning that the reason for the examined student's failure to answer the test items correctly is due to his limited ability, and not because he was unable to reach all the test items because of the speed factor. The findings of this question were consistent with the findings of (Becker & Forsyth, 1992;Craig & Kaiser, 2003;Kimball, 1989 The second study question results showed matching results of the 10th-grade students in terms of ability parameter for them on the mathematics test of the IRT's assumptions, PCM, where only 3 students were deleted. Their answers did not match the expectations of the PCM. Where a non-matching student means that his/her observed responses deviate from the model's expectations, such as he/she may answer about the items incorrectly despite its difficulty level below his/her ability level, or he/she is responding about the items correctly, despite its difficulty level above his /her ability. Moreover, the results of the same question also showed matching the findings of the 10th-grade students in terms of the difficulty parameter of the items on the mathematics test, where none of the test items did match the expectations of the PCM; where is meant by the non-matching item is that the probability of answering the items is high for students with low abilities and low for students with high abilities. The findings of this question regarding matching the items to the PCM are in agreement with the findings of a study and disagreed with the findings of (Afrassa & Keeves, 1999; Bielinski & Davison, 2001;Hashway, 1977;Muraki, 1992;Muthen, 1988).
The third study question results showed that the values of the abilities of the 10th-grade students matched on the mathematics test were ranged from -0.02 to 2. 34 with an AM=49.57, SD=23.62. This means that the student's abilities on the test are not distributed using the normal distribution (ND) as they are supposed to. Moreover, the results showed that the value of the arithmetic mean AM of the student's abilities becomes clear that their abilities are higher than the test level. This question's results agree with the result of the study (Abedalaziz, 2010;Bohlin, 1994) and are not in agreement with the study's result (Stone, 1992).
The fourth study question showed that the values of the difficulty parameter for the mathematics test ranged from 1.00 to -0.32, within AM=107.08, SD=29.34, which means mediating the difficulty of the mathematics test items, as it is not extreme in its difficulty. This question's results agreed with the results of the study (Gershon, 1994

6.Conclusion
This study was considered unique in choosing the primary stage in public schools in the Irbid government. It is unique in its approach to construct an achievement test in the mathematics test, in particular. Despite this, it is a study in common with previous studies in its general field of achievement tests and its attempt to identify the degree of effectiveness of applying the PCM in achievement tests. The findings of this study have brought several issues concerning mathematics subject achievement by verifying the test's and reliability and its accomplishment of the suppositions of the (IRT) according to the (PCM). (i) The foreign studies that dealt with the current topic are a lot and various. In other words, there is great interest in the topic by foreign research, whereas it is limited in Arab societies and Arab studies in terms of using and employing it. (ii) All the samples were university students or secondary stage students in these studies. As a result, we will apply it to the primary stage students to demonstrate its effectiveness at this stage, compared to the older age stages. (iii) The majority of the previous research focused on using statistical methods in the light of the Classical Theory to verify psychometric characteristics. A few of them used modern forms in measurement, so it has been applied according to Rasch PCM and demonstrated its effectiveness with the achievement tests.

Recommendations
Based on the findings of this study and considering the significant place the mathematics in our educational system, the study made the following recommendations: (1) Teachers and other stakeholders should pay special attention to encourage and motivate students to develop good study habits to improve their academic achievement in mathematics.
(2) Further studies should be adopted the PCM to see the contribution of this model in measuring students' achievements, especially in the mathematics subject. (3) Adoption of the current mathematics test by 10th-grade teachers. (4) Teachers and other stakeholders should endeavor to encourage and motivate students to learn mathematics subjects. (5) Teachers may need to be more sensitive to the different needs of male and female students. Hence, care has to be placed when teaching both genders. (6) Curriculum developers should develop instructions that would improve students' knowledge by emphasizing the perceived difficulty areas in mathematics subjects.