Release date:2021-06-21Information Sources:
CURRENT LOCATION:HOME | METHODOLOGY | 2018
Sampling and Data Cleansing for the 2018 Survey
Font size selection:SML

In the Working Conditions Survey of Chinese Urban Residents (2018), we established the sampling frame of primary sampling units (PSUs) based the data of China's “sixth population census” provided by the National Bureau of Statistics (NBSC). Then, using data of the total population, the population aged 16 and over, and the number of households pertaining to the chosen PSUs of the NBSC of China, as well as all secondary sampling units (SSUs) and tertiary sampling units (TSUs), we chose employed persons aged 16 and over living in communities (residents’ committees) across different county-level cities, prefectural-level cities and direct-administered municipalities in China as the subjects of our survey, and collected data at individual, household, organizational and community levels through a door-to-door questionnaire-based survey, with a view of measuring, evaluating and analyzing work conditions of the Chinese employed urban population. The sampling scheme provides great representativeness for urban working and employed population in China.

1.Sampling design

(1) Target population

The target population of the Working Conditions Survey of Chinese Urban Residents (2018) is the employed population aged 16 and over living in urban areas of mainland China. As we only selected 1 person for each sampled household, the survey data can be appropriately weighted to be representative of the employed Chinese urban households. In addition, the term of “employed urban population” is operationally defined as the employed population aged 16 and over living in communities (residents’ committees) across different county-level cities, prefectural-level cities and direct-administered municipalities in mainland China between October 2018 and February 2019.

(2) Sampling design

This survey adopts a complex sampling design. Specifically, the county-level administrative units (districts and county-level cities) were recognized as PSUs based on the data of China's “sixth population census” conducted in 2010, in combination with the latest information on administrative division defined by the Ministry of Civil Affairs of the People’s Republic of China, to establish data of the PSU sampling frame (for PSUs containing less than 9 SSUs, we merged two PSUs into one PSU by geographical proximity in order to allow the number of its SSUs to be higher than 9). A total of 60 county-level urban areas (PSUs) were sampled using the probability proportional to size (PPS) method.

For each PSU, we applied the PPS method to select 9 communities (residents’ committees) as SSUs. In cases of inaccessibility, relocation or changed administrative division, we would supplement new SSUs that were selected from the sampling frame using the same method.

For each selected SSU, our survey implementation institution dispatched on-field interviewers to make district maps and the list of household addresses as the third-stage sampling frame, from which we selected 15 households as TSUs using the simple random sampling without replacement (SRSWOR) method. Given the presence of refusal rate and non-target households, we requested interviewers to provide an estimated value of refusal rate for each TSU in sampling and used the following equation:

 

the number of selected addresses=int(15/(1-refusal rate)+0.5)

 

to offer all sampling addresses in a one-off fashion. If the number of successfully surveyed households reached 10, the survey was stopped; if otherwise, the interviewers should recalculate the newly selected addresses based on the number of remaining households to be surveyed and new refusal rate using the above equation.

With respect to selected TSUs (households), we further selected one individual per address as the ultimate sampling unit (USU) based on the random number table (see the Questionnaire) that we established. If a selected individual did not accept to be interviewed and did not allow other household members to do it alternatively, then the next address in the selected address list would be visited.

(3) Sample size

With simple random sampling of the population (without replacement), we could obtain the estimated sample size using the following equation:

微信图片_20210711180434.png 

where p is the probability that a certain class of the sample appears in the population; uα is the distribution critical value when the confidence level is α; and d is the difference between sample estimate and population parameter. According to the above equation, if we set the confidence level of the estimation interval α=0.05 and the absolute error d=3%, then we only need to survey a sample of about 1,000 for the estimation of the majority of distributions.

However, given the fact that this survey used a multistage complex sample instead of a simple random sample, we must also take into account of the design effect (deff). The design effect is the ratio of sample variance generated when using the complex sampling to the sample variance generated when using the simple random sampling under the same sample size. The estimation equation of the design effect is as follows:

微信图片_20210711180825.png 

Where b is the number of samples selected from a single sampling unit; and roh is the homogeneity within the sampling unit. This equation indicates that the larger the number of samples selected from a single sampling unit, the larger the deff; and the larger the homogeneity within the sampling unit, the larger the deff. This survey utilizes a sampling scheme that increases the number of sampling units and reduces the number of samples within a single sampling unit to the highest extent possible. Therefore, we set the deff of this survey as 6 based on the design scheme utilized in this survey and our prior experience in sampling. Thus, the sample size that takes deff into account would be 1000×6=6000.

To obtain an unbiased parameter estimate, a certain level of response rate r must be ensured for a social survey:

 

微信图片_20210711180830.png 

 

Methodologically, the target population can be divided into two potential populations by whether a response is given: the population available for survey and the population unavailable for survey. The size of the former can be calculated by response rate * target population size; while the latter by (1-response rate)* target population size. The lower the response rate, the smaller the population size that can be inferred by the sample estimate. Only by assuming that there is no statistically significant difference between inferred parameters for available and unavailable populations can we generalize the survey results to all population members in the presence of nonresponses. The rule of thumb is that we should at least ensure a 50% or higher response rate in sampling surveys (that is, both available and unavailable populations account for half of the target population). Considering the presence of nonresponses in the survey, we need to appropriately increase the size of the selected samples. We assumed that the response rate of the survey was 75%. Thus, considering the existence of nonresponses, the sample size for this survey should be 6000/0.75=8000. Further considering the specific distributions of samples, we determined the final sample size as 8100 (=60*9*15), which is made up of 60 PSUs, with each PSU comprised of 9 SSUs, each SSU comprised of 15 TSUs and each TSU comprised of 1 USU.

(4) Sampling frame and sampling procedure

① The first-stage sampling: selection of PSUs (cities and districts)

The PSU sampling frame of this survey is derived from the Sixth National Population Census Data (County-specific) published by the NBSC of China in 2010. Given that it has been 8 years since 2010, we adjusted the death rate of urban population aged 8 and over based on the sex- and age-specific crude death rates of the sixth population census data to rectify the effect of population changes. The adjusted data were used as the sampling frame of PSUs (including 1,226 PSUs), and the urban population aged 8 and over was used as the weight. According to the sampling design scheme, we selected 60 PSUs from the 1,226 PSUs (from cities other than those in Xinjiang and Tibet) based on the PPS method. The 60 PSUs are distributed in 24 provinces, cities and autonomous regions, with Shandong, Hubei and Guangxi provinces containing the most samples (5 PSUs) and Shanghai, Yunan, Jilin, Jiangxi, Fujian and Hainan the fewest samples (1 PSU).

② The second-stage sampling: selection of SSUs (residential committees of communities)

The SSU sampling frame of this survey is derived from the raw data of the Sixth National Population Census Data published by the NBSC of China in 2010, as well as from the number of households and urban population aged 8 and over in 2010 provided by the related sub-departments of the bureau. According to the sampling scheme, we should select 9 residential communities from each PSU as a SSU based on the PPS method. In principle, the total selected residential communities should be 540.

③ The third-stage sampling: selection of TSUs (households)

For the purpose of this survey, households include regular households, collective households and various collective residential units that are covered by household registration. The TSU sampling frame was derived from the land plot drawings and the list of addresses made by the survey implementation agency, which further developed the “household sampling frame” on site based on the land plots. After establishing the “household sampling frame”, the research team randomly selected a list of addresses for door-to-door survey using computer software. While contacting the addresses to be surveyed, interviewers were not allowed to conduct survey at addresses that were not included in the list. For sampled houses that still could not be visited after being contacted three times, the interviewer need to specify the reason in the relevant section of the Registration Form of Home Visits before moving on to contact the next household.

To incorporate the migrant population into the survey, we followed a principle of “determining survey subjects by households” in TSU sampling, that is, by taking residential addresses as tertiary sampling units (TSUs), we regarded a household as a potential survey target as long as one or more of its members were employed regardless of their being household registered population, population of long-term residents or migrant population.

④ The fourth-stage sampling: selection of USUs(respondents)

The fourth-stage sampling frame is comprised of all employed members aged 16 and over in the selected households. After successfully accessing to the selected homes, an interviewer needed to choose the respondent from household members using the Kish grid on the first page of the questionnaire. It is worth noting that the Kish grid was used to select respondents in centralized residential households with no higher than 10 members; if the number of household members was larger than 10, the interviewers would follow the “median age” principle, that is, to select an individual whose age falls in the middle among all appropriate respondents.

A questionnaire-based interview would begin if the selected respondent agreed to accept the interview. In the event that the respondent refused the interview, the interviewer should specify the sex of the respondent and “cause of failure” in corresponding sections under the “Failed Interviews” - “Reasons Given by the Respondent” on the Registration Form of Home Visits. If the selected subject could not accept the interview due to reasons like being away from home, going abroad or seriously ill, the interviewer should consider whether it was necessary to make an appointment for interview. If not, the interviewer was also required to specify the sex of the respondent and “cause of failure” in corresponding sections of the Registration Form of Home Visits.

Regardless of the causes for a failed interview, an interviewer was prohibited from switching a selected respondent to another member of the household; rather, the interviewer should specify the cause in the Registration Form of Home Visits before proceeding with the interview with the next household.

2. Survey quality control

The objective of survey quality control is to reduce the systematic errors (biases) of survey data under the guidance of the “overall research design”. Based on the research design of Work Conditions Survey of Chinese Urban Residents (2018), the survey may entail systematic errors in the following three stages: 

 

1) First, the household selection stage. For example, factors like incomplete plot drawings for sampling, inaccurate entries on sampling forms and arbitrary replacements of household addresses by interviewers could all cause errors.

2) Second, the respondent selection stage. For example, biases in sample sex and age could be caused if interviewers fail to perform in-home selection based on the Kish grid procedures or if the entries of the Kish grid are non-compliant.

3) Third, the field interview stage. For example, biases could occur if interviewers systematically miss questions, intentionally avoid part of question sets by taking advantage of skipping rules, merge questions that should be asked in an “item-by-item” fashion, or direct or suggest respondents providing certain kinds of answers.

 

Revolving around the three stages stated above, data quality control was performed through the following procedures in this survey.

(1) The household selection stage

A sampler was dispatched to the selected residential community for a field visit, where he or she examined all buildings in the area and drew or updated the Residential Land Plot Drawing for Sampling. On that basis, the sampler also filled out a Land Plot Sampling Form where the numbers of floors and entrances in each building, and the number of households for each entrance per floor were recorded. All residential households in the above building constitute the sampling frame for this sampling round. The sampler must ensure that the identifier numbers of the drawings, residential buildings and rooms are consistent. In the event that the number of households indicated by the Land Plot Sampling Form is evidently lower than the size of a regular residential community in the locality, the sampler should promptly check whether the Land Plot Sampling Drawing and the Land Plot Sampling Form are complete.

Upon receiving the data of Land Plot Sampling Form, the contractor’s project team should provide a list of addresses to be visited for each community developed in a random selection process. A visit to a household cannot be regarded as failed unless there have been 3 nonresponses or 2 refusals. In case of any failed visits, the interviewer must truthfully specify the causes on the Registration Form of Home Visits. For communities where a specified number of valid sample interviews can still not be realized after 3 home visits, the contractor’s project team should provide the second set of visiting addresses.

During home visits, an interviewer must carefully fill out the Registration Form of Home Visits and are prohibited from arbitrarily change the visiting addressing. If it is found in the validation process that more than 100 households’ entries are missing in the Land Plot Sampling Form, the questionnaires of the residential community will be regarded as invalid.

The interviewer must take a set of photos pertaining to an interviewed household (photos of the full name of the residents’ committee, the identifier number of the residential building/bungalow, and the household’s doorplate number), where the indicated address should be the same as the sample address. Questionnaires with photos missing or inconsistent addresses will be regarded as invalid.

Based on the above principles and the records of the Registration Form of Home Visits, a total of 27,496 household addresses have been contacted in this survey. That includes 7,218 households that were successfully visited, accounting for 26.3 ; 12,531 households that refused to take part in the interviews due to various reasons, accounting for 45.6% ; and 7,747 households entailing other circumstances (including no one at home, selected interviewees not at home, stoppage by the entrance control system or erogenous addresses), accounting for 28.2%. On average, about 1 out of 4 household visits was successful.

(2) The respondent selection stage

According to survey procedures, the interviewer should select the respondent from household members using the Kish grid provided on the first page of the questionnaire after entering a household. Respondent selection is crucial for ensuring sample randomness and thus should be properly implemented. The survey institution conducted two rounds of reviews within 2 days after the completion of questionnaires to check whether the Kish grid-base sampling was implemented correctly. If any flaws with the sampling process were found, the in-home interviews must be re-performed. As found from the Kish grid tests of the collected samples, only 4 sets of data were not in conformity with the randomness principle, indicating that at least at the data recording level, the respondent selection work in this survey was done in accordance with the principle.

The survey implementing agency was also required to rapidly aggregate the sex- and age-specific survey data by city. In cases of sex-ratio imbalance or biased age structure, the implementer must verify the situation and submit an explanation to the research team.

Interviewers must record the audio of the entire process of interviews they performed. The proper “respondent selection” procedure must be reflected in the audio. All questionnaires administered by interviewers that were found to be engaged in falsification were regarded as invalid. Questionnaires with the “respondent selection” procedure missing in the audio were also regarded as invalid.

(3) The home interview stage

The survey implementing agency provided effective training for interviewers using the Interviewer Manual and relevant video data. With respect to the field interview stage, the survey institution needed to conduct two rounds of reviews within 2 days after the completion of questionnaires in order to avoid missed questions or misuse of question skipping. Immediate remedies must be implemented if the above mistakes occurred. After completing the interviews with the first samples, an interviewer must promptly transmit the electronic version of the questionnaires and audio to the Chinese Academy of Social Sciences (CASS) for scrutiny, so that potential problems could be evaluated and corrected.

After questionnaires were completed, an audio record review covering all interviewers and residents’ committees was performed. If an audio record indicated that an interview lasted no longer than 15 minutes, the reviewer needed to re-examine the problem and provided timely guidance for the respondent. In the event that any mistakes were found in the review process, the reviewer was required to promptly notify the survey implementing agency and demand it to make immediate rectifications.

In instances of evident falsification of audio records, the survey implementing agency was required to re-perform the interviews concerning all questionnaires administered by the alleged interviewer. An interviewer must write down respondents’ telephone numbers on the first page of the questionnaire for purpose of telephone checks.

3. Data entry and cleansing

(1) Data entry

Two surveying approaches - the PAD-based and paper-based questionnaires - were employed in this survey. In the final data, 88.13% was from PADs, which was entered simultaneously as the field interviews progressed and transmitted to the survey implementing agency via wireless networks. Apart from the quality control process implemented by the survey implementing agency, we also proofread 100% of the record data by re-listening to the audio files, and corrected any inconsistencies between answers provided by respondents in the audio and the corresponding entries made by interviewers. A total of 1,094 errors were corrected in this stage.

Paper-based questionnaire data must also be accompanied with audio and photographic information. It was entered using the software EpiData Entry 3.1, and then validated through double-entry comparisons. In addition, the quality control function offered by Epidata Entry for data entry was also utilized. With a pre-established program, the computer system could automatically detect and control the value field errors of variables and the logic errors between variables. In the event of outliers for a single variable or erroneous logic relations between multiple variables, the original questionnaires must be consulted to confirm whether errors took place in the entry process; if not, the respondents should be contacted for confirmation and further remedial measures.

(2) Data cleansing

Following data entry, we further carried out data cleansing for the 8,218 entries. The data cleansing process mainly involved the following aspects:

① Correction of entry errors

We proofread 100% of the record data by re-listening to the audio files, and corrected any inconsistencies between answers provided by respondents in the audio and the corresponding entries made by interviewers. A total of 1,094 errors were corrected in this stage.

 

② Validation of respondents’ identities

We re-checked respondents’ occupations and identities based on the jobs they did (variables of c2a and c2b) and ruled out nonconforming occupations such as housekeepers and retired persons. In this process, a total of 17 invalid samples were ruled out, leaving 8,201 valid samples. Additionally, we also uniformly coded the entered occupations based on research of Zhe Xiaoye and Chen Yinging and transformed them into career esteems, which were all presented in the published data (Zhe Xiaoye & Chen Yingying, 1995).

 

③ Logic validation

We examined more than 50 potential logic relations in the questionnaire and checked possible logic errors in a one-by-one manner. Appropriate treatments, including invalidating and rationalizing questionnaires, were performed for each instance. These instances include, for example, part-time job incomes while the respondents do not have any part-time jobs; the number of dependent family members exceeding family sizes; benefits not provided by an organization but enjoyed by its employees; communicative channels not offered by an organization but enjoyed by its employees; an employee at the lowest hierarchical level at an organization but having subordinates; household income lower than individual income; the time spent on an individual activity exceeding 20 hours; the number of direct reports exceeding the total workers of an organization; the starting year of the current job earlier than that of the first job; and overtime pay while the respondents do not work overtime.

④ Check on extremes and distributions

We manually checked the distribution of every variable and treated evidently deviant values as missing values. For example, data like the number of workers in a household exceeding 20, the starting year of employment being 1900, and the internship period lasting 885 months was treated as missing values.

⑤ Validation of the Kish grid

Whether the Kish grid of the questionnaire was properly used was validated in accordance with the review procedures. Noncompliant questionnaires were considered invalid. Among all the 8201 cases for which the proper use of Kish grid was reviewed, only 4 sets of data were found to be noncompliant with the randomness principle of the Kish grid. This indicates that, at least in terms of data record, the respondent selection process of this survey was performed by observing the randomness principle. We treated questionnaires not in compliance with the Kish test as invalid, which we were ruled out, leaving 8,197 valid samples.

⑥ Interviewer falsification

In this survey, a total of 421 interviewers were hired, including 681 female interviewers (accounting for 77.3%) whose average age is 30.8, median age is 29, minimum age is 18 and maximum age is 68. The average length of work for these interviewers is 3.5 years, and the median length is 3 years. Specifically, specialized interviewers account for 36.0%, and students make up account for 17.5%. Interviewers are generally well educated, with 0.2% having junior middle school diplomas, 29.4% having senior middle school diplomas, 49.3% having technical college diplomas and 21.1% having undergraduate or higher degrees. A relatively large number of interviewers were engaged because on the one hand, in-home survey has been increasingly difficult over the past decades, and on the other hand, the content of the questionnaire is sensitive and its sampling procedures are complicated and demanding.

In particular, we implemented a triple process and data quality control system, that is, the check of 31 frontline survey agencies, the review of 1 general agency based in Beijing, and the third-round review of the research team were separately maintained. After the review, all samples produced by interviewers involving a large proportion of low-quality data (larger than 90%) were ruled out, resulting in a deletion of 1,038 samples. Further, we also carried out telephone reviews for all samples, and a sample would be deleted if its review result indicated “untrue”. In this process, a total of 340 samples were deleted.

Apart from process control and telephone reviews, we also conducted qualitative examinations over the quality of valid data we had received. Random answer selection has been a typical falsification method used by interviewers. Thus, we developed test programs separately for the randomness of questionnaires and the correlation between items on a single questionnaire. Using this method, we deleted 82 data sets.

In general, a total of 1,460 samples were deleted in the falsification review process, leaving 6,737 valid samples.

4. Weights and target population calibration

In social surveys, weights are crucial for ensuring the correspondence between samples and target population. Weights are mainly classified into two types: sampling weights and calibration weights. Sampling weights are the inverse of the likelihood that an individual case is sampled, which are determined by the sampling scheme. Methodologically, there are mainly 4 approaches to calculate a sampling weight - design-based (randomization), model-based, model-assisted and Bayesian approaches (Valliant et al., 2018). In this survey, we adopted a design-based sampling weight calculation method in order to accurately reflect the particularity of our sampling design; in the meantime, we also performed calibration based on the statistics published by the NBSC of China in order to compensate for potential biases (including compositional biases) arising from outmoded sampling frames, refusal of respondents and falsification of interviewers, and reduce the variances and errors of sample estimates. As such, the survey data have two weight variables - the sampling weight and the calibration weight. While the former is used to adjust the unequal probability in the designed multistage sampling; the latter is used to further adjust the structural weights in order to prevent various potential errors between samples and population, especially demographic errors. The following is a brief description on how these weights were generated.

(1) Sampling design weight

This survey is designed to use the multistage complex sampling. In the first stage, 60 urban districts were selected using the PPS method; in the second stage, residents’ committees in these urban districts were selected using the PPS method; in the third stage, households in the residential communities were selected using the SRSWOR method; in the fourth stage, 1 respondent was randomly selected using the Kish grid at the respondent's home.

Thus, when selecting PSUs using the PPS method in the first stage, the sampling probability pi for the i-th PSU is:

 

微信图片_20210711180835.png 

 

Where m is the sample size of each stage, and n denotes the employed population or its substitution variable in each sampling unit. N.total is the population aged 16 or over living in the residential communities of urban districts, which was calculated using data of the Sixth Population Census. As can be calculated based on the sampling frame, its value is 616432389. M.psu is the number of selected PSUs, that is, 60. Ni.psu is the employed population in the i-th PSU, that is, the sampling probability of the i-th PSU is the sum of the ratios of its employed population to the total employed population added for 60 times.

In the second stage, we selected 9 SSUs from urban PSUs using the PPS method. With a PSU being selected, the sampling probability of the j-th SSU, denoted by Pjssu|i, is:

微信图片_20210711182522.png 

Where mj.ssu is the number of SSUs selected from each PSU, which is set as 9. Nj.ssu is the working population in the j-th SSU, ni.psu is the working population in the i-th PSU. That is, the sampling probability of each SSU is the sum of the ratios of its working population to the total population of PSU, added for 9 times.

In the third stage, we randomly selected household TSUs from community SSUs, but the sampling frame did not contain the number of households in communities or the total number of households with at least one employed member. Therefore, we employed the SRSWOR method. With an SSU being selected, the sampling probability of the k-th household, denoted by pktsu|j, is:

微信图片_20210711182904.png

Where mk.tsu is the number of households that should be selected from each community TUS, which is set as 15; nk.tsu is the total number of households having at least one employed member in the community. In this survey, we utilized a Land Plot Registry Form, requiring interviewers to register all accessible addresses of a community before conducting in-home interviews. The number of these addresses was used as the substitution variable of nk.tsu.

In the fourth stage, interviewers visited addresses randomly selected from the Land Plot Registry Form and initiated selection as soon as they contacted household members, that is, the number of household members that are employed and aged 16 and over under the current address. If the number of employed members is not 0, the interviewer would use the Kish grid on the first page of the questionnaire to select appropriate respondent and ask about his or her willingness to participate. If the respondent is not willing to participate in the survey, the household would be considered as a refusal. The interviewer continued to visit the next household on the list of addresses until obtaining the consensus of the respondent to start the interview. With a home address being selected, the sampling probability of each individual case is plusu|k:

微信图片_20210711183025.png 

Where 微信图片_20210711183156.png is the number of employed members in the 1st address of the k-th community, which is obtained from item B1 on the questionnaire.

According to the above steps, the sampling probability of each respondent is pmcase, that is, the product of the above four probabilities, denoted by:

 

图片1.png

 

Where 微信图片_20210711183747.png denotes the constant sampling probability of each individual case. As we used the PPS method in both the 1st and 2nd stages, the resultant sample is a self-weighting sample (equal probability of selection method sampling, epsem) (Kish, 1965). The second term 微信图片_20210711191708.png is the changed sampling probability of each individual case. In the third sampling stage, we did not use the PPS method to select households due to a lack of necessary information. Instead, we utilized the SRSWOR method, causing changes in the sampling probability of each respondent.

In addition, we used working population aged 16 and over as the population size for the PPS-based sampling. As working population is highly correlated with, but different from, employed population, we used employment rate as a moderating factor. In other words, we assumed in the sampling stage that the employment rate across different sampling units is constant.

Substituting relevant values into the above equation, we get:

 

图片2.png

 

Therefore, the survey data is generally not a self-weighting sample (equal probability of selection method sampling, epsem), and the sampling probabilities for individual cases are different (Kish, 1965). That is, the average sampling probability of this survey is about 1.3/100,000.

The sampling weight wgtcase  should be the inverse of the aforementioned sampling probability pmcase, that is:

 

图片3.png

 

Where 微信图片_20210711183156.png is the number of employed members in the 1st household of the k-th TSU, nk.tsu is the total number of households with at least one employed member in the k-th SSU (community), and nj.ssu is the working population of the j-th SSU. On average, each respondent of the survey approximately represents 76,000 employed urban population.

However, due to the existence of various errors (especially refusals and falsification) in the actual survey, 6,702 individual cases instead of 8,100 were finally retained in this survey. If changes in weights owing to refusals are to be corrected (Saerndal & Lundstrom, 2005), the value 8,100 incorporated into the above equation should be replaced by 6,702. That is, the weight wgtrsp after correcting for the refusal effect is:

 

图片4.png

 

We recommend that this sampling weight be utilized when using the survey data in descriptive analysis or answering empirical questions. Thus, each respondent in this survey approximately represent 92,000 employed urban population. The changes in representative scope of these two weights also reflect the potential effect of refusals, that is, refusals would reduce the coverage of the sample over the overall population. The refusal correction weight is obtained by increasing the population coverage of the sample by 100% while assuming the populations refusing and accepting interviews are homogeneous in parameters to be estimated.

(2) Raking weight

In PSU and SSU sampling, we utilized estimated data from the Sixth Population Census as the sampling frame of the survey. However, there is an 8-year gap between the survey and the Six Population Census, during which significant changes have taken place in not only the demographic sizes and structures, but also social structures.

First, the pace of urbanization has accelerated. As can be seen from Table 1, the main urban areas and urban-rural fringe areas, the two types of village residences, rose to 10.23% and 4.28% from 7.87% and 3.77%, respectively, with a combined gap of 2.87%. In other words, 2.87% of administrative village-level residences have been transformed, statistically, to urban residences from non-urban ones, partly reflecting the pace of urbanization in China over the 8 years.

Second, there have also been changes in age and sex of employed population. In the urbanization process, the surplus labour migrates from rural to urban areas. However, the labour migration process is evidently associated with sex and age, which is an academic consensus supported by a large body of empirical research.

Third, changes in demographic structures, such as aging of labor population, improvement in education levels and exit of women from the labor market, have also happened, causing a significant impact on the compositional biases of our sampling frame.

Fourth, there have also been changes in the nature of ownership and industrial structures in the Chinese economy, which may affect the position structure in the labor market and the occupational structure of the population, thereby potentially influencing our sampling frame.

微信图片_20210711192005.png 

To correct for the compositional biases between the survey data and the known overall attributes, we used the latest statistics published by NBSC of China and adjusted the weight of the data using the raking method. At the individual level, our target population was 434.19 million employed urban population in 2018. We drew on relevant statistical communiqués published by the NBSC of China to correct for 5 variables including ownership, sex, educational level, age group and province.

As can be seen from Table 2, the “self-employed”, “privately-owned” and “state-owned” types of ownership, which take up considerably large proportions, are roughly consistent with the proportions of the calibrated composition. In this survey, the proportions of “publicly-owned” and “foreign-invested” organizations are evidently lower, while those of “collectively-owned” and “public institutions” are evidently higher. The refusal correction weight facilitates these ratios to move towards the correct direction, but with limited effects.

微信图片_20210711192135.png

 

The sex composition (as shown in Table 3) of the survey data is basically consistent with that of the calibrated data. The sex ratios of the samples prior to the calibration are closer to the calibration ratio. 

 微信图片_20210711192332.png

The survey data has a relatively lower bias for age composition. As shown in Table 4, the proportion of respondents aged between 25 and 29 is most biased, which is 6.8% higher than the calibrated proportion; the proportion of those aged more than 65 is relatively biased from the calibrated data, which, however, has little effect due to its small percentage in the population. The refusal correction weight is generally skewed towards the corrected population in terms of age bias. 

微信图片_20210711192510.png

With respect to education levels, data concerning the groups having a junior middle school and technical college education has a relatively greater bias. The proportion of respondents having a junior middle school education is only 13.5%, far lower than the 33.7% released by the National Bureau of Statistics (see Table 5); to the contrary, the proportion of respondents having a technical college degree is twice of that of the calibrated data, and the percentage of those having a master's degree is also relatively high. As can be seen from the education structure, the respondents in this survey are evidently highly educated and the majority of whom have a junior high school and above education. 

微信图片_20210711192706.png

The last item whose composition is calibrated is province. It should be noted that the sampling design of this survey only supports inference across nation-wide employed urban population and should not be used to make sampling inferences at a level lower than provinces. The survey data only cover respondents from urban areas of 23 provinces, direct-controlled municipalities and autonomous regions in China. However, restricted by the sampling frame, our sample cannot reflect the provincial-level distribution of urban work positions across the country. In addition, the geographical distributions of residences and work are not consistent. Thus, it is necessary for us to calibrate the composition of employed population by province. As can be seen from Table 6, compared with data released by NBSC, the proportions of employed persons in provinces of Guangxi, Hubei and Hebei are relatively higher, while the proportions of employed workers surveyed from Jiangsu and Guangdong, the two largest provinces in terms of GDP, are excessively lower. It is uncertain to what direction that the refusal correction weight would influence the province composition. In some cases, it might exert a backward influence. For example, the proportion of Guangdong province dropped to 5.6% after the refusal calibration from the 6.2% in the sample. However, in the calibrated data, Guangdong province accounts for 12.6% of the employment urban population nationwide. The positive influence is exemplified by Shanghai. Its proportion rose to 2.8% after refusal calibration from the 1.9% in the sample. In the calibrated data, its proportion is 4%. As the survey was carried out by different implementing agencies in different regions, the correction effect of refusal correction weight is associated with the quality of survey implementation maintained by these agencies, which is the cause of the uncertainty in terms of the direction towards which the refusal correction influences province composition.

微信图片_20210711192822.png

Through the operations stated above, the survey data is consistent across the above 5 aspects with the data released by NBSC in 2018 in terms of the composition of employed urban population. In the meantime, we expected to reduce the compositional bias of the survey data in other areas through calibrating these 5 aspects. Next, we will assess the calibrated survey data from the perspective of data quality.

5. Overall assessment of data quality

Looking from the general perspective of error, every set of survey data contains a variety of errors and biases (Biemer et al., 2017). Although the research team has put in a tremendous quantity of manpower, financial and material resources in monitoring, assessing and correcting various known biases and errors in this survey, it is inevitable that the released data may still contain unrecognized errors and biases. Next, we will assess the quality of the survey data from aspects of its internal reliability, external validity and possible biases, that is, to investigate the measure of the degree to which the survey data reflect the social reality (Biemer & Lyberg, 2003).

(1) Internal reliability

The internal reliability of data refers to the stability of various measures. In demographic and social sciences, we often use the Myers' composite indicator to measure the reliability of a continuous variable across the units of observations (Shryock et al., 1976). Here, we examine respondents’ ages and telephone numbers provided by them in the survey data. The former is non-sensitive information, while the latter is relatively sensitive information, both of which are used as indicator variables to test the internal reliability of the survey data.

In this survey, we asked the respondents to leave their telephone or cellphone numbers. The result shows that 62.22% of respondents have left format-compliant cellphone or telephone (in extremely rare cases) numbers, the distribution of their last digits are as follows (see Table 7):


 微信图片_20210711192941.png

As can be seen, there is only one number of last digit - the number “4” - whose proportion is 7.68%, evidently lower than other numbers. The Myers' composite indicator of telephone numbers is 3.57, indicating that only 3.57% of telephone numbers need to shift their digits in order for the last digit numbers to follow a statistically uniform distribution. It is consistent not only to statistical principles, but also to the common sense of our everyday life. As such, the survey data is basically reliable with respect to the indicator.

With respect to the last digit number of ages, as can be seen from Table 8, the numbers of 0, 5 and 8 are somewhat concentrated. The estimated Myers' composite indicator is 34.31, indicating that at least 34.31% of the last digits of respondents’ ages need to be adjusted in order for them to follow a statistically uniform distribution. In earlier surveys conducted by the research team, the Myers' composite indicators of ages are largely around 5%. Compared against the benchmark of 20 for the upper limit of Myer's indicator, the quality of this survey data still falls within the acceptable range. 

微信图片_20210711193121.png

In general, the survey data is in the upper middle quality range in terms of its internal reliability, and thus can basically reflect the general picture of the researched area. However, further improvement in measurement accuracy is needed.

(2) External validity

The external validity of survey data refers to the degree to which its various estimates are consistent with other authoritative statistical data. Here, we selected the number of members of the Communist Party of China (CPC) as an external reference indicator. The CPC is the largest party in the world and the governing party in China; the number and composition of party members released by its organizational department can be considered a highly reliable external indicator, as the information is basically unlikely to be influenced by economic interest or administrative factors like official performances. More importantly, the age criterion for joining the CPC basically overlaps with that of employment. After removing retired persons and the elderly persons who have never been employed, it basically covers the entire employed adult population.

After calibration weighting, this survey data shows that the proportion of CPC members among employed urban population is 8.7%, and the 95% confidence interval of that proportion is [6.94%, 10.46%]. According to the Statistics Bulletin of the Communist Party of China: 2018, the population of urban employed CPC members excluding those engaged in agriculture, animal husbandry and fishing, students and retired persons is 45.199 million, accounting for 10.4% of the total 434.19 million urban employed persons, fitting in the confidence interval of the CPC member proportion estimated in this survey. As mentioned earlier, we hadn’t used the CPC member as a variable in the calibrated survey data, and it thus can be used as an internal validity indicator.

Apart from the number of CPC members, we also estimated “the number of persons covered by the basic endowment insurance for urban workers” and “the year-end number of urban employees covered by the basic medical insurance”. According to estimates of the survey data, the 95% confidence interval for the number of persons covered by the basic endowment insurance for urban workers is [29308.53, 34551.99]. According to data released by the National Bureau of Statistics, at the end of 2018, the year-end urban employees covered by the basic endowment insurance for urban workers is 301.04 million, falling within the confidence interval.

According to estimates of the survey data, the 95% confidence interval for the number of urban employees covered by the basic medical insurance is [2637440, 31617.86]. According to data released by the National Bureau of Statistics, at the end of 2018, the year-end urban employees covered by the basic medical insurance is 316.808 million, which is slightly higher than the upper limit of the confidence interval. Considering the fact that the data released by the Ministry of Human Resources and Social Security contains retired workers who do not fall in the inference range of this survey, the estimates derived from this survey are basically reliable.

The above indicators reveal that this survey data has considerably high external validity. As the first sociological survey data dedicated to the urban work environment in China, it can be used as a solid cornerstone for theoretical research, hypothetical validation and policy analysis despite the imperfection in its quality.

(3) Possible biases

As a process of collecting information and generating knowledge in public spaces, a social survey is reliant on respondents being candid and open and survey implementers being professional and diligent. This survey data does not cover high-end or confidential workplaces that are inaccessible for social surveys, or workers that have no residential addresses or whose addresses are inaccessible. In a low-trust society, any information disclosure entails high risks. Our survey data is not just a product of such a low-trust society, but also a manifestation of the conditions in this society. Inevitably, it carries with it the marks of this era and society.

 

References:

Biemer, Paul P., & Lars Lyberg. 2003. Introduction to Survey Quality. Hoboken, NJ.: Wiley.

Biemer, Paul P., et al. 2017. Total Survey Error in Practice. Hoboken, New Jersey: Wiley.

Kish, Leslie. 1965. Survey Sampling. New York,: J. Wiley.

Saerndal, Carl-Erik, & Sixten Lundstrom. 2005. Estimation in Surveys with Nonresponse. Hoboken, NJ: Wiley.

Shryock, Henry S., Jacob S. Siegel, & Edward G. Stockwell. 2004. The Methods and Materials of Demography, Condensed. New York: Academic Press.

Valliant, Richard, Jill A. Dever, & Frauke Kreuter. 2018. Practical Tools for Designing and Weighting Survey Samples, 2nd. New York: Springer.

Zhe Xiaoye & Chen Yingying. 1995. China’s First Research “Occupation-Identity” Esteem in Rural Areas, 6th Issue, Social Sciences in China.