Sampling and Data Cleansing for the 2017 Survey 2017 中国工作环境研究

METHODOLOGY

Information Sources：2021-06-21Information Sources：

CURRENT LOCATION:HOME | METHODOLOGY | 2017

Sampling and Data Cleansing for the 2017 Survey

Font size selection：S N L

In the Work Conditions Survey of Chinese Urban Residents (2017), we established the sampling frame of primary sampling units (PSUs) based on the data of China’s “sixth population census” provided by the National Bureau of Statistics (NBSC). Then, using data of the total population, the population aged 16 and over, and the number of households pertaining to the chosen PSUs of the NBSC, as well as all secondary sampling units (SSUs) and tertiary sampling units (TSUs), we selected employed persons aged 16 and over living in communities (residents’ committees) across different county-level cities, prefectural-level cities and direct-administered municipalities in China as the subjects of our survey, and collected data at individual, household, organizational and community levels through a door-to-door questionnaire survey, with a view to measuring, evaluating and analyzing the work conditions of the Chinese employed urban population. The sampling scheme provides great representativeness for urban working and employed population in China.

Ⅰ. Sampling design

(Ⅰ) Target population

The target population of the Work Conditions Survey of Chinese Urban Residents (2017) is the employed population aged 16 and over living in urban areas of mainland China. As we only selected 1 person for each sampled household, the survey data can be appropriately weighted to be representative of the employed Chinese urban households. In addition, the term “employed urban population” is operationally defined as the employed population aged 16 and over living in communities (residents’ committees) across different county-level cities, prefectural-level cities and direct-administered municipalities in China.

(Ⅱ) Sampling frame

This survey adopts a complex sampling design. Specifically, the county-level administrative units (districts and county-level cities) were recognized as PSUs based on the data of China’s “sixth population census” conducted in 2010, in combination with the latest information on administrative division released by the Ministry of Civil Affairs of the People’s Republic of China, to establish data of the PSU sampling frame (for PSUs containing less than 9 SSUs, we merged two PSUs into one PSU by geographical proximity in order to allow the number of its SSUs to be higher than 9). A total of 60 county-level urban areas (PSUs) were sampled using the probability proportional to size (PPS) method.

For each PSU, we applied the PPS method to select 9 communities (residents’ committees) as SSUs. In cases of inaccessibility, relocation or changed administrative division, we would supplement new SSUs that were selected from the sampling frame using the same method.

For each selected SSU, our survey implementation institution dispatched on-field interviewers to make district maps and the list of household addresses as the third-stage sampling frame, from which we selected 15 households as TSUs using the simple random sampling without replacement (SRSWOR) method. Given the presence of refusal rate and non-target households, we requested interviewers to provide an estimated value of refusal rate for each TSU in sampling, and offer all sampling addresses in a one-off fashion based on the following equation:

the number of selected addresses=int(15/(1-refusal rate)+0.5))

If the number of successfully surveyed households reached 10, the survey was suspended; if otherwise, the interviewers should recalculate the newly selected addresses based on the number of remaining households to be surveyed and new refusal rate using the above equation.

With respect to selected TSUs (households), we further selected one individual per address as the ultimate sampling unit (USU) based on the random number table (see the Questionnaire) that we established. If a selected individual did not accept the interview and did not allow other household members to do it alternatively, then the next address in the selected address list would be visited.

(Ⅲ) Sample size

With simple random sampling of the population (without replacement), we could obtain the estimated sample size using the following equation:

图片1.png

Where p is the probability that a certain class of the sample appears in the population; uα is the distribution critical value when the confidence level is α; and d is the difference between sample estimate and population parameter. According to the above equation, if we set the confidence level of the estimation interval α=0.05 and the absolute error d=3%, then we only need to survey a sample of about 1,000 for the estimation of the majority of distributions.

However, given the fact that this survey used a multistage complex sample instead of a simple random sample, we must also take into account the design effect (deff). The design effect is the ratio of sample variance generated when using the complex sampling to the sample variance generated when using the simple random sampling under the same sample size. The estimation equation of the design effect is as follows:

图片2.png

Where b is the number of samples selected from a single sampling unit; and roh is the homogeneity within the sampling unit. This equation indicates that the larger the number of samples selected from a single sampling unit, the larger the deff; and the larger the homogeneity within the sampling unit, the larger the deff. This survey utilizes a sampling scheme that increases the number of sampling units and reduces the number of samples within a single sampling unit to the greatest extent possible. Therefore, we set the deff of this survey as 6 based on the design scheme utilized in this survey and our previous experience in sampling. Thus, the sample size that takes deff into account would be 1000×6＝6000.

To obtain an unbiased parameter estimate, a certain level of response rate (r) must be ensured for a social survey:

微信图片_20210711151415.png

Methodologically, the target population can be divided into two potential populations by whether a response is given: the population available for survey and the population unavailable for survey. The size of the former can be calculated by response rate * target population size; while the latter by (1-response rate)*target population size. The lower the response rate, the smaller the population size that can be inferred by the sample estimate. Only by assuming that there is no statistically significant difference between inferred parameters for available and unavailable populations can we generalize the survey results to all population members in the presence of nonresponses. The rule of thumb is that we should at least ensure a 50% or higher response rate in sampling surveys (that is, both available and unavailable populations account for half of the target population). Considering the presence of nonresponses in the survey, we need to appropriately increase the size of the selected samples. We presumed that the response rate of the survey was 75%. Thus, considering the existence of nonresponses, the sample size for this survey should be 6000/0.75=8000. Further considering the specific distributions of samples, we determined the final sample size as 8100 (=60*9*15), which is made up of 60 PSUs, with each PSU comprised of 9 SSUs, each SSU comprised of 15 TSUs and each TSU comprised of 1 USU.

(Ⅳ) Sampling frame and sampling procedure

1. The first-stage sampling: selection of PSUs (cities and districts)

The PSU sampling frame of this survey is derived from the Sixth National Population Census Data (County-specific) published by the NBSC in 2010. Given that it has been 8 years since 2010, we adjusted the death rate of urban population aged 8 and over based on the gender- and age-specific crude death rates of the sixth population census data to rectify the effect of population changes. The adjusted data were used as the sampling frame of PSUs (including 1,226 PSUs), and the urban population aged 8 and over was used as the weight. According to the sampling design scheme, we selected 60 PSUs from the 1,226 PSUs (from cities other than those in Xinjiang and Tibet) based on the PPS method. The 60 PSUs are distributed in 24 provinces, cities and autonomous regions, with Hubei Province containing the most samples (5 PSUs) and Yunnan Province the fewest samples (1 PSU).

2. The second-stage sampling: selection of SSUs (residential committees of communities)

The SSU sampling frame of this survey is derived from the raw data of the Sixth National Population Census Data published by the NBSC in 2010, which the related departments of the bureau provided the number of households and urban population aged 8 and over in 2010. According to the sampling scheme, we should select 9 residential communities from each PSU as a SSU based on the PPS method. In principle, the total selected residential communities should be 540. In the actual sampling process, we segmented some of the residential communities having large population sizes. Therefore, a same residential community may have been repeatedly selected. The final SSU sampling frame consists of 529 residential communities.

3. The third-stage sampling: selection of TSUs (households)

For the purpose of this survey, households include regular households, collective households and various collective residential units that are covered by household registration. The TSU sampling frame was derived from the land plot drawings and the list of addresses made by the survey implementing agency, who further made the “household sampling frame” on site based on the land plots. After establishing the “household sampling frame”, the research team randomly selected a list of addresses for door-to-door survey using computer software. While contacting the addresses to be surveyed, interviewers were not allowed to conduct survey at addresses that were not included in the list. For sampled houses that still could not be visited after being contacted three times, the interviewer needs to specify the reason in the relevant section of the Registration Form of Home Visits before moving on to contact the next household.

To incorporate the migrant population into the survey, we followed a principle of “determining survey subjects by households” in TSU sampling, that is, by taking residential addresses as tertiary sampling units (TSUs), we regarded a household as a potential survey target as long as one or more of its members were employed regardless of their being household registered population, population of long-term residents or migrant population.

4. The fourth-stage sampling: selection of USUs (respondents)

The fourth-stage sampling frame is comprised of all employed members aged 16 and over in the selected households. After successfully accessing the selected homes, an interviewer needed to choose the respondent from household members using the Kish grid on the first page of the questionnaire. It is worth noting that the Kish grid was used to select respondents in centralized residential households with no more than 10 members; if the number of household members was larger than 10, the interviewers would follow the “median age” principle, that is, to select an individual whose age falls in the middle among all appropriate respondents.

A questionnaire-based interview would begin if the selected respondent agreed to accept the interview. In the event that the respondent refused the interview, the interviewer should specify the gender of the respondent and “cause of failure” in corresponding sections under the “Failed Interviews” - “Reasons Given by the Respondent” on the Registration Form of Home Visits. If the selected subject could not accept the interview due to reasons like being away from home, going abroad or seriously ill, the interviewer should consider whether it was necessary to make an appointment for the interview. If not, the interviewer was also required to specify the gender of the respondent and “cause of failure” in corresponding sections of the Registration Form of Home Visits.

Regardless of the causes for a failed interview, an interviewer was prohibited from switching a selected respondent to another member of the household; rather, the interviewer should specify the cause in the Registration Form of Home Visits before proceeding to the interview with the next household.

Ⅱ. Quality control of the survey

The objective of survey quality control is to reduce the systematic errors (biases) of survey data under the guidance of the “overall research design”. Based on the research design of Work Conditions Survey of Chinese Urban Residents (2017), the survey may entail systematic errors in the following three stages:

1) First, the household selection stage. For example, factors like incomplete plot drawings for sampling, inaccurate entries on sampling forms and arbitrary replacements of household addresses by interviewers could all cause errors.

2) Second, the respondent selection stage. For example, biases in sample gender and age could be caused if interviewers fail to perform in-home selection based on the Kish grid procedures or if the entries of the Kish grid are non-compliant.

3) Third, the field interview stage. For example, biases could occur if interviewers systematically miss questions, intentionally avoid part of question sets by taking advantage of skipping rules, merge questions that should be asked in an “item-by-item” fashion, or direct or suggest respondents providing certain kinds of answers.

Revolving around the three stages stated above, data quality control was performed through the following procedures in this survey.

(Ⅰ) The household selection stage

A sampler was dispatched to the selected residential community for a field visit, where he or she examined all buildings in the area and drew or updated the Residential Land Plot Drawing for Sampling. On that basis, the sampler also filled out a Land Plot Sampling Form where the numbers of floors and entrances in each building, and the number of households for each entrance per floor were recorded. All residential households in the above building constitute the sampling frame for this sampling round. The sampler must ensure that the identifier numbers of the drawings, residential buildings and rooms are consistent. In the event that the number of households indicated by the Land Plot Sampling Form is evidently lower than the size of a regular residential community in the locality, the sampler should promptly check whether the Land Plot Sampling Drawing and the Land Plot Sampling Form are complete.

Upon receiving the data of Land Plot Sampling Form, the contractor’s project team should provide a list of addresses to be visited for each community developed in a random selection process. A visit to a household cannot be regarded as failed unless there have been 3 nonresponses or 2 refusals. In case of any failed visits, the interviewer must truthfully specify the causes on the Registration Form of Home Visits. For communities where a specified number of valid sample interviews can still not be realized after 3 home visits, the contractor’s project team should provide the second set of visiting addresses.

During home visits, an interviewer must carefully fill out the Registration Form of Home Visits and is prohibited from arbitrarily changing the visiting address. If it is found in the validation process that more than 100 households’ entries are missing in the Land Plot Sampling Form, the questionnaires of the residential community will be regarded as invalid.

The interviewer must take a set of photos pertaining to an interviewed household (photos of the full name of the residents’ committee, the identifier number of the residential building/bungalow, and the household’s doorplate number), where the indicated address should be the same as the sample address. Questionnaires with photos missing or inconsistent addresses will be regarded as invalid.

Based on the above principles and the records of the Registration Form of Home Visits, a total of 37,115 household addresses have been contacted in this survey. That includes 8,271 households that were successfully visited, accounting for 22.3%; 4,109 households that could not be reached, accounting for 11.1%, which was either because the interviewers were stopped by the entrance control system or security guards, or due to nonresponses or instances that the suitable respondents were not home; 23,044 households that refused to take part in the interviews due to various reasons, accounting for 62.0%; 2.1% of households that did not have suitable respondents; and 941 households entailing other circumstances (including relocation, or questionnaires were deemed as invalid), accounting for 2.5%. On average, about 1 out of 5 household visits was successful.

The interviewers have completed 58,635 household visits in this survey. The distribution of these visits presents a “one-peak and one-valley” trend across a week, that is, a peak around weekends and a valley in the middle of weekdays. The specific percentages are as follows:

微信图片_20210711151824.png

The in-home visits in this survey mostly took place during the daytime (see Figure 1), with a peak around 10 a.m. and another peak between 2 and 3 p.m. The average time interval for moving between every two household addresses is 27 minutes, and the mode is 11 minutes.

With respect to the timing and weekly characteristics of the survey, it can be said that the interviews conducted for this survey reflect the flexibility of the role and behaviours of specialized interviewers, who were not specifically mindful about the concept of weekends.

The average length of in-home interviews is 40.7 minutes; the median length is 37 minutes; the shortest interview lasted 10 minutes, while the longest one lasted 472 minutes.

In addition, we also examined the relationship between refusal probability and home-visit timing, without finding any significant statistical relation.

Figure 1 Timing characteristics and frequency of in-home visits

图片27.png

(Ⅱ) The respondent selection stage

According to survey procedures, the interviewer should select the respondent from household members using the Kish grid provided on the first page of the questionnaire after entering a household. Respondent selection is crucial for ensuring sample randomness and thus should be properly implemented. The survey institution conducted two rounds of reviews within 2 days after the completion of questionnaires to check whether the Kish grid-based sampling was implemented correctly. If any flaws with the sampling process were found, the in-home interviews must be re-performed.

The survey implementing agency was also required to rapidly aggregate the gender- and age-specific survey data by city. In cases of gender-ratio imbalance or biased age structure, the implementer must verify the situation and submit a situation report to the research team.

Interviewers must record the audio of the entire process of interviews they performed. The proper “respondent selection” procedure must be reflected in the audio. All questionnaires administered by interviewers that were found to be engaged in falsification were regarded as invalid. Questionnaires with the “respondent selection” procedure missing were also regarded as invalid.

(Ⅲ) The in-home interview stage

The survey implementing agency provided effective training for interviewers using the Interviewer Manual and relevant video data. With respect to the field interview stage, the survey institution needed to conduct two rounds of reviews within 2 days after the completion of questionnaires in order to avoid missed questions or misuse of question skipping. Immediate remedies must be implemented if the above mistakes occurred. After completing the interviews with the first samples, an interviewer must promptly transmit the electronic version of the questionnaires and audio to the Chinese Academy of Social Sciences (CASS) for scrutiny, so that potential problems could be evaluated and corrected.

After questionnaires were completed, an audio record review covering all interviewers and residents’ committees was performed. If an audio record indicated that an interview lasted no longer than 15 minutes, the reviewer needed to re-examine the problem and provided timely guidance for the respondent. In the event that any mistakes were found in the review process, the reviewer was required to promptly notify the survey implementing agency and demand it to make immediate rectifications.

In instances of evident falsification of audio records, the survey implementing agency was required to re-perform the interviews concerning all questionnaires administered by the alleged interviewer. An interviewer must write down respondents’ telephone numbers on the first page of the questionnaire for purpose of telephone checks.

Ⅲ. Data entry and data cleansing

(Ⅰ) Data entry

Two surveying approaches - the PAD-based and paper-based questionnaires - were employed in this survey. 76.7% of the final data was from PADs, which was entered simultaneously as the field interviews progressed and transmitted to the survey implementing agency via wireless networks. Apart from the quality control process implemented by the survey implementing agency, we also proofread 100% of the record data by re-listening to the audio files. In the meantime, the geographic information of each interview was validated through comparing PAD-generated latitude/longitude coordinates with Baidu Map. Cases that are evidently inconsistent were removed.

Audio and photo information was also required to be provided for data of paper-based questionnaires. Paper-based questionnaire data were entered using the software EpiData Entry 3.1, and then validated through double-entry comparisons. In addition, the quality control function offered by Epidata Entry for data entry was also utilized. With a pre-established program, the computer system could automatically detect and control the value field errors of variables and the logic errors between variables. In the event of outliers for a single variable or erroneous logic relations between multiple variables, the original questionnaires must be consulted to confirm whether errors took place in the entry process; if not, the respondents should be contacted for confirmation and further remedial measures.

(Ⅱ) Data cleansing

Following data entry, we further carried out data cleansing, a process that mainly involved the following aspects:

1. Data structure review

Double-entry comparisons (i.e., the data of the same questionnaire was separately entered by two different data entry clerks, and the two data entries were then compared. If any inconsistent samples and variables were found, the original questionnaires would be consulted to correct the mistakes so that data errors arising from the entry process could be minimized); review of the uniqueness and completeness of cases; review of questionnaire numbers, zip codes of residents’ committees and addresses through textual and code comparisons; and review of interviewing times were carried out. Further, we also uniformly encoded the names of occupations and converted them into occupational prestige based on research findings by Zhe & Chen. The occupational prestige data were also presented in the published data (Zhe & Chen, 1995).

2. GPS-based validation of interview addresses

Among the 8,795 sets of valid data that we had received, 5,612 sets were provided with GPS data. We retrieved the address information corresponding to these longitude and latitude coordinates using the API interface of Baidu Map, and compared it with the visited addressed recorded in the data. The result shows that the majority of addresses are consistent provincially, with only 15 inconsistent data sets mainly involving Hebei, Shanxi, Chongqing, Henan, Tianjin and Sichuan. The inconsistent data sets were removed.

Although the remaining cases were found geographically consistent in terms of cities and districts, many inconsistent cases emerged when it came to specific communities. Due to the fact that it was the first time that the research team had utilized such an approach in data validation, we are unclear about the specific causes of these consistencies. However, potential areas that may cause errors include: (1) errors with the GPS information generated for locations of in-home interviews; (2) wording differences between the original addresses and the matched addresses based on GPS information, for instance, the returned result from Baidu Map for Dazhou, Sichuan was Da County of Sichuan; (3) in some cases, in-home interviews were performed using paper-based questionnaires, but their data were entered through PADs, causing a disparity in community layers.

3. Validation of the Kish grid

Whether the Kish grid of the questionnaire was properly used was validated in accordance with the review procedures. Noncompliant questionnaires were considered invalid. Among all the 8780 cases for which the proper use of Kish grid was reviewed, only 15 sets of data were found to be noncompliant with the randomness principle of the Kish grid. We considered these questionnaires invalid and removed them. This indicates that, at least in terms of data record, the respondent selection process of this survey was performed by observing the randomness principle.

4. Check of extreme values and their distributions

We manually checked the distribution of every variable. Using the reversed order method (as the probability for the distribution of a variable declines as it continues to skew towards the extreme values, an inconsistency to such a trend is called a reversed order), the extreme values involving reversed-order variables were considered as missing values. For example, we considered values of 1 million or higher total household incomes, 1,000 or more houses, 130 million yuan or higher housing prices and 120 and more children as missing values.

5. Logic validation

We examined 65 potential logic relations in the questionnaire and checked possible logic errors in a one-by-one manner. Appropriate treatments, including invalidating and rationalizing questionnaires, were performed for each instance. These instances include, for example, part-time job incomes while the respondents do not have any part-time jobs; the number of dependent family members exceeding family sizes; household income lower than individual income; the time used daily exceeding 24 hours; the use of kindergartens or breastfeeding rooms while the respondents do not have any children; the use of breastfeeding rooms by male respondents; maternity leave taken by male respondents; the requirement of childbirth experience when male respondents seek jobs; the number of direct reports exceeding the total workers of a company; the starting year of the current job earlier than that of the first job; and overtime pay while the respondents do not work overtime.

In the meantime, we also deleted 12 cases involving in-home interviewers earlier than October 2017 (trial survey) and later than March 2018 (supplementary survey) to make the timing of in-home visits more concentrated.

6. Interviewer falsification

In this survey, a total of 859 interviewers were hired, including 681 female interviewers (accounting for 77.3%) whose average age is 30.8, median age is 29, minimum age is 18 and maximum age is 68. The average length of work for these interviewers is 3.5 years, and the median length is 3 years. Specifically, specialized interviewers account for 36.0%, and students account for 17.5%. Interviewers are generally well educated, with 0.2% having junior middle school diplomas, 29.4% having senior middle school diplomas, 49.3% having junior college diplomas and 21.1% having undergraduate or higher degrees. A relatively large number of interviewers were engaged because on the one hand, in-home survey has been increasingly difficult over the past decades, and on the other hand, the content of the questionnaire is sensitive and its sampling procedures are complicated and demanding. In particular, we implemented a triple process and data quality control system, that is, the check of 31 frontline survey agencies, the review of 2 general agencies based in Beijing and Chongqing, and the third-time review of the research team were separately maintained.

Apart from process control and review of interview behaviours, we also conducted qualitative examinations over the quality of valid data we had received. Using the three scales in the front, middle and back scales of the questionnaire, we treated each interviewer as an analytical unit and reviewed the measurement structure of their survey data in order to detect any missed quality control steps, as well as data generated by interviewers that allegedly involved typically known falsification behaviours like random entry or duplication.

We categorized data collected by certain interviewers as “falsified” and deleted it if any one of the three measurement structure characteristics of the three scales were found. These three measurement characteristics include: (1) none of the correlation coefficients between the three scales was significant; (2) there are zero variances of indicator variables; (3) there are negative factor loadings. Using this approach, we deleted a total of 1,870 sets of data.

We initially received 8,795 survey data sets from the survey implementing agency (including 761 data sets labeled by the survey implementing agency as invalid), and eventually obtained 6,895 valid samples after validation and cleansing.

Ⅳ. Weights and target population calibration

In social surveys, weights are crucial for ensuring the correspondence between samples and the target population. Weights are mainly classified into two types: sampling weights and calibration weights. Sampling weights are the inverse of the likelihood that individual cases are sampled, which are determined by the sampling scheme. Methodologically, there are mainly 4 approaches to calculate a sampling weight - design-based (randomization), model-based, model-assisted and Bayesian approaches (Valliant et al., 2018). In this survey, we adopted a design-based sampling weight calculation method in order to accurately reflect the particularity of our sampling design; in the meantime, we also performed calibration based on the statistics published by the NBSC in order to compensate for potential biases (including compositional biases) arising from outmoded sampling frames, refusal of respondents and falsification of interviewers, and reduce the variances and errors of sample estimates. As such, the survey data have two weight variables - the sampling weight and the calibration weight. While the former is used to adjust the unequal probability in the designed multistage sampling; the latter is used to further adjust the structural weights in order to prevent various potential errors between samples and population, especially demographic errors. The following is a brief description of how these weights were generated.

(Ⅰ) Sampling design weight

This survey is designed to use the multistage complex sampling. In the first stage, 60 urban districts were selected using the PPS method; in the second stage, residents’ committees in these urban districts were selected using the PPS method; in the third stage, households in the residential communities were selected using the SRSWOR method; in the fourth stage, 1 respondent was randomly selected using the Kish grid at the respondent's home.

Thus, when selecting PSUs using the PPS method in the first stage, the sampling probability pi for the i-th PSU is:

图片3.png

Where m is the sample size of each stage, and n denotes the employed population or its substitution variable in each sampling unit. N.total is the population aged 16 or over living in the residential communities of urban districts, which was calculated using data of the Sixth Population Census. As can be calculated based on the sampling frame, its value is 616432389. M.psu is the number of selected PSUs, that is, 60. Ni.psu is the employed population in the i-th PSU, that is, the sampling probability of the i-th PSU is the sum of the ratios of its employed population to the total employed population added for 60 times.

In the second stage, we selected 9 SSUs from urban PSUs using the PPS method. With a PSU being selected, the sampling probability of the j-th SSU, denoted by Pjssu|i, is:

图片28.png

Where mj.ssu is the number of SSUs selected from each PSU, which is set as 9. Nj.ssu is the working population in the j-th SSU, ni.psu is the working population in the i-th PSU. That is, the sampling probability of each SSU is the sum of the ratios of its working population to the total population of PSU, added for 9 times.

In the third stage, we randomly selected household TSUs from community SSUs, but the sampling frame did not contain the number of households in communities or the total number of households with at least one employed member. Therefore, we employed the SRSWOR method. With an SSU being selected, the sampling probability of the k-th household, denoted by pktsu|j, is:

图片4.png

Where mk.tsu is the number of households that should be selected from each community TSU, which is set as 15; nk.tsu is the total number of households having at least one employed member in the community. In this survey, we utilized a Land Plot Registry Form, requiring interviewers to register all accessible addresses of a community before conducting in-home interviews. The number of these addresses was used as the substitution variable of nk.tsu.

In the fourth stage, interviewers visited addresses randomly selected from the Land Plot Registry Form and initiated selection as soon as they contacted household members, that is, the number of household members that are employed and aged 16 and over under the current address. If the number of employed members is not 0, the interviewer would use the Kish grid on the first page of the questionnaire to select appropriate respondent and ask about his or her willingness to participate. If the respondent is not willing to participate in the survey, the household would be considered as a refusal. The interviewer continued to visit the next household on the list of addresses until obtaining the consensus of the respondent to start the interview. With a home address being selected, the sampling probability of each individual case is plusu|k:

图片5.png

Where 图片6.png is the number of employed members in the 1st address of the k-th community, which is obtained from item B1 on the questionnaire.

According to the above steps, the sampling probability of each respondent is pmcase, that is, the product of the above four probabilities, denoted by:

图片31.png

Where 图片7.png denotes the constant sampling probability of each individual case. As we used the PPS method in both the 1st and 2nd stages, the resultant sample is a self-weighting sample (equal probability of selection method sampling, epsem) (Kish, 1965).

The second term 图片8.png is the changed sampling probability of each individual case. In the third sampling stage, we did not use the PPS method to select households due to a lack of necessary information. Instead, we utilized the SRSWOR method, causing changes in the sampling probability of each respondent.

In addition, we used working population aged 16 and over as the population size for the PPS-based sampling. As working population is highly correlated with, but different from, employed population, we used employment rate as a moderating factor. In other words, we assumed in the sampling stage that the employment rate across different sampling units is constant.

Substituting relevant values into the above equation, we get:

图片34.png

Therefore, the survey data are generally not a self-weighting sample (equal probability of selection method sampling, epsem), and the sampling probabilities for individual cases are different (Kish, 1965). That is, the average sampling probability of this survey is about 1.3/100,000.

The sampling weight wgtcase should be the inverse of the aforementioned sampling probability pmcase, that is:

图片35.png

Where 图片9.png is the number of employed members in the 1st household of the k-th TSU, nk.tsu is the total number of households with at least one employed member in the k-th SSU (community), and nj.ssu is the working population of the j-th SSU. On average, each respondent of the survey approximately represents 55,000 employed urban population.

However, due to the existence of various errors (especially refusals and falsification) in the actual survey, 6,873 individual cases instead of 8,100 were finally retained in this survey. If changes in weights owing to refusals are to be corrected (Saerndal & Lundstrom, 2005), the value 8,100 incorporated into the above equation should be replaced by 6,873. That is, the weight wgtrsp after correcting for the refusal effect is:

图片36.png

We recommend that this sampling weight be utilized when using the survey data in descriptive analysis or answering empirical questions. Thus, each respondent in this survey approximately represents 6.5 employed urban population. The changes in representative scope of these two weights also reflect the potential effect of refusals, that is, refusals would reduce the coverage of the sample over the overall population. The refusal correction weight is obtained by increasing the population coverage of the sample by 100% while assuming the populations refusing and accepting interviews are homogeneous in parameters to be estimated.

(Ⅱ) Raking weight

In PSU and SSU sampling, we utilized estimated data from the Sixth Population Census as the sampling frame of the survey. However, there is a 7-year gap between the survey and the Sixth Population Census, during which significant changes have taken place in not only the demographic sizes and structures, but also social structures.

First, the pace of urbanization has accelerated. As can be seen from Table 2, the main urban areas and urban-rural fringe areas, the two types of village residences, rose to 10.23% and 4.28% from 7.87% and 3.77%, respectively, with a combined gap of 2.87%. In other words, 2.87% of administrative village-level residences have been transformed, statistically, to urban residences from non-urban ones, partly reflecting the pace of urbanization in China over the 8 years.

Second, there have also been changes in the age and gender of employed population. In the urbanization process, the surplus labour migrates from rural to urban areas. However, the labour migration process is evidently associated with gender and age, which is an academic consensus supported by a large body of empirical research.

Third, changes in demographic structures, such as aging of labour population, improvement in education levels and exit of women from the labour market, have also happened, causing a significant impact on the compositional biases of our sampling frame.

Fourth, there have also been changes in the nature of ownership and industrial structures in the Chinese economy, which may affect the position structure in the labour market and the occupational structure of the population, thereby potentially influencing our sampling frame.

微信图片_20210711153459.png

To correct the compositional biases between the survey data and the known overall attributes, we used the latest statistics published by NBSC and adjusted the weight of the data using the raking method. At the individual level, our target population was 424.62 million employed urban population in 2017. We drew on relevant statistical communiqués published by the NBSC to corrected 6 variables including ownership, gender, level of education, age group, employment status (employee, employer, self-employed worker and family assistants) and province.

As can be seen from Table 3, the “self-employed” ownership type in the survey data is evidently higher by about 18%, while the ratios of other categories like “company, state-owned enterprise and foreign-owned enterprise” are apparently lower. The refusal correction weight facilitates these ratios to move towards the correct direction, but with limited effects.

微信图片_20210711153646.png

In the meantime, the gender composition of the survey data is also evidently imbalanced. Prior to calibration, male respondents accounted for 34.8% of the survey sample, and the refusal correction weight further reduced the ratio. According to data published by the NBSC, the ratio of male employed workers is evidently higher than that of female workers, accounting for 56.1% (see Table 4). Comparatively, the participation rate of male respondents is significantly lower.

微信图片_20210711153820.png

Compared with ownership and gender compositions, the survey data have a relatively lower bias for age composition. Respondents are aged between 25 and 29, higher than that of the NBSC data by more than 11%. Compared with the calibrated data, young people who had just started their career take up the highest ratio in this survey, while the probability that middle-aged and senior respondents aged 45 years and over taking part in the survey is evidently lower (see Table 5). The refusal correction weight has almost no effect on correcting the age bias, indicating that the refusal behaviour has no chain effect of age.

微信图片_20210711154045.png

With respect to education levels, data concerning the group of junior-middle-school graduates have a relatively greater bias. The probability that this group of people takes part in the survey declined by more than 100%, and the ration of these people in the sample fell as low as to 14.9%, which is significantly lower than 33.7%, the ratio published by NBSC (see Table 6); starting from respondents with senior middle school education, the higher the education attainment, the greater the probability for taking part in this survey. This indicates that the public-interest nature of this survey is an important factor influencing people’s willingness to participate in the survey. However, the refusal weight has little effect on correcting for the compositional bias of education level. This indicates that the refusal rate is a composite measure reflecting not only respondents’ behaviours but also those of interviewers.

微信图片_20210711154228.png

Surprisingly, the probability of “employers” taking part in the survey was unexpectedly higher. They account for 10.0% in the survey data, comparatively, their proportion in the NBSC data is only 3.9%. Domestic helps, a new occupation added by NBSC in this year, refer to workers participating in family business operations without pay. This is inconsistent with the work defined by us in the survey (work that generates pay within two weeks before the survey). However, the “Others” item of employment status provided in earlier years was not listed in this year's table. Thus, the differences in the “Others” item of Table 7 has more to do with definition than bias arising from survey implementation. Generally speaking, the compositional bias of employment status is relatively small in this survey.

微信图片_20210711154401.png

The last item whose composition is calibrated is province. It should be noted that the sampling design of this survey only supports inference across nation-wide employed urban population and should not be used to make sampling inferences at the province-specific or lower level. The survey data only cover respondents from urban areas of 23 provinces, direct-controlled municipalities and autonomous regions in China. However, restricted by the sampling frame, our sample cannot reflect the provincial-level distribution of urban work positions across the country. In addition, the geographical distributions of residences and work are not consistent. Thus, it is necessary for us to calibrate the composition of employed population by province. As can be seen from Table 8, compared with data released by NBSC, the proportions of employed persons in provinces, direct-controlled cities and autonomous regions like Tianjin, Liaoning, Hubei and Guangxi in this survey are relatively higher, while those in Jiangsu and Guangdong, two of the largest provinces in terms of employed workers, are excessively lower. It is uncertain to what direction that the refusal correction weight would influence the province composition. In some cases, it might exert a backward influence. For example, the ratio of Hebei province rose to 9.3% after refusal calibration from 5.2% in the sample. However, in the calibrated data, Hebei province only accounts for 3.0% of the employment urban population nationwide. The positive influence is exemplified by Shanghai. Its ratio rose to 2.4% after refusal calibration from 1.5% in the sample. In the calibrated data, its proportion is 3.6%. As the survey was carried out by different implementing agencies in different regions, the correction effect of refusal correction weight is associated with the quality of survey implementation maintained by these agencies, which is the cause of the uncertainty in terms of the direction towards which the refusal correction influences province composition.

微信图片_20210711154905.png

Through the operations stated above, the survey data is consistent across the above 6 aspects with the data released by NBSC in 2017 in terms of the composition of employed urban population. In the meantime, we expected to reduce the compositional bias of the survey data in other areas through calibrating these 6 aspects. Next, we will assess the calibrated survey data from the perspective of data quality.

Ⅴ. Overall assessment of data quality

Looking from the general perspective of error, every set of survey data contains a variety of errors and biases (Biemer et al., 2017). Although the research team has put in a tremendous quantity of manpower, financial and material resources in monitoring, assessing and correcting various known biases and errors in this survey, it is inevitable that the released data may still contain unrecognized errors and biases. Next, we will assess the quality of the survey data from aspects of its internal reliability, external validity and possible biases, that is, to investigate the measure of the degree to which the survey data reflect the social reality (Biemer & Lyberg, 2003).

(Ⅰ) Internal reliability

The internal reliability of data refers to the stability of various measures. In demographic and social sciences, we often use the Myers‘ composite indicator to measure the reliability of a continuous variable across the units of observations(Shryock et al., 1976). Here, we examine respondents’ ages and telephone numbers provided by them in the survey data. The former is non-sensitive information, while the latter is relatively sensitive information, both of which are used as indicator variables to test the internal reliability of the survey data.

In this survey, we asked the respondents to leave their telephone or cellphone numbers. The result shows that 78.9% of respondents have left format-compliant cellphone or telephone (in extremely rare cases) numbers, the distribution of their last digits are as follows (see Table 9):

微信图片_20210711155106.png

As can be seen, there is only one number of the last digit, that is 4, that has been evidently avoided by respondents. The percentage of the digit is 7.5%, apparently lower than those of other last-digit numbers. The Myers’ composite indicator of telephone numbers is 3.0, indicating that only 3.0% of telephone numbers need to shift their digits in order for the last digit numbers to follow a statistically uniform distribution. This is not only consistent with statistical principles, but also with the common sense of our everyday life. As such, the survey data is basically reliable with respect to the indicator.

With respect to the last digit number of ages, as can be seen from Table 10, the numbers of 0, 5 and 8 are somewhat concentrated. The estimated Myers’ composite indicator is 15.5, indicating that at least 15.5% of the last digits of respondents’ ages need to be adjusted in order for them to follow a statistically uniform distribution. In earlier surveys conducted by the research team, the Myers’ composite indicators of ages are largely around 5%. Historically speaking, the survey data has a relatively low record accuracy; however, compared against the benchmark of 20 for the upper limit of Myer's indicator, the quality of this survey data still falls within the acceptable range.

微信图片_20210711155324.png

In addition, we also analyzed the effects of interviewers, survey approaches and survey institutions on scale structure and regression models. The results show that they all produce a statistically significant effect. However, the magnitude of these effects is of little substantial significance, especially considering the fact that the survey data has a relatively large sample size. Due to length limitations, we do not present the specific results of the analysis. However, we feel obliged to remind potential users of this survey data that the effects of interviewer, survey approach, institution and region are intertwined owing to the particular implementation methods utilized in this survey, and are thus difficult to be discriminated in a single sectional data.

In general, the survey data is in the upper middle quality range in terms of its internal reliability, and thus can basically reflect the general picture of the researched area. However, further improvement in measurement accuracy is needed.

(Ⅱ) External validity

The external validity of survey data refers to the degree to which its various estimates are consistent with other authoritative statistical data. Here, we selected the number of members of the Communist Party of China (CPC) as an external reference indicator. The CPC is the largest party in the world and the governing party in China; the number and composition of party members released by its organizational department can be considered a highly reliable external indicator, as the information is basically unlikely to be influenced by economic interest or administrative factors like official performances. More importantly, the age criterion for joining the CPC basically overlaps with that of employment. After removing retired persons and elderly persons who have never been employed, it basically covers the entire employed adult population.

After calibration weighting, this survey data shows that the ratio of CPC members among employed urban population is 11.0%. In 2017, the total number of CPC members nationwide was 89.447 million; according to the estimates of this survey data, the 95% confidence interval for the number of CPC members among employed urban population is [30.7, 63.0] million, and the point estimate is 46.9 million, both of which fall within a reasonable estimate range. As stated earlier, we did not use the variable of CPC members when calibrating the survey data. Thus, it can be used as an indicator of external validity.

In terms of age, the number of CPC members aged 30 and below is 13.69 million nationwide. We estimated that the 95% confidence interval for the number of CPC members among employed urban population aged 30 and below is [9.5, 24.1] million, and the point estimate is 16.8 million, both of which are higher than the population parameter. Considering the rural hollowing effect and the migration of young workers to cities, the fact that estimates interval covers the values of the target population represents a faceted reflection that this survey data is basically reliable.

In terms of gender, the total number of women CPC members across the country is 22.982 million. Among the employed urban women population, we estimated that the 95% confidence interval for CPC members is [11.0, 25.3] million, and the point estimate is 18.1 million, which are close to population parameters. This estimate interval also covers the values of the target population, but its upper limit is lower than the population parameter, which, again, reflects that this survey data is basically reliable.

In terms of education level, the total number of CPC members that have attained college or above education across the country is 41.301 million. Among the employed urban population with a college or above degree, we estimated that the 95% confidence interval for CPC members is [18.8, 36.5] million, and the point estimate is 27.7 million, which are close to the population parameters. The estimates interval does not cover the values of the target population, but its upper limit is close to the population parameter.

Apart from the number of CPC members, we also utilized the statistical reports released by the Ministry of Human Resources and Social Security (MHRSS) to estimate related indicators. For example, as of November 2017, the number of persons covered by the “basic endowment insurance for urban workers” across the country was 398,474,544; according to the estimates of this survey data, there are 202 million employed urban persons covered by the endowment insurance, and its 95% confidence interval is [15.3, 25.1]. Given the fact that the MOHRSS data contains information of retired persons who do not fall within the inference range of this survey, the estimates of this survey are thus basically reliable. Similar results were also found with respect to indicators like medical insurance, unemployment insurance, work-related injury insurance and maternity insurance, detailed discussions on which are omitted.

The above indicators reveal that this survey data has considerably high external validity. As the first sociological survey data dedicated to the urban work conditions in China, it can be used as a solid cornerstone for theoretical research, hypothetical validation and policy analysis despite the imperfection in its quality.

(Ⅲ) Possible biases

As discussed earlier, due to the joint effect of the difficulty of household accessibility, biases in sampling frame development, refusals and falsification, this survey data is evidently skewed towards the bottom layer of the occupational structure. For example, the proportion of women respondents (65.2%) is evidently higher than that of the men respondents (34.8%), and 37.2% of respondents are self-employed workers.

As a process of collecting information and generating knowledge in public spaces, a social survey is reliant on respondents being candid and open and survey implementers being professional and diligent. This survey data does not cover high-end or confidential workplaces that are inaccessible for social surveys, or workers that have no residential addresses or whose addresses are inaccessible. In a low-trust society, any information disclosure entails high risks. Our survey data is not just a product of such a low-trust society, but also a manifestation of the conditions in this society. Inevitably, it carries the marks of this era and society.

References:

Biemer, Paul P., & Lars Lyberg. 2003. Introduction to Survey Quality. Hoboken, NJ.: Wiley.

Biemer, Paul P., et al. 2017. Total Survey Error in Practice. Hoboken, New Jersey: Wiley.

Kish, Leslie. 1965. Survey Sampling. New York,: J. Wiley.

Saerndal, Carl-Erik, & Sixten Lundstrom. 2005. Estimation in Surveys with Nonresponse. Hoboken, NJ: Wiley.

Shryock, Henry S., Jacob S. Siegel, & Edward G. Stockwell. 1976. The Methods and Materials of Demography, Condensed. New York: Academic Press.

Valliant, Richard, Jill A. Dever, & Frauke Kreuter. 2018. Practical Tools for Designing and Weighting Survey Samples, 2nd. New York: Springer.

Xiaoye Zhe & Yingying Chen. 1995 Studies on Occupational Prestigein Rural China, Social Sciences in China, 6th Issue