😜This page is on-going and will be updated frequently
Feel free to contact me for any inquiries, questions: 📧 zxliu at nau.edu.cn
Job Posting:
Data Handling Page
We start with this data handling process:
Data Cleaning Process
-
Summary of Data Cleaning:
In this section, we outline the data generation and cleaning process for job posting data used in our analysis. The primary data consists of job postings collected from 10 major job posting websites in China. The data is cleaned and processed to remove duplicates, unnecessary entries, and inconsistencies to ensure reliability and accuracy in our analysis.
- Data Import and Preliminary Cleaning:
The raw data files are imported as pandas DataFrames using the 'read_dat_data' function. This function reads the '.dat' files and assigns appropriate column names to the DataFrames.
- Data Preprocessing:
- Missing values in the '工作名称' (job title) column are replaced with values from the '职位名称' (position name) column.
- Rows with missing '发布日期' (publish date) are dropped.
- Entries with '兼职' (part-time) in the job title are dropped.
- The dataset is filtered to only include data from 10 major job posting websites in China.
- Duplicate entries within a month, having the same '公司ID' (company ID), '工作名称' (job title), and '城市名称' (city name) are dropped.
- Data Separation:
After the preprocessing, the '工作描述' (job description) column is separated from the rest of the data to make the dataset more manageable.
- Data Aggregation:
The cleaned and preprocessed data is then saved as separate CSV files. All the CSV files are combined into a single DataFrame, containing only the '工作名称' (job title) column. This final DataFrame is saved as 'charac_posting.csv', which is used for further analysis.
-
Summary of Data Processing:
In this study, we aim to analyze job postings data to classify them into Standard Occupational Classification (SOC) categories. The data generation and cleaning process is as follows:
- Read the data:
Load the raw job posting data from a CSV file into a pandas DataFrame,
df
. Only the '工作名称' (job title) column is needed for this step.
- Filter out rare job titles:
Count the occurrences of each unique job title in the DataFrame, and filter out those that occur less than 5 times, which leaves us with 883,695 unique job titles across 50,340,840 job postings.
- Parallelize job title classification:
Split the DataFrame containing the filtered job titles into 300 smaller DataFrames. Save each sub-DataFrame as a CSV file.
- Classify job titles using ChatGPT:
Define a function,
classify_job_title
, that takes a job title and an API key, and returns the most likely Standard Occupational Classification (SOC) code for that job title. The function sends a request to the ChatGPT API using the provided API key, and extracts the SOC code from the API response.
- Run classification in parallel:
Using Python's ThreadPoolExecutor, create a pool of 30 worker threads to classify the job titles concurrently. Read each sub-DataFrame containing job titles from the CSV files, and submit the job titles to the thread pool for classification. Append the resulting SOC codes to the DataFrame and save it as a new CSV file.
- Merge the classified titles:
Append all the classified titles' CSV files to create a single DataFrame,
df_title
. Keep only the '工作名称' and 'soc_code' columns, and drop any rows with missing SOC codes.
- Map unmapped job postings using job descriptions:
Merge the DataFrame containing the SOC codes with another DataFrame containing job descriptions based on the job title. This will be used to map the remaining unmapped job postings.
- Load ONET SOC job titles:
Load a DataFrame containing all possible SOC job titles from the ONET dataset. This helps remove incorrect mappings.
- Filter out rare SOC codes:
Filter out broad occupations with less than 100 job postings to ensure that there is enough data to train a good model. We end up with 408 broad occupations.
- Randomly sample job postings:
Randomly sample 3,000 job postings within each broad occupation and save them to a CSV file. Feed this data to ChatGPT to map the job postings to SOC categories using job descriptions.
- Verify labeling based on job descriptions:
Define a function,
classify_job_desp
, that takes a job description, job title, and API key, and returns whether the given SOC code is a reasonable classification based on the job description. The function sends a request to the ChatGPT API using the provided API key and returns a yes or no answer.
- Double-check the sub-sampled dataset:
Parallelize the job description classification using ThreadPoolExecutor, and append the true/false indicator to the DataFrame. Save the resulting DataFrame as a new CSV file.
- Generate the final dataset for model fine-tuning:
Dataset that passed the second check (classification using job descriptions) is used for model fine-tuning.
Model Training Page
This page records the model training process:
Model Training
-
Summary of Fine-tune Chinese BERT-wwm:
In this Python code, we are fine-tuning a pre-trained BERT model to classify job postings based on their Standard Occupational Classification (SOC) codes using the transformers library. We'll break down the code into several steps and provide a detailed explanation for each step.
- Import necessary libraries and load data: First, we import necessary libraries such as pandas, torch, and transformers, and read the data from a CSV file. The data contains columns such as 'soc_code', 'true_ind', '工作名称' (job title), and '工作描述' (job description).
- Preprocessing the data: The 'soc_code' column has '-' symbols that need to be removed, and the column is then converted to integers. We also replace 'Yes' with True and NaN with False in the 'true_ind' column, which indicates whether a label is more reliable or not. This information will later be used to assign different weights to samples during model training.
- Splitting the data: We split the data into training, validation, and testing sets in a 60-20-20 ratio using the train_test_split function from the scikit-learn library.
- Creating a new column 'soc_code1': We create a new column 'soc_code1' to map the unique SOC codes to sequential integer labels. This is done to facilitate the classification task and make it easier for the model to learn patterns in the data.
- Tokenization: The job titles and descriptions are tokenized using the BERT tokenizer, which converts text into a format that the BERT model can understand.
- Assigning weights to samples: We use the 'true_ind' column to assign different weights to samples during model training. More reliable samples will have a higher weight, making the model focus more on learning from them.
- Creating a custom PyTorch Dataset: We create a custom PyTorch Dataset class named JobPostingDataset to store the tokenized text, labels, and weights. This class will be used to create data loaders for efficient training, validation, and testing.
- Preparing data loaders: Data loaders are created using the custom JobPostingDataset class. They facilitate efficient loading of data in batches during model training, validation, and testing.
- Computing class weights: To handle class imbalance, we compute class weights using the training data and pass them to the CrossEntropyLoss criterion. This ensures that the model pays more attention to minority classes during training.
- Defining the model, optimizer, and learning rate scheduler: We use the pre-trained BERT model and fine-tune it for our classification task. The number of output labels is set to the number of unique 'soc_code1' values. We use the AdamW optimizer and a learning rate scheduler with a warmup period for training.
- Training the model with early stopping: We train the model using a training loop that incorporates early stopping based on the validation loss. If the validation loss does not improve for a certain number of consecutive epochs (specified by the early_stopping_patience variable), the training will be stopped to prevent overfitting.
- Evaluating the model: The model's performance is evaluated on the validation set using a custom evaluation function that computes the validation loss. The best model is saved based on the lowest validation loss.
Model Evaluation Page
This page evaluates the performance of the finetuned model: