Data Cleaning Process

The datasets generated from this code have been uploaded to the Hugging Face:

Clone this dataset repository:

git lfs install
git clone <https://huggingface.co/datasets/Zexuan/soc_data>

<aside> 💡 Data Source

</aside>

The raw data was purchased from RESSET, which can be found at http://www.resset.cn/enebdp.

The data format is:

Variable	Format	Variable	Format	Variable	Format
招聘主键ID	bigint(20)	工作薪酬	varchar(255)	工作名称	varchar(255)
公司ID	bigint(20)	教育要求	varchar(255)	招聘数量	varchar(255)
公司名称	varchar(255)	工作经历	varchar(255)	发布日期	datetime
城市名称	varchar(255)	工作描述	varchar(255)	行业名称	varchar(255)
公司所在区域	varchar(255)	职位名称	varchar(255)	来源	varchar(255)

<aside> 🗣 Data Subset and Initial Screen

</aside>

The obtained job posting data is a large .dat file over 156GB. To make the data manageable, we first subset and screen the dataset using the code:

The key steps are:
1. Split the large .dat file into 142 files.
2. We use 工作名称 as the title (not 职位名称) to feed the GPT-3.5-turbo for title labeling.
3. We drop postings whose 发布日期 is missing.
4. When 工作名称 is missing, we replace it with 职位名称.
5. We drop postings whose job title is 兼职.
6. We drop postings with the same '公司ID', '工作名称', '城市名称' within a month, as we treat this case as duplicates (same job posting has been published multiple times).
7. Only keep the data from the 10 major job posting websites: '来源' == '智联招聘', '前程无忧', '拉勾网', 'BOSS直聘', '58同城', '猎聘网', '看准网', 百姓网', '拉勾网', '赶集网'. The code to determine the 10 largest job posting platforms is attached at the end of the code.
8. We further separate the 工作描述 column to form a new dataset due to the size of the file.

<aside> 📢 Classification Data Preparation

</aside>