The datasets generated from this code have been uploaded to the Hugging Face:
Zexuan/soc_data · Datasets at Hugging Face
Clone this dataset repository:
git lfs install
git clone <https://huggingface.co/datasets/Zexuan/soc_data>
<aside> 💡 Data Source
</aside>
The raw data was purchased from RESSET, which can be found at http://www.resset.cn/enebdp.
The data format is:
| Variable | Format | Variable | Format | Variable | Format |
|---|---|---|---|---|---|
| 招聘主键ID | bigint(20) | 工作薪酬 | varchar(255) | 工作名称 | varchar(255) |
| 公司ID | bigint(20) | 教育要求 | varchar(255) | 招聘数量 | varchar(255) |
| 公司名称 | varchar(255) | 工作经历 | varchar(255) | 发布日期 | datetime |
| 城市名称 | varchar(255) | 工作描述 | varchar(255) | 行业名称 | varchar(255) |
| 公司所在区域 | varchar(255) | 职位名称 | varchar(255) | 来源 | varchar(255) |
<aside> 🗣 Data Subset and Initial Screen
</aside>
Job-Posting/data_subset_screen.ipynb at main · lzxlll/Job-Posting
工作名称 as the title (not 职位名称) to feed the GPT-3.5-turbo for title labeling.发布日期 is missing.工作名称 is missing, we replace it with 职位名称.兼职.工作描述 column to form a new dataset due to the size of the file.<aside> 📢 Classification Data Preparation
</aside>