The datasets generated from this code have been uploaded to the Hugging Face:
Zexuan/soc_data · Datasets at Hugging Face
Clone this dataset repository:
git lfs install
git clone <https://huggingface.co/datasets/Zexuan/soc_data>
<aside> 💡 Data Source
</aside>
The raw data was purchased from RESSET, which can be found at http://www.resset.cn/enebdp.
The data format is:
Variable | Format | Variable | Format | Variable | Format |
---|---|---|---|---|---|
招聘主键ID | bigint(20) | 工作薪酬 | varchar(255) | 工作名称 | varchar(255) |
公司ID | bigint(20) | 教育要求 | varchar(255) | 招聘数量 | varchar(255) |
公司名称 | varchar(255) | 工作经历 | varchar(255) | 发布日期 | datetime |
城市名称 | varchar(255) | 工作描述 | varchar(255) | 行业名称 | varchar(255) |
公司所在区域 | varchar(255) | 职位名称 | varchar(255) | 来源 | varchar(255) |
<aside> 🗣 Data Subset and Initial Screen
</aside>
Job-Posting/data_subset_screen.ipynb at main · lzxlll/Job-Posting
工作名称
as the title (not 职位名称
) to feed the GPT-3.5-turbo
for title labeling.发布日期
is missing.工作名称
is missing, we replace it with 职位名称
.兼职
.工作描述
column to form a new dataset due to the size of the file.<aside> 📢 Classification Data Preparation
</aside>