Posts, Users, Votes, Comments, Badges, PostHistory, and PostLinks that has occurred on Stack Overflow.
You can either download pre-prepared Parquet versions of the data, containing every post up to April 2024, or download the latest data in XML format and load this. Stack Overflow provide updates to this data periodically - historically every 3 months.
The following diagram shows the schema for the available tables assuming Parquet format.
A description of the schema of this data can be found here.
Pre-prepared data
We provide a copy of this data in Parquet format, up to date as of April 2024. While small for ClickHouse with respect to the number of rows (60 million posts), this dataset contains significant volumes of text and large String columns.eu-west-2. The dataset is located in eu-west-3.
Posts
Votes
Comments
Users
Badges
PostLinks
PostHistory
Original dataset
The original dataset is available in compressed (7zip) XML format at https://archive.org/download/stackexchange - files with prefixstackoverflow.com*.
Download
Convert to JSON
At the time of writing, ClickHouse doesn’t have native support for XML as an input format. To load the data into ClickHouse we first convert to NDJSON. To convert XML to JSON we recommend thexq linux tool, a simple jq wrapper for XML documents.
Install xq and jq:
stackoverflow.com-Posts.7z file as an example. Modify as required.
Extract the file using p7zip. This will produce a single xml file - in this case Posts.xml.
Files are compressed approximately 4.5x. At 22GB compressed, the posts file requires around 97G uncompressed.
posts.json file.
Load into ClickHouse with the following command. Note the schema is specified for the posts.json file. This will need to be adjusted per data type to align with the target table.
Example queries
A few simple questions to you get started.Most popular tags on Stack Overflow
User with the most answers (active accounts)
Account requires aUserId.
ClickHouse related posts with the most views
Most controversial posts
Attribution
We thank Stack Overflow for providing this data under thecc-by-sa 4.0 license, acknowledging their efforts and the original source of the data at https://archive.org/details/stackexchange.