Data publishing basics in CKAN¶
In CKAN, fundamental elements structure the publication of data, aligning with the principles of efficient data management. Understanding these basics is pivotal to effectively organize and share information within CKAN's framework.
Understanding these basic components within CKAN establishes a solid foundation for effective data publication and management practices.
learn more in CKAN docs
Dataset¶
In CKAN, data is published in units called “datasets”. A dataset is a parcel of data - for example, it could be the crime statistics for a region, the spending figures for a government department, or temperature readings from various weather stations. When users search for data, the search results they see will be individual datasets.
A dataset contains two things:
-
Information or “metadata” about the data. For example, the title and publisher, date, what formats it is available in, what license it is released under, etc.
-
Resources: Any number of resources can be attached to this dataset. For example, different resources might contain the data for different years, or they might contain the same data in different formats.
Dataset Versioning¶
In CKAN every dataset has a metadata field "version".
In TUMI we do dataset versioning this way:
- Version numbers comply with Semantic Versioning 2.0.0
- A freshly published dataset gets the version number "1.0.0"
- Data corrections get a patch version upgrade, e.g. 1.0.1
- Data updates or additions without changing data structures get a minor version update, e.g. 1.1.0
- If data structures change, major version number increases, e.g. 2.0.0
- To keep a stable URL for the dataset the name (URL-path) of the latest dataset should remain the same and not change with the addition of data, e.g. /population-brasil and not /population-brasil-2023
Resource¶
A “resource” holds the data itself.
CKAN does not mind what format the data is in. A resource can be
- a CSV or Excel spreadsheet,
- XML file,
- PDF document,
- image file (jpg, png, ...),
- linked data in RDF format
- and many more types of data
CKAN can store the resource internally, or store it simply as a link, the resource itself being elsewhere on the web.
Data published as Open Data can be rated according to the 5 star deployment scheme . For data publisher that means: the more stars the better the data is.
Every dataset in CKAN must belong to exactly one organization.
Organization¶
Normally, each dataset is owned by an “organization”.
A CKAN instance can have any number of organizations. For example, if CKAN is being used as a data portal by a national government, the organizations might be different government departments, each of which publishes data. Each organization can have its own workflow and authorizations, allowing it to manage its own publishing process.
Group¶
In CKAN, groups are about curation - collecting datasets together into groups. If a user is an editor or admin of a group, then they can add datasets (that already exist on the site) into their group and can remove datasets from their group, but they cannot necessarily add new datasets to the site, or edit the datasets that are in their group.
Groups are meant to be used by the community of users of a site (the people consuming the data, not the people who're publishing the datasets on the site) to collect related datasets together into themes like "climate" etc.
DCAT-AP provides some standardized groups:
- Agriculture, fisheries, forestry and food (AGRI)
- Economy and finance (ECON)
- Education, culture and sport (EDUC)
- Energy (ENER)
- Environment (ENVI)
- Government and public sector (GOVE)
- Health (HEAL)
- International issues (INTR)
- Justice, legal system and public safety (JUST)
- Regions and cities (REGI)
- Population and society (SOCI)
- Science and technology (TECH)
- Transport (TRAN)
In TUMI datahubs, there are some additional mobility groups provided:
- Environmental Data (environmental_data)
- Individual Transport (individual_transport)
- Planning Policy (planning_policy)
- Public Transport (public_transport)
- Raw Mobility Data (raw_mobility_data)
- Traffic Generating Parameters (traffic_generating_parameters)
Tag¶
Tags can be attached to datasets in order to group them thematically. Tags are defined by data publisher when creating or updating a dataset.
A CKAN tag can be either
- a normal tag
- a tag out of a tag vocabulary
For details about tag vocabularies, see here
Tag vocabularies can only be created by a sysadmin.
In TUMI we don't use tag vocabularies right now.