Data Sheet – underrepresented

Datasheet inspired by Gebru et al

1. For what purpose was the dataset created?

Our aim is developing a resource for gathering information and contents about and provided by people belinging to under-represented groups. Such a work is preparatory for the modelling of counter-narratives, namely narratives about the world and the society told from the perspective of people that are often marginalized in the public debate.

2. Who created the dataset?

The KG was created by the University of Turin, Department of Computer Science. More specifically, the project may be seen as part of the activities of the Hate Speech Monitoring Group, since it is focused on counter-narratives against online discriminations.

3. What do the instances that comprise the dataset represent?

The KG includes writers born from 1808 with a page on Wikidata and all their works recorded on Wikidata, Open Library, Goodreads, Google Books.
Semantic information about writers (biographical events, emotions, moral values) and works synopsis (emotions, moral values) was extracted through existing NLP tools and methodologies and added to the KG.

4. Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources?

Data were collected from the following sources of knowledge: Wikidata, Wikipedia, Goodreads, Open Library, Google Books.

5. Is the dataset self-contained, or does it link to or otherwise rely on external resources?

Each instance is linked to the source of knowledge from which it was gathered. The PROV Ontology was used to explicit the provenance of each entity in the KG

6. Does the dataset contain all possible instances or is it a sample of instances from a larger set?

Instances of the type writer have been gathered from Wikidata throught SPARQL queries. Hence, to be in the KG a writer must match the following requirements:

being on Wikidata;
having one of the following occupations (P106): writer (Q36180), novelist (Q6625963), or poet (Q49757);
being born from 1808, since we are consider under-representation as a phenomenon linked to decolonization processes.

Only works from such authors was gathered from Open Library, Goodreads, and Google Books.

7. Is there a label or target associated with each instance?

Writers are clustered in two groups: Western writers and Transnational writers. Such a distinction is based on two criteria: the country of birth, that must be a former colony with a medium or lower Human Development Index, and the belonging to an ethnic minority.
17,368 are labeled as ‘Transnational’; 176,697 are labeled as ‘Western’.

Works are labeled consequently and have the following distribution

Source of Knowledge	Western works	Transnational works
Wikidata	136,995	8,380
Open Library	824,378	66,050
Goodreads	152,468	37,680

8. Are relationships between individual instances made explicit?

Each relation between writers, works, and metadata are made explicit within the KG. For each of them, the following metadata may be present

about the writers: date of birth/death, place of birth/death, country of birth/death, gender, genres, emotions, and moral values extracted from their biographies;
about the works: date and place of publishing, editions, subjects, locations of setting, publishers, genres, awards, ratings (from Goodreads), emotions, and moral values extracted from their synopsis.

9. Are there any errors, sources of noise, or redundancies in the dataset?

Our criterion to distinguish between Western and transnational writers is prone to some false positives. For instance, there is a subset of writers who were born in former colonies, but as part of a dominant minority (eg: Bernard-Henri Lévy, J.M. Coetze, Italo Calvino). We chose the label ‘Transnational’ instead of ‘Under-Represented’ to mitigate such issue.

10. Does the dataset identify any sub-populations?

The KG relies on a distinction between sub-populations, since it has been developed to investigate the under-representation of non-western people and people with non-western origins.

11. Does the dataset contains confidential or sensitive information?

The dataset only includes public personalities with a Wikidata page. Each information was gathered from public and accessible resources

12. How was the data associated with each instance acquired?

Each instance has been collected from authoritative sources of knowledge. Row text descriptions of instances have been processes with publicly available NLP tools.

13. Over what timeframe was the data collected?

Data were collected between October 2021 and May 2022. An update of all the data is planned within the last quarter of 2023.

14. Was any preprocessing/cleaning/labeling of the data done?

Wikidata, Open Library, and Google Books provide APIs for gathering structured data in json format. Data from Goodreads and Wikipedia were cleaned and scraped with Beautiful Soup

15. Was the “raw” data saved in addition to the preprocessed/ cleaned/labeled data?

Raw data are not part of the KG, but they are stored in a private SQL table for further experiments. It is worth mentioning that only publicly available data about public personalities are stored.

16. Has the dataset been used for any tasks already?

The dataset has been used for biographical events extraction and annotation tasks, and for emotions and moral values extraction. Each experiment is documented in this website, together with all the bibliographical records about the KG.

17. What (other) tasks could the dataset be used for?

A full list of possible usage of the KG is provided in the Ontology Requirements.

18. Who will be supporting/hosting/maintaining the dataset?

The KG will be hosted and mantained by the University of Turin, Department of Computer Science

19. How can the owner/curator/manager of the dataset be contacted (for example, email address)?

Any question about the KG may be sent to Marco Antonio Stranisci (marcoantonio.stranisci@unito.it)

20. Will the dataset be updated (for example, to correct labeling errors, add new instances, delete instances)?

Regular updated of the KG are planned. A dump of each version will be available in the section dedicated to the KG, together with a list of erratum.

21. If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

People interested in contributing to the KG may contact Marco Antonio Stranisci (marcoantonio.stranisci@unito.it) or visit the Github page of the project.