Dataset to find effectiveness of using crowd intelligence for generation of patent clusters

  • Gokula Vijayumar Annamalai Vasantha (Creator)
  • Andrew Wodehouse (Creator)
  • Jonathan Corney (Contributor)
  • Ross MacLachlan (Contributor)
  • Ananda Prasanna Jagadeesan (Contributor)



This dataset contains data collected and analysed to prove advantageous of using crowd intelligence for effective generation of patent clusters at lower cost and with greater rationale. The dataset structured around two major undertaken tasks: patent clustering and ranking patent clusters to the given design problem. Please read our journal paper “Crowd-generated patent clusters in relation to the algorithm and expert approaches” before understanding this dataset.

The file names are numbered to facilitate the order of exploring this dataset. The details of uploaded data documents are illustrated below:

- Documents 1 and 2 provide all information collected from crowdsourcing in Crowdflower and mTurk respectively. Please note that these documents contain both accepted and rejected results.

- Documents 3 and 4 provide information contents that are accepted for further analyses from Crowdflower and mTurk respectively.

- Document 5 contains information about the number of clusters and patent pairs for every approach. The second sheet contains a graph between the number of clusters and patent pairs.

- Document 6 contains matrix between patent pairs and crowd workers. The patent pair is linked if marked with ‘1’, otherwise ‘0’. The second sheet contains a graph between the number of crowd workers in agreement and frequency of linked patent pairs. The third sheet contains a graph comparing different approaches between the number of workers in pair agreement and percentage of pair agreement.

- Document 7 contains all cluster labels generated by various approaches. Different colours are used to differentiate between approaches. The number ‘1’ is used to represent label presence in a particular approach. This number may be more than one represent the frequency of times mentioned by experts/crowd.

- Document 8 contains patent relevance ranking mentioned by each crowd worker. The first and second sheets are used to represent relevance for mTurk and Crowdflower workers respectively. The aggregations of rankings are provided on the extreme right of both sheets. Based on the aggregated results, the shown relevance rankings are represented in red colour. The third sheet provides a summary of these aggregated results. The last sheet provides a graph between patent ranking and number of patents for crowd and Fu’s algorithm approach.

- Document 9 contains the evaluation of crowd responses by three evaluators. For each patent pair, evaluators score is mentioned as ‘1’ if they agree, ‘0’ if they don’t. These information contents were used to illustrate the relationship between the number of agreed evaluators and the number of crowd workers agreed with a patent pair.

- Document 10 details all the labels chosen by three evaluators from the crowd responses. The data contents illustrate evaluators agreement with both patent pairs and labels, only patent pairs, no patent pairs match and extra labels generated.
Date made available3 Jul 2017
PublisherUniversity of Strathclyde

Cite this