Skip to content

Commit

Permalink
v0.9.0 (#27)
Browse files Browse the repository at this point in the history
* Empty json file for testing
* Added notes for large pickles #21
* Delete requirements.txt (#26); setup.py is sufficient; completes #25
* Corrected URL for word cloud & updated FDG section in README.md
  • Loading branch information
IanGrimstead authored Aug 22, 2018
1 parent b8a511a commit d34a122
Show file tree
Hide file tree
Showing 10 changed files with 40 additions and 44 deletions.
31 changes: 29 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ The score here is derived from the term [tf-idf](https://en.wikipedia.org/wiki/T

### Word cloud

Here is a [wordcloud](https://github.com/datasciencecampus/patent_app_detect/output/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
Here is a [wordcloud](https://raw.githubusercontent.com/datasciencecampus/patent_app_detect/master/outputs/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.

### Force directed graph

This output provides an [interactive graph](https://github.com/datasciencecampus/patent_app_detect/outputs/fdg/index.html) that shows connections between terms that are generally found in the same patent documents. This example was run for the Y02 classification on a 10,000 random sample of patents.
This output provides an interactive graph in the to be viewed in a web browser (you need to locally open the file ```outputs/fdg/index.html```). The graph shows connections between terms that are generally found in the same patent documents. The example wordcloud in the ```outputs/fdg``` folder was created using the Y02 classification on a 10,000 random sample of patents.

## How to install

Expand Down Expand Up @@ -82,6 +82,27 @@ python detect.py -ps=USPTO-random-10000

Will run the tool for a pre-created random dataset of 10,000 patents.

### Additional patent sources

Patent datasets are stored in the sub-folder ```data```, we have supplied the following files:
- ```USPTO-random-100.pkl.bz2```
- ```USPTO-random-1000.pkl.bz2```
- ```USPTO-random-10000.pkl.bz2```
- ```USPTO-random-100000.pkl.bz2```
- ```USPTO-random-500000.pkl.bz2```

The command ```python detect.py -ps=USPTO-random-10000``` instructs the program to load a pickled data frame of patents
from a file located in ```data/USPTO-random-10000.pkl.bz2```. Hence ```-ps=NAME``` looks for ```data/NAME.pkl.bz2```.

We have hosted larger datasets on a google drive, as the files are too large for GitHub version control. We have made available:
- All USPTO patents from 2004 (477Mb): [USPTO-all.pkl.bz2](https://drive.google.com/drive/folders/1d47pizWdKqtORS1zoBzsk3tLk6VZZA4N)

To use additional files, follow the link and download the pickle file into the data folder. Access the new data
with ```-ps=NameWithoutFileExtension```; for example, ```USPTO-all.pkl.bz2``` would be loaded with ```-ps=USPTO-all```.

Note that large datasets will require a large amount of system memory (such as 64Gb), otherwise it will process very slowly
as virtual memory (swap) is very likely to be used.

### Choosing CPC classification

This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". In this case a larger patent dataset is generally required to allow for the reduction in patent numbers after subsetting. An example script is:
Expand Down Expand Up @@ -216,3 +237,9 @@ optional arguments:
the desired cpc classification
```

## Acknowledgements

### Patent data

Patent data was obtained from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov) through the [Bulk Data Storage System (BDSS)](https://bulkdata.uspto.gov). In particular we used the `Patent Grant Full Text Data/APS (JAN 1976 - PRESENT)` dataset, using the data from 2004 onwards in XML 4.* format.
2 changes: 1 addition & 1 deletion detect.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def get_tfidf(args, filename, cpc):


def main():
paths = [os.path.join('outputs', 'reports'), os.path.join('outputs', 'json'), os.path.join('outputs', 'wordclouds')]
paths = [os.path.join('outputs', 'reports'), os.path.join('outputs', 'wordclouds')]
for path in paths:
os.makedirs(path, exist_ok=True)

Expand Down
1 change: 1 addition & 0 deletions outputs/fdg/empty.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[]
27 changes: 1 addition & 26 deletions outputs/fdg/f.js
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
var dataURL = "http://mysafeinfo.com/api/data?list=englishmonarchs&format=json";

var dataURL = "https://raw.githubusercontent.com/datasciencecampus/patent_app_detect/master/outputs/fdg/empty.json";

var refresh = function(data){



var json_obj = JSON.parse(data);
var svg = d3.select("svg"),
width = +svg.attr("width"),
Expand All @@ -25,40 +22,20 @@ d3.select("div#chartId")
//class to make it responsive
.classed("svg-content-responsive", true);

//var container = d3.select('body').append('div')
// .attr('id','container')
//;
//
//// svg#sky
//var sky = container.append('svg')
// //.attr('height', 100)
// //.attr('width', 100)
// .attr('id', 'sky')
//;

var color = d3.scaleOrdinal(d3.schemeCategory20c);
//var nodeRadius = 20;

var padding = 1, // separation between circles
radius=6;



var simulation = d3.forceSimulation()
.force("link", d3.forceLink().id(function(d) {
return d.text;
}).distance(300))
.force("charge", d3.forceManyBody().strength(-100))
.force("center", d3.forceCenter(width / 2, height / 2))
//.force("gravity", 0.05)
//.force("linkDistance", 50)
//.force("size", [9000, 6000])
.force("collide", d3.forceCollide().radius(function(d) {
return 12*radius + padding; }).iterations(40))




d3.json(dataURL, function(error, graph) {
if (error) throw error;

Expand All @@ -72,7 +49,6 @@ d3.json(dataURL, function(error, graph) {
.data(graph.links)
.enter().append("line").attr("stroke-width", function(d) {
return (8*d.size);
//Math.sqrt(1.5*d.size);
});

var node = svg.append("g")
Expand Down Expand Up @@ -119,7 +95,6 @@ d3.json(dataURL, function(error, graph) {
.text(function(d) {
return d.text
});


simulation
.nodes(graph.nodes)
Expand Down
2 changes: 1 addition & 1 deletion outputs/fdg/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
<link rel="stylesheet" href="fdg_style.css"/>
<script src="https://d3js.org/d3.v4.min.js"></script>
<script src="knockout-3.4.2.js"></script>
<script type="text/javascript" src="key-terms.json"></script>
<script src="key-terms.js"></script>
<script src="f.js"></script>
1 change: 1 addition & 0 deletions outputs/fdg/key-terms.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion outputs/fdg/key-terms.json

This file was deleted.

8 changes: 0 additions & 8 deletions requirements.txt

This file was deleted.

Loading

0 comments on commit d34a122

Please sign in to comment.