v0.9.0 (#27)

* Empty json file for testing * Added notes for large pickles #21 * Delete requirements.txt (#26); setup.py is sufficient; completes #25 * Corrected URL for word cloud & updated FDG section in README.md
datasciencecampus · Aug 22, 2018 · d34a122 · d34a122
1 parent b8a511a
commit d34a122
Show file tree

Hide file tree

Showing 10 changed files with 40 additions and 44 deletions.
diff --git a/README.md b/README.md
@@ -27,11 +27,11 @@ The score here is derived from the term [tf-idf](https://en.wikipedia.org/wiki/T
 
 ### Word cloud
 
-Here is a [wordcloud](https://github.com/datasciencecampus/patent_app_detect/output/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
+Here is a [wordcloud](https://raw.githubusercontent.com/datasciencecampus/patent_app_detect/master/outputs/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
 
 ### Force directed graph
 
-This output provides an [interactive graph](https://github.com/datasciencecampus/patent_app_detect/outputs/fdg/index.html) that shows connections between terms that are generally found in the same patent documents. This example was run for the Y02 classification on a 10,000 random sample of patents.
+This output provides an interactive graph in the to be viewed in a web browser (you need to locally open the file ```outputs/fdg/index.html```). The graph shows connections between terms that are generally found in the same patent documents. The example wordcloud in the ```outputs/fdg``` folder was created using the Y02 classification on a 10,000 random sample of patents.
 
 ## How to install
 
@@ -82,6 +82,27 @@ python detect.py -ps=USPTO-random-10000
 
 Will run the tool for a pre-created random dataset of 10,000 patents.
 
+### Additional patent sources
+
+Patent datasets are stored in the sub-folder ```data```, we have supplied the following files:
+- ```USPTO-random-100.pkl.bz2```
+- ```USPTO-random-1000.pkl.bz2```
+- ```USPTO-random-10000.pkl.bz2```
+- ```USPTO-random-100000.pkl.bz2```
+- ```USPTO-random-500000.pkl.bz2```
+
+The command ```python detect.py -ps=USPTO-random-10000``` instructs the program to load a pickled data frame of patents
+from a file located in ```data/USPTO-random-10000.pkl.bz2```. Hence ```-ps=NAME``` looks for ```data/NAME.pkl.bz2```.
+
+We have hosted larger datasets on a google drive, as the files are too large for GitHub version control. We have made available:
+- All USPTO patents from 2004 (477Mb): [USPTO-all.pkl.bz2](https://drive.google.com/drive/folders/1d47pizWdKqtORS1zoBzsk3tLk6VZZA4N)
+
+To use additional files, follow the link and download the pickle file into the data folder. Access the new data
+with ```-ps=NameWithoutFileExtension```; for example, ```USPTO-all.pkl.bz2``` would be loaded with ```-ps=USPTO-all```.
+
+Note that large datasets will require a large amount of system memory (such as 64Gb), otherwise it will process very slowly
+as virtual memory (swap) is very likely to be used.
+
 ### Choosing CPC classification
 
 This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". In this case a larger patent dataset is generally required to allow for the reduction in patent numbers after subsetting. An example script is:
@@ -216,3 +237,9 @@ optional arguments:
                         the desired cpc classification
 
 ```
+
+## Acknowledgements
+
+### Patent data
+
+Patent data was obtained from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov) through the [Bulk Data Storage System (BDSS)](https://bulkdata.uspto.gov). In particular we used the `Patent Grant Full Text Data/APS (JAN 1976 - PRESENT)` dataset, using the data from 2004 onwards in XML 4.* format.
diff --git a/detect.py b/detect.py
@@ -109,7 +109,7 @@ def get_tfidf(args, filename, cpc):
 
 
 def main():
-    paths = [os.path.join('outputs', 'reports'), os.path.join('outputs', 'json'), os.path.join('outputs', 'wordclouds')]
+    paths = [os.path.join('outputs', 'reports'), os.path.join('outputs', 'wordclouds')]
     for path in paths:
         os.makedirs(path, exist_ok=True)
 

diff --git a/outputs/fdg/empty.json b/outputs/fdg/empty.json
@@ -0,0 +1 @@
+[]
diff --git a/outputs/fdg/f.js b/outputs/fdg/f.js
@@ -1,10 +1,7 @@
-var dataURL = "http://mysafeinfo.com/api/data?list=englishmonarchs&format=json";
-
+var dataURL = "https://raw.githubusercontent.com/datasciencecampus/patent_app_detect/master/outputs/fdg/empty.json";
 
 var refresh = function(data){
 
-
-
 var json_obj = JSON.parse(data);
 var svg = d3.select("svg"),
     width = +svg.attr("width"),
@@ -25,40 +22,20 @@ d3.select("div#chartId")
     //class to make it responsive
     .classed("svg-content-responsive", true);
 
-//var container = d3.select('body').append('div')
-//    .attr('id','container')
-//;
-//
-//// svg#sky
-//var sky = container.append('svg')
-//    //.attr('height', 100)
-//    //.attr('width', 100)
-//    .attr('id', 'sky')
-//;
-
 var color = d3.scaleOrdinal(d3.schemeCategory20c);
-//var nodeRadius = 20;
 
 var padding = 1, // separation between circles
     radius=6;
 
-
-
 var simulation = d3.forceSimulation()
     .force("link", d3.forceLink().id(function(d) {
         return d.text;
     }).distance(300))
     .force("charge", d3.forceManyBody().strength(-100))
     .force("center", d3.forceCenter(width / 2, height / 2))
-    //.force("gravity", 0.05)
-    //.force("linkDistance", 50)
-    //.force("size", [9000, 6000])
     .force("collide", d3.forceCollide().radius(function(d) {
         return 12*radius + padding; }).iterations(40))
 
-
-
-
 d3.json(dataURL, function(error, graph) {
     if (error) throw error;
 
@@ -72,7 +49,6 @@ d3.json(dataURL, function(error, graph) {
         .data(graph.links)
         .enter().append("line").attr("stroke-width", function(d) {
         return (8*d.size);
-            //Math.sqrt(1.5*d.size);
         });
 
     var node = svg.append("g")
@@ -119,7 +95,6 @@ d3.json(dataURL, function(error, graph) {
         .text(function(d) {
             return d.text
         });
-
 
     simulation
         .nodes(graph.nodes)

diff --git a/outputs/fdg/index.html b/outputs/fdg/index.html
@@ -4,5 +4,5 @@
 <link rel="stylesheet" href="fdg_style.css"/>
 <script src="https://d3js.org/d3.v4.min.js"></script>
 <script src="knockout-3.4.2.js"></script>
-<script type="text/javascript" src="key-terms.json"></script>
+<script src="key-terms.js"></script>
 <script src="f.js"></script>
diff --git a/outputs/fdg/key-terms.js b/outputs/fdg/key-terms.js
diff --git a/outputs/fdg/key-terms.json b/outputs/fdg/key-terms.json
diff --git a/requirements.txt b/requirements.txt