Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added --citations-only option. It prints all the articles that cite the queried one #83

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lucabaronti
Copy link

@lucabaronti lucabaronti commented Feb 25, 2017

Added an handy option to automatically retrieve the list of articles that cites the first article returned by the query.
For instance, if you want a list of articles that cite the first article returned by this query:
$ ./scholar.py -c 1 --author "albert einstein" --phrase "quantum theory"
use the --citations-only option
$ ./scholar.py --citations-only -c 1 --author "albert einstein" --phrase "quantum theory"
and it will print this:

         Title Modern Electrochemistry 2B: Electrodics in Chemistry, Engineering, Biology and Environmental Science
           URL http://books.google.com/books?hl=en&lr=&id=V3tpJrG1H5wC&oi=fnd&pg=PA1539&ots=OUzlJ0YriM&sig=5gwx3WY-wSRLLMe3lRygYwxK1U8
          Year 2000
     Citations 7392
      Versions 10
    Cluster ID 13855735528547899559
Citations list http://scholar.google.com/scholar?cites=13855735528547899559&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=13855735528547899559&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt This long awaited and thoroughly updated version of the classic text (Plenum Press, 1970) explains the subject of electrochemistry in clear, straightforward language for undergraduates and mature scientists who want to understand solutions. Like its 

         Title Spectral analysis and time series
           URL http://www.citeulike.org/group/96/article/745677
          Year 1981
     Citations 6726
      Versions 3
    Cluster ID 16874516227592319711
Citations list http://scholar.google.com/scholar?cites=16874516227592319711&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=16874516227592319711&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt Search all the public and authenticated articles in CiteULike. Include unauthenticated resultstoo (may include "spam") Enter a search phrase. You can also specify a CiteULike article id(123456),. a DOI (doi:10.1234/12345678). or a PubMed ID (pmid:12345678). Click Help for

         Title Introduction
           URL http://link.springer.com/chapter/10.1007/978-1-4614-0511-5_1
          Year 2011
     Citations 5380
      Versions 59
    Cluster ID 3815736992424174150
Citations list http://scholar.google.com/scholar?cites=3815736992424174150&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=3815736992424174150&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt Abstract In recent years, the adopting of some supply chain practice such as outsourcing and lean production helps in smoothing the operations, but it also results in little buffer inventory in a supply chain which may lead to increased vulnerability of the chains. 1 At the 

         Title The random walk's guide to anomalous diffusion: a fractional dynamics approach
           URL http://www.sciencedirect.com/science/article/pii/S0370157300000703
          Year 2000
     Citations 5144
      Versions 18
    Cluster ID 11032747530556470631
Citations list http://scholar.google.com/scholar?cites=11032747530556470631&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=11032747530556470631&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt Fractional kinetic equations of the diffusion, diffusion–advection, and Fokker–Planck type are presented as a useful approach for the description of transport dynamics in complex systems which are governed by anomalous diffusion and non-exponential relaxation 

         Title Diffusion processes
           URL http://onlinelibrary.wiley.com/doi/10.1002/0471667196.ess0495.pub2/full
          Year 1974
     Citations 3118
      Versions 9
    Cluster ID 13465318938558459827
Citations list http://scholar.google.com/scholar?cites=13465318938558459827&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=13465318938558459827&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt Suppose that we are given a differential operator As of the form (6). We want to construct a diffusion process whose generator is As. In 1936, W. Feller proved that the backward equation (8) together with the terminal condition (9) has a unique solution under the 

         Title Metapopulation biology
           URL http://agris.fao.org/agris-search/search.do?recordID=US201300021834
          Year 1997
     Citations 3025
      Versions 2
    Cluster ID 6335487017156325677
Citations list http://scholar.google.com/scholar?cites=6335487017156325677&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=6335487017156325677&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt FAO_logo. home-icon. English; Español; Français; العربية; 中文; Русский.home-icon. Toggle navigation AGRIS. Register. Sign in. My Profile; Change Password;Searching History; Browsing History; Saved Publications; Logout. Search. Register;

         Title Stochastic processes
           URL http://epubs.siam.org/doi/pdf/10.1137/1.9781611971125.bm
          Year 1999
     Citations 2978
      Versions 16
    Cluster ID 9561949148186522176
Citations list http://scholar.google.com/scholar?cites=9561949148186522176&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=9561949148186522176&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt When published in 1962 this book was described by some reviewers as a truly introductory textbook and comprehensive survey of stochastic processes, requiring only a minimal background in introductory probability theory and mathematical analysis. It continues to be 

         Title Cavitation and bubble dynamics
           URL http://books.google.com/books?hl=en&lr=&id=yRhaAQAAQBAJ&oi=fnd&pg=PR11&ots=O6xRuHnbh2&sig=Mk4rT4w-xmW-mpbLGtM4cThmdJo
          Year 2013
     Citations 2868
      Versions 26
    Cluster ID 10903735145678015071
Citations list http://scholar.google.com/scholar?cites=10903735145678015071&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=10903735145678015071&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt Cavitation and Bubble Dynamics deals with the fundamental physical processes of bubble dynamics and the phenomenon of cavitation. It is ideal for graduate students and research engineers and scientists, and a basic knowledge of fluid flow and heat transfer is assumed. 

         Title Introduction to colloid and surface chemistry: Butterworth-Heinemann, Oxford, 1991, ISBN 0 7506 1182 0, 306 pp,£ 14.95
          Year 1993
     Citations 2771
      Versions 6
    Cluster ID 10562832630033572094
Citations list http://scholar.google.com/scholar?cites=10562832630033572094&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=10562832630033572094&hl=en&as_sdt=2005&sciodt=0,5

         Title Irreversibility and generalized noise
           URL http://scholar.google.com/https://journals.aps.org/pr/abstract/10.1103/PhysRev.83.34
          Year 1951
     Citations 2721
      Versions 3
    Cluster ID 13951920364609032371
Citations list http://scholar.google.com/scholar?cites=13951920364609032371&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=13951920364609032371&hl=en&as_sdt=2005&sciodt=0,5
       Excerpt Abstract A relation is obtained between the generalized resistance and the fluctuations of the generalized forces in linear dissipative systems. This relation forms the extension of the Nyquist relation for the voltage fluctuations in electrical impedances. The general formalism 

@dsevero
Copy link

dsevero commented Mar 2, 2017

Did you notice it bugs if you change the citation format? It only outputs the first result.

./scholar.py --phrase "Online Clustering of Bandits" --citations-only --citation bt
@inproceedings{kawale2015efficient,
  title={Efficient Thompson Sampling for Online Matrix-Factorization Recommendation},
  author={Kawale, Jaya and Bui, Hung H and Kveton, Branislav and Tran-Thanh, Long and Chawla, Sanjay},
  booktitle={Advances in Neural Information Processing Systems},
  pages={1297--1305},
  year={2015}
}
}

@lucabaronti
Copy link
Author

@daniel-severo I just included this feature following the main behavior of the tool (with only a minimal change in the code).
Apparently in this case the main behavior, if you specify a citation format, is to return only the first article in that format.
For instance, this command
./scholar.py --phrase "deep learning"
returns the list of (some) papers that contains "deep learning":

     Title Deep learning
       URL http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html
      Year 2015
 Citations 1888
  Versions 41
Cluster ID 5362332738201102290
Citations list http://scholar.google.com/scholar?cites=5362332738201102290&as_sdt=2005&sciodt=0,5&hl=en
Versions list http://scholar.google.com/scholar?cluster=5362332738201102290&hl=en&as_sdt=0,5
 Excerpt Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object 

    Title Learning in science: A comparison of deep and surface approaches
       URL http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1098-2736(200002)37:2%3C109::AID-TEA3%3E3.0.CO;2-7/full
      Year 2000
 Citations 434
  Versions 5
Cluster ID 8108748482885444188
Citations list http://scholar.google.com/scholar?cites=8108748482885444188&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=8108748482885444188&hl=en&as_sdt=0,5
   Excerpt ... The findings also suggest that to encourage a deep learning approach, teachers couldprovide prompts and contextualized scaffolding and encourage students to ask questions,predict, and explain during activities. © 2000 John Wiley & Sons, Inc. ...

     Title Deep learning in neural networks: An overview
       URL http://www.sciencedirect.com/science/article/pii/S0893608014002135
      Year 2015
 Citations 1091
  Versions 22
Cluster ID 15932869302045479284
Citations list http://scholar.google.com/scholar?cites=15932869302045479284&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=15932869302045479284&hl=en&as_sdt=0,5
   Excerpt Abstract In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarizes relevant work, much of it from the previous millennium. Shallow and 

     Title Multimodal deep learning
       URL http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Ngiam_399.pdf
      Year 2011
 Citations 621
  Versions 28
Cluster ID 4020282035517476898
  PDF link http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Ngiam_399.pdf
Citations list http://scholar.google.com/scholar?cites=4020282035517476898&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=4020282035517476898&hl=en&as_sdt=0,5
   Excerpt Abstract Deep networks have been successfully applied to unsupervised feature learning for single modalities (eg, text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for 

     Title Why does unsupervised pre-training help deep learning?
       URL http://www.jmlr.org/papers/v11/erhan10a.html
      Year 2010
 Citations 826
  Versions 29
Cluster ID 13018263321881826087
Citations list http://scholar.google.com/scholar?cites=13018263321881826087&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=13018263321881826087&hl=en&as_sdt=0,5
   Excerpt Abstract Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The 

     Title Unsupervised feature learning for audio classification using convolutional deep belief networks
       URL http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks
      Year 2009
 Citations 514
  Versions 21
Cluster ID 2046036768079393393
Citations list http://scholar.google.com/scholar?cites=2046036768079393393&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=2046036768079393393&hl=en&as_sdt=0,5
   Excerpt ... Abstract In recent years, deep learning approaches have gained significant interest as a wayof building hierarchical representations from unlabeled data. However, to our knowledge, thesedeep learning approaches have not been extensively stud- ied for auditory data. ...

     Title Domain adaptation for large-scale sentiment classification: A deep learning approach
       URL http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Glorot_342.pdf
      Year 2011
 Citations 497
  Versions 20
Cluster ID 18093548304865208974
  PDF link http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Glorot_342.pdf
Citations list http://scholar.google.com/scholar?cites=18093548304865208974&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=18093548304865208974&hl=en&as_sdt=0,5
   Excerpt Abstract The exponential increase in the availability of online reviews and recommendations makes sentiment classification an interesting topic in academic and industrial research. Reviews can span so many different domains that it is difficult to gather annotated training 

     Title Deep Learning for a Digital Age: Technology's Untapped Potential To Enrich Higher Education.
       URL http://eric.ed.gov/?id=ED457787
      Year 2002
 Citations 355
  Versions 2
Cluster ID 11010152299026972441
Citations list http://scholar.google.com/scholar?cites=11010152299026972441&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=11010152299026972441&hl=en&as_sdt=0,5
   Excerpt This book shows how faculty can help students develop skills in research, problem solving, critical thinking, and knowledge management by using Web-based collaboration tools. This innovative approach to teaching and learning emphasizes the use of virtual spaces," 

     Title On the importance of initialization and momentum in deep learning.
       URL http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf
      Year 2013
 Citations 499
  Versions 17
Cluster ID 7449004388220998591
  PDF link http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf
Citations list http://scholar.google.com/scholar?cites=7449004388220998591&as_sdt=2005&sciodt=0,5&hl=en
     Versions list http://scholar.google.com/scholar?cluster=7449004388220998591&hl=en&as_sdt=0,5
   Excerpt Abstract Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with 

     Title Playing atari with deep reinforcement learning
       URL http://scholar.google.com/https://arxiv.org/abs/1312.5602
      Year 2013
 Citations 436
  Versions 26
Cluster ID 10603651548644623407
Citations list http://scholar.google.com/scholar?cites=10603651548644623407&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=10603651548644623407&hl=en&as_sdt=0,5
   Excerpt ... DeepMind Technologies {vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller} @deepmind.com Abstract We present the first deep learning model to successfully learn controlpolicies di- rectly from high-dimensional sensory input using reinforcement learning. ...

whilst if you specify the citation format
./scholar.py --phrase "deep learning" --citation bt
you get only the bibtex format of the first paper in the previous list

@Article{lecun2015deep,
title={Deep learning},
author={LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey},
journal={Nature},
volume={521},
number={7553},
pages={436--444},
year={2015},
publisher={Nature Research}
}

So unless i missed something, no, my feature didn't added a bug, it's just a preexisting behavior.

Checking the code, I've found that the problem is present only when the settings are specified (like in the case of the citation format).
In the settings data structure there's a field named per_page_results initially set at None.
Changing it to a proper value (like 10) will solve the problem.

Since it's a very easy fix, this may or may not be a bug in the main tool, it could be an intended behavior of the author.
Anyway I, as you, think that it's more useful to return the exact same list converted in the specified format, so I've pushed a fix for that on my branch. Hope that my push there will reflect in the code present in this pull request.

I've only tested it for biblatex format before incurring in the captcha limit.
It should works for other formats too, but please double check that for me.

tl:dr it wasn't my fault. It might not be a bug. Fixed anyway.

@Shaam93
Copy link

Shaam93 commented Apr 17, 2017

Thank you for this great modification, it sounds to do exactly what I am looking for. But unfortunately when I run the code after modifications to the parts you added and deleted I get the following error:

self.per_page_results = 10
^
IndentationError: unexpected indent

So I have not get the output yet and I would like to have a list of papers cited by an original paper in CSV format. I would be glad if you could help me.
I am using Spyder(Python 3.6) in case it is related to solve the problem.

Thanks in advanced

@lucabaronti
Copy link
Author

It should've been a typo, try it now and let me know.

@Shaam93
Copy link

Shaam93 commented Apr 17, 2017

I am sorry it is not fixed yet, the error line is 1035 as shown below:

runfile('C:/Users/NOVEMBER/Documents/src/PaperCrawler/.git/scholar.py', args='--citations-only -c 1 --author "albert einstein" --phrase "quantum theory"', wdir='C:/Users/NOVEMBER/Documents/src/PaperCrawler/.git')
File "C:/Users/NOVEMBER/Documents/src/PaperCrawler/.git/scholar.py", line 1035
self.send_query(query)
^
IndentationError: unindent does not match any outer indentation level

I added exactly the lines you added and deleted what you have deleted, and used the command line: --citations-only -c 1 --author "albert einstein" --phrase "quantum theory"

Did I do something wrong?

@Shaam93
Copy link

Shaam93 commented Apr 17, 2017

I have read a little bit about changing taps into 4 spaces in order to fix the previous error, and I changed the typo you told me about, but then I got another error in line 1035 as previously stated.

Thank you for replying so fast, I appreciate it.

@Shaam93
Copy link

Shaam93 commented Apr 17, 2017 via email

@lucabaronti
Copy link
Author

Your problem seems related to the different indentation styles used in different systems (mine is Unix, I assume you are using windows).
It's hard to fix that problem for me since I'm not able to reproduce it, however you should be able to replace the spaces with tabs (or vice-versa) where needed.

Everything else should work as intended, let me know otherwise.

On another note, I've just noticed that the current version is unable to download more than the first 10 citations.
The right solution might require to perform more modifications to the code that I'm intended to do.
Truth be told I'm not the greatest fan of how the code is structured in this project, so I prefer to keep the modifications at minimum, relying on the author for proper integrate my parts (should the need arise).

Now I've pushed a workaround that is able to fetch all the citations for a given paper.
Since I'm doing a scholar search every 10 citations, I've put a sleep between them.
That means that the command requires 1 second every 10 citations the paper has.
If you need a faster solution just change the number of the seconds in the sleep at line 1052 to lower values
time.sleep(1)
just be sure to not flood the server with requests or you might be softbanned.
A better solution may exists, I'll double check that later.

Also, it's been quite some time since last time I've touched that code, and I hadn't time to check every possible interactions, so let me know if you find some new issues.

About your specific query, I've checked this command
$ ./scholar.py --phrase "Novel properties of the Fourier decomposition of the sinogram" --citations-only --csv
and it acctually prints the csv of all the 151 papers that cite it (too long to paste here)

@Shaam93
Copy link

Shaam93 commented Apr 18, 2017

I fixed what you changed and fixed the taps and spaces problem now I am getting onlz this output:

UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
An exception has occurred, use %tb to see the full traceback.

SystemExit: 0

I am using what is called Python interpreter, I only downloaded Winpython version 3.6 for Windows and opened the scholar.py from the shortcut Spyder that the Winpython provides, could you please tell me how do you usually run the code on your device, and what about the things people mention in other questions and comments about Beautifulsoup4 and Pip, I have no idea on how to run the code other than Spyder, please if you have some time tell me how you run it.

Thanks in advanced.

@Shaam93
Copy link

Shaam93 commented Apr 30, 2017

Hello again,
just wanted to say that in order to get rid of the error message that I have posted earlier, I had to remove System pause from the code. That is what was stopping the code from running smoothly,so to be more clear, instead of line 1347 sys.exit(main()), write main() and there will be no more errors.
I have the results that I want now. Thank you so much for everything.

@lucabaronti
Copy link
Author

The sys.exit(main()) was put there by the original author.
It's not a pause, it's used to provide an exit code (which is the basic way to determine if the program terminated correctly).

It works well on my machine, however if that causes you troubles I think that you can safely replace it with main() as you did.

@Shaam93
Copy link

Shaam93 commented May 8, 2017

Hello Mr. Baronti,
thank you for your reply, you are right it is not about the system.exit because it is happening again. The thing is that I sometimes get an output of the required data and other times for some reason I get no output at all. Knowing that I have not changed anything in the code, it is still the same one I used before.

Please tell me if you have any idea why this is happening, I am sure it is not because of the code, you did a great job, my question is like is it related to the google scholar itself or is there anything that I am not taking into consideration?
It used to happen then it worked and now it is not working again, and I am still using the same code!!!

Thank you for your time and consideration.

@lucabaronti
Copy link
Author

lucabaronti commented May 9, 2017

It's possible that you made too many requests in a day and the server blocked them as result.
When the server detect too many requests from the same user it may softban him or perform some checks (usually in the form of a captcha).
When this event occurs, the program stop working as it can't go any further and this may explain your problem.

Has been a while since last time I checked this project code, but I remember that I couldn't find a way to request all the citations at once.
In order to fetch every citation, I had to ask them page-by-page (e.g. if a paper has 300 citations I had to make 30 separate requests).
In order to prevent the flooding of the server I put a sleep of 1 second in between them, however the server checks may consider the request numbers as well as their frequency.

For this purpose, an user may be blocked server-side by its ip, their cookies, or both.
If I remember well the program reset the cookies at every run, however the ip will stay the same, so further requests once the program fails the first time are unlikely to succeed.

You can try to mitigate the problem increasing the sleep time (search for sleep in the code) but keep in mind that since this is a server issue there are very few things we can do client-side to address it.

@Shaam93
Copy link

Shaam93 commented May 17, 2017

It is mostly the reasons that I am thinking of, but since we can not fix this, how can I be sure that the server is blocking me, and that it is not another issue? Because I do not receive any warnings, it is just not giving any output as for the very first few times I tried the code. Is it possible to make a warning message when the server is drawn in request, or if the user is blocked?

@lucabaronti
Copy link
Author

I'm currently using the original author's functions to query google scholar.
I agree that a more informative error message might be helpful, unfortunately I don't have much time to check this and, in fact, a change of that part is way beyond the scope of this pull request.
If I manage to find some extra time for this project I might work on a more informative error message and submit it on a separate pull request.
However I can't promise nothing now on that regard, I'm sorry.

@ivanperez-keera
Copy link

@lucabaronti It would be fantastic if this could be merged. It's a very useful feature.

@lucabaronti
Copy link
Author

@ivanperez-keera I'm glad you like it. You should ask the original author since he's the sole who can merge this pull request.

@ivanperez-keera
Copy link

I like the idea, but I have not been able to try it yet. Does it work with the latest version of scholar.py?

@lucabaronti
Copy link
Author

As you can see from the date of my last comment, it has been a while since last time I tried it.
However I think it should work with the current version, so you should give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants