gummy.journals module¶
These classes get contents from paper pages ( html
) or files ( PDF
, TeX
)
Supported journals are listed here (Supported journals · iwasakishuto/Translation-Gummy Wiki), and if you want to support for new journals, please request on twitter DM or Github issues.
You can easily get (import) Journal Crawler Class
by the following ways.
>>> from gummy import journals
>>> crawler = journals.get("nature")
>>> crawler
<gummy.journals.NatureCrawler at 0x1256777c0>
>>> from gummy.journals import NatureCrawler
>>> nature = NatureCrawler()
>>> nature
<gummy.journals.NatureCrawler at 0x1253da9a0>
>>> crawler = journals.get(nature)
>>> id(crawler) == id(nature)
True
-
class
gummy.journals.
GummyAbstJournal
(crawl_type='soup', gateway='useless', sleep_for_loading=3, verbose=True, DecomposeTexTags=['<cit.>', '\xa0', '<ref>'], DecomposeSoupTags=[{'name': 'link'}, {'name': 'meta'}, {'name': 'noscript'}, {'name': 'script'}, {'name': 'style'}, {'name': 'sup'}], subheadTags=[], **kwargs)[source]¶ Bases:
object
If you want define your own journal crawlers, please inherit this class and define these methods:
- if
crawl_type
=="tex"
: (required) get_contents_tex(self, url, driver=None)
(required) get_sections_from_tex(tex)
(required) get_contents_from_tex_sections(tex_sections)
- if
- if
crawl_type
=="soup"
: (if necessary)
get_contents_from_soup_sections(self, soup_sections)
(required)
get_title_from_soup(self, soup)
(required)
get_sections_from_soup(self, soup)
(required)
get_head_from_section(self, section)
(if necessary)
make_elements_visible(self, driver)
organize_soup_section(self, section, head="", head_is_not_added=True)
- if
- Parameters
crawl_type (str) – Crawling type, if you not specify, use recommended crawling type.
gateway (str, GummyGateWay) – identifier of the Gummy Gateway Class. See
gateways
. (default=None
)sleep_for_loading (int) – Number of seconds to wait for a web page to load (default=
3
)verbose (bool) – Whether you want to print output or not. (default=
True
)DecomposeTexTags (list) – Tex tags to be removed in advance for easier analysis. (default=
["<cit.>"," ","<ref>"]
)DecomposeSoupTags (list) – HTML tags to be removed in advance for easier analysis. (default=
["i","link","meta","noscript","script","style","sup"]
)subheadTags (list) – HTML tag names to identify the subheadings.
kwargs (dict) – There is no use for it so far.
-
crawling_logs
¶ Crawling logs.
- Type
dict
-
property
class_name
¶ Same as
self.__class__.__name__
.
-
property
name
¶ Translator service name.
-
property
journal_type
¶ Journal Type.
-
property
default_title
¶ Default title.
-
get_contents
(url, driver=None, crawl_type=None, **gatewaykwargs)[source]¶ Get contents using the method which is determined based on
crawl_type
- Parameters
url (str) – URL of a paper or
path/to/local.file
.driver (WebDriver) – Selenium WebDriver.
crawl_type (str) – Crawling type, if you not specify, use recommended crawling type.
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
(title, content)
- Return type
tuple (str, dict)
Examples
>>> from gummy import journals >>> crawler = journals.get("nature") >>> title, texts = crawler.get_contents(url="https://www.nature.com/articles/ncb0800_500") Crawling Type: soup : >>> print(title) Formation of the male-specific muscle in female by ectopic expression >>> print(texts[:1]) [{'head': 'Abstract', 'raw': 'The () gene product Fru has been ... for the sexually dimorphic actions of the gene.'}]
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
get_contents_soup
(url, driver=None, **gatewaykwargs)[source]¶ Get contents from url of the web page using
BeautifulSoup
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
(title, content)
- Return type
tuple (str, dict)
-
get_soup_source
(url, driver=None, **gatewaykwargs)[source]¶ Scrape and get page source from
url
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
BeautifulSoup
-
make_elements_visible
(driver)[source]¶ Make all elements of the page visible.
- Parameters
driver (WebDriver) – Selenium WebDriver.
This function is not necessary for all Journals, but you can trim
DecomposeSoupTags
from the soup and it will help with debugging.- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.
- Returns
soup, a dict showing the number of decomposed tags.
- Return type
tuple (BeautifulSoup, dict)
Register
DecomposeSoupTags
- Parameters
**kwargs – Kwargs for
soup.find
method.
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
get_sections_from_soup
(soup)[source]¶ Get sections from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
Page sections. Each element is (bs4.element.Tag)
- Return type
list
-
get_head_from_section
(section)[source]¶ Get head from a page section.
- Parameters
section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
- Returns
A section head tag.
- Return type
bs4.element.Tag
-
get_contents_from_soup_sections
(soup_sections)[source]¶ Get contents from each soup section.
- Parameters
soup_sections (list) – Each element is (bs4.element.Tag).
- Returns
Each element is
dict
(key is one of the["raw", "head", "subhead", "img"]
).- Return type
list
-
organize_soup_section
(section, head='', head_is_not_added=True)[source]¶ Organize soup section:
Extract an image and display it as
base64
format in the section.Add
head
only to the initial content.
- Parameters
section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
head (str) – Head word.
head_is_not_added (bool) – Whether head is added or not. (default=
True
)
-
static
arrange_english
(english)[source]¶ Get rid of extra characters from body (english). This method is used in
arrange_english
.- Parameters
english (str) – Raw English.
- Returns
Arranged English
- Return type
str
-
static
get_tex_url
(url)[source]¶ Convert the URL to the URL of the tex page you access when
crawl_type=="tex"
-
get_contents_tex
(url, driver=None)[source]¶ Get contents from url by parsing TeX sources.
- Parameters
url – (str) URL of a paper.
driver – (WebDriver) webdriver
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
(title, content)
- Return type
tuple (str, dict)
-
get_tex_source
(url, driver=None)[source]¶ Download and get tex source from url.
- Parameters
url (str) – URL of a tex source or
path/to/local.tex
.driver (WebDriver) – Selenium WebDriver.
- Returns
Plain text in tex source.
- Return type
str
-
get_title_from_tex
(tex)[source]¶ Get a title from tex source.
- Parameters
tex (str) – Plain text in tex source.
- Returns
TeX title.
- Return type
str
-
get_sections_from_tex
(tex)[source]¶ Get sections from tex source.
- Parameters
tex (str) – Plain text in tex source.
- Returns
Each element is plain text (str)
- Return type
list
-
get_contents_from_tex_sections
(tex_sections)[source]¶ Get text for each tex section.
- Parameters
tex_sections – (list) Each element is plain text (str).
- Returns
Each element is
dict
(key is one of the["raw", "head", "subhead", "img"]
).- Return type
(list)
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_contents_pdf
(url, driver=None)[source]¶ Get contents from url by parsing PDF file.
- Parameters
url (str) – URL of a paper or
path/to/local.pdf
.driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
(title, content)
- Return type
tuple (str, dict)
-
get_pdf_source
(url, driver=None)[source]¶ Download and get PDF source from url.
- Parameters
url (str) – URL of a PDF file or
path/to/local.pdf
.driver (WebDriver) – Selenium WebDriver.
- Returns
Each element is a list which contains [text, bbox(x0,y0,x1,y1)]
- Return type
list
-
class
gummy.journals.
PDFCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_contents_pdf
(url, driver=None)[source]¶ Get contents from url by parsing PDF file.
- Parameters
url (str) – URL of a paper or
path/to/local.pdf
.driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
(title, content)
- Return type
tuple (str, dict)
-
static
-
class
gummy.journals.
NatureCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ NatureCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidAriaLabel
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
arXivCrawler
(sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ arXivCrawler's
defaultcrawl_type
is"pdf"
.- Type
str
-
AvoidAriaLabel
¶ Markers indicating the extra section to remove in
get_sections_from_pdf
- Type
list
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_tex_url
(url)[source]¶ Convert the URL to the URL of the tex page you access when
crawl_type=="tex"
-
class
gummy.journals.
NCBICrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ NCBICrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
arrange_english
(english)[source]¶ Get rid of extra characters from body (english). This method is used in
arrange_english
.- Parameters
english (str) – Raw English.
- Returns
Arranged English
- Return type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
PubMedCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ PubMedCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidIdsPatterns
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_contents_soup
(url, driver=None, **gatewaykwargs)[source]¶ Get contents from url of the web page using
BeautifulSoup
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
(title, content)
- Return type
tuple (str, dict)
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
OxfordAcademicCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ OxfordAcademicCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ScienceDirectCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ScienceDirectCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
SpringerCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ SpringerCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidAriaLabel
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
MDPICrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ MDPICrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidAriaLabel
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_soup_source
(url, driver=None, **gatewaykwargs)[source]¶ Scrape and get page source from
url
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
BeautifulSoup
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
UniOKLAHOMACrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ UniOKLAHOMACrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
LungCancerCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ LungCancerCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidHead
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
CellPressCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ CellPressCrawler's
defaultcrawl_type
is"soup"
.- Type
str
Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
WileyOnlineLibraryCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ WileyOnlineLibraryCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
JBCCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ JBCCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
BiologistsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ BiologistsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidIDs
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
BioMedCentralCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ BioMedCentralCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidAriaLabel
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
IEEEXploreCrawler
(gateway='useless', sleep_for_loading=10, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ IEEEXploreCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
JSTAGECrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ JSTAGECrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_soup_source
(url, driver=None, **gatewaykwargs)[source]¶ Scrape and get page source from
url
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
BeautifulSoup
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ACSPublicationsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ACSPublicationsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_soup_source
(url, driver=None, **gatewaykwargs)[source]¶ Scrape and get page source from
url
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
BeautifulSoup
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
StemCellsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ StemCellsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
UniKeioCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ UniKeioCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_contents_soup
(url, driver=None, **gatewaykwargs)[source]¶ Get contents from url of the web page using
BeautifulSoup
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
(title, content)
- Return type
tuple (str, dict)
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
PLOSONECrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ PLOSONECrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidIDs
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
frontiersCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ frontiersCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
RNAjournalCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ RNAjournalCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_soup_source
(url, driver=None, **gatewaykwargs)[source]¶ Scrape and get page source from
url
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
BeautifulSoup
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
IntechOpenCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ IntechOpenCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
NRCResearchPressCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ NRCResearchPressCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
SpandidosCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ SpandidosCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
TaylorandFrancisOnlineCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ TaylorandFrancisOnlineCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
bioRxivCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ bioRxivCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidIDs
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
RSCPublishingCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ RSCPublishingCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
JSSECrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ JSSECrawler's
defaultcrawl_type
is"pdf"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
ScienceAdvancesCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ScienceAdvancesCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidIDs
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
medRxivCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ medRxivCrawler's
defaultcrawl_type
is"pdf"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ACLAnthologyCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ACLAnthologyCrawler's
defaultcrawl_type
is"pdf"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
PNASCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ PNASCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidIDs
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
AMSCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ AMSCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
AvoidIDs
¶ Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ACMCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
NOTE: If you want to download PDF, you must run driver with a browser.
- URL:
-
crawl_type
¶ ACMCrawler's
defaultcrawl_type
is"pdf"
.- Type
str
-
AvoidIDs
¶ Markers indicating the extra section to remove in
get_sections_from_pdf
- Type
list
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
APSCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ APSCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
make_elements_visible
(driver)[source]¶ Make all elements of the page visible.
- Parameters
driver (WebDriver) – Selenium WebDriver.
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ASIPCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ASIPCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_soup_source
(url, driver=None, **gatewaykwargs)[source]¶ Scrape and get page source from
url
.- Parameters
url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See
passthrough
.
- Returns
A data structure representing a parsed HTML or XML document.
- Return type
BeautifulSoup
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
AnatomyPubsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ AnatomyPubsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
RenalPhysiologyCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ RenalPhysiologyCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
GeneticsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ GeneticsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
GeneDevCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ GeneDevCrawler's
defaultcrawl_type
is"pdf"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
JAMANetworkCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ JAMANetworkCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
SAGEjournalsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ SAGEjournalsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
MolCellBioCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ MolCellBioCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
JKMSCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ JKMSCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
JKNSCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ JKNSCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
BioscienceCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ BioscienceCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
RadioGraphicsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ RadioGraphicsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
PediatricSurgeryCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ PediatricSurgeryCrawler's
defaultcrawl_type
is"soup"
.- Type
str
Markers indicating the extra section to remove in
get_sections_from_soup
- Type
list
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
AGUPublicationsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ AGUPublicationsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
NEJMCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ NEJMCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
LWWJournalsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ LWWJournalsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ARVOJournalsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ARVOJournalsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
LearningMemoryCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ LearningMemoryCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ScienceMagCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ScienceMagCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
PsyChiArtistCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ PsyChiArtistCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
OncotargetCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ OncotargetCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ClinicalEndoscopyCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ClinicalEndoscopyCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
EMBOPressCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ EMBOPressCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ASPBCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ASPBCrawlerCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
BiomedGridCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ BiomedGridCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
NRRCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ NRRCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
YMJCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ YMJCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
TheLancetCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ YMJCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
FutureScienceCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ FutureScienceCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ScitationCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ScitationCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
IOPScienceCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
Todo
Deal with ShieldSquare Captcha.
-
crawl_type
¶ IOPScienceCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
AACRPublicationsCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
- URL:
-
crawl_type
¶ AACRPublicationsCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
class
gummy.journals.
PsycNetCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ PsycNetCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
make_elements_visible
(driver)[source]¶ Make all elements of the page visible.
- Parameters
driver (WebDriver) – Selenium WebDriver.
-
static
is_request_successful
(soup)[source]¶ Check whether request is successful or not.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
MinervaMedicaCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ MinervaMedicaCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
make_elements_visible
(driver)[source]¶ Make all elements of the page visible.
- Parameters
driver (WebDriver) – Selenium WebDriver.
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
JNeurosciCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ JNeurosciCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
static
get_soup_url
(url)[source]¶ Convert the URL to the URL of the web page you access when
crawl_type=="soup"
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
HindawiCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ HindawiCrawler's
defaultcrawl_type
is"soup"
.- Type
str
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-
-
class
gummy.journals.
ChemRxivCrawler
(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶ Bases:
gummy.journals.GummyAbstJournal
-
crawl_type
¶ ChemRxivCrawler's
defaultcrawl_type
is"pdf"
.- Type
str
-
static
get_pdf_url
(url)[source]¶ Convert the URL to the URL of the PDF page you access when
crawl_type=="pdf"
-
get_title_from_soup
(soup)[source]¶ Get page title from page source.
- Parameters
soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
- Returns
A page title.
- Return type
str
-