gummy.journals module¶

These classes get contents from paper pages ( html ) or files ( PDF, TeX)

Supported journals are listed here (Supported journals · iwasakishuto/Translation-Gummy Wiki), and if you want to support for new journals, please request on twitter DM or Github issues.

You can easily get (import) Journal Crawler Class by the following ways.

>>> from gummy import journals
>>> crawler = journals.get("nature")
>>> crawler
<gummy.journals.NatureCrawler at 0x1256777c0>
>>> from gummy.journals import NatureCrawler
>>> nature = NatureCrawler()
>>> nature
<gummy.journals.NatureCrawler at 0x1253da9a0>
>>> crawler = journals.get(nature)
>>> id(crawler) == id(nature)
True

class gummy.journals.GummyAbstJournal(crawl_type='soup', gateway='useless', sleep_for_loading=3, verbose=True, DecomposeTexTags=['<cit.>', '\xa0', '<ref>'], DecomposeSoupTags=[{'name': 'link'}, {'name': 'meta'}, {'name': 'noscript'}, {'name': 'script'}, {'name': 'style'}, {'name': 'sup'}], subheadTags=[], **kwargs)[source]¶

Bases: object

If you want define your own journal crawlers, please inherit this class and define these methods:

if crawl_type == "tex":
- get_contents_tex(self, url, driver=None)
- (required) get_contents_tex(self, url, driver=None)
- (required) get_sections_from_tex(tex)
- (required) get_contents_from_tex_sections(tex_sections)
if crawl_type == "soup":
- get_soup_source(self, url, driver=None, **gatewaykwargs)
- get_contents_soup(self, url, driver=None, **gatewaykwargs)
- (if necessary) get_contents_from_soup_sections(self, soup_sections)
- (required) get_title_from_soup(self, soup)
- (required) get_sections_from_soup(self, soup)
- (required) get_head_from_section(self, section)
- (if necessary) make_elements_visible(self, driver)
- decompose_soup_tags(self, soup)
- organize_soup_section(self, section, head="", head_is_not_added=True)
if crawl_type == "pdf":

Parameters

crawl_type (str) – Crawling type, if you not specify, use recommended crawling type.
gateway (str, GummyGateWay) – identifier of the Gummy Gateway Class. See gateways. (default= None )
sleep_for_loading (int) – Number of seconds to wait for a web page to load (default= 3 )
verbose (bool) – Whether you want to print output or not. (default= True )
DecomposeTexTags (list) – Tex tags to be removed in advance for easier analysis. (default= ["<cit.>"," ","<ref>"] )
DecomposeSoupTags (list) – HTML tags to be removed in advance for easier analysis. (default= ["i","link","meta","noscript","script","style","sup"] )
subheadTags (list) – HTML tag names to identify the subheadings.
kwargs (dict) – There is no use for it so far.

crawling_logs¶

Crawling logs.

Type: dict

property class_name¶: Same as self.__class__.__name__.

property name¶: Translator service name.

property journal_type¶: Journal Type.

property default_title¶: Default title.

get_contents(url, driver=None, crawl_type=None, **gatewaykwargs)[source]¶

Get contents using the method which is determined based on crawl_type

Parameters

url (str) – URL of a paper or path/to/local.file.
driver (WebDriver) – Selenium WebDriver.
crawl_type (str) – Crawling type, if you not specify, use recommended crawling type.
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

Examples

>>> from gummy import journals
>>> crawler = journals.get("nature")
>>> title, texts = crawler.get_contents(url="https://www.nature.com/articles/ncb0800_500")
Crawling Type: soup
    :
>>> print(title)
Formation of the male-specific muscle in female by ectopic expression
>>> print(texts[:1])
[{'head': 'Abstract', 'raw': 'The  () gene product Fru has been ... for the sexually dimorphic actions of the gene.'}]

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

get_contents_soup(url, driver=None, **gatewaykwargs)[source]¶

Get contents from url of the web page using BeautifulSoup.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_soup_source(url, driver=None, **gatewaykwargs)[source]¶

Scrape and get page source from url.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

make_elements_visible(driver)[source]¶

Make all elements of the page visible.

Parameters: driver (WebDriver) – Selenium WebDriver.

decompose_soup_tags(soup)[source]¶

This function is not necessary for all Journals, but you can trim DecomposeSoupTags from the soup and it will help with debugging.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.
Returns: soup, a dict showing the number of decomposed tags.
Return type: tuple (BeautifulSoup, dict)

register_decompose_soup_tags(**kwargs)[source]¶

Register DecomposeSoupTags

Parameters: **kwargs – Kwargs for soup.find method.

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

get_contents_from_soup_sections(soup_sections)[source]¶

Get contents from each soup section.

Parameters: soup_sections (list) – Each element is (bs4.element.Tag).
Returns: Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).
Return type: list

organize_soup_section(section, head='', head_is_not_added=True)[source]¶

Organize soup section:

Extract an image and display it as base64 format in the section.
Add head only to the initial content.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
head (str) – Head word.
head_is_not_added (bool) – Whether head is added or not. (default= True)

static arrange_english(english)[source]¶

Get rid of extra characters from body (english). This method is used in arrange_english.

Parameters: english (str) – Raw English.
Returns: Arranged English
Return type: str

static get_tex_url(url)[source]¶: Convert the URL to the URL of the tex page you access when crawl_type=="tex"

get_contents_tex(url, driver=None)[source]¶

Get contents from url by parsing TeX sources.

Parameters

url – (str) URL of a paper.
driver – (WebDriver) webdriver
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_tex_source(url, driver=None)[source]¶

Download and get tex source from url.

Parameters

url (str) – URL of a tex source or path/to/local.tex.
driver (WebDriver) – Selenium WebDriver.

Returns

Plain text in tex source.

Return type

str

get_title_from_tex(tex)[source]¶

Get a title from tex source.

Parameters: tex (str) – Plain text in tex source.
Returns: TeX title.
Return type: str

get_sections_from_tex(tex)[source]¶

Get sections from tex source.

Parameters: tex (str) – Plain text in tex source.
Returns: Each element is plain text (str)
Return type: list

get_contents_from_tex_sections(tex_sections)[source]¶

Get text for each tex section.

Parameters: tex_sections – (list) Each element is plain text (str).
Returns: Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).
Return type: (list)

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_contents_pdf(url, driver=None)[source]¶

Get contents from url by parsing PDF file.

Parameters

url (str) – URL of a paper or path/to/local.pdf.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_pdf_source(url, driver=None)[source]¶

Download and get PDF source from url.

Parameters

url (str) – URL of a PDF file or path/to/local.pdf.
driver (WebDriver) – Selenium WebDriver.

Returns

Each element is a list which contains [text, bbox(x0,y0,x1,y1)]

Return type

list

get_title_from_pdf(pdf_pages)[source]¶

Get title from PDF source.

Parameters: pdf_pages (list) – Each element is a list which contains [text, bbox(x0,y0,x1,y1)]
Returns: PDF title.
Return type: str

get_contents_from_pdf_pages(pdf_pages)[source]¶

Get contents from each page.

Parameters: pdf_pages (list) – Each element is a list which contains [text, bbox(x0,y0,x1,y1)]
Returns: (title, content)
Return type: tuple (str, dict)

class gummy.journals.PDFCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_contents_pdf(url, driver=None)[source]¶

Get contents from url by parsing PDF file.

Parameters

url (str) – URL of a paper or path/to/local.pdf.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

class gummy.journals.NatureCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.nature.com

crawl_type¶

NatureCrawler's default crawl_type is "soup".

Type: str

AvoidAriaLabel¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.arXivCrawler(sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://arxiv.org

crawl_type¶

arXivCrawler's default crawl_type is "pdf".

Type: str

AvoidAriaLabel¶

Markers indicating the extra section to remove in get_sections_from_pdf

Type: list

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_tex_url(url)[source]¶: Convert the URL to the URL of the tex page you access when crawl_type=="tex"

static get_arXivNo(url)[source]¶

get_sections_from_tex(tex)[source]¶

Get sections from tex source.

Parameters: tex (str) – Plain text in tex source.
Returns: Each element is plain text (str)
Return type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

class gummy.journals.NCBICrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.ncbi.nlm.nih.gov

crawl_type¶

NCBICrawler's default crawl_type is "soup".

Type: str

static arrange_english(english)[source]¶

Get rid of extra characters from body (english). This method is used in arrange_english.

Parameters: english (str) – Raw English.
Returns: Arranged English
Return type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.PubMedCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://pubmed.ncbi.nlm.nih.gov

crawl_type¶

PubMedCrawler's default crawl_type is "soup".

Type: str

AvoidIdsPatterns¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_contents_soup(url, driver=None, **gatewaykwargs)[source]¶

Get contents from url of the web page using BeautifulSoup.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.OxfordAcademicCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://academic.oup.com

crawl_type¶

OxfordAcademicCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ScienceDirectCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.sciencedirect.com

crawl_type¶

ScienceDirectCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.SpringerCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://link.springer.com

crawl_type¶

SpringerCrawler's default crawl_type is "soup".

Type: str

AvoidAriaLabel¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.MDPICrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.mdpi.com

crawl_type¶

MDPICrawler's default crawl_type is "soup".

Type: str

AvoidAriaLabel¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_soup_source(url, driver=None, **gatewaykwargs)[source]¶

Scrape and get page source from url.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.UniOKLAHOMACrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.ou.edu

crawl_type¶

UniOKLAHOMACrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_contents_from_soup_sections(soup_sections)[source]¶

Get contents from each soup section.

Parameters: soup_sections (list) – Each element is (bs4.element.Tag).
Returns: Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).
Return type: list

class gummy.journals.LungCancerCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.lungcancerjournal.info

crawl_type¶

LungCancerCrawler's default crawl_type is "soup".

Type: str

AvoidHead¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.CellPressCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.cell.com

crawl_type¶

CellPressCrawler's default crawl_type is "soup".

Type: str

AvoidDataLeftHandNavs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.WileyOnlineLibraryCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

crawl_type¶

WileyOnlineLibraryCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.JBCCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.jbc.org

crawl_type¶

JBCCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.BiologistsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

crawl_type¶

BiologistsCrawler's default crawl_type is "soup".

Type: str

AvoidIDs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.BioMedCentralCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

crawl_type¶

BioMedCentralCrawler's default crawl_type is "soup".

Type: str

AvoidAriaLabel¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.IEEEXploreCrawler(gateway='useless', sleep_for_loading=10, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://ieeexplore.ieee.org

crawl_type¶

IEEEXploreCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.JSTAGECrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.jstage.jst.go.jp

crawl_type¶

JSTAGECrawler's default crawl_type is "soup".

Type: str

get_soup_source(url, driver=None, **gatewaykwargs)[source]¶

Scrape and get page source from url.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ACSPublicationsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://pubs.acs.org/

crawl_type¶

ACSPublicationsCrawler's default crawl_type is "soup".

Type: str

get_soup_source(url, driver=None, **gatewaykwargs)[source]¶

Scrape and get page source from url.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.StemCellsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://stemcellsjournals.onlinelibrary.wiley.com

crawl_type¶

StemCellsCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.UniKeioCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://keio.pure.elsevier.com

crawl_type¶

UniKeioCrawler's default crawl_type is "soup".

Type: str

get_contents_soup(url, driver=None, **gatewaykwargs)[source]¶

Get contents from url of the web page using BeautifulSoup.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – Selenium WebDriver.
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_contents_from_soup_sections(soup_sections)[source]¶

Get contents from each soup section.

Parameters: soup_sections (list) – Each element is (bs4.element.Tag).
Returns: Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).
Return type: list

class gummy.journals.PLOSONECrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://journals.plos.org

crawl_type¶

PLOSONECrawler's default crawl_type is "soup".

Type: str

AvoidIDs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.frontiersCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.frontiersin.org

crawl_type¶

frontiersCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.RNAjournalCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://rnajournal.cshlp.org

crawl_type¶

RNAjournalCrawler's default crawl_type is "soup".

Type: str

get_soup_source(url, driver=None, **gatewaykwargs)[source]¶

Scrape and get page source from url.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.IntechOpenCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.intechopen.com

crawl_type¶

IntechOpenCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.NRCResearchPressCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.nrcresearchpress.com

crawl_type¶

NRCResearchPressCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.SpandidosCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.spandidos-publications.com

crawl_type¶

SpandidosCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.TaylorandFrancisOnlineCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.tandfonline.com

crawl_type¶

TaylorandFrancisOnlineCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.bioRxivCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.biorxiv.org

crawl_type¶

bioRxivCrawler's default crawl_type is "soup".

Type: str

AvoidIDs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.RSCPublishingCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://pubs.rsc.org

crawl_type¶

RSCPublishingCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.JSSECrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.jsse.org

crawl_type¶

JSSECrawler's default crawl_type is "pdf".

Type: str

static get_jsseNo(url)[source]¶

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ScienceAdvancesCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://advances.sciencemag.org

crawl_type¶

ScienceAdvancesCrawler's default crawl_type is "soup".

Type: str

AvoidIDs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.medRxivCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.medrxiv.org

crawl_type¶

medRxivCrawler's default crawl_type is "pdf".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ACLAnthologyCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.aclweb.org

crawl_type¶

ACLAnthologyCrawler's default crawl_type is "pdf".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.PNASCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.pnas.org

crawl_type¶

PNASCrawler's default crawl_type is "soup".

Type: str

AvoidIDs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.AMSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://journals.ametsoc.org

crawl_type¶

AMSCrawler's default crawl_type is "soup".

Type: str

AvoidIDs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ACMCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

NOTE: If you want to download PDF, you must run driver with a browser.

URL:

https://dl.acm.org

crawl_type¶

ACMCrawler's default crawl_type is "pdf".

Type: str

AvoidIDs¶

Markers indicating the extra section to remove in get_sections_from_pdf

Type: list

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.APSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://journals.aps.org

crawl_type¶

APSCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

make_elements_visible(driver)[source]¶

Make all elements of the page visible.

Parameters: driver (WebDriver) – Selenium WebDriver.

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ASIPCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://ajp.amjpathol.org

crawl_type¶

ASIPCrawler's default crawl_type is "soup".

Type: str

static get_asipID(url)[source]¶

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_soup_source(url, driver=None, **gatewaykwargs)[source]¶

Scrape and get page source from url.

Parameters

url (str) – URL of a paper.
driver (WebDriver) – webdriver
gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.AnatomyPubsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://anatomypubs.onlinelibrary.wiley.com

crawl_type¶

AnatomyPubsCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.RenalPhysiologyCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://journals.physiology.org

crawl_type¶

RenalPhysiologyCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.GeneticsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.genetics.org

crawl_type¶

GeneticsCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.GeneDevCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://genesdev.cshlp.org

crawl_type¶

GeneDevCrawler's default crawl_type is "pdf".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.JAMANetworkCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://jamanetwork.com

crawl_type¶

JAMANetworkCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.SAGEjournalsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://journals.sagepub.com

crawl_type¶

SAGEjournalsCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.MolCellBioCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://mcb.asm.org

crawl_type¶

MolCellBioCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.JKMSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://jkms.org

crawl_type¶

JKMSCrawler's default crawl_type is "soup".

Type: str

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.JKNSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.jkns.or.kr

crawl_type¶

JKNSCrawler's default crawl_type is "soup".

Type: str

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.BioscienceCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.bioscience.org

crawl_type¶

BioscienceCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.RadioGraphicsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://pubs.rsna.org

crawl_type¶

RadioGraphicsCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.PediatricSurgeryCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.jpedsurg.org

crawl_type¶

PediatricSurgeryCrawler's default crawl_type is "soup".

Type: str

AvoidDataLeftHandNavs¶

Markers indicating the extra section to remove in get_sections_from_soup

Type: list

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.AGUPublicationsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://agupubs.onlinelibrary.wiley.com

crawl_type¶

AGUPublicationsCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.NEJMCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.nejm.org

crawl_type¶

NEJMCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.LWWJournalsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://journals.lww.com

crawl_type¶

LWWJournalsCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ARVOJournalsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

crawl_type¶

ARVOJournalsCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.LearningMemoryCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://learnmem.cshlp.org/

crawl_type¶

LearningMemoryCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ScienceMagCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://science.sciencemag.org/

crawl_type¶

ScienceMagCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.PsyChiArtistCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.psychiatrist.com/

crawl_type¶

PsyChiArtistCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.OncotargetCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.oncotarget.com/

crawl_type¶

OncotargetCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ClinicalEndoscopyCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.e-ce.org/

crawl_type¶

ClinicalEndoscopyCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.EMBOPressCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.embopress.org

crawl_type¶

EMBOPressCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ASPBCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

http://www.plantphysiol.org/

crawl_type¶

ASPBCrawlerCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.BiomedGridCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://biomedgrid.com/

crawl_type¶

BiomedGridCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.NRRCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.nrronline.org/

crawl_type¶

NRRCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.YMJCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://eymj.org/

crawl_type¶

YMJCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.TheLancetCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.thelancet.com/

crawl_type¶

YMJCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.FutureScienceCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.future-science.com/

crawl_type¶

FutureScienceCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ScitationCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

crawl_type¶

ScitationCrawler's default crawl_type is "soup".

Type: str

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.IOPScienceCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://iopscience.iop.org/

Todo

Deal with ShieldSquare Captcha.

crawl_type¶

IOPScienceCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.AACRPublicationsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

crawl_type¶

AACRPublicationsCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.PsycNetCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://psycnet.apa.org/

crawl_type¶

PsycNetCrawler's default crawl_type is "soup".

Type: str

make_elements_visible(driver)[source]¶

Make all elements of the page visible.

Parameters: driver (WebDriver) – Selenium WebDriver.

static is_request_successful(soup)[source]¶

Check whether request is successful or not.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.MinervaMedicaCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.minervamedica.it/

crawl_type¶

MinervaMedicaCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

make_elements_visible(driver)[source]¶

Make all elements of the page visible.

Parameters: driver (WebDriver) – Selenium WebDriver.

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.JNeurosciCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.jneurosci.org/

crawl_type¶

JNeurosciCrawler's default crawl_type is "soup".

Type: str

static get_soup_url(url)[source]¶: Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.HindawiCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://www.hindawi.com/

crawl_type¶

HindawiCrawler's default crawl_type is "soup".

Type: str

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

<article>
    <div class="xml-content"> paper content </div>
    <div class="xml-content"> References </div>
    <div class="xml-content"> Copyright </div>
</article>

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

class gummy.journals.ChemRxivCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]¶

Bases: gummy.journals.GummyAbstJournal

URL:

https://chemrxiv.org/

crawl_type¶

ChemRxivCrawler's default crawl_type is "pdf".

Type: str

static get_pdf_url(url)[source]¶: Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]¶

Get page title from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: A page title.
Return type: str

get_sections_from_soup(soup)[source]¶

Get sections from page source.

Parameters: soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)
Returns: Page sections. Each element is (bs4.element.Tag)
Return type: list

get_head_from_section(section)[source]¶

Get head from a page section.

Parameters: section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.
Returns: A section head tag.
Return type: bs4.element.Tag

gummy.journals module¶

Other contents

Social link

Previous topic

Next topic

This Page