gummy.journals module

These classes get contents from paper pages ( html ) or files ( PDF, TeX)

Supported journals are listed here (Supported journals · iwasakishuto/Translation-Gummy Wiki), and if you want to support for new journals, please request on twitter DM twitter badge or Github issues.

You can easily get (import) Journal Crawler Class by the following ways.

>>> from gummy import journals
>>> crawler = journals.get("nature")
>>> crawler
<gummy.journals.NatureCrawler at 0x1256777c0>
>>> from gummy.journals import NatureCrawler
>>> nature = NatureCrawler()
>>> nature
<gummy.journals.NatureCrawler at 0x1253da9a0>
>>> crawler = journals.get(nature)
>>> id(crawler) == id(nature)
True
class gummy.journals.GummyAbstJournal(crawl_type='soup', gateway='useless', sleep_for_loading=3, verbose=True, DecomposeTexTags=['<cit.>', '\xa0', '<ref>'], DecomposeSoupTags=['link', 'meta', 'noscript', 'script', 'style', 'sup'], subheadTags=[], **kwargs)[source]

Bases: object

If you want define your own journal crawlers, please inherit this class and define these methods:

Parameters
  • crawl_type (str) – Crawling type, if you not specify, use recommended crawling type.

  • gateway (str, GummyGateWay) – identifier of the Gummy Gateway Class. See gateways. (default= None )

  • sleep_for_loading (int) – Number of seconds to wait for a web page to load (default= 3 )

  • verbose (bool) – Whether you want to print output or not. (default= True )

  • DecomposeTexTags (list) – Tex tags to be removed in advance for easier analysis. (default= ["<cit.>"," ","<ref>"] )

  • DecomposeSoupTags (list) – HTML tags to be removed in advance for easier analysis. (default= ["i","link","meta","noscript","script","style","sup"] )

  • subheadTags (list) – HTML tag names to identify the subheadings.

  • kwargs (dict) – There is no use for it so far.

crawling_logs

Crawling logs.

Type

dict

property class_name

Same as self.__class__.__name__.

property name

Translator service name.

property journal_type

Journal Type.

property default_title

Default title.

get_contents(url, driver=None, crawl_type=None, **gatewaykwargs)[source]

Get contents using the method which is determined based on crawl_type

Parameters
  • url (str) – URL of a paper or path/to/local.file.

  • driver (WebDriver) – Selenium WebDriver.

  • crawl_type (str) – Crawling type, if you not specify, use recommended crawling type.

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

Examples

>>> from gummy import journals
>>> crawler = journals.get("nature")
>>> title, texts = crawler.get_contents(url="https://www.nature.com/articles/ncb0800_500")
Crawling Type: soup
    :
>>> print(title)
Formation of the male-specific muscle in female by ectopic expression
>>> print(texts[:1])
[{'head': 'Abstract', 'raw': 'The  () gene product Fru has been ... for the sexually dimorphic actions of the gene.'}]
static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

get_contents_soup(url, driver=None, **gatewaykwargs)[source]

Get contents from url of the web page using BeautifulSoup.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – Selenium WebDriver.

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_soup_source(url, driver=None, **gatewaykwargs)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

make_elements_visible(driver)[source]

Make all elements of the page visible.

Parameters

driver (WebDriver) – Selenium WebDriver.

decompose_soup_tags(soup)[source]

This function is not necessary for all Journals, but you can trim DecomposeSoupTags from the soup and it will help with debugging.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.

Returns

soup, a dict showing the number of decomposed tags.

Return type

tuple (BeautifulSoup, dict)

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

get_contents_from_soup_sections(soup_sections)[source]

Get contents from each soup section.

Parameters

soup_sections (list) – Each element is (bs4.element.Tag).

Returns

Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).

Return type

list

organize_soup_section(section, head='', head_is_not_added=True)[source]

Organize soup section:

  • Extract an image and display it as base64 format in the section.

  • Add head only to the initial content.

Parameters
  • section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

  • head (str) – Head word.

  • head_is_not_added (bool) – Whether head is added or not. (default= True)

static arrange_english(english)[source]

Get rid of extra characters from body (english). This method is used in arrange_english.

Parameters

english (str) – Raw English.

Returns

Arranged English

Return type

str

static get_tex_url(url)[source]

Convert the URL to the URL of the tex page you access when crawl_type=="tex"

get_contents_tex(url, driver=None)[source]

Get contents from url by parsing TeX sources.

Parameters
  • url – (str) URL of a paper.

  • driver – (WebDriver) webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_tex_source(url, driver=None)[source]

Download and get tex source from url.

Parameters
  • url (str) – URL of a tex source or path/to/local.tex.

  • driver (WebDriver) – Selenium WebDriver.

Returns

Plain text in tex source.

Return type

str

get_title_from_tex(tex)[source]

Get a title from tex source.

Parameters

tex (str) – Plain text in tex source.

Returns

TeX title.

Return type

str

get_sections_from_tex(tex)[source]

Get sections from tex source.

Parameters

tex (str) – Plain text in tex source.

Returns

Each element is plain text (str)

Return type

list

get_contents_from_tex_sections(tex_sections)[source]

Get text for each tex section.

Parameters

tex_sections – (list) Each element is plain text (str).

Returns

Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).

Return type

(list)

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_contents_pdf(url, driver=None)[source]

Get contents from url by parsing PDF file.

Parameters
  • url (str) – URL of a paper or path/to/local.pdf.

  • driver (WebDriver) – Selenium WebDriver.

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_pdf_source(url, driver=None)[source]

Download and get PDF source from url.

Parameters
  • url (str) – URL of a PDF file or path/to/local.pdf.

  • driver (WebDriver) – Selenium WebDriver.

Returns

Each element is text (str) in a page of PDF file.

Return type

list

get_title_from_pdf(pdf_pages)[source]

Get title from PDF source.

Parameters

pdf_pages (list) – Each element is text (str) in a page of PDF file.

Returns

PDF title.

Return type

str

get_contents_from_pdf_pages(pdf_pages)[source]

Get contents from each page.

Parameters

pdf_pages (list) – Each element is text (str) in a page of PDF file.

Returns

(title, content)

Return type

tuple (str, dict)

class gummy.journals.PDFCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_contents_pdf(url, driver=None)[source]

Get contents from url by parsing PDF file.

Parameters
  • url (str) – URL of a paper or path/to/local.pdf.

  • driver (WebDriver) – Selenium WebDriver.

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

class gummy.journals.NatureCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

NatureCrawler's default crawl_type is "soup".

Type

str

AvoidAriaLabel

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.arXivCrawler(sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

arXivCrawler's default crawl_type is "pdf".

Type

str

AvoidAriaLabel

Markers indicating the extra section to remove in get_sections_from_pdf

Type

list

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_tex_url(url)[source]

Convert the URL to the URL of the tex page you access when crawl_type=="tex"

static get_arXivNo(url)[source]
get_sections_from_tex(tex)[source]

Get sections from tex source.

Parameters

tex (str) – Plain text in tex source.

Returns

Each element is plain text (str)

Return type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

class gummy.journals.NCBICrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

NCBICrawler's default crawl_type is "soup".

Type

str

static arrange_english(english)[source]

Get rid of extra characters from body (english). This method is used in arrange_english.

Parameters

english (str) – Raw English.

Returns

Arranged English

Return type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.PubMedCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

PubMedCrawler's default crawl_type is "soup".

Type

str

AvoidIdsPatterns

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_contents_soup(url, driver=None, **gatewaykwargs)[source]

Get contents from url of the web page using BeautifulSoup.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – Selenium WebDriver.

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.OxfordAcademicCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

OxfordAcademicCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ScienceDirectCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ScienceDirectCrawler's default crawl_type is "soup".

Type

str

decompose_soup_tags(soup)[source]

Decompose <div class="dropBlock reference-citations">

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.SpringerCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

SpringerCrawler's default crawl_type is "soup".

Type

str

AvoidAriaLabel

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.MDPICrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

MDPICrawler's default crawl_type is "soup".

Type

str

AvoidAriaLabel

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_soup_source(url, driver=None, **gatewaykwargs)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.UniOKLAHOMACrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

UniOKLAHOMACrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_contents_from_soup_sections(soup_sections)[source]

Get contents from each soup section.

Parameters

soup_sections (list) – Each element is (bs4.element.Tag).

Returns

Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).

Return type

list

class gummy.journals.LungCancerCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

LungCancerCrawler's default crawl_type is "soup".

Type

str

AvoidHead

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.CellPressCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

CellPressCrawler's default crawl_type is "soup".

Type

str

AvoidDataLeftHandNavs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

decompose_soup_tags(soup)[source]

Decompose <div class="dropBlock reference-citations">

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.WileyOnlineLibraryCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

WileyOnlineLibraryCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.JBCCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

JBCCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.BiologistsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

BiologistsCrawler's default crawl_type is "soup".

Type

str

AvoidIDs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.BioMedCentralCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

BioMedCentralCrawler's default crawl_type is "soup".

Type

str

AvoidAriaLabel

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.IEEEXploreCrawler(gateway='useless', sleep_for_loading=10, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

IEEEXploreCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.JSTAGECrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

JSTAGECrawler's default crawl_type is "soup".

Type

str

get_soup_source(url, driver=None, **gatewaykwargs)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ACSPublicationsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ACSPublicationsCrawler's default crawl_type is "soup".

Type

str

get_soup_source(url, driver=None, **gatewaykwargs)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.StemCellsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

StemCellsCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.UniKeioCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

UniKeioCrawler's default crawl_type is "soup".

Type

str

get_contents_soup(url, driver=None, **gatewaykwargs)[source]

Get contents from url of the web page using BeautifulSoup.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – Selenium WebDriver.

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

(title, content)

Return type

tuple (str, dict)

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_contents_from_soup_sections(soup_sections)[source]

Get contents from each soup section.

Parameters

soup_sections (list) – Each element is (bs4.element.Tag).

Returns

Each element is dict (key is one of the ["raw", "head", "subhead", "img"]).

Return type

list

class gummy.journals.PLOSONECrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

PLOSONECrawler's default crawl_type is "soup".

Type

str

AvoidIDs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.frontiersCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

frontiersCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.RNAjournalCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

RNAjournalCrawler's default crawl_type is "soup".

Type

str

get_soup_source(url, driver=None, **gatewaykwargs)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.IntechOpenCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

IntechOpenCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.NRCResearchPressCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

NRCResearchPressCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.SpandidosCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

SpandidosCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.TaylorandFrancisOnlineCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

TaylorandFrancisOnlineCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.bioRxivCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

bioRxivCrawler's default crawl_type is "soup".

Type

str

AvoidIDs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_soup_source(url, driver=None, **gatewaykwargs)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.RSCPublishingCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

RSCPublishingCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.JSSECrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

JSSECrawler's default crawl_type is "pdf".

Type

str

static get_jsseNo(url)[source]
static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ScienceAdvancesCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ScienceAdvancesCrawler's default crawl_type is "soup".

Type

str

AvoidIDs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.medRxivCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

medRxivCrawler's default crawl_type is "pdf".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ACLAnthologyCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ACLAnthologyCrawler's default crawl_type is "pdf".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.PNASCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

PNASCrawler's default crawl_type is "soup".

Type

str

AvoidIDs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.AMSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

AMSCrawler's default crawl_type is "soup".

Type

str

AvoidIDs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ACMCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

NOTE: If you want to download PDF, you must run driver with a browser.

URL:
crawl_type

ACMCrawler's default crawl_type is "pdf".

Type

str

AvoidIDs

Markers indicating the extra section to remove in get_sections_from_pdf

Type

list

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.APSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

APSCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

make_elements_visible(driver)[source]

Make all elements of the page visible.

Parameters

driver (WebDriver) – Selenium WebDriver.

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ASIPCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ASIPCrawler's default crawl_type is "soup".

Type

str

static get_asipID(url)[source]
static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_soup_source(url, driver=None, **gatewaykwargs)[source]

Scrape and get page source from url.

Parameters
  • url (str) – URL of a paper.

  • driver (WebDriver) – webdriver

  • gatewaykwargs (dict) – Gateway keywargs. See passthrough.

Returns

A data structure representing a parsed HTML or XML document.

Return type

BeautifulSoup

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.AnatomyPubsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

AnatomyPubsCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.RenalPhysiologyCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

RenalPhysiologyCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.GeneticsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

GeneticsCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.GeneDevCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

GeneDevCrawler's default crawl_type is "pdf".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.JAMANetworkCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

JAMANetworkCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.SAGEjournalsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

SAGEjournalsCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.MolCellBioCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

MolCellBioCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.JKMSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

JKMSCrawler's default crawl_type is "soup".

Type

str

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.JKNSCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

JKNSCrawler's default crawl_type is "soup".

Type

str

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.BioscienceCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

BioscienceCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.RadioGraphicsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

RadioGraphicsCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.PediatricSurgeryCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

PediatricSurgeryCrawler's default crawl_type is "soup".

Type

str

AvoidDataLeftHandNavs

Markers indicating the extra section to remove in get_sections_from_soup

Type

list

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.AGUPublicationsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

AGUPublicationsCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.NEJMCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

NEJMCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.LWWJournalsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

LWWJournalsCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ARVOJournalsCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ARVOJournalsCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.LearningMemoryCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

LearningMemoryCrawler's default crawl_type is "soup".

Type

str

static get_soup_url(url)[source]

Convert the URL to the URL of the web page you access when crawl_type=="soup"

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ScienceMagCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ScienceMagCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.PsyChiArtistCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

PsyChiArtistCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.OncotargetCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

OncotargetCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ClinicalEndoscopyCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ClinicalEndoscopyCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.EMBOPressCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

EMBOPressCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ASPBCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ASPBCrawlerCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.BiomedGridCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

BiomedGridCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.NRRCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

NRRCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.YMJCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

YMJCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.TheLancetCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

YMJCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.FutureScienceCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

FutureScienceCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.ScitationCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:
crawl_type

ScitationCrawler's default crawl_type is "soup".

Type

str

static get_pdf_url(url)[source]

Convert the URL to the URL of the PDF page you access when crawl_type=="pdf"

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag

class gummy.journals.IOPScienceCrawler(gateway='useless', sleep_for_loading=3, verbose=True, **kwargs)[source]

Bases: gummy.journals.GummyAbstJournal

URL:

Todo

Deal with ShieldSquare Captcha.

_images/ss_captcha.png
crawl_type

IOPScienceCrawler's default crawl_type is "soup".

Type

str

get_title_from_soup(soup)[source]

Get page title from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

A page title.

Return type

str

get_sections_from_soup(soup)[source]

Get sections from page source.

Parameters

soup (BeautifulSoup) – A data structure representing a parsed HTML or XML document.)

Returns

Page sections. Each element is (bs4.element.Tag)

Return type

list

get_head_from_section(section)[source]

Get head from a page section.

Parameters

section (bs4.element.Tag) – Represents an HTML or XML tag that is part of a parse tree, along with its attributes and contents.

Returns

A section head tag.

Return type

bs4.element.Tag