cowidev.utils.web¶
cowidev.utils.web.download¶
- class cowidev.utils.web.download.DESAdapter(pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False)[source]¶
Bases:
HTTPAdapter
A TransportAdapter that re-enables 3DES support in Requests.
From: https://stackoverflow.com/a/46186957/5056599
- init_poolmanager(*args, **kwargs)[source]¶
Initializes a urllib3 PoolManager.
This method should not be called from user code, and is only exposed for use when subclassing the
HTTPAdapter
.- Parameters:
connections – The number of urllib3 connection pools to cache.
maxsize – The maximum number of connections to save in the pool.
block – Block when no free connections are available.
pool_kwargs – Extra keyword arguments used to initialize the Pool Manager.
- proxy_manager_for(*args, **kwargs)[source]¶
Return urllib3 ProxyManager for the given proxy.
This method should not be called from user code, and is only exposed for use when subclassing the
HTTPAdapter
.- Parameters:
proxy – The proxy to return a urllib3 ProxyManager for.
proxy_kwargs – Extra keyword arguments used to configure the Proxy Manager.
- Returns:
ProxyManager
- Return type:
urllib3.ProxyManager
- cowidev.utils.web.download.download_file_from_url(url, save_path, chunk_size=1048576, timeout=30, verify=True, ciphers_low=False, use_proxy=False)[source]¶
- cowidev.utils.web.download.get_base_url(url: str, scheme='https') str [source]¶
Parse a URL and return the base URL.
## Parameters :
- url: str
URL to parse. <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
- scheme{‘http’, ‘https’}, default ‘https’
Scheme to use.
## Returns :
- Base URL: str
<scheme>://<netloc>/
- cowidev.utils.web.download.read_csv_from_url(url, timeout=30, verify=True, ciphers_low=False, use_proxy=False, **kwargs)[source]¶
- cowidev.utils.web.download.read_xlsx_from_url(url: str, timeout=30, as_series: bool = False, verify=True, drop=False, ciphers_low=False, use_proxy=False, **kwargs) DataFrame [source]¶
Download and load xls file from URL.
- Parameters:
url (str) – File url.
as_series (bol) – Set to True to return a pandas.Series object. Source file must be of shape 1xN (1 row, N columns). Defaults to False.
kwargs – Arguments for pandas.read_excel.
- Returns:
Data loaded.
- Return type:
pandas.DataFrame
cowidev.utils.web.scraping¶
- cowidev.utils.web.scraping.get_driver(headless: bool = True, download_folder: str | None = None, options=None, firefox: bool = False, timeout: int | None = None)[source]¶
- cowidev.utils.web.scraping.get_headers() dict [source]¶
Get generic header for requests.
- Returns:
Header.
- Return type:
dict
- cowidev.utils.web.scraping.get_response(source: str, request_method: str = 'get', use_proxy: bool = False, **kwargs)[source]¶
- cowidev.utils.web.scraping.get_soup(source: str, from_encoding: str | None = None, parser='lxml', request_method: str = 'get', use_proxy: bool = False, **kwargs) BeautifulSoup [source]¶
Get soup from website.
- Parameters:
source (str) – Website url.
from_encoding (str, optional) – Encoding to use. Defaults to None.
parser (str, optional) – HTML parser. Read https://www.crummy.com/software/BeautifulSoup/bs4/doc/ #installing-a-parser. Defaults to ‘lxml’.
request_method (str, optional) – Request method. Options are ‘get’ and ‘post’. Defaults to GET method. For POST method, make sure to specify a header (default one does not work).
use_proxy (bool)
kwargs (dict) – Extra arguments passed to requests.get method. Default values for headers, verify and timeout are used.
- Returns:
Website soup.
- Return type:
BeautifulSoup
- cowidev.utils.web.scraping.request_json(url, mode='soup', **kwargs) dict [source]¶
Get data from url as a dictionary.
Content at url should be a dictionary.
- Parameters:
url (str) – URL to data.
mode (str) – Mode to use. Accepted is ‘soup’ (default) and ‘raw’.
kwargs – Check get_soup for the complete list of accepted arguments.
- Returns:
Data
- Return type:
dict
- cowidev.utils.web.scraping.request_text(url, mode='soup', **kwargs) str [source]¶
Get data from url as plain text.
Content at url should be a dictionary.
- Parameters:
url (str) – URL to data.
mode (str) – Mode to use. Accepted is ‘soup’ (default) and ‘raw’.
kwargs – Check get_soup for the complete list of accepted arguments.
- Returns:
Data
- Return type:
dict
cowidev.utils.web.utils¶
- cowidev.utils.web.get_base_url(url: str, scheme='https') str [source]¶
Parse a URL and return the base URL.
## Parameters :
- url: str
URL to parse. <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
- scheme{‘http’, ‘https’}, default ‘https’
Scheme to use.
## Returns :
- Base URL: str
<scheme>://<netloc>/
- cowidev.utils.web.get_driver(headless: bool = True, download_folder: str | None = None, options=None, firefox: bool = False, timeout: int | None = None)[source]¶
- cowidev.utils.web.get_soup(source: str, from_encoding: str | None = None, parser='lxml', request_method: str = 'get', use_proxy: bool = False, **kwargs) BeautifulSoup [source]¶
Get soup from website.
- Parameters:
source (str) – Website url.
from_encoding (str, optional) – Encoding to use. Defaults to None.
parser (str, optional) – HTML parser. Read https://www.crummy.com/software/BeautifulSoup/bs4/doc/ #installing-a-parser. Defaults to ‘lxml’.
request_method (str, optional) – Request method. Options are ‘get’ and ‘post’. Defaults to GET method. For POST method, make sure to specify a header (default one does not work).
use_proxy (bool)
kwargs (dict) – Extra arguments passed to requests.get method. Default values for headers, verify and timeout are used.
- Returns:
Website soup.
- Return type:
BeautifulSoup
- cowidev.utils.web.read_xlsx_from_url(url: str, timeout=30, as_series: bool = False, verify=True, drop=False, ciphers_low=False, use_proxy=False, **kwargs) DataFrame [source]¶
Download and load xls file from URL.
- Parameters:
url (str) – File url.
as_series (bol) – Set to True to return a pandas.Series object. Source file must be of shape 1xN (1 row, N columns). Defaults to False.
kwargs – Arguments for pandas.read_excel.
- Returns:
Data loaded.
- Return type:
pandas.DataFrame
- cowidev.utils.web.request_json(url, mode='soup', **kwargs) dict [source]¶
Get data from url as a dictionary.
Content at url should be a dictionary.
- Parameters:
url (str) – URL to data.
mode (str) – Mode to use. Accepted is ‘soup’ (default) and ‘raw’.
kwargs – Check get_soup for the complete list of accepted arguments.
- Returns:
Data
- Return type:
dict