Module Reference
mwsql.dump
A set of utilities for processing MediaWiki SQL dump data.
- class mwsql.dump.Dump(database: Optional[str], table_name: Optional[str], col_names: List[str], col_sql_dtypes: Dict[str, str], primary_key: Optional[str], source_file: Union[str, pathlib.Path], encoding: str)
Class for parsing an SQL dump file and processing its contents.
- property dtypes: Dict[str, type]
Mapping between col_names and native Python dtypes.
- Returns
A mapping from the column names in a SQL table to their respective Python data types. Example: {“ct_id”: int}
- Return type
Dict[str, type]
- property encoding: str
The encoding used to read the dump file.
- Getter
Returns the current encoding
- Setter
Sets the encoding to a new value
- Returns
Text encoding
- Return type
str
- classmethod from_file(file_path: Union[str, pathlib.Path], encoding: str = 'utf-8') mwsql.dump.T
Initialize Dump object from dump file.
- Parameters
cls (Dump) – A Dump class instance
file_path (PathObject) – Path to source SQL dum file. Can be a .gz or an uncompressed file
encoding (str, optional) – Text encoding, defaults to “utf-8” If you get an encoding error when processing the file, try setting this parameter to ‘Latin-1’
- Returns
A Dump class instance
- Return type
- head(n_lines: int = 10, convert_dtypes: bool = False) None
Display first n rows.
- Parameters
n_lines (int, optional) – Number of rows to display, defaults to 10
convert_dtypes (bool, optional) – Optionally, shows numerical types as int or float instead of all str. Defaults to False.
- rows(convert_dtypes: bool = False, strict_conversion: bool = False, **fmtparams: Any) Iterator[List[Any]]
Create a generator object from the rows.
- Parameters
convert_dtypes (bool, optional) – When set to True, numerical types are converted from str to int or float. Defaults to False.
strict_conversion (bool, optional) – When True, raise exception Error on bad input when converting from SQL dtypes to Python dtypes. Defaults to False.
fmtparams – Any kwargs you want to pass to the csv.reader() function that does the actual parsing.
- Yield
A generator used to iterate over the rows in the SQL table
- Return type
Iterator[List[Any]]
- to_csv(file_path: Union[str, pathlib.Path], **fmtparams: Any) None
Write Dump object to CSV file.
- Parameters
file_path (PathObject) – The file to write to. Will be created if it doesn’t already exist. Will be overwritten if it does exist.
mwsql.utils
Utility functions used to download, open and display the contents of Wikimedia SQL dump files.
- mwsql.utils.download_file(url: str, file_name: str) Optional[pathlib.Path]
Download a file from a URL and show a progress indicator. Return the path to the downloaded file. :param url: URL to download from :param file_name: name of the file to download :return: path to the downloaded file
- mwsql.utils.head(file_path: Union[str, pathlib.Path], n_lines: int = 10, encoding: str = 'utf-8') None
Display first n lines of a file. Works with both .gz and uncompressed files. Defaults to 10 lines.
- Parameters
file_path (PathObject) – The path to the file
n_lines (int, optional) – Lines to display, defaults to 10
encoding (str, optional) – Text encoding, defaults to “utf-8”
- mwsql.utils.load(database: str, filename: str, date: str = 'latest', extension: str = 'sql') Optional[Union[str, pathlib.Path]]
Load a dump file from a Wikimedia public directory if the user is in a supported environment (PAWS, Toolforge…). Otherwise, download dump file from the web and save in the current working directory. In both cases,the function returns a path-like object which can be used to access the file. Does not check if the file already exists on the path.
- Parameters
database (str) – The database backup dump to download a file from, e.g. ‘enwiki’ (English Wikipedia). See a list of available databases here: https://dumps.wikimedia.org/backup-index-bydb.html
filename (str) – The name of the file to download, e.g. ‘page’ loads the file {database}-{date}-page.sql.gz
date (str, optional) – Date the dump was generated, defaults to “latest”. If “latest” is not used, the date format should be “YYYYMMDD”
extension (str) – The file extension. Defaults to ‘sql’
- Returns
Path to dump file
- Return type
Optional[PathObject]