Module Reference

mwsql.dump

A set of utilities for processing MediaWiki SQL dump data.

class mwsql.dump.Dump(database: str | None, table_name: str | None, col_names: List[str], col_sql_dtypes: Dict[str, str], primary_key: str | None, source_file: str | Path, encoding: str)

Class for parsing an SQL dump file and processing its contents.

property dtypes: Dict[str, type]

Mapping between col_names and native Python dtypes.

Returns:: A mapping from the column names in a SQL table to their respective Python data types. Example: {“ct_id”: int}
Return type:: Dict[str, type]

property encoding: str

The encoding used to read the dump file.

Getter:: Returns the current encoding
Setter:: Sets the encoding to a new value
Returns:: Text encoding
Return type:: str

classmethod from_file(file_path: str | Path, encoding: str = 'utf-8') → T

Initialize Dump object from dump file.

Parameters:

cls (Dump) – A Dump class instance
file_path (PathObject) – Path to source SQL dump file. Can be a .gz or an uncompressed file
encoding (str, optional) – Text encoding, defaults to “utf-8” If you get an encoding error when processing the file, try setting this parameter to ‘Latin-1’

Returns:

A Dump class instance

Return type:

Dump

head(n_lines: int = 10, convert_dtypes: bool = False) → None

Display first n rows.

Parameters:

n_lines (int, optional) – Number of rows to display, defaults to 10
convert_dtypes (bool, optional) – Optionally, shows numerical types as int or float instead of all str. Defaults to False.

rows(convert_dtypes: bool = False, strict_conversion: bool = False, **fmtparams: Any) → Iterator[List[Any]]

Create a generator object from the rows.

Parameters:

convert_dtypes (bool, optional) – When set to True, numerical types are converted from str to int or float. Defaults to False.
strict_conversion (bool, optional) – When True, raise exception Error on bad input when converting from SQL dtypes to Python dtypes. Defaults to False.
fmtparams – Any kwargs you want to pass to the csv.reader() function that does the actual parsing.

Yield:

A generator used to iterate over the rows in the SQL table

Return type:

Iterator[List[Any]]

to_csv(file_path: str | Path, **fmtparams: Any) → None

Write Dump object to CSV file.

Parameters:: file_path (PathObject) – The file to write to. Will be created if it doesn’t already exist. Will be overwritten if it does exist.

mwsql.utils

Utility functions used to download, open and display the contents of Wikimedia SQL dump files.

mwsql.utils.download_file(url: str, file_name: str) → Path | None

Download a file from a URL and show a progress indicator. Return the path to the downloaded file.

Parameters:

url (str) – URL to download from
file_name (str) – name of the file to download

Returns:

path to the downloaded file

Return type:

Optional[Path]

mwsql.utils.head(file_path: str | Path, n_lines: int = 10, encoding: str = 'utf-8') → None

Display first n lines of a file. Works with both .gz and uncompressed files. Defaults to 10 lines.

Parameters:

file_path (PathObject) – The path to the file
n_lines (int, optional) – Lines to display, defaults to 10
encoding (str, optional) – Text encoding, defaults to “utf-8”

mwsql.utils.load(database: str, filename: str, date: str = 'latest', extension: str = 'sql') → str | Path | None

Load a dump file from a Wikimedia public directory if the user is in a supported environment (PAWS, Toolforge…). Otherwise, download dump file from the web and save in the current working directory. In both cases,the function returns a path-like object which can be used to access the file. Does not check if the file already exists on the path.

Parameters:

database (str) – The database backup dump to download a file from, e.g. ‘enwiki’ (English Wikipedia). See a list of available databases here: https://dumps.wikimedia.org/backup-index-bydb.html
filename (str) – The name of the file to download, e.g. ‘page’ loads the file {database}-{date}-page.sql.gz
date (str, optional) – Date the dump was generated, defaults to “latest”. If “latest” is not used, the date format should be “YYYYMMDD”
extension (str) – The file extension. Defaults to ‘sql’

Returns:

Path to dump file

Return type:

Optional[PathObject]