Module Reference

mwsql.dump

A set of utilities for processing MediaWiki SQL dump data.

class mwsql.dump.Dump(database: str | None, table_name: str | None, col_names: List[str], col_sql_dtypes: Dict[str, str], primary_key: str | None, source_file: str | Path, encoding: str)

Class for parsing an SQL dump file and processing its contents.

property dtypes: Dict[str, type]

Mapping between col_names and native Python dtypes.

Returns:

A mapping from the column names in a SQL table to their respective Python data types. Example: {“ct_id”: int}

Return type:

Dict[str, type]

property encoding: str

The encoding used to read the dump file.

Getter:

Returns the current encoding

Setter:

Sets the encoding to a new value

Returns:

Text encoding

Return type:

str

classmethod from_file(file_path: str | Path, encoding: str = 'utf-8') T

Initialize Dump object from dump file.

Parameters:
  • cls (Dump) – A Dump class instance

  • file_path (PathObject) – Path to source SQL dump file. Can be a .gz or an uncompressed file

  • encoding (str, optional) – Text encoding, defaults to “utf-8” If you get an encoding error when processing the file, try setting this parameter to ‘Latin-1’

Returns:

A Dump class instance

Return type:

Dump

head(n_lines: int = 10, convert_dtypes: bool = False) None

Display first n rows.

Parameters:
  • n_lines (int, optional) – Number of rows to display, defaults to 10

  • convert_dtypes (bool, optional) – Optionally, shows numerical types as int or float instead of all str. Defaults to False.

rows(convert_dtypes: bool = False, strict_conversion: bool = False, **fmtparams: Any) Iterator[List[Any]]

Create a generator object from the rows.

Parameters:
  • convert_dtypes (bool, optional) – When set to True, numerical types are converted from str to int or float. Defaults to False.

  • strict_conversion (bool, optional) – When True, raise exception Error on bad input when converting from SQL dtypes to Python dtypes. Defaults to False.

  • fmtparams – Any kwargs you want to pass to the csv.reader() function that does the actual parsing.

Yield:

A generator used to iterate over the rows in the SQL table

Return type:

Iterator[List[Any]]

to_csv(file_path: str | Path, **fmtparams: Any) None

Write Dump object to CSV file.

Parameters:

file_path (PathObject) – The file to write to. Will be created if it doesn’t already exist. Will be overwritten if it does exist.

mwsql.utils

Utility functions used to download, open and display the contents of Wikimedia SQL dump files.

mwsql.utils.download_file(url: str, file_name: str) Path | None

Download a file from a URL and show a progress indicator. Return the path to the downloaded file.

Parameters:
  • url (str) – URL to download from

  • file_name (str) – name of the file to download

Returns:

path to the downloaded file

Return type:

Optional[Path]

mwsql.utils.head(file_path: str | Path, n_lines: int = 10, encoding: str = 'utf-8') None

Display first n lines of a file. Works with both .gz and uncompressed files. Defaults to 10 lines.

Parameters:
  • file_path (PathObject) – The path to the file

  • n_lines (int, optional) – Lines to display, defaults to 10

  • encoding (str, optional) – Text encoding, defaults to “utf-8”

mwsql.utils.load(database: str, filename: str, date: str = 'latest', extension: str = 'sql') str | Path | None

Load a dump file from a Wikimedia public directory if the user is in a supported environment (PAWS, Toolforge…). Otherwise, download dump file from the web and save in the current working directory. In both cases,the function returns a path-like object which can be used to access the file. Does not check if the file already exists on the path.

Parameters:
  • database (str) – The database backup dump to download a file from, e.g. ‘enwiki’ (English Wikipedia). See a list of available databases here: https://dumps.wikimedia.org/backup-index-bydb.html

  • filename (str) – The name of the file to download, e.g. ‘page’ loads the file {database}-{date}-page.sql.gz

  • date (str, optional) – Date the dump was generated, defaults to “latest”. If “latest” is not used, the date format should be “YYYYMMDD”

  • extension (str) – The file extension. Defaults to ‘sql’

Returns:

Path to dump file

Return type:

Optional[PathObject]