Module Reference

mwsql.dump

A set of utilities for processing MediaWiki SQL dump data.

class mwsql.dump.Dump(database: Optional[str], table_name: Optional[str], col_names: List[str], col_sql_dtypes: Dict[str, str], primary_key: Optional[str], source_file: Union[str, pathlib.Path], encoding: str)

Class for parsing an SQL dump file and processing its contents.

property dtypes: Dict[str, type]

Mapping between col_names and native Python dtypes.

Returns

A mapping from the column names in a SQL table to their respective Python data types. Example: {“ct_id”: int}

Return type

Dict[str, type]

property encoding: str

The encoding used to read the dump file.

Getter

Returns the current encoding

Setter

Sets the encoding to a new value

Returns

Text encoding

Return type

str

classmethod from_file(file_path: Union[str, pathlib.Path], encoding: str = 'utf-8') mwsql.dump.T

Initialize Dump object from dump file.

Parameters
  • cls (Dump) – A Dump class instance

  • file_path (PathObject) – Path to source SQL dum file. Can be a .gz or an uncompressed file

  • encoding (str, optional) – Text encoding, defaults to “utf-8” If you get an encoding error when processing the file, try setting this parameter to ‘Latin-1’

Returns

A Dump class instance

Return type

Dump

head(n_lines: int = 10, convert_dtypes: bool = False) None

Display first n rows.

Parameters
  • n_lines (int, optional) – Number of rows to display, defaults to 10

  • convert_dtypes (bool, optional) – Optionally, shows numerical types as int or float instead of all str. Defaults to False.

rows(convert_dtypes: bool = False, strict_conversion: bool = False, **fmtparams: Any) Iterator[List[Any]]

Create a generator object from the rows.

Parameters
  • convert_dtypes (bool, optional) – When set to True, numerical types are converted from str to int or float. Defaults to False.

  • strict_conversion (bool, optional) – When True, raise exception Error on bad input when converting from SQL dtypes to Python dtypes. Defaults to False.

  • fmtparams – Any kwargs you want to pass to the csv.reader() function that does the actual parsing.

Yield

A generator used to iterate over the rows in the SQL table

Return type

Iterator[List[Any]]

to_csv(file_path: Union[str, pathlib.Path], **fmtparams: Any) None

Write Dump object to CSV file.

Parameters

file_path (PathObject) – The file to write to. Will be created if it doesn’t already exist. Will be overwritten if it does exist.

mwsql.utils

Utility functions used to download, open and display the contents of Wikimedia SQL dump files.

mwsql.utils.download_file(url: str, file_name: str) Optional[pathlib.Path]

Download a file from a URL and show a progress indicator. Return the path to the downloaded file. :param url: URL to download from :param file_name: name of the file to download :return: path to the downloaded file

mwsql.utils.head(file_path: Union[str, pathlib.Path], n_lines: int = 10, encoding: str = 'utf-8') None

Display first n lines of a file. Works with both .gz and uncompressed files. Defaults to 10 lines.

Parameters
  • file_path (PathObject) – The path to the file

  • n_lines (int, optional) – Lines to display, defaults to 10

  • encoding (str, optional) – Text encoding, defaults to “utf-8”

mwsql.utils.load(database: str, filename: str, date: str = 'latest', extension: str = 'sql') Optional[Union[str, pathlib.Path]]

Load a dump file from a Wikimedia public directory if the user is in a supported environment (PAWS, Toolforge…). Otherwise, download dump file from the web and save in the current working directory. In both cases,the function returns a path-like object which can be used to access the file. Does not check if the file already exists on the path.

Parameters
  • database (str) – The database backup dump to download a file from, e.g. ‘enwiki’ (English Wikipedia). See a list of available databases here: https://dumps.wikimedia.org/backup-index-bydb.html

  • filename (str) – The name of the file to download, e.g. ‘page’ loads the file {database}-{date}-page.sql.gz

  • date (str, optional) – Date the dump was generated, defaults to “latest”. If “latest” is not used, the date format should be “YYYYMMDD”

  • extension (str) – The file extension. Defaults to ‘sql’

Returns

Path to dump file

Return type

Optional[PathObject]