Module Reference
mwsql.dump
A set of utilities for processing MediaWiki SQL dump data.
- class mwsql.dump.Dump(database: str | None, table_name: str | None, col_names: List[str], col_sql_dtypes: Dict[str, str], primary_key: str | None, source_file: str | Path, encoding: str)
Class for parsing an SQL dump file and processing its contents.
- property dtypes: Dict[str, type]
Mapping between col_names and native Python dtypes.
- Returns:
A mapping from the column names in a SQL table to their respective Python data types. Example: {“ct_id”: int}
- Return type:
Dict[str, type]
- property encoding: str
The encoding used to read the dump file.
- Getter:
Returns the current encoding
- Setter:
Sets the encoding to a new value
- Returns:
Text encoding
- Return type:
str
- classmethod from_file(file_path: str | Path, encoding: str = 'utf-8') T
Initialize Dump object from dump file.
- Parameters:
cls (Dump) – A Dump class instance
file_path (PathObject) – Path to source SQL dump file. Can be a .gz or an uncompressed file
encoding (str, optional) – Text encoding, defaults to “utf-8” If you get an encoding error when processing the file, try setting this parameter to ‘Latin-1’
- Returns:
A Dump class instance
- Return type:
- head(n_lines: int = 10, convert_dtypes: bool = False) None
Display first n rows.
- Parameters:
n_lines (int, optional) – Number of rows to display, defaults to 10
convert_dtypes (bool, optional) – Optionally, shows numerical types as int or float instead of all str. Defaults to False.
- rows(convert_dtypes: bool = False, strict_conversion: bool = False, **fmtparams: Any) Iterator[List[Any]]
Create a generator object from the rows.
- Parameters:
convert_dtypes (bool, optional) – When set to True, numerical types are converted from str to int or float. Defaults to False.
strict_conversion (bool, optional) – When True, raise exception Error on bad input when converting from SQL dtypes to Python dtypes. Defaults to False.
fmtparams – Any kwargs you want to pass to the csv.reader() function that does the actual parsing.
- Yield:
A generator used to iterate over the rows in the SQL table
- Return type:
Iterator[List[Any]]
- to_csv(file_path: str | Path, **fmtparams: Any) None
Write Dump object to CSV file.
- Parameters:
file_path (PathObject) – The file to write to. Will be created if it doesn’t already exist. Will be overwritten if it does exist.
mwsql.utils
Utility functions used to download, open and display the contents of Wikimedia SQL dump files.
- mwsql.utils.download_file(url: str, file_name: str) Path | None
Download a file from a URL and show a progress indicator. Return the path to the downloaded file.
- Parameters:
url (str) – URL to download from
file_name (str) – name of the file to download
- Returns:
path to the downloaded file
- Return type:
Optional[Path]
- mwsql.utils.head(file_path: str | Path, n_lines: int = 10, encoding: str = 'utf-8') None
Display first n lines of a file. Works with both .gz and uncompressed files. Defaults to 10 lines.
- Parameters:
file_path (PathObject) – The path to the file
n_lines (int, optional) – Lines to display, defaults to 10
encoding (str, optional) – Text encoding, defaults to “utf-8”
- mwsql.utils.load(database: str, filename: str, date: str = 'latest', extension: str = 'sql') str | Path | None
Load a dump file from a Wikimedia public directory if the user is in a supported environment (PAWS, Toolforge…). Otherwise, download dump file from the web and save in the current working directory. In both cases,the function returns a path-like object which can be used to access the file. Does not check if the file already exists on the path.
- Parameters:
database (str) – The database backup dump to download a file from, e.g. ‘enwiki’ (English Wikipedia). See a list of available databases here: https://dumps.wikimedia.org/backup-index-bydb.html
filename (str) – The name of the file to download, e.g. ‘page’ loads the file {database}-{date}-page.sql.gz
date (str, optional) – Date the dump was generated, defaults to “latest”. If “latest” is not used, the date format should be “YYYYMMDD”
extension (str) – The file extension. Defaults to ‘sql’
- Returns:
Path to dump file
- Return type:
Optional[PathObject]