data_manipulation package¶
Submodules¶
data_manipulation.base module¶
- data_manipulation.base.clean_string(string: str, remove_parenthesis: bool = False, remove_brackets: bool = False) str [source]¶
Cleans and standardizes input string.
- Parameters:
string (str) – String to clean.
remove_parenthesis (bool) – Whether to remove content within parentheses.
remove_brackets (bool) – Whether to remove content within square brackets.
- Returns:
Uppercase cleaned string with standardized spacing.
- Return type:
str
- Raises:
ValueError – If input string is None or empty.
Examples
>>> clean_string(" sHawn tesT ") 'SHAWN TEST' >>> clean_string("shawn ( te st )") 'SHAWN ( TE ST )' >>> clean_string("shawn ( te st )", remove_parenthesis=True) 'SHAWN' >>> clean_string("shawn [ te st ]", remove_brackets=True) 'SHAWN'
- data_manipulation.base.create_encode_url(url: str, query_params: Dict[str, Any] | None = None) str [source]¶
Creates an encoded URL with query parameters.
- Parameters:
url (str) – Base URL.
query_params (dict, optional) – Dictionary of URL query parameters. Defaults to {}.
- Returns:
Encoded URL with query parameters.
- Return type:
str
- Raises:
ValueError – If URL is invalid.
- data_manipulation.base.delete_list_indices(list_: list, indices: List[int]) None [source]¶
Deletes multiple indices from a list in-place.
- Parameters:
list (list) – Original list to modify.
indices (list) – List of indices to delete.
Examples
>>> values = [0, 1, 2, 3, 4] >>> delete_list_indices(values, [1, 3]) >>> values [0, 2, 4]
- data_manipulation.base.get_country_code_variation() dict [source]¶
Returns a dictionary mapping country codes to their name variations.
- Returns:
Dictionary with country codes as keys and lists of country name variations as values.
- Return type:
dict
- data_manipulation.base.get_country_name_variation() dict [source]¶
Returns a dictionary mapping country names to their codes.
- Returns:
Dictionary with country names as keys and country codes as values.
- Return type:
dict
- data_manipulation.base.get_none_variation() List[None | str] [source]¶
Returns a list of common variations of None/null values.
- Returns:
List containing None and various string representations of null values.
- Return type:
list
Examples
>>> get_none_variation() [None, 'NONE', 'NONe', 'NOnE', 'NOne', 'NoNE', 'NoNe', 'NonE', 'None', 'nONE', 'nONe', 'nOnE', 'nOne', 'noNE', 'noNe', 'nonE', 'none', 'NULL', 'NULl', 'NUlL', 'NUll', 'NuLL', 'NuLl', 'NulL', 'Null', 'nULL', 'nULl', 'nUlL', 'nUll', 'nuLL', 'nuLl', 'nulL', 'null', 'NA', 'Na', 'nA', 'na', 'N.A', 'N.a', 'N.A', 'N.a', 'n.A', 'n.a', 'n.A', 'n.a', 'N.A.', 'N.A.', 'N.a.', 'N.a.', 'N.A.', 'N.A.', 'N.a.', 'N.a.', 'n.A.', 'n.A.', 'n.a.', 'n.a.', 'n.A.', 'n.A.', 'n.a.', 'n.a.', 'NAN', 'NAn', 'NaN', 'Nan', 'nAN', 'nAn', 'naN', 'nan', 'NIL', 'NIl', 'NiL', 'Nil', 'nIL', 'nIl', 'niL', 'nil']
- data_manipulation.base.get_path_files(path: str | Path, keywords: List[str]) List[str] [source]¶
Returns sorted list of files from given path that contain specified keywords.
- Parameters:
path (str) – Directory path to search.
keywords (list) – List of keywords to match in filenames.
- Returns:
Sorted list of matching filenames.
- Return type:
List[str]
Examples
>>> get_path_files("test_base_folder", ["py"]) ['test1.py', 'test2.py', 'test3.py', 'test4.py', 'test5.py']
- data_manipulation.base.get_string_case_combination(str_: str) List[str] [source]¶
Generates all possible case combinations of a string.
- Parameters:
str (str) – Input string to generate combinations for.
- Returns:
List of all possible case combinations.
- Return type:
List[str]
Examples
>>> get_string_case_combination("abc") ['ABC', 'ABc', 'AbC', 'Abc', 'aBC', 'aBc', 'abC', 'abc']
- data_manipulation.base.list_to_file(filepath: str | Path, list_: List, newline: bool = True) None [source]¶
Writes list contents to a file.
- Parameters:
filepath (str) – Path to output file.
list (list) – List of values to write.
newline (bool, optional) – Whether to add newline after each item. Defaults to True.
Examples
>>> list_to_file("test.txt", [1, 2, 3])
- data_manipulation.base.list_tuple_without_none(list_tuple: List | Tuple) List | Tuple [source]¶
Removes None variations from a list or tuple.
- Parameters:
list_tuple – Input list or tuple to clean.
- Returns:
Cleaned list or tuple.
- Return type:
Union[List, Tuple]
- Raises:
TypeError – If input is not a list or tuple.
Examples
>>> list_tuple_without_none(["a", "none"]) ['a'] >>> list_tuple_without_none(("a", "none")) ('a',)
- data_manipulation.base.parse_ps_aux(ps_aux_commands: str) List[List[str]] [source]¶
Parses Linux ps aux command output into a list of records.
- Parameters:
ps_aux_commands (str) – Linux ps aux command string.
- Returns:
List of process records.
- Return type:
List[List[str]]
Examples
>>> # parse_ps_aux("ps aux | egrep -i '%cpu|anaconda3' | head")
- data_manipulation.base.remove_path_file(path: str | Path, keyword: str, n: int = 2) None [source]¶
Removes all but the n newest files matching the keyword.
- Parameters:
path (str) – Directory path.
keyword (str) – Keyword to match in filenames.
n (int, optional) – Number of newest files to keep. Defaults to 2.
Examples
>>> remove_path_file("test_base_folder", ".py")
- data_manipulation.base.string_boolean_to_int(boolean_str_rep: str) int [source]¶
Converts string boolean representations to integers.
- Parameters:
boolean_str_rep (str) – String representation of boolean value.
- Returns:
1 for true values, 0 for false values.
- Return type:
int
Examples
>>> string_boolean_to_int("true") 1 >>> string_boolean_to_int("True") 1
- data_manipulation.base.string_dlt_to_dlt(dlt_str_rep: str) Dict | List | Tuple [source]¶
Converts string representation of data structures to actual Python objects.
- Parameters:
dlt_str_rep (str) – String representation of dictionary/list/tuple.
- Returns:
Converted Python data structure.
- Return type:
Union[Dict, List, Tuple]
Examples
>>> string_dlt_to_dlt("[1, 2, 3]") [1, 2, 3] >>> string_dlt_to_dlt("{'a': 1, 'b': 2}") {'a': 1, 'b': 2} >>> string_dlt_to_dlt("('1', '2', '3')") ('1', '2', '3')
- data_manipulation.base.string_str_to_str(string_str_rep: str) str [source]¶
Converts string representation to a clean string by removing quotes.
- Parameters:
string_str_rep (str) – String representation to clean.
- Returns:
Cleaned string with outer quotes removed.
- Return type:
str
Examples
>>> string_str_to_str("'test'") 'test' >>> string_str_to_str('"test"') 'test'
data_manipulation.beautifulsoup_ module¶
- data_manipulation.beautifulsoup_.build_soup(url: str, features: str = 'lxml', to_preprocess: bool = True, timeout: int = 10, headers: dict | None = None) BeautifulSoup | None [source]¶
Creates a BeautifulSoup object from a given URL.
- Parameters:
url (str) – URL to fetch and parse.
features (str, optional) – Parser to use. Defaults to “lxml”.
to_preprocess (bool, optional) – Whether to preprocess the HTML. Defaults to True.
timeout (int, optional) – Request timeout in seconds. Defaults to 10.
headers (Optional[dict], optional) – Custom headers for the request. Defaults to None.
- Returns:
Parsed BeautifulSoup object, or None if request fails.
- Return type:
Optional[BeautifulSoup]
Examples
>>> a = build_soup("https://google.com") >>> type(a) <class 'bs4.BeautifulSoup'>
Note
Requires requests and beautifulsoup4 packages.
- data_manipulation.beautifulsoup_.preprocess(html: str) str | None [source]¶
Removes whitespaces and newline characters from HTML string.
- Parameters:
html (str) – HTML string to be cleaned.
- Returns:
Cleaned HTML string with normalized whitespace.
- Return type:
Optional[str]
Examples
>>> a = "<html> <p> Something </p> </html> " >>> preprocess(a) '<html><p>Something</p></html>'
Note
Reference: https://stackoverflow.com/questions/23241641
data_manipulation.boto3_ module¶
- data_manipulation.boto3_.list_s3_bucket_files(bucket: str, to_dateframe: bool = False, prefix: str | None = None) List[str] | pd.DataFrame [source]¶
Lists all files in an S3 bucket.
- Parameters:
bucket (str) – Name of the S3 bucket.
to_dateframe (bool, optional) – Whether to return results as pandas DataFrame. Defaults to False.
prefix (Optional[str], optional) – Filter results to files with this prefix. Defaults to None.
- Returns:
- List of file keys or DataFrame containing file keys.
If to_dateframe is True, returns DataFrame with ‘key’ column. If to_dateframe is False, returns list of file keys.
- Return type:
Union[List[str], pd.DataFrame]
Examples
>>> files = list_s3_bucket_files('my-bucket') >>> type(files) <class 'list'>
>>> df = list_s3_bucket_files('my-bucket', to_dateframe=True) >>> type(df) <class 'pandas.core.frame.DataFrame'>
- data_manipulation.boto3_.send_aws_ses_email(sender: str, recipient: List[str], subject: str, body_text: str, body_type: str, ses_client: Any, attachment: str | None = None, max_attachment_size: int = 10485760, max_retries: int = 3) Dict[str, Any] | None [source]¶
Sends an email using AWS SES service.
- Parameters:
sender (str) – Sender’s email address.
recipient (list) – List of recipient email addresses.
subject (str) – Email subject line.
body_text (str) – Email body content.
body_type (str) – MIME type of email body (e.g., ‘plain’, ‘html’).
ses_client – AWS SES client instance.
attachment (str, optional) – Path to file to attach. Defaults to None.
max_attachment_size (int, optional) – Maximum allowed attachment size in bytes. Defaults to 10MB.
max_retries (int, optional) – Maximum number of retry attempts for failed sends. Defaults to 3.
- Returns:
AWS SES response dictionary if successful, None if failed.
- Return type:
dict
data_manipulation.cryptography_ module¶
- data_manipulation.cryptography_.decrypt_fernet_data(keypath: str | Path, filepath: str | Path) bytes [source]¶
Decrypts a file using Fernet symmetric encryption.
- Parameters:
keypath (str) – Path to the Fernet key file.
filepath (str) – Path to the encrypted file.
- Returns:
Decrypted data.
- Return type:
str
- Raises:
FileNotFoundError – If key file or input file doesn’t exist
InvalidToken – If the key is invalid or data is corrupted
TypeError – If input types are incorrect
- data_manipulation.cryptography_.encrypt_fernet_file(keypath: str | Path, filepath: str | Path) bytes [source]¶
Encrypts a file using Fernet symmetric encryption.
- Parameters:
keypath (str) – Path to the Fernet key file.
filepath (str) – Path to the file to be encrypted.
- Returns:
Encrypted data.
- Return type:
str
- Raises:
TypeError – If keypath or filepath are not strings.
Examples
>>> encrypted = encrypt_fernet_file('key.txt', 'data.txt') >>> isinstance(encrypted, str) True
- data_manipulation.cryptography_.generate_fernet_key(output_directory: str | Path, output_filename: str) bytes [source]¶
Generates and saves a Fernet encryption key.
- Parameters:
output_directory (str) – Directory path where the key file will be saved.
output_filename (str) – Name of the key file to be created.
- Returns:
Generated Fernet key in bytes format.
- Return type:
bytes
- Raises:
Exception – If key generation or file writing fails.
Examples
>>> key = generate_fernet_key('/path/to/keys', 'encryption.key') >>> isinstance(key, bytes) True
data_manipulation.django_ module¶
- data_manipulation.django_.django_validate_email(email: str, whitelist_domains: List[str] | None = None) str | None [source]¶
Validates an email address using Django’s validator.
- Parameters:
email (str) – Email address to validate.
whitelist_domains (Optional[List[str]]) – List of allowed email domains.
- Returns:
The validated email address if valid, None if invalid.
- Return type:
Optional[str]
Examples
>>> django_validate_email("valid@email.com") 'valid@email.com' >>> django_validate_email("valid@email.com", whitelist_domains=["email.com"]) 'valid@email.com' >>> print(django_validate_email("valid@email.com", whitelist_domains=["other.com"])) None
- data_manipulation.django_.django_validate_phone(phone: str, region: str | None = None) str | None [source]¶
Validates and formats a phone number using Django’s phone number field.
- Parameters:
phone (str) – Phone number to validate.
region (Optional[str], optional) – Region code for parsing local numbers. Defaults to None.
- Returns:
The phone number in E.164 format if valid, None if invalid.
- Return type:
Optional[str]
Examples
>>> django_validate_phone("+1234567890") '+1234567890' >>> print(django_validate_phone("invalid")) None
Note
Requires django-phonenumber-field package to be installed.
- data_manipulation.django_.django_validate_url(url: str, allowed_schemes: List[str] | None = None) str | None [source]¶
Validates a URL using Django’s URL validator.
- Parameters:
url (str) – URL to validate.
allowed_schemes (Optional[List[str]]) – List of allowed URL schemes (e.g., [‘http’, ‘https’]).
- Returns:
The validated URL if valid, None if invalid.
- Return type:
Optional[str]
- data_manipulation.django_.get_django_countries_dict() Tuple[Dict[str, str], Dict[str, str]] [source]¶
Retrieves dictionaries mapping country codes to names and vice versa.
- Returns:
- A tuple containing:
code_name: Dict mapping country codes to uppercase country names
name_code: Dict mapping uppercase country names to codes
- Return type:
Tuple[Dict[str, str], Dict[str, str]]
- Raises:
ImportError – If django-countries package is not installed.
- data_manipulation.django_.init_django(django_dir: str, project_name: str | None = None) None [source]¶
Initializes Django environment for external Python scripts or notebooks.
- Parameters:
django_dir (str) – Path to Django project directory.
project_name (Optional[str]) – Name of the Django project. If None, will check DJANGO_PROJECT env var.
- Raises:
ValueError – If project_name is not provided and DJANGO_PROJECT environment variable is not set.
ImportError – If Django cannot be imported or setup fails.
data_manipulation.flask_ module¶
data_manipulation.geopandas_ module¶
- data_manipulation.geopandas_.haversine_np(longitude1: NumericType, latitude1: NumericType, longitude2: NumericType, latitude2: NumericType) NumericType [source]¶
Calculates the great circle distance between two points on Earth.
- Parameters:
longitude1 (float | ndarray) – Longitude of the first point(s) in decimal degrees.
latitude1 (float | ndarray) – Latitude of the first point(s) in decimal degrees.
longitude2 (float | ndarray) – Longitude of the second point(s) in decimal degrees.
latitude2 (float | ndarray) – Latitude of the second point(s) in decimal degrees.
- Returns:
Distance between the points in kilometers.
- Return type:
float | ndarray
- Raises:
ValueError – If latitude values are outside [-90, 90] or longitude values outside [-180, 180].
Examples
>>> haversine_np(-0.127758, 51.507351, 103.819836, 1.352083) # London to Singapore 10880.39... >>> haversine_np(np.array([-0.127758]), np.array([51.507351]), ... np.array([103.819836]), np.array([1.352083])) array([10880.39...])
Note
Uses the Haversine formula to calculate great circle distances. Earth radius is assumed to be 6371 km. Supports both scalar and numpy array inputs for vectorized calculations.
data_manipulation.kerberos_ module¶
- data_manipulation.kerberos_.keytab_valid(keytab_filepath: str, principal_name: str) CompletedProcess[str] [source]¶
Validates a Kerberos keytab file using kinit.
- Parameters:
keytab_filepath (str) – Path to the Kerberos keytab file.
principal_name (str) – Kerberos principal name associated with the keytab.
- Returns:
- Result of the kinit command execution containing:
returncode: 0 if successful, non-zero if failed
stdout: Standard output from the command
stderr: Standard error from the command
- Return type:
subprocess.CompletedProcess[str]
- Raises:
ValueError – If keytab file doesn’t exist or inputs are invalid
FileNotFoundError – If kinit command is not available
subprocess.SubprocessError – If command execution fails
Examples
>>> result = keytab_valid("/path/to/keytab", "user@REALM.COM") >>> result.returncode == 0 # True if keytab is valid True
data_manipulation.mysql_connector_python_ module¶
- exception data_manipulation.mysql_connector_python_.DatabaseError[source]¶
Bases:
Exception
Custom exception for database-related errors.
- data_manipulation.mysql_connector_python_.create_connection_pool(host: str, dbname: str, user: str, password: str, port: int = 3306, pool_size: int = 5) Any [source]¶
Creates a connection pool for MySQL database.
- Parameters:
host (str) – Database host address.
dbname (str) – Database name to connect to.
user (str) – Username for authentication.
password (str) – Password for authentication.
port (int, optional) – Database port number. Defaults to 3306.
pool_size (int, optional) – Size of the connection pool. Defaults to 5.
- Returns:
Database connection pool object.
- Return type:
Any
- Raises:
DatabaseError – If pool creation fails.
- data_manipulation.mysql_connector_python_.execute_query(connection: Any, sql_query: str, data: dict | tuple | None = None, fetch: bool = False) int | None | List[Dict[str, Any]] [source]¶
Executes a MySQL query with parameters.
- Parameters:
connection (ConnectionType) – Active MySQL connection.
sql_query (str) – SQL query to execute.
data (Optional[Union[dict, tuple]], optional) – Parameters for the SQL query.
fetch (bool, optional) – Whether to fetch results. Defaults to False.
- Returns:
For INSERT: Last inserted row ID
For SELECT: List of dictionaries containing the results
None for other operations or on failure
- Return type:
Union[Optional[int], List[Dict[str, Any]], None]
- Raises:
DatabaseError – If query execution fails.
- data_manipulation.mysql_connector_python_.get_connection(host: str, dbname: str, user: str, password: str, port: int = 3306, use_pool: bool = False, pool: Any | None = None) Generator[Any, None, None] [source]¶
Context manager for database connections.
- Parameters:
host (str) – Database host address.
dbname (str) – Database name to connect to.
user (str) – Username for authentication.
password (str) – Password for authentication.
port (int, optional) – Database port number. Defaults to 3306.
use_pool (bool, optional) – Whether to use connection pooling. Defaults to False.
pool (Optional[Any], optional) – Existing connection pool.
- Yields:
ConnectionType – Database connection object.
- Raises:
DatabaseError – If connection fails.
data_manipulation.openldap_ module¶
- data_manipulation.openldap_.ldapsearch(search_base: str, ldap_uri: str, bind_dn: str, password: str, search_filter: str) CompletedProcess [source]¶
Executes an LDAP search query using the ldapsearch command.
- Parameters:
search_base (str) – Base DN for the search operation.
ldap_uri (str) – LDAP server URI (e.g., “ldap://example.com:389”).
bind_dn (str) – Distinguished Name (DN) for binding to the LDAP server.
password (str) – Password for authentication.
search_filter (str) – LDAP search filter (e.g., “(objectClass=person)”).
- Returns:
Result of the ldapsearch command execution.
- Return type:
subprocess.CompletedProcess
- Raises:
ValueError – If any input parameters are empty or invalid.
subprocess.SubprocessError – If the ldapsearch command fails to execute.
FileNotFoundError – If ldapsearch command is not available.
Examples
>>> result = ldapsearch( ... "dc=example,dc=com", ... "ldap://ldap.example.com", ... "cn=admin,dc=example,dc=com", ... "password", ... "(objectClass=person)" ... ) >>> result.returncode == 0 # True if search was successful True
data_manipulation.pandas_ module¶
- data_manipulation.pandas_.add_type_columns(dataframe: DataFrame) DataFrame [source]¶
Adds type information columns for each column in the dataframe.
- Parameters:
dataframe (pd.DataFrame) – Input dataframe to analyze.
- Returns:
- DataFrame with additional type columns. For each original column ‘X’,
adds a column ‘X_type’ containing the Python type of each value.
- Return type:
pd.DataFrame
- Raises:
TypeError – If input is not a pandas DataFrame.
Examples
>>> df = pd.DataFrame({'a': [1, 2], 'b': ['x', 'y']}) >>> add_type_columns(df) a a_type b b_type 0 1 <class 'int'> x <class 'str'> 1 2 <class 'int'> y <class 'str'>
- data_manipulation.pandas_.aggregate_set_without_none(column: Series, nested_set: bool = False) set [source]¶
Creates a set from a series, excluding None values.
- Parameters:
column (pd.Series) – Series to aggregate into a set.
nested_set (bool, optional) – Whether to handle nested sets. Defaults to False.
- Returns:
Set containing non-None values from the series.
- Return type:
set
Examples
>>> s = pd.Series([1, None, 2, None, 3]) >>> aggregate_set_without_none(s) {1.0, 2.0, 3.0, nan, nan}
- data_manipulation.pandas_.chunking_dataframe(dataframe: DataFrame, chunk_size: int) List[DataFrame] [source]¶
Splits a dataframe into smaller chunks of specified size.
- Parameters:
dataframe (pd.DataFrame) – DataFrame to split.
chunk_size (int) – Maximum number of rows in each chunk.
- Returns:
List of DataFrames, each containing at most chunk_size rows.
- Return type:
List[pd.DataFrame]
- Raises:
TypeError – If dataframe is not a pandas DataFrame or chunk_size is not an integer.
Examples
>>> df = pd.DataFrame({'a': range(5)}) >>> chunks = chunking_dataframe(df, 2) >>> [len(chunk) for chunk in chunks] [2, 2, 1]
- data_manipulation.pandas_.clean_none(dataframe: DataFrame, nan_to_none: bool = True, clean_variation: bool = True, none_variations: List[str] = []) DataFrame [source]¶
Standardizes None values in a dataframe.
- Parameters:
dataframe (pd.DataFrame) – Input DataFrame.
nan_to_none (bool, optional) – Convert NaN to None. Defaults to True.
clean_variation (bool, optional) – Clean common None variations. Defaults to True.
none_variations (List[str], optional) – Additional None variations to clean. Defaults to [].
- Returns:
DataFrame with standardized None values.
- Return type:
pd.DataFrame
Note
Deprecated as of pandas 1.3.0.
- data_manipulation.pandas_.compare_all_list_items(list_: List[Any]) DataFrame [source]¶
Creates a DataFrame comparing all possible pairs of items in a list.
- Parameters:
list (List[Any]) – Input list containing items to compare
- Returns:
- DataFrame with columns:
item1: First item in comparison
item2: Second item in comparison
item1_eq_item2: Boolean indicating if items are equal
- Return type:
pd.DataFrame
- Raises:
TypeError – If input is not a list
Examples
>>> compare_all_list_items([1, 1, 2, 3]) item1 item2 item1_eq_item2 0 1 1 True 1 1 2 False 2 1 3 False 3 1 2 False 4 1 3 False 5 2 3 False
- data_manipulation.pandas_.compare_dataframes(dataframe1: DataFrame, dataframe2: DataFrame) None [source]¶
Compares two dataframes and prints detailed comparison information.
- Parameters:
dataframe1 (pd.DataFrame) – First DataFrame to compare.
dataframe2 (pd.DataFrame) – Second DataFrame to compare.
- Raises:
TypeError – If either input is not a pandas DataFrame.
Note
Prints comparison results including: - Row counts - Column-wise comparison of non-null values - Detailed analysis of differences when counts don’t match
- data_manipulation.pandas_.config_pandas_display() None [source]¶
Configures pandas display settings for better output readability.
- Sets the following pandas display options:
display.max_columns: 500
display.max_colwidth: 500
display.expand_frame_repr: True
Note
Modifies global pandas settings.
- data_manipulation.pandas_.dtypes_dictionary(dataframe: DataFrame) Dict[type, List[str]] [source]¶
Creates a dictionary mapping Python types to column names in a DataFrame.
- Parameters:
dataframe (pd.DataFrame) – Input DataFrame to analyze
- Returns:
- Dictionary where:
keys are Python types (e.g., str, int, float)
values are lists of column names with that type
- Return type:
Dict[type, List[str]]
- Raises:
TypeError – If input is not a DataFrame
ValueError – If DataFrame is empty
Examples
>>> df = pd.DataFrame({ ... 'int_': [0], ... 'str_': ['0'], ... 'list_': [[]], ... 'dict_': [{}], ... 'none_': [None] ... }) >>> dtypes_dictionary(df) { <class 'int'>: ['int_'], <class 'str'>: ['str_'], <class 'list'>: ['list_'], <class 'dict'>: ['dict_'], <class 'NoneType'>: ['none_'] }
- data_manipulation.pandas_.head_tail(dataframe: DataFrame, n: int = 5) DataFrame [source]¶
Returns first and last n rows of a dataframe.
- Parameters:
dataframe (pd.DataFrame) – Input DataFrame.
n (int, optional) – Number of rows from top and bottom. Defaults to 5.
- Returns:
Concatenated first and last n rows.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({'a': range(10)}) >>> len(head_tail(df, 2)) 4
- data_manipulation.pandas_.index_marks(n_rows: int, chunk_size: int) range [source]¶
Calculates indices for splitting data into chunks of specified size.
- Parameters:
n_rows (int) – Total number of rows in the dataset
chunk_size (int) – Desired size of each chunk
- Returns:
Range object containing indices where chunks should be split
- Return type:
range
- Raises:
TypeError – If either n_rows or chunk_size is not an integer
Examples
>>> index_marks(10, 3) range(3, 12, 3) >>> list(index_marks(10, 3)) [3, 6, 9]
Note
Used internally by chunking_dataframe() to determine split points for breaking large datasets into smaller chunks.
- data_manipulation.pandas_.is_running_in_jupyter() bool [source]¶
Checks if code is running in a Jupyter notebook environment.
- Returns:
True if running in Jupyter notebook, False otherwise.
- Return type:
bool
Examples
>>> is_running_in_jupyter() True # When running in Jupyter False # When running in regular Python
- data_manipulation.pandas_.print_dataframe_overview(dataframe: DataFrame, stats: bool = False) None [source]¶
Prints comprehensive overview of DataFrame contents and statistics.
Displays for each column: - Number of unique elements - Number of null elements - Value counts with percentages - Optional statistics (min/max) if stats=True
- Parameters:
dataframe (pd.DataFrame) – DataFrame to analyze
stats (bool, optional) – Whether to include min/max statistics. Defaults to False.
- Raises:
TypeError – If input is not a pandas DataFrame
Examples
>>> data = {"int_": [1, 1, 2], "str_": ["a", "b", "b"]} >>> df = pd.DataFrame(data) >>> print_dataframe_overview(df) ============================== int_ ============================== Unique elements: 2 Null elements: 0 count percent int_ 1 2 66.666667 2 1 33.333333 ============================== str_ ============================== Unique elements: 2 Null elements: 0 count percent str_ b 2 66.666667 a 1 33.333333
- data_manipulation.pandas_.ps_aux_dataframe(ps_aux_commands: str) DataFrame [source]¶
Converts Linux ps aux command output to a DataFrame.
- Parameters:
ps_aux_commands (str) – ps aux command string to execute.
- Returns:
- DataFrame containing process information with columns matching
ps aux output format.
- Return type:
pd.DataFrame
Examples
>>> df = ps_aux_dataframe("ps aux | head") >>> 'PID' in df.columns True
- data_manipulation.pandas_.series_count(series: Series) DataFrame [source]¶
Calculates value counts and percentages for a Series.
Enhanced version of series.value_counts() that includes percentage calculations.
- Parameters:
series (pd.Series) – Input series to analyze
- Returns:
- DataFrame with columns:
count: Frequency of each unique value
percent: Percentage of total (0-100) for each value
- Return type:
pd.DataFrame
- Raises:
TypeError – If input is not a pandas Series
Examples
>>> values = [1, 1, 1, 2, 2, 3] >>> s = pd.Series(values) >>> series_count(s) count percent 1 3 50.000000 2 2 33.333333 3 1 16.666667
- data_manipulation.pandas_.series_to_columns(dataframe: DataFrame, column: str) DataFrame [source]¶
Expands a column containing dictionaries into separate columns.
- Parameters:
dataframe (pd.DataFrame) – Input dataframe.
column (str) – Name of column containing dictionaries to expand.
- Returns:
DataFrame with dictionary column expanded into separate columns.
- Return type:
pd.DataFrame
Examples
>>> values = [[1, {'a':1, 'b':2}]] >>> cols = ["c1", "c2"] >>> df = pd.DataFrame(values, columns=cols) >>> df c1 c2 0 1 {'a': 1, 'b': 2} >>> series_to_columns(df, "c2") c1 a b 0 1 1 2
- data_manipulation.pandas_.split_dataframe(dataframe: DataFrame, uuid: str, columns: List[str]) Tuple[DataFrame, DataFrame] [source]¶
Splits a dataframe into two based on specified columns while maintaining a linking ID.
- Parameters:
dataframe (pd.DataFrame) – Input DataFrame to split.
uuid (str) – Name of the column containing unique identifiers.
columns (List[str]) – List of column names to split into second DataFrame.
- Returns:
- Tuple containing:
First DataFrame with specified columns removed
Second DataFrame with only specified columns and uuid
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
- Raises:
TypeError – If inputs are not of correct types.
Examples
>>> df = pd.DataFrame({'id': [1], 'a': [2], 'b': [3]}) >>> df1, df2 = split_dataframe(df, 'id', ['b']) >>> df1 id a 0 1 2 >>> df2 b id 0 3 1
- data_manipulation.pandas_.split_left_merged_dataframe(dataframe1: DataFrame, dataframe2: DataFrame, columns: List[str]) Tuple[DataFrame, DataFrame] [source]¶
Splits result of a left merge into matched and unmatched records.
- Parameters:
dataframe1 (pd.DataFrame) – Left DataFrame for merge.
dataframe2 (pd.DataFrame) – Right DataFrame for merge.
columns (List[str]) – Columns to merge on.
- Returns:
- Tuple containing:
DataFrame with rows that matched in both inputs
DataFrame with rows only present in left input
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
Examples
>>> df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo', "zen"], 'value': [1, 2, 3, 5, 9]}) >>> df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]}) >>> df_both, df_left = split_left_merged_dataframe(df1, df2, ["key"]) count percent _merge both 6 85.714286 left_only 1 14.285714 right_only 0 0.000000 >>> df_both key value value_y _merge 0 foo 1 5.0 both 1 foo 1 8.0 both 2 bar 2 6.0 both 3 baz 3 7.0 both 4 foo 5 5.0 both 5 foo 5 8.0 both >>> df_left key value value_y _merge 6 zen 9 NaN left_only
- data_manipulation.pandas_.to_excel_keep_url(filepath: str, dataframe: DataFrame) None [source]¶
Exports DataFrame to Excel while preserving URLs as clickable links.
Handles Excel’s automatic URL conversion and limitations: - Preserves URLs without auto-conversion - Handles Excel’s limit of 65,530 URLs per worksheet - Manages URLs longer than 255 characters
- Parameters:
filepath (str) – Output Excel file path (including .xlsx extension)
dataframe (pd.DataFrame) – DataFrame to export
- Raises:
TypeError – If filepath is not a string or dataframe is not a DataFrame
IOError – If file cannot be written to specified location
Examples
>>> df = pd.DataFrame({ ... 'url': ['https://www.example.com'], ... 'data': ['test'] ... }) >>> to_excel_keep_url('output.xlsx', df) Excel exported ...
Note
Uses xlsxwriter engine with specific options to prevent URL auto-conversion. Make sure the target directory exists before calling this function.
- data_manipulation.pandas_.useless_columns(dataframe: DataFrame) Tuple[List[str], List[str]] [source]¶
Identifies empty and single-value columns in a DataFrame.
- Parameters:
dataframe (pd.DataFrame) – DataFrame to analyze
- Returns:
- Two lists containing:
List of column names that are completely empty
List of column names that contain only a single unique value
- Return type:
Tuple[List[str], List[str]]
- Raises:
TypeError – If input is not a pandas DataFrame
Examples
>>> df = pd.DataFrame({ ... 'empty': [None, None], ... 'single': ['A', 'A'], ... 'normal': ['A', 'B'] ... }) >>> empty_cols, single_cols = useless_columns(df) Empty columns: ['empty'] Single value columns: ['single'] >>> print(f"Empty columns: {empty_cols}") Empty columns: ['empty'] >>> print(f"Single value columns: {single_cols}") Single value columns: ['single']
Note
Empty columns are those where all values are None/NaN. Single-value columns may be candidates for removal or conversion to constants.
data_manipulation.prometheus_ module¶
data_manipulation.postgres_ module¶
data_manipulation.pyspark_ module¶
- data_manipulation.pyspark_.add_dummy_columns(dataframe: DataFrame, columns: List[str], value: str) DataFrame [source]¶
Adds new columns with default values to a Spark DataFrame.
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
columns (List[str]) – List of column names to add
value (str) – Default value for the new columns
- Returns:
DataFrame with added columns
- Return type:
pyspark.sql.DataFrame
- Raises:
TypeError – If arguments are not of correct type - dataframe must be a Spark DataFrame - columns must be a list - value must be a string
Examples
>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"]) >>> new_df = add_dummy_columns(df, ["age", "city"], "unknown") >>> new_df.show() +-----+---+---+------+ | name| id|age| city| +-----+---+---+------+ |Alice| 1|unknown|unknown| +-----+---+---+------+
- data_manipulation.pyspark_.column_into_list(dataframe: DataFrame, column: str) List[Any] [source]¶
Extracts values from a DataFrame column into a Python list.
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
column (str) – Name of the column to extract
- Returns:
- List containing all values from the specified column,
including duplicates
- Return type:
List[Any]
- Raises:
TypeError – If dataframe is not a Spark DataFrame or column is not a string
Examples
>>> df = spark.createDataFrame([(1,), (2,), (2,)], ["value"]) >>> column_into_list(df, "value") [1, 2, 2]
- data_manipulation.pyspark_.column_into_set(dataframe: DataFrame, column: str) Set[Any] [source]¶
Extracts unique values from a DataFrame column into a Python set.
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
column (str) – Name of the column to extract
- Returns:
Set containing unique values from the specified column
- Return type:
Set[Any]
- Raises:
TypeError – If dataframe is not a Spark DataFrame or column is not a string
Examples
>>> df = spark.createDataFrame([(1,), (2,), (2,)], ["value"]) >>> column_into_set(df, "value") {1, 2}
- data_manipulation.pyspark_.columns_prefix(dataframe: DataFrame, prefix: str) DataFrame [source]¶
Adds a prefix to all column names in a DataFrame.
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
prefix (str) – Prefix to add to column names
- Returns:
DataFrame with renamed columns
- Return type:
pyspark.sql.DataFrame
- Raises:
TypeError – If dataframe is not a Spark DataFrame or prefix is not a string
Examples
>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"]) >>> new_df = columns_prefix(df, "user_") >>> new_df.show() +---------+-------+ |user_name|user_id| +---------+-------+ | Alice| 1| +---------+-------+
- data_manipulation.pyspark_.columns_statistics(dataframe: DataFrame, n: int = 10) Tuple[List[str], List[str]] [source]¶
Analyzes column statistics and identifies empty and single-value columns.
Performs comprehensive analysis of each column including: - Value counts - Empty value detection - Single value detection - Basic statistics
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
n (int, optional) – Number of top values to display for each column. Defaults to 10.
- Returns:
- Two lists containing:
List of empty column names
List of single-value column names
- Return type:
Tuple[List[str], List[str]]
- Raises:
TypeError – If dataframe is not a Spark DataFrame
Examples
>>> df = spark.createDataFrame([ ... ("Alice", None), ... ("Alice", None) ... ], ["name", "email"]) >>> empty_cols, single_cols = columns_statistics(df) >>> print(f"Empty columns: {empty_cols}") Empty columns: ['email'] >>> print(f"Single value columns: {single_cols}") Single value columns: ['name']
- data_manipulation.pyspark_.config_spark_local(autoset: bool = True) None [source]¶
Configures Spark for local execution with optimized settings based on system resources.
Automatically calculates and sets optimal Spark configuration parameters based on: - Available CPU cores - System memory - Executor allocation - Memory distribution
- Parameters:
autoset (bool, optional) – Whether to automatically apply the configuration. If False, only prints recommended settings. Defaults to True.
Examples
>>> config_spark_local() Here is the current computer specs ... executor_per_node: 1 spark_executor_instances: 1 total_executor_memory: 30 spark_executor_memory: 27 memory_overhead: 3 spark_default_parallelism: 10 spark.sql.execution.arrow.pyspark.enabled recommended by Koalas ... spark auto-configured ...
Note
Configuration includes: - Executor cores and memory - Driver cores and memory - Memory overhead - Default parallelism - Shuffle partitions - Arrow optimization for PySpark
- data_manipulation.pyspark_.describe(dataframe: DataFrame) None [source]¶
Prints comprehensive information about a DataFrame.
Displays: - DataFrame type - Number of columns - Number of rows - Schema information
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
- Raises:
TypeError – If dataframe is not a Spark DataFrame
Examples
>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"]) >>> describe(df) The dataframe: <class 'pyspark.sql.dataframe.DataFrame'> Number of columns: 2 Number of rows: 1 root |-- name: string (nullable = true) |-- id: long (nullable = true)
- data_manipulation.pyspark_.group_count(dataframe: DataFrame, columns: str | List[str], n: int | float = 10) DataFrame [source]¶
Performs group by operation and calculates count and percentage for each group.
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
columns (Union[str, List[str]]) – Column(s) to group by
n (Union[int, float], optional) – Number of top groups to return. Use float(‘inf’) for all groups. Defaults to 10.
- Returns:
- DataFrame with columns:
Group by column(s)
count: Count of records in each group
percent: Percentage of total records in each group
- Return type:
pyspark.sql.DataFrame
- Raises:
TypeError – If arguments are not of correct type
Examples
>>> df = spark.createDataFrame([ ... (1, 'A'), (1, 'B'), (2, 'A') ... ], ["id", "category"]) >>> group_count(df, ["id"]).show() +---+-----+-------+ | id|count|percent| +---+-----+-------+ | 1| 2| 66.7| | 2| 1| 33.3| +---+-----+-------+
- data_manipulation.pyspark_.rename(dataframe: DataFrame, columns: Dict[str, str]) DataFrame [source]¶
Renames multiple columns in a DataFrame using a mapping dictionary.
- Parameters:
dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
columns (Dict[str, str]) – Dictionary mapping old column names to new names
- Returns:
DataFrame with renamed columns
- Return type:
pyspark.sql.DataFrame
- Raises:
TypeError – If dataframe is not a Spark DataFrame or columns is not a dict
Examples
>>> df = spark.createDataFrame([("Alice", 1)], ["_1", "_2"]) >>> new_df = rename(df, {"_1": "name", "_2": "id"}) >>> new_df.show() +-----+---+ | name| id| +-----+---+ |Alice| 1| +-----+---+
data_manipulation.smtplib_ module¶
- data_manipulation.smtplib_.send_email(logname: str, message_subject: str, message_sender: str, message_receiver: str, html: str, smtp_address: str, smtp_port: int = 587, smtp_username: str | None = None, smtp_password: str | None = None) bool [source]¶
Sends an HTML email using SMTP with TLS and logs the operation.
- Parameters:
logname (str) – Path to the log file
message_subject (str) – Subject line of the email
message_sender (str) – Email address of the sender
message_receiver (str) – Email address of the recipient
html (str) – HTML content of the email body
smtp_address (str) – SMTP server address (e.g., ‘smtp.gmail.com’)
smtp_port (int, optional) – SMTP server port. Defaults to 587.
smtp_username (str, optional) – SMTP authentication username
smtp_password (str, optional) – SMTP authentication password
- Returns:
True if email was sent successfully, False otherwise
- Return type:
bool
- Raises:
ValueError – If required parameters are empty or invalid
SMTPException – If email sending fails
IOError – If log file cannot be accessed
data_manipulation.sqlalchemy_ module¶
- exception data_manipulation.sqlalchemy_.DatabaseConnectionError[source]¶
Bases:
Exception
Custom exception for database connection errors.
- data_manipulation.sqlalchemy_.create_sqlalchemy_engine(drivername: str, host: str, dbname: str, user: str, password: str, port: int = 3306, pool_size: int = 5, max_overflow: int = 10, pool_timeout: int = 30, connect_timeout: int = 10, ssl_ca: str | None = None) Engine [source]¶
Creates and tests a SQLAlchemy engine for database operations.
- Parameters:
drivername (str) – Database driver name. Supported options include: - ‘mysql+mysqlconnector’ - ‘mysql+pymysql’ - ‘postgresql+psycopg2’ - ‘mssql+pyodbc’ - ‘oracle+cx_oracle’ - ‘sqlite3’
host (str) – Database server hostname or IP address
dbname (str) – Name of the target database
user (str) – Database username for authentication
password (str) – Database password for authentication
port (int, optional) – Database server port number. Defaults to 3306.
pool_size (int, optional) – The size of the connection pool. Defaults to 5.
max_overflow (int, optional) – Maximum number of connections above pool_size. Defaults to 10.
pool_timeout (int, optional) – Timeout for getting a connection from pool. Defaults to 30.
connect_timeout (int, optional) – Timeout for database connections. Defaults to 10.
ssl_ca (Optional[str], optional) – Path to SSL CA certificate. Defaults to None.
- Returns:
Configured database engine object
- Return type:
sqlalchemy.engine.base.Engine
- Raises:
DatabaseConnectionError – If engine creation or connection test fails
Examples
>>> engine = create_sqlalchemy_engine( ... drivername='postgresql+psycopg2', ... host='localhost', ... dbname='mydb', ... user='admin', ... password='secret', ... port=5432 ... ) # Logs "create_sqlalchemy_engine: True" on success # or "create_sqlalchemy_engine: False (error_message)" on failure
Note
The function automatically tests the connection upon creation and logs the result using loguru. A successful connection will be logged as info, while failures will be logged as errors with the specific exception message.
- data_manipulation.sqlalchemy_.create_sqlalchemy_url(drivername: str, host: str, dbname: str, user: str, password: str, port: int = 3306, query: Dict[str, Any] | None = None) URL [source]¶
Creates a SQLAlchemy URL object for database connection.
- Parameters:
drivername (str) – Database driver name. Supported options include: - ‘mysql+mysqlconnector’ - ‘mysql+pymysql’ - ‘postgresql+psycopg2’ - ‘mssql+pyodbc’ - ‘oracle+cx_oracle’ - ‘sqlite3’
host (str) – Database server hostname or IP address
dbname (str) – Name of the target database
user (str) – Database username for authentication
password (str) – Database password for authentication
port (int, optional) – Database server port number. Defaults to 3306.
query (Optional[Dict[str, Any]], optional) – Additional connection parameters. Useful for SSL configuration. Defaults to None.
- Returns:
Configured URL object for database connection
- Return type:
sqlalchemy.engine.url.URL
Examples
>>> url = create_sqlalchemy_url( ... drivername='postgresql+psycopg2', ... host='localhost', ... dbname='mydb', ... user='admin', ... password='secret', ... port=5432 ... ) >>> str(url) 'postgresql+psycopg2://admin:secret@localhost:5432/mydb'