Structured Data Generation

Schema and Field

For generating data by schema, just create an instance of Field object, which takes any string which represents the name of data provider in following formats:

  • method — will be chosen the first provider which has a method method

  • provider.method — explicitly defines that the method method belongs to provider

and **kwargs of the method method, after that you should describe the schema in lambda function (or any other callable object) and pass it to the object Schema and call method create().

Warning

The schema should be wrapped in a callable object to ensure that it is evaluated dynamically, rather than just once, resulting in the same data being generated for each iteration.

Example of usage:

from mimesis.enums import Gender, TimestampFormat
from mimesis.locales import Locale
from mimesis.schema import Field, Fieldset, Schema

field = Field(locale=Locale.EN)
fieldset = Fieldset(locale=Locale.EN)

schema = Schema(
    schema=lambda: {
        "pk": field("increment"),
        "uid": field("uuid"),
        "name": field("text.word"),
        "version": field("version", pre_release=True),
        "timestamp": field("timestamp", fmt=TimestampFormat.POSIX),
        "owner": {
            "email": field("person.email", domains=["mimesis.name"]),
            "token": field("token_hex"),
            "creator": field("full_name", gender=Gender.FEMALE),
        },
        "apps": fieldset(
            "text.word", i=5, key=lambda name: {"name": name, "id": field("uuid")}
        ),
    },
    iterations=2,
)
schema.create()

Output:

[
  {
    "apps": [
      {
        "id": "680b1947-e747-44a5-aec2-3558491cac34",
        "name": "exit"
      },
      {
        "id": "2c030612-229a-4415-8caa-82e070604f02",
        "name": "requirement"
      }
    ],
    "name": "undergraduate",
    "owner": {
      "creator": "Temple Martinez",
      "email": "franklin1919@mimesis.name",
      "token": "18c9c17aa696fd502f27a1e9d5aff5a4e0394133491358fb85c59d07eafd2694"
    },
    "pk": 1,
    "timestamp": "2005-04-30T10:37:26Z",
    "uid": "1d30ca34-349b-4852-a9b8-dc2ecf6c7b20",
    "version": "0.4.8-alpha.11"
  },
  {
    "apps": [
      {
        "id": "e5505358-b090-4784-9148-f2acce8d3451",
        "name": "taste"
      },
      {
        "id": "2903c277-826d-4deb-9e71-7b9fe061fc3f",
        "name": "upcoming"
      }
    ],
    "name": "advisory",
    "owner": {
      "creator": "Arlena Moreno",
      "email": "progress2030@mimesis.name",
      "token": "72f0102513053cd8942eaa85c0e0ffea47eed424e40eeb9cb5ba0f45880c2893"
    },
    "pk": 2,
    "timestamp": "2021-02-24T04:46:00Z",
    "uid": "951cd971-a6a4-4cdc-9c7d-79a2245ac4a0",
    "version": "6.0.0-beta.5"
  }
]

By default, Field works only with providers which supported by Generic, to change this behavior should be passed parameter providers with a sequence of data providers:

from mimesis.schema import Field
from mimesis.locales import Locale
from mimesis import builtins

custom_providers = (
     builtins.RussiaSpecProvider,
     builtins.NetherlandsSpecProvider,
)
field = Field(Locale.EN, providers=custom_providers)

field('snils')
# Output: '239-315-742-84'

field('bsn')
# Output: '657340522'

The scheme is an iterator, so you can iterate over it, for example like this:

from mimesis import Schema, Field
from mimesis.locales import Locale

field = Field(Locale.DE)

schema = Schema(
    schema=lambda: {
        "pk": field("increment"),
        "name": field("full_name"),
        "email": field("email", domains=["example.org"]),
    },
    iterations=100,
)


for obj in schema:
    print(obj)

Output:

{'pk': 1, 'name': 'Lea Bohn', 'email': 'best2045@example.org'}
...
{'pk': 100, 'name': 'Karsten Haase', 'email': 'dennis2024@example.org'}

Field vs Fieldset

The main difference between Field and Fieldset is that Fieldset generates a set (well, actually a list) of values for a given field, while Field generates a single value.

Let’s take a look at the example:

>>> from mimesis import Field, Fieldset
>>> from mimesis.locales import Locale

>>> field = Field(locale=Locale.EN)
>>> fieldset = Fieldset(locale=Locale.EN)

>>> field("name")
Chase

>> [field("name") for _ in range(3)]
['Nicolle', 'Kelvin', 'Adaline']

>>> fieldset("name", i=3)
['Basil', 'Carlee', 'Sheryll']

The keyword argument i is used to specify the number of values to generate. If i is not specified, a reasonable default value (which is 10) is used.

The Fieldset class is a subclass of BaseField and inherits all its methods, attributes and properties. This means that API of Fieldset is almost the same as for Field which is also a subclass of BaseField.

Almost, because an instance of Fieldset accepts keyword argument i.

While it may not be necessary in most cases, it is possible to override the default name of a keyword argument i for a specific field.

Let’s take a look at the example:

>>> from mimesis import Fieldset
>>> class MyFieldset(Fieldset):
...     fieldset_iterations_kwarg = "wubba_lubba_dub_dub"

>>> fs = MyFieldset(locale=Locale.EN)
>>> fs("name", wubba_lubba_dub_dub=3)
['Janella', 'Beckie', 'Jeremiah']

# The order of keyword arguments doesn't matter.
>>> fs("name", wubba_lubba_dub_dub=3, key=str.upper)
['RICKY', 'LEONORE', 'DORIAN']

Fieldset and Pandas

If your aim is to create synthetic data for your Pandas dataframes , you can make use of the Fieldset as well.

With Fieldset, you can create datasets that are similar in structure to your real-world data, allowing you to perform accurate and reliable testing and analysis:

import pandas as pd
from mimesis.schema import Fieldset
from mimesis.locales import Locale

fs = Fieldset(locale=Locale.EN, i=5)

df = pd.DataFrame.from_dict({
    "ID": fs("increment"),
    "Name": fs("person.full_name"),
    "Email": fs("email"),
    "Phone": fs("telephone", mask="+1 (###) #5#-7#9#"),
})

print(df)

Output:

ID             Name                          Email              Phone
1     Jamal Woodard              ford1925@live.com  +1 (202) 752-7396
2       Loma Farley               seq1926@live.com  +1 (762) 655-7893
3  Kiersten Barrera      relationship1991@duck.com  +1 (588) 956-7099
4   Jesus Frederick  troubleshooting1901@gmail.com  +1 (514) 255-7091
5   Blondell Bolton       strongly2081@example.com  +1 (327) 952-7799

Isn’t it cool? Of course, it is!

Key Functions

You can optionally apply a key function to the result returned by a field or fieldset. To do this, simply pass a callable object that returns the final result as the key parameter.

Let’s take a look at the example:

>>> from mimesis import Field, Fieldset
>>> from mimesis.locales import Locale

>>> field = Field(Locale.EN)
>>> field("name", key=str.upper)
'JAMES'

>>> fieldset = Fieldset(i=3)
>>> fieldset("name", key=str.upper)
['PETER', 'MARY', 'ROBERT']

As you can see, key function can be applied to both — field and fieldset.

Mimesis also provides a set of built-in key functions:

Maybe This, Maybe That

Real-world data can be messy and may contain missing values. This is why generating data with None values may be useful to create more realistic synthetic data.

Luckily, you can achieve this by using key function maybe()

It’s has nothing to do with monads, it is just a closure which accepts two arguments: value and probability.

Let’s take a look at the example:

>>> from mimesis import Fieldset
>>> from mimesis.keys import maybe
>>> from mimesis.locales import Locale

>>> fieldset = Fieldset(Locale.EN, i=5)
>>> fieldset("email", key=maybe(None, probability=0.6))

[None, None, None, 'bobby1882@gmail.com', None]

In the example above, the probability of generating a None value instead of email is 0.6, which is 60%.

You can use any other value instead of None:

>>> from mimesis import Fieldset
>>> from mimesis.keys import maybe

>>> fieldset = Fieldset("en", i=5)
>>> fieldset("email", key=maybe('N/A', probability=0.6))

['N/A', 'N/A', 'static1955@outlook.com', 'publish1929@live.com', 'command2060@yahoo.com']

Romanization of Cyrillic Data

If your locale is part of the Cyrillic language family, but you require locale-specific data in romanized form, you can make use of the following key function romanize().

Let’s take a look at the example:

>>> from mimesis.keys import romanize
>>> from mimesis.locales import Locale
>>> from mimesis.schema import Field, Fieldset

>>> fieldset = Fieldset(Locale.RU, i=5)
>>> fieldset("name", key=romanize(Locale.RU))
['Gerasim', 'Magdalena', 'Konstantsija', 'Egor', 'Alisa']

>>> field = Field(locale=Locale.UK)
>>> field("full_name", key=romanize(Locale.UK))
'Dem'jan Babarychenko'

At this moment romanize() works only with Russian (Locale.RU), Ukrainian (Locale.UK) and Kazakh (Locale.KK) locales.

Accessing Random Object in Key Functions

To ensure that all key functions have the same seed, it may be necessary to access a random object, especially if you require a complex key function that involves performing additional tasks with random object.

In order to achieve this, you are required to create a key function that accepts two parameters - result and random. The result argument denotes the output generated by the field, while random is an instance of the Random class used to ensure that all key functions accessing random have the same seed.

Here is an example of how to do this:

>>> from mimesis import Field
>>> from mimesis.locales import Locale

>>> field = Field(Locale.EN, seed=42)
>>> foobarify = lambda val, rand: rand.choice(["foo", "bar"]) + val

>>> field("email", key=foobarify)
'fooany1925@gmail.com'

Export Data to JSON, CSV or Pickle

Data can be exported in JSON or CSV formats, as well as pickled object representations.

Let’s take a look at the example:

from mimesis.enums import TimestampFormat
from mimesis.locales import Locale
from mimesis.keys import maybe
from mimesis.schema import Field, Schema

field = Field(locale=Locale.EN)
schema = Schema(
    schema=lambda: {
        "pk": field("increment"),
        "name": field("text.word", key=maybe("N/A", probability=0.2)),
        "version": field("version"),
        "timestamp": field("timestamp", TimestampFormat.RFC_3339),
    },
    iterations=1000
)
schema.to_csv(file_path='data.csv')
schema.to_json(file_path='data.json')
schema.to_pickle(file_path='data.obj')

Example of the content of data.csv (truncated):

pk,uid,name,version,timestamp
1,save,6.8.6-alpha.3,2018-09-21T21:30:43Z
2,sponsors,6.9.6-rc.7,2015-03-02T06:18:44Z
3,N/A,4.5.6-rc.8,2022-03-31T02:56:15Z
4,queen,9.0.6-alpha.11,2008-07-22T05:56:59Z