Random data generator python
Here, we use the expr option to introduce a SQL expression to control the generation of theĬolumn. The withColumn method call for the r column introduces multiple columns labelled r1 to rN as determined by The example above shows some common uses: If this is not called, the id field is used during data generation, but it is droppedĮach of the withColumn method calls introduces a new column (or columns). Use of the method withIdOutput() retains the idįield in the output data. build ()īy default, the seed column for each row is the id column. withColumn ( "code5", StringType (), values =, random = True, weights = ) ) dfTestData = testDataSpec. withColumn ( "code4", StringType (), values =, random = True, percentNulls = 0.05 ). withColumn ( "code3", StringType (), values = ). withColumn ( "code2", IntegerType (), minValue = 0, maxValue = 10, random = True ). withColumn ( "code1", IntegerType (), minValue = 100, maxValue = 200 ). DataGenerator ( spark, name = "test_data_set1", rows = row_count, partitions = 4, randomSeedMethod = 'hash_fieldname', verbose = True ). Import dbldatagen as dg from import FloatType, IntegerType, StringType row_count = 1000 * 100 column_count = 10 testDataSpec = ( dg. There is also support for applying arbitrary SQL expressions, and generation of common data from templates Mapping the base value to one of a set of discrete values, optionally with the use of weightingĪrithmetic transformation of the base valueĪdding string formatting to the base value
#Random data generator python series#
(generated with spark.range for batch data frames, or using a Rate source for streaming data frames).Įach column can be specified as based on the id field or other columns in the test data generation spec.Ĭolumns may be based on the value of on or more base fields, or on a hash of the base values.Ĭolumn base values may also be generated at random.Īll further data generation on a particular column is controlled by a series of transformations on this base value, Or a schema can be added from an existing table or Spark SQL schema object.Įach column to be generated derives its test data from a set of one or more seed values.īy default, this is the id field of the base data frame The test data generation process is controlled by a test data generation spec which can build a schema implicitly, Specify a statistical distribution for random values Script Spark SQL table creation statement for dataset Use SQL based expressions to control or augment column generation Values optionally with weighting of how frequently values occur Generate column data from one or more seed columns Generate column data at random or from repeatable seed values Specify numeric, time and date ranges for columns Specify number of Spark partitions to distribute data generation across The data generator includes the following features: Start with an existing schema and add columns along with specifications as to how values are generated Generate a synthetic data set adding columns according to specifiers provided Generate a synthetic data set for an existing Spark SQL schema. Generate a synthetic data set without defining a schema in advance The Databricks Labs Data Generator is a Python Library that can be used in several different ways: Under 2 minutes using a 12 node x 8 core cluster (using DBR 8.3) In minutes with reasonable sized clusters.įor example, at the time of writing, a billion row version of the IOT data set example listed later in the documentĬan be generated and written to a Delta table in Runtime, and you can use it from Scala, R or other languages by definingĪs the data generator is a Spark process, it can scale to generating data with millions or billions of rows It has no dependencies on any libraries that are not already included in the Databricks Or generally manipulated using the existing Spark Dataframe APIs. With the generated data, it may be saved to storage in a variety of formats, saved to tables As the output of the process is a Spark dataframe populated
It uses the features of Spark dataframes and Spark SQL The Databricks Labs data generator (aka dbldatagen) is a Spark based solution for generating Getting started with the Databricks Labs Data Generator ¶ Using the Databricks Labs data generator.Contributing to the Databricks Labs Data Generator.Generating Change Data Capture (CDC) data.A more complex example - building Device IOT Test Data.Adding dataspecs to match multiple columns.Creating data set with pre-existing schema.Create a data set without pre-existing schemas.