Table of contents
- Why It’s Great
- Why Your Team Needs OADG
- Setting up & using OADG
- Heads Up
- Add-Ons I Would Love To See
A high-quality application relies on thorough testing. Generating sample data fast is vital for keeping up with today's world of Agile and condensed timelines. Traditional methods for generating data have too many downfalls: manual manipulation of production data is slow and error-prone and 3rd party tools, require a fee-based model or if free, a compromise in features and support.
Enter Oracle APEX Data Generator (OADG), a SQL Workshop utility tool available in Release 22.1 (May 2022). It generates large volumes of data for you, in the millions, in multiple formats (CSV, JSON, SQL) utilizing a vast library of data types. Its simple UI and workflow, along with its price (it's free!), make it an all-around winner over the multitude of alternatives out there.
Continue reading to discover why I think OADG is great, why you should use it over other options and how to set up and use it. I've also thrown a section on hints and tips I picked up along the way along with features I would ❤️ to see in the future.
Why It’s Great
The new tool is feature-rich, compared to the alternatives, both free and paid. And it's just in its first version. Its features include:
- Generates data in multiple formats (JSON, CSV, SQL)
- If you prefer, it inserts data directly into your existing table(s)
- Generates a million rows of sample data, with and without complex formulas, within 5 minutes
- Supports functions available in SQL for data transformations and derivations such as:
- DB functions, e.g., TRIM, LPAD, RPAD, CASE, CONCAT
- DB PL/SQL functions, e.g., DBMS_RANDOM.value
- your own custom PL/SQL functions
- Supports use of SQL query results to populate your columns
- Supports foreign keys to generate consistent data across your tables and columns
- Supports the reuse of previously-defined data definition templates, known as blueprints, so you can generate data over and over again
- Provides ability to preview your data while you are building out the definition
Provides a vast library of data types to generate sample data including date and time, integers, boolean, products, persons, plants, food, cars, animals, etc. And for each data type, a series of attributes are available to provide context to your data. For example,
- Persons data type includes attributes such as first name, last name, biological sex, email address, etc.
- Airport data type includes airport type, city, country, latitude, longitude, etc.
- Car data type includes make, model, drivetrain and many others as illustrated in screenshot below:
APEX makes the sample data types and their respective values available to query via the the GET_EXAMPLE function. For example, the query below will generate all cities available for the data type "airport":
select * from apex_dg_data_gen.get_example (p_friendly_name => 'airport.city', p_rows => 100);
The p_friendly_name variable may be found in the app declaratively:
I downloaded the APEX 22.1 Install and found the script that generates all the data types, attributes and values. You can view all sample data here.
- And if the above data types do not meet your needs, BYOD - import your own data and let the OADG transform it as needed.
Why Your Team Needs OADG
TLDR - Because it saves time. Testers no longer have to search or wait for data.
Oracle ERP, APEX and other app implementations typically require a combination of the following for testing:
- Custom Code - developers build PL/SQL packages to generate large numbers of records for volume testing
- Excel Hell - analysts and end users download a copy of Production records into Excel and then apply a series of formulas and functions by column to generate unique records. Users then hand-off to developers to load the records into the app or interface.
- 3rd-Party Tools - hundreds of open source (free) and pay-as-you-go tools are available for generating test data. However, you need to consider factors like supported databases, data generation methods, data types available, cost and support.
OADG allows your team to bypass all these concerns, and what's more, it's free, fast and easy to use.
A quick note on reusing Production Data - relying on production data introduces risk as this data typically represents 'perfect' data that has already been vetted. To thoroughly test your new app, integration or business logic, your test data needs exceptions, anomalies, negative outliers and other invalid data to rigorously prove out your app and any ETL routines.
Be a hero and provide them OADG!
Tip: Data Generation tools are typically not applicable for ML use cases as you do not want to feed algorithms randomly-generated data. You need some data correlation across your columns so your algorithm can learn.
Setting up & using OADG
The best way to illustrate this feature is through an example. We'll use a simple Product table with a relational dependency to the Product Categories table:
Here's a visual of the workflow for data generation:
1. Create Blueprint
As depicted in the workflow above, the first step to generating data is to capture the definition of your tables and columns via a Blueprint, a template-like object which stores the settings, transformation rules and data sources for your columns.
OADG provides three methods for creating a blueprint. A wizard-like flow guides you through each method, as screenshot illustrates below.
Navigation: SQL Workshop > Utilities > Data Generator
There are benefits to each method:
Use Existing Tables - this option provides you with two big features: 1) it auto-generates your blueprint based on your table's definition, and 2) if you choose, it will automatically load the generated sample data directly into your existing table.
From Scratch - this option provides you ultimate flexibility in the sequence of your table(s) and column(s) and will provide you generated data in a CSV, JSON or SQL file.
Import JSON File - this option serves as my ultimate 'edit' recourse. I typically generate a blueprint from scratch or from an existing table and if I need to edit the sequence of a table or column, I export it as a JSON, make the edits I need, and then reimport it with this option.
2. Define each column's data source
This is where you define the data you want to generate and how to transform it. Once you have defined your table(s) and column(s) with guidance from the wizard in step 1 above, click on your first column to display its Column and Data Source regions, as illustrated in screenshot below.
NOTE: I have uploaded my blueprint's JSON here in case you'd like to follow along.
OADG makes available six data sources, providing almost unlimited flexibility for data transformation.
Let's see each of these in action below.
This option allows the relational dependency (foreign-key) between tables. For my use case, I need to reference the CATEGORY_ID from the PRODUCT_CATEGORY table to the CATEGORY_ID in my PRODUCT table.
Tip: this option is only displayed if you have if you have more than one table defined in your blueprint. If you need to reference Table A.Column A's values for your Table B.Column B, ensure you define Table A and its columns before defining Table B in the Blueprint. This feature is sequence-dependent and will force you to redesign the sequence of your tables if they are not in the right order.
2. Custom Data Source
Select this option if the data generator needs to randomly assign values from your own provided data. Simply create a custom data source by selecting the hamburger bar as shown below:
For my example, I uploaded a CSV file with 1000+ retail brands using the Data Workshop utility. These values will serve as sample data for the BRAND column.
NOTE: The RETAIL_BRANDS.xlsx file I use in this example is available here.
This source option allows you to use any of the pre-seeded data types. I found myself running the "friendly_name" script provided in the "Why It's Great" section above a number of times to get me familiar with the data and its boundless possibilities.
I chose the Built-In option for five columns: CATEGORY_NAME, PRODUCT_NAME, COST, LAUNCH_DATE and ACTIVE. Here is a pic of the LAUNCH_DATE and ACTIVE column settings. Check out the convenience the "Date Between Min and Max" data type provides:
Other "Date" and "Boolean" options include the following:
This is a very handy option for when you have a manageable list of pre-defined values. Instead of having to load your custom data to a table, you simply list out your values in a table provided by this option. For my use case, I defined four values for the UOM column. See screenshot below.
Note the "Frequency" column above, which allows me to indicate the occurrence of one value compared to the others in a generated data set. Neat!
And Note the "Multi Value Options" fields as displayed in the screenshot below. It allows you to generate multiple values for a column. For this specific screenshot, I have instructed OADG to assign up to three values for the UOM column with a comma as my delimiter.
As the name applies, this option will generate a numeric sequence for your column. It lets you specify the starting sequence number as well as the incremental value. I use this for both my CATEGORY_ID and PRODUCT_ID columns.
This one is my favorite because it's powerful and provides so much flexibility. It allows you to apply just about any function to transform built-in or custom data. You can apply a substring function to limit size of values, DBMS_RANDOM or CASE to derive values based on dependencies with other columns, as shown in the screenshot below:
Did you catch the Required flag setting on the screenshot above? If you set a column to not Required, you can indicate the percentage of values you'd like to be set as Null.
This was a quick overview of the six data sources available. You will spend the bulk of your setup in this section. With such flexibility, you can just about build anything you dream up.
3. Preview and correct as needed
The 'Preview' option provides a live preview of the data your blueprint will generate, giving you the opportunity to edit any of your column's definitions before generating your data. Here's a preview of my data. Because I have two tables defined, it provides me two views to review:
4. Generate data (CSV, JSON, SQL)
Finally, OADG offers three file output formats: CSV, JSON or a SQL script ready for upload directly into your table(s).
And here are a few pics of my results in each format.
As you can see, data generation is lightning fast for one million rows with eight columns, three of which had formulas. These test runs were ran in OCI Cloud free tier.
|10,000||13 sec||13 sec|
|100,000||40 sec||48 sec|
|1 Million||5 min||7 min|
Here are a few tips and hints I thought I'd pass along:
- The sequence field is not updatable. To re-sequence a column, you will need to delete the column and recreate it in the position you need it within your table. You can perform this declaratively by right-clicking on the column to delete it and recreate it in the APEX Builder or, an easier option is to re-create the blueprint via JSON.
- If column dependencies between tables exist, it is important to sequence the tables correctly in the blueprint design. For instance, since my PRODUCT table has a dependency on a column in the PRODUCT_CATEGORY table, it is important to sequence the PRODUCT_CATEGORY table in my blueprint first, followed by the PRODUCT table. In my first run, I defined my tables in the wrong sequence. Since the sequence field is not updatable, I had to delete both tables to re-create them in the correct sequence. Alternatively, I could have exported the blueprint design, edited the sequence number in its JSON and then reimport it again.
- The option to load into existing tables is possible for blueprints created from scratch or based on an existing table. If you built your blueprint from scratch, your blueprint's table name must match the table name in your schema exactly and you must comply with typical table constraints such as ensuring your blueprint includes all required columns, etc.
- A blueprint is an all or nothing definition. If you have three tables in your blueprint, it will generate data for all three tables. If for a particular run, you want to create data for just one table, you will need to create a new blueprint because you are unable to temporarily disable the other tables. NOTE: A 'copy' or 'duplicate' blueprint button does not exist, but you can export your current blueprint, edit the JSON and import it back in.
- When designing your column definitions, it appears a column with a formula cannot reference another column with a formula. It appears each column is being evaluated independently. Although it did not throw out an error during my testing, the Preview results in my column were not correct.
- There can be a performance hit if your columns include very complex formulas. The higher the complexity, the bigger the hit.
- Don't skip the APEX Help sections - they provide great examples as they guide through the data generation process.
Add-Ons I Would Love To See
Release 22.1 (May 2022) is the first roll-out of the Data Generator and it is fully-robust as it is. I can't help but dream up of additional iterations the APEX team could deliver in future versions...
- Enable For Everyone - As it's currently an APEX Utility Tool, only developers and administrators have access to the data generator. But business analysts and other end users could benefit too. As designed now, end users have to request a developer to generate the data for them. Making it available outside the APEX Builder would allow users to use it without developer intervention. Alternatively, you could point users to apex.oracle.com so they could upload their own blueprints and custom data.
- Copy Blueprint Button - ideal for when I need a new version of my blueprint. Currently, I export my existing blueprint, edit it in JSON and re-import with a new name. This is not hard, but a button would require less clicks.
Hundreds of fee-based and open-source test data generation tools exist out on the market. When you consider the factors for selecting a tool such as the data types supported, data generation methods, sample data library, performance and cost, the Oracle APEX Data Generator comes out on top. Thanks Oracle APEX team. 🙏