JSON 1: parse numbers, strings
Learning goals
You will learn or practice how to:
- Write programs that use dynamic memory (
malloc(…)
) - Debug code that uses dynamic memory (
malloc(…)
) - Parse strings
- Apply test-driven development (TDD).
Overview
This is part 1 of a 2-part sequence in which you will create a decoder for the JSON data format.
JSON is a file format for exchanging hierarchical data between programs, often written in different programming languages and/or on different computers. In web programming, this allows a server application written in any language (e.g., Python or even C) to pass complex structures of information to the user's browser, which then renders it.
A JSON string may represent a number, a string, or a list. (In this context, when we say “list”, we mean a linked list.) Here are some examples:
type | json | data |
---|---|---|
number | 10 | 10 |
string | "ten" | "ten" |
list | [2, 6, 4] |
HW11: The last item in the table above is a linked list. We will be learning more about those in the next 2-3 lectures.
Getting Started on HW11
- Use
264get HW11
to fetch the starter code.you@ecegrid-thin1 ~/ $
cd 264
you@ecegrid-thin1 ~/264 $
264get hw11
you@ecegrid-thin1 ~/264 $
cd hw11
you@ecegrid-thin1 ~/264/hw11 $
- Implement
parse_int(…)
.- Create a trivial test (e.g., 0).
- Implement just enough of
parse_int(…)
so that it passes your trivial test. At this point, it should fail most other tests. - Add another simple trivial test (e.g., 9).
- Implement just enough to pass your two tests so far.
- Add another test (e.g., 15).
- Implement just enough to pass your three tests so far.
- Add another test (e.g., -15).
- … and so on, until
parse_int(…)
is finished.
- Test your
parse_int(…)
completely, using miniunit.h. Getparse_int(…)
completely finished and tested to perfection—including error handling—before you do anything else. Seriously… do not do anything else until this is done. - Submit.
- Implement
print_element(…)
just enough to support integers. - Test
print_element(…)
. (This should be easy to test.) - Submit.
- Implement
parse_element(…)
just enough to support integers. - Test
parse_element(…)
. (This should be trivial.) - Submit.
- Do the same for
parse_string(…)
.- First, get the functionality right. Initially, you will have memory leaks.
- Implement just enough of
free_element(…)
so that no memory leaks result from testingparse_string(…)
- Test your
parse_string(…)
thoroughly and make sure it is 100% perfect, and fully tested, before you do anything else. - Submit.
- Extend
print_element(…)
so that it supports strings, too. - Test
print_element(…)
. (Again, this should be easy.) - Submit.
- Extend
parse_element(…)
to support strings. - Test
parse_element(…)
. (Again, easy, but make sure it's perfect.) - Submit.
- Test error cases. Make sure
parse_element(…)
,parse_int(…)
, andparse_string(…)
return false for incorrect input. - Test that there are no memory leaks, even for incorrect input.
How much work is this?
This assignment will only feasible if you are disciplined in the way you build it. If you try to build it all at once, success is extremely unlikely.
Instructor's solution json.c is 69 sloc for HW11 and 127 sloc for the two halves together. That code is tight, but defensible (meets code quality standards, and uses no methods or language features we haven't covered).
Bonus opportunities
2 bonus point for handling the following escape sequences:
\"
,
\\
,
\/
,
\b
,
\f
,
\n
,
\r
,
\t
,
\u00▒▒
.
Interpret these the same as in C. (You may need to search for more information.) These
correspond to the nine escape sequences listsed at json.org.
For \u00▒▒
, you only need to handle \u0020
to \u007e
.
You can ignore any unicode-specific details, and just treat these as ASCII values (e.g.,
\u0041
for 'A'
, etc.).
Add #define BONUS_JSON_ESCAPE_SEQUENCES
to your json.h if you believe you
have finished this.
2 bonus point for handling the constants null
, true
, and false
.
Add ELEMENT_NULL
and ELEMENT_BOOL
to the enum
in json.h, and add void* as_null
and bool as_bool
to the
union
.
Add #define BONUS_JSON_CONSTANTS
to your json.h if you believe you
have finished this.
4 points – Support JSON “objects” with BSTs. For the BST node, you will have typedef struct _BSTNode { struct _BSTNode* left; struct _BSTNode* right; Element element; char* key; } BSTNode;
. For the parse function: bool parse_object(BSTNode** a_root, char** a_pos) { … }
. This note may be amended and/or replaced later, but the above details should be sufficient. Refer to the JSON standard
Bonus options must be submitted with HW12. There will be no partial credit for the bonus options. You may be required to come and explain in person. We'll let you know if that is necessary.
Requirements
- Your submission must contain each of the following files, as specified:
file contents json.c functions parse int(int✶ a value, char✶✶ a pos)
→ return type: boolSet*a_value
to whatever integer value is found at*a_pos
.*a_pos
is initially the address of the first character of the integer literal in the input string.*a_value
is the (already allocated) location where the parsed int should be stored.- Return
true
if a properly formed integer literal was found at*a_pos
.*a_pos
should refer to the next character in the input string, i.e., after the last digit of the integer literal.- Ex:
parse_int(…)
should returntrue
for 9, -99, and 123AAA.
- Ex:
- Return
false
if an integer literal was not found.*a_pos
should refer to the unexpected character that indicated a problem.- Ex:
parse_int(…)
should returnfalse
for A, AAA123, -A, -A9, and -.
- Ex:
- ⚠ Calling
parse_int(…)
should not result in any calls tomalloc(…)
. - You do not need to parse hexadecimal, octal, scientific notation, floating point values, or anything other than integers in decimal notation (positive or negative).
- Whenever
parse_int(…)
returnsfalse
,*a_value
should not be modified.
parse string(char✶✶ a string, char✶✶ a pos)
→ return type: boolSet*a_string
to a copy of the string literal at*a_pos
.- Caller is responsible for freeing the memory.
- A string literal must be surrounded by double quotation marks, and
may not contain a newline. In addition, we make two simplifications:
- Strings may not contain double quotation marks.
- Backslash is not special. Do not parse escape codes (i.e., "\▒") in the input.
- Calling
parse_string(…)
should result in exactly one call tomalloc(…)
. - Return
true
if a properly formed string literal was found. *a_pos should be set to the next character in the input string, i.e., after the ending double quotation mark.- Ex:
parse_string(…)
should returntrue
for "abc", "abc\", and "abc\z".
- Ex:
- Return
false
if a string literal was not found. *a_pos should refer to the unexpected character that indicated a problem (e.g., newline or null terminator in the input).- Ex:
parse_string(…)
should returnfalse
for "abc and "abc
def".
- Ex:
- Whenever
parse_string(…)
returnsfalse
, do not modify*a_string
, and no heap memory should be allocated.
parse element(Element✶ a element, char✶✶ a pos)
→ return type: bool- First, eat any whitespace at
*a_pos
.- “Eat whitespace” just means to skip over any whitespace characters (i.e., increment
*a_pos
untilisspace(**a_pos)==false
).
- “Eat whitespace” just means to skip over any whitespace characters (i.e., increment
- Next, decide what kind of element this is.
- If it's a digit (
isdigit(**a_pos)
) or hyphen ('-'
), set the element'stype
toELEMENT_INT
and callparse int(&(a element -> as int), a pos)
. - If it's a string (
**a_pos=='"'
), then set the element'stype
toELEMENT_STRING
and callparse string(&(a element -> as string), a pos)
. - If it's a list (
**a_pos == '['
), then set the element'stype
toELEMENT_LIST
and call:parse list(&(a element -> as list), a pos)
.
- If it's a digit (
- Return whatever was returned by
parse_int(…)
,parse_string(…)
, orparse_list(…)
.- If none of those functions was called—i.e., if the next character was neither digit,
'-'
,'"'
, nor'['
—then returnfalse
.
- If none of those functions was called—i.e., if the next character was neither digit,
- Do not modify
*a_pos
directly inparse_element(…)
, except for eating whitespace.*a_pos
can—and should—be modified inparse_int(…)
,parse_string(…)
, andparse_list(…)
.
- Caller is responsible for freeing memory by calling
free_element(…)
wheneverparse_element(…)
returnstrue
. - Whenever
parse_element(…)
returnsfalse
, do not modify*element
, and free any heap memory that was allocated prior to discovery of the error.
print element(Element element)
→ return type: voidGiven anElement
object, print it in JSON notation.- Spacing is up to you, as long as it is valid JSON.
- If element is a string
or integer, then print it (with double-quotes) usingprintf(…)
. - If element is an integer, print it
(with double-quotes)usingprintf(…)
. - If element is a list, print a
'['
. Then print each element in the list usingprint_element(…)
(recursively), separated by commas. Finally, print']'
.
free element(Element element)
→ return type: voidFree the contents of theElement
, as needed.- If it contains a linked list, free the list, including all elements.
- If it contains a string, free the string.
- ⚠ Do not attempt to free the
Element
object itself.free_element(element)
only frees dynamic memory that element refers to.
test_json.c functions main(int argc, char✶ argv[])
→ return type: intTest your all of the above functions using yourminiunit.h.
.- This should consist primarily of calls to
mu_run(_test_▒▒▒)
. - 100% code coverage is required.
- Your main(…) must return EXIT_SUCCESS.
- You may ignore any trailing characters in the input string, as long
as it starts with a well-formed JSON element.
- Acceptable: 123AAA, "12"AAA, "12",[,
- You only need to support the specific features of JSON that are explicitly required in this assignment description. You do not need to support unicode (e.g., "萬國碼", "يونيكود", "യൂണികോഡ്"), objects/dictionaries (e.g., {"a":1, "b":2}), backslash escapes (e.g., "\n"), embedded quotes (e.g., "He said, \"Roar!\""), floating point numbers (e.g., 3.1415), non-decimal notations (e.g., 0xdeadbeef, 0600), null, false)
- Do not modify json.h except as explicitly directed.
- There may be no memory faults (e.g., leaks, invalid read/write, etc.), even when
parse_▒▒▒(…)
return false. -
The following external header files, functions, and symbols are
allowed.
header functions/symbols allowed in… stdbool.h bool
,true
,false
json.c
,test_json.c
stdio.h printf
,fprintf
,fputs
,stdout
,fflush
json.c
,test_json.c
assert.h assert
json.c
,test_json.c
ctype.h isdigit
,isspace
json.c
,test_json.c
stdlib.h EXIT_SUCCESS
,abs
,malloc
,free
,size_t
json.c
,test_json.c
string.h strncpy
,strchr
,strlen
,strcmp
json.c
,test_json.c
limits.h INT_MIN
,INT_MAX
json.c
,test_json.c
miniunit.h anything
test_json.c
clog.h anything
json.c
,test_json.c
- Submissions must meet the code quality standards and the course policies on homework and academic integrity.
Submit
To submit HW11 from within your hw11 directory,
type
264submit HW11 json.c json.h test_json.c expected.txt miniunit.h clog.h Makefile
If your code does not depend on miniunit.h or clog.h, those may be omitted. Your json.h will most likely be identical to the starter. Makefile will not be checked, but including it may help in case we need to do any troubleshooting.
Pre-tester ●
The pre-tester for HW11 has been released and is ready to use.
Q&A
How can I structure my tests?
Here's a start. (We may add to this at some point.)// OK TO COPY / ADAPT this snippet---but ONLY if you understand it completely. // ⚠ Do not copy blindly. // // This test is nowhere near adequate on its own. It is provided to illustrate how to // use helper functions to streamline your test code. #include <stdio.h> #include <stdlib.h> #include "json.h" #include "miniunit.h" int _test_parse_int_valid() { mu_start(); //────────────────────────────────────────────────────── int result; // will be initialized in parse_int(…) char* input = "0"; char* pos = input; bool is_success = parse_int(&result, &pos); mu_check(is_success); // because the input is valid mu_check(pos == input + 1); mu_check(result == 0); //────────────────────────────────────────────────────── mu_end(); } int _test_parse_int_invalid() { mu_start(); //────────────────────────────────────────────────────── int result; // will be initialized in parse_int(…) char* input = "A"; char* pos = input; bool is_success = parse_int(&result, &pos); mu_check(!is_success); // because the input is valid mu_check(pos == input); // failure should be at the first character in the input //────────────────────────────────────────────────────── mu_end(); } int main(int argc, char* argv[]) { mu_run(_test_parse_int_valid); mu_run(_test_parse_int_invalid); return EXIT_SUCCESS; }
That's a lot of duplication! Can we make our tests more concise?
You could use a helper function and a struct type just for testing. You may copy/adapt this code—but only if you understand it completely. ⚠ Do not copy blindly.// FANCY way of testing, using a helper function and struct type just for testing. // // Okay to copy/adapt, but ONLY IF YOU UNDERSTAND THIS CODE COMPLETELY. // ⚠ Do not copy blindly. #include <stdio.h> #include <stdlib.h> #include <string.h> #include "json.h" #include "miniunit.h" typedef struct { bool is_success; union { // anonymous union (C11) Element element; long int error_idx; }; } ParseResult; ParseResult _parse_json(char* s) { Element element; // Not initialized because parse_element(…) *must* do so. char* pos = s; bool is_success = parse_element(&element, &pos); if(is_success) { return (ParseResult) { .is_success = is_success, .element = element }; } else { return (ParseResult) { .is_success = is_success, .error_idx = pos - s }; } } int _test_int() { mu_start(); //──────────────────── ParseResult result = _parse_json("0"); mu_check(result.is_success); if(result.is_success) { mu_check(result.element.type == ELEMENT_INT); mu_check(result.element.as_int == 0); free_element(result.element); // should do nothing } //──────────────────── mu_end(); } int _test_string() { mu_start(); //──────────────────── result = _parse_json("\"abc\""); mu_check(result.is_success); if(result.is_success) { mu_check(result.element.type == ELEMENT_STRING); mu_check(strcmp(result.element.as_string, "abc") == 0); mu_check(strlen(result.element.as_string) == 3); free_element(result.element); } //──────────────────── mu_end(); } int _test_list_of_ints() { mu_start(); //──────────────────── ParseResult result = _parse_json("[1, 2]"); mu_check(result.is_success); if(result.is_success) { mu_check(result.element.type == ELEMENT_LIST); mu_check(result.element.as_list != NULL); mu_check(result.element.as_list -> element.as_int == 1); mu_check(result.element.as_list -> element.type == ELEMENT_INT); mu_check(result.element.as_list -> next != NULL); mu_check(result.element.as_list -> next -> element.type == ELEMENT_INT); mu_check(result.element.as_list -> next -> element.as_int == 2); free_element(result.element); } //──────────────────── mu_end(); } int main(int argc, char* argv[]) { mu_run(_test_int); mu_run(_test_string); mu_run(_test_list_of_ints); return EXIT_SUCCESS; }
⚠ If you do not understand this code, do not use it.What should be the value of
*a_pos
afterparse_▒▒▒(…)
returns?_____________________________________________ # EXAMPLE #1 INPUT: 123 BEFORE we call parse_int(…): 123 ↑ *a_pos RETURN value from parse_int(…): true After parse_int(…) returns: 123 ↑ *a_pos refers to null terminator just after the integer literal. element.type == ELEMENT_INT element.as_int == 123 _____________________________________________ # EXAMPLE #2 INPUT: 123ABC BEFORE we call parse_int(…): 123ABC ↑ *a_pos RETURN value from parse_int(…): true After parse_int(…) returns: 123ABC ↑ *a_pos refers to the non-digit character after the integer literal. element.type == ELEMENT_INT element.as_int == 123 _____________________________________________ # EXAMPLE #3 INPUT: -A1 BEFORE we call parse_int(…): -A1 ↑ *a_pos RETURN value from parse_int(…): false After parse_int(…) returns: -A1 ↑ *a_pos refers first character that informed us this cannot be an integer literal. element.type == (don't care) element.as_int == (don't care)
What does the output of
print_element(…)
look like?It's just the inverse operation toparse_element(…)
.parse_element(…)
takes JSON as input.print_element(…)
prints JSON as output.If you were to parse the output ofprint_element(…)
withparse_element(…)
you should get an equivalent object.If you parse a JSON string and then print it again, you should get an equivalent string.If you're looking for concrete examples, just look at any example of input toparse_element(…)
(except for the trailing characters). There are several at the top of this assignment description page.The specification says we don't have to handle escape sequences, but then it mentions escape sequences. Do we parse escape sequences or don't we? How do we handle backslash?
We use C escape codes to make C string literals containing certain characters (e.g., double-quote, newline, etc.) in our C code.You don't have to parse JSON escape codes. Unless you are doing the escape sequence bonus, just treat a backslash like any other character.Here's an example to illustrate the distinction:#include <stdlib.h> #include <assert.h> #include <string.h> #include <stdio.h> #include "json.h" int main(int argc, char* argv[]) { Element element; // 'element' will be initialized inside parse_element(…) // C escape codes used to create a C string literal containing double quotes char* json_input = "\"A\""; // same as: {'\"', 'A', '\"'} char* pos = json_input; assert(strlen(json_input) == 3); parse_element(&element, &pos); printf(">>>|%s|<<<\n", element.as_string); // Output: // >>>|A|<<< // JSON escape codes json_input = "\"A\\nB\""; // same as: {'\"', 'A', '\\', 'n', 'B', '\"'} assert(strlen(json_input) == 5); pos = json_input; parse_element(&element, &pos); printf(">>>|%s|<<<\n", element.as_string); // Output: // >>>|A\nB|<<< // Output: (with escape code bonus) // >>>|A // B|<<< // Note: strlen(…) does not count the null terminator return EXIT_SUCCESS; }
Should the double quotes be stored in memory?
No. The double quotes are part of the JSON syntax.This is just like how in C, when you define a string like this:… the double quotes are not stored in memory.char s[] = "abc";
How much code should I end up with?
Here's a screenshot of the instructor's solution. The parts that had to be added to support lists (i.e., HW12) are highlighted in yellow.